diff mbox series

[1/3] KVM: arm64: Don't defer TLB invalidation when zapping table entries

Message ID 20240325185158.8565-2-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: TLBI fixes for the pgtable code | expand

Commit Message

Will Deacon March 25, 2024, 6:51 p.m. UTC
Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
unmap") introduced deferred TLB invalidation for the stage-2 page-table
so that range-based invalidation can be used for the accumulated
addresses. This works fine if the structure of the page-tables remains
unchanged, but if entire tables are zapped and subsequently freed then
we transiently leave the hardware page-table walker with a reference
to freed memory thanks to the translation walk caches. For example,
stage2_unmap_walker() will free page-table pages:

	if (childp)
		mm_ops->put_page(childp);

and issue the TLB invalidation later in kvm_pgtable_stage2_unmap():

	if (stage2_unmap_defer_tlb_flush(pgt))
		/* Perform the deferred TLB invalidations */
		kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);

For now, take the conservative approach and invalidate the TLB eagerly
when we clear a table entry.

Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kvm/hyp/pgtable.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Oliver Upton March 26, 2024, 8:34 a.m. UTC | #1
Hey Will,

On Mon, Mar 25, 2024 at 06:51:56PM +0000, Will Deacon wrote:
> Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
> unmap") introduced deferred TLB invalidation for the stage-2 page-table
> so that range-based invalidation can be used for the accumulated
> addresses. This works fine if the structure of the page-tables remains
> unchanged, but if entire tables are zapped and subsequently freed then
> we transiently leave the hardware page-table walker with a reference
> to freed memory thanks to the translation walk caches.

Yikes! Well spotted. This is rather unfortunate because the unmap path
has been found to be a massive pain w/o aggregating invalidations. But
sacrificing correctness in the name of performance... No thanks :)

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3fae5830f8d2..de0b667ba296 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -896,9 +896,11 @@ static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
>  	if (kvm_pte_valid(ctx->old)) {
>  		kvm_clear_pte(ctx->ptep);
>  
> -		if (!stage2_unmap_defer_tlb_flush(pgt))
> +		if (!stage2_unmap_defer_tlb_flush(pgt) ||
> +		    kvm_pte_table(ctx->old, ctx->level)) {
>  			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
>  					ctx->addr, ctx->level);
> +		}

I'm not sure this is correct, though. My understanding of TTL is that
we're telling hardware where the *leaf* entry we're invalidating is
found, however here we know that the addressed PTE is a table entry.

So maybe in the case of a table PTE this invalidation should
TLBI_TTL_UNKNOWN.

>  	}
>  
>  	mm_ops->put_page(ctx->ptep);

At least for the 'normal' MMU where we use RCU, this could be changed to
->free_unlinked_table() which would defer the freeing of memory til
after the invalidation completes. But that still hoses pKVM's stage-2
MMU freeing in-place.
Oliver Upton March 26, 2024, 2:31 p.m. UTC | #2
On Tue, Mar 26, 2024 at 01:34:17AM -0700, Oliver Upton wrote:
> >  	}
> >  
> >  	mm_ops->put_page(ctx->ptep);
> 
> At least for the 'normal' MMU where we use RCU, this could be changed to
> ->free_unlinked_table() which would defer the freeing of memory til
> after the invalidation completes. But that still hoses pKVM's stage-2
> MMU freeing in-place.

How about this (untested) diff? I _think_ it should address the
invalidation issue while leaving the performance optimization in place
for a 'normal' stage-2.

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3fae5830f8d2..896fdc0d157d 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -872,14 +872,19 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
 static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
 {
 	/*
-	 * If FEAT_TLBIRANGE is implemented, defer the individual
-	 * TLB invalidations until the entire walk is finished, and
-	 * then use the range-based TLBI instructions to do the
-	 * invalidations. Condition deferred TLB invalidation on the
-	 * system supporting FWB as the optimization is entirely
-	 * pointless when the unmap walker needs to perform CMOs.
+	 * It is possible to use FEAT_TLBIRANGE to do TLB invalidations at the
+	 * end of the walk if certain conditions are met:
+	 *
+	 *  - The stage-2 is for a 'normal' VM (i.e. managed in the kernel
+	 *    context). RCU provides sufficient guarantees to ensure that all
+	 *    hardware and software references on the stage-2 page tables are
+	 *    relinquished before freeing a table page.
+	 *
+	 *  - The system supports FEAT_FWB. Otherwise, KVM needs to do CMOs
+	 *    during the page table table walk.
 	 */
-	return system_supports_tlb_range() && stage2_has_fwb(pgt);
+	return !is_hyp_code() && system_supports_tlb_range() &&
+		stage2_has_fwb(pgt);
 }
 
 static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
@@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 					       kvm_granule_size(ctx->level));
 
 	if (childp)
-		mm_ops->put_page(childp);
+		mm_ops->free_unlinked_table(childp, ctx->level);
 
 	return 0;
 }
Will Deacon March 26, 2024, 4:10 p.m. UTC | #3
On Tue, Mar 26, 2024 at 07:31:27AM -0700, Oliver Upton wrote:
> On Tue, Mar 26, 2024 at 01:34:17AM -0700, Oliver Upton wrote:
> > >  	}
> > >  
> > >  	mm_ops->put_page(ctx->ptep);
> > 
> > At least for the 'normal' MMU where we use RCU, this could be changed to
> > ->free_unlinked_table() which would defer the freeing of memory til
> > after the invalidation completes. But that still hoses pKVM's stage-2
> > MMU freeing in-place.
> 
> How about this (untested) diff? I _think_ it should address the
> invalidation issue while leaving the performance optimization in place
> for a 'normal' stage-2.
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3fae5830f8d2..896fdc0d157d 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -872,14 +872,19 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
>  static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
>  {
>  	/*
> -	 * If FEAT_TLBIRANGE is implemented, defer the individual
> -	 * TLB invalidations until the entire walk is finished, and
> -	 * then use the range-based TLBI instructions to do the
> -	 * invalidations. Condition deferred TLB invalidation on the
> -	 * system supporting FWB as the optimization is entirely
> -	 * pointless when the unmap walker needs to perform CMOs.
> +	 * It is possible to use FEAT_TLBIRANGE to do TLB invalidations at the
> +	 * end of the walk if certain conditions are met:
> +	 *
> +	 *  - The stage-2 is for a 'normal' VM (i.e. managed in the kernel
> +	 *    context). RCU provides sufficient guarantees to ensure that all
> +	 *    hardware and software references on the stage-2 page tables are
> +	 *    relinquished before freeing a table page.
> +	 *
> +	 *  - The system supports FEAT_FWB. Otherwise, KVM needs to do CMOs
> +	 *    during the page table table walk.
>  	 */
> -	return system_supports_tlb_range() && stage2_has_fwb(pgt);
> +	return !is_hyp_code() && system_supports_tlb_range() &&
> +		stage2_has_fwb(pgt);
>  }
>  
>  static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
> @@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  					       kvm_granule_size(ctx->level));
>  
>  	if (childp)
> -		mm_ops->put_page(childp);
> +		mm_ops->free_unlinked_table(childp, ctx->level);

Hmm, but doesn't the deferred TLBI still happen after the RCU critical
section?

I also think I found another bug, so I'll send a v2 with an extra patch...

Will
Oliver Upton March 26, 2024, 4:14 p.m. UTC | #4
On Tue, Mar 26, 2024 at 04:10:01PM +0000, Will Deacon wrote:
> > @@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >  					       kvm_granule_size(ctx->level));
> >  
> >  	if (childp)
> > -		mm_ops->put_page(childp);
> > +		mm_ops->free_unlinked_table(childp, ctx->level);
> 
> Hmm, but doesn't the deferred TLBI still happen after the RCU critical
> section?

Right... We'd need to change that too :) Although I suppose table
invalidations are relatively infrequent when compared to leaf
invalidations.

> I also think I found another bug, so I'll send a v2 with an extra patch...

And I'm sure there's plenty more to be found too, heh.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3fae5830f8d2..de0b667ba296 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -896,9 +896,11 @@  static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
 	if (kvm_pte_valid(ctx->old)) {
 		kvm_clear_pte(ctx->ptep);
 
-		if (!stage2_unmap_defer_tlb_flush(pgt))
+		if (!stage2_unmap_defer_tlb_flush(pgt) ||
+		    kvm_pte_table(ctx->old, ctx->level)) {
 			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
 					ctx->addr, ctx->level);
+		}
 	}
 
 	mm_ops->put_page(ctx->ptep);