diff mbox series

[v2,7/7] KVM: arm64: Create a fast stage-2 unmap path

Message ID 20230206172340.2639971-8-rananta@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Add support for FEAT_TLBIRANGE | expand

Commit Message

Raghavendra Rao Ananta Feb. 6, 2023, 5:23 p.m. UTC
The current implementation of the stage-2 unmap walker
traverses the entire page-table to clear and flush the TLBs
for each entry. This could be very expensive, especially if
the VM is not backed by hugepages. The unmap operation could be
made efficient by disconnecting the table at the very
top (level at which the largest block mapping can be hosted)
and do the rest of the unmapping using free_removed_table().
If the system supports FEAT_TLBIRANGE, flush the entire range
that has been disconnected from the rest of the page-table.

Suggested-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Comments

Oliver Upton March 30, 2023, 12:42 a.m. UTC | #1
On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> The current implementation of the stage-2 unmap walker
> traverses the entire page-table to clear and flush the TLBs
> for each entry. This could be very expensive, especially if
> the VM is not backed by hugepages. The unmap operation could be
> made efficient by disconnecting the table at the very
> top (level at which the largest block mapping can be hosted)
> and do the rest of the unmapping using free_removed_table().
> If the system supports FEAT_TLBIRANGE, flush the entire range
> that has been disconnected from the rest of the page-table.
> 
> Suggested-by: Ricardo Koller <ricarkol@google.com>
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0858d1fa85d6b..af3729d0971f2 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	return 0;
>  }
>  
> +/*
> + * The fast walker executes only if the unmap size is exactly equal to the
> + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> + * be disconnected from the rest of the page-table without the need to
> + * traverse all the PTEs, at all the levels, and unmap each and every one
> + * of them. The disconnected table is freed using free_removed_table().
> + */
> +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +			       enum kvm_pgtable_walk_flags visit)
> +{
> +	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +	kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> +	struct kvm_s2_mmu *mmu = ctx->arg;
> +
> +	if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> +		return 0;
> +
> +	if (!stage2_try_break_pte(ctx, mmu))
> +		return -EAGAIN;
> +
> +	/*
> +	 * Gain back a reference for stage2_unmap_walker() to free
> +	 * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> +	 */
> +	mm_ops->get_page(ctx->ptep);

Doesn't this run the risk of a potential UAF if the refcount was 1 before
calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
the refcount to 0 on the page before this ever gets called.

Also, AFAICT this misses the CMOs that are required on systems w/o
FEAT_FWB. Without them it is possible that the host will read something
other than what was most recently written by the guest if it is using
noncacheable memory attributes at stage-1.

I imagine the actual bottleneck is the DSB required after every
CMO/TLBI. Theoretically, the unmap path could be updated to:

 - Perform the appropriate CMOs for every valid leaf entry *without*
   issuing a DSB.

 - Elide TLBIs entirely that take place in the middle of the walk

 - After the walk completes, dsb(ish) to guarantee that the CMOs have
   completed and the invalid PTEs are made visible to the hardware
   walkers. This should be done implicitly by the TLBI implementation

 - Invalidate the [addr, addr + size) range of IPAs

This would also avoid over-invalidating stage-1 since we blast the
entire stage-1 context for every stage-2 invalidation. Thoughts?

> +	mm_ops->free_removed_table(childp, ctx->level);
> +	return 0;
> +}
> +
Raghavendra Rao Ananta April 4, 2023, 5:52 p.m. UTC | #2
On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > The current implementation of the stage-2 unmap walker
> > traverses the entire page-table to clear and flush the TLBs
> > for each entry. This could be very expensive, especially if
> > the VM is not backed by hugepages. The unmap operation could be
> > made efficient by disconnecting the table at the very
> > top (level at which the largest block mapping can be hosted)
> > and do the rest of the unmapping using free_removed_table().
> > If the system supports FEAT_TLBIRANGE, flush the entire range
> > that has been disconnected from the rest of the page-table.
> >
> > Suggested-by: Ricardo Koller <ricarkol@google.com>
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0858d1fa85d6b..af3729d0971f2 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >       return 0;
> >  }
> >
> > +/*
> > + * The fast walker executes only if the unmap size is exactly equal to the
> > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > + * be disconnected from the rest of the page-table without the need to
> > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > + * of them. The disconnected table is freed using free_removed_table().
> > + */
> > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +                            enum kvm_pgtable_walk_flags visit)
> > +{
> > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > +
> > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > +             return 0;
> > +
> > +     if (!stage2_try_break_pte(ctx, mmu))
> > +             return -EAGAIN;
> > +
> > +     /*
> > +      * Gain back a reference for stage2_unmap_walker() to free
> > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > +      */
> > +     mm_ops->get_page(ctx->ptep);
>
> Doesn't this run the risk of a potential UAF if the refcount was 1 before
> calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> the refcount to 0 on the page before this ever gets called.
>
> Also, AFAICT this misses the CMOs that are required on systems w/o
> FEAT_FWB. Without them it is possible that the host will read something
> other than what was most recently written by the guest if it is using
> noncacheable memory attributes at stage-1.
>
> I imagine the actual bottleneck is the DSB required after every
> CMO/TLBI. Theoretically, the unmap path could be updated to:
>
>  - Perform the appropriate CMOs for every valid leaf entry *without*
>    issuing a DSB.
>
>  - Elide TLBIs entirely that take place in the middle of the walk
>
>  - After the walk completes, dsb(ish) to guarantee that the CMOs have
>    completed and the invalid PTEs are made visible to the hardware
>    walkers. This should be done implicitly by the TLBI implementation
>
>  - Invalidate the [addr, addr + size) range of IPAs
>
> This would also avoid over-invalidating stage-1 since we blast the
> entire stage-1 context for every stage-2 invalidation. Thoughts?
>
Correct me if I'm wrong, but if we invalidate the TLB after the walk
is complete, don't you think there's a risk of race if the guest can
hit in the TLB even though the page was unmapped?

Thanks,
Raghavendra

Raghavendra
> > +     mm_ops->free_removed_table(childp, ctx->level);
> > +     return 0;
> > +}
> > +
>
> --
> Thanks,
> Oliver
Oliver Upton April 4, 2023, 7:19 p.m. UTC | #3
On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > The current implementation of the stage-2 unmap walker
> > > traverses the entire page-table to clear and flush the TLBs
> > > for each entry. This could be very expensive, especially if
> > > the VM is not backed by hugepages. The unmap operation could be
> > > made efficient by disconnecting the table at the very
> > > top (level at which the largest block mapping can be hosted)
> > > and do the rest of the unmapping using free_removed_table().
> > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > that has been disconnected from the rest of the page-table.
> > >
> > > Suggested-by: Ricardo Koller <ricarkol@google.com>
> > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > ---
> > >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 44 insertions(+)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > >       return 0;
> > >  }
> > >
> > > +/*
> > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > + * be disconnected from the rest of the page-table without the need to
> > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > + * of them. The disconnected table is freed using free_removed_table().
> > > + */
> > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > +                            enum kvm_pgtable_walk_flags visit)
> > > +{
> > > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > > +
> > > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > +             return 0;
> > > +
> > > +     if (!stage2_try_break_pte(ctx, mmu))
> > > +             return -EAGAIN;
> > > +
> > > +     /*
> > > +      * Gain back a reference for stage2_unmap_walker() to free
> > > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > +      */
> > > +     mm_ops->get_page(ctx->ptep);
> >
> > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > the refcount to 0 on the page before this ever gets called.
> >
> > Also, AFAICT this misses the CMOs that are required on systems w/o
> > FEAT_FWB. Without them it is possible that the host will read something
> > other than what was most recently written by the guest if it is using
> > noncacheable memory attributes at stage-1.
> >
> > I imagine the actual bottleneck is the DSB required after every
> > CMO/TLBI. Theoretically, the unmap path could be updated to:
> >
> >  - Perform the appropriate CMOs for every valid leaf entry *without*
> >    issuing a DSB.
> >
> >  - Elide TLBIs entirely that take place in the middle of the walk
> >
> >  - After the walk completes, dsb(ish) to guarantee that the CMOs have
> >    completed and the invalid PTEs are made visible to the hardware
> >    walkers. This should be done implicitly by the TLBI implementation
> >
> >  - Invalidate the [addr, addr + size) range of IPAs
> >
> > This would also avoid over-invalidating stage-1 since we blast the
> > entire stage-1 context for every stage-2 invalidation. Thoughts?
> >
> Correct me if I'm wrong, but if we invalidate the TLB after the walk
> is complete, don't you think there's a risk of race if the guest can
> hit in the TLB even though the page was unmapped?

Yeah, we'd need to do the CMOs _after_ making the translation invalid in
the page tables and completing the TLB invalidation. Apologies.

Otherwise, the only requirement we need to uphold w/ either the MMU
notifiers or userspace is that the translation has been invalidated at
the time of return.
Raghavendra Rao Ananta April 4, 2023, 9:07 p.m. UTC | #4
On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > The current implementation of the stage-2 unmap walker
> > > > traverses the entire page-table to clear and flush the TLBs
> > > > for each entry. This could be very expensive, especially if
> > > > the VM is not backed by hugepages. The unmap operation could be
> > > > made efficient by disconnecting the table at the very
> > > > top (level at which the largest block mapping can be hosted)
> > > > and do the rest of the unmapping using free_removed_table().
> > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > that has been disconnected from the rest of the page-table.
> > > >
> > > > Suggested-by: Ricardo Koller <ricarkol@google.com>
> > > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > > ---
> > > >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 44 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > >       return 0;
> > > >  }
> > > >
> > > > +/*
> > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > + * be disconnected from the rest of the page-table without the need to
> > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > + */
> > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > +                            enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > > > +
> > > > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > +             return 0;
> > > > +
> > > > +     if (!stage2_try_break_pte(ctx, mmu))
> > > > +             return -EAGAIN;
> > > > +
> > > > +     /*
> > > > +      * Gain back a reference for stage2_unmap_walker() to free
> > > > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > +      */
> > > > +     mm_ops->get_page(ctx->ptep);
> > >
> > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > the refcount to 0 on the page before this ever gets called.
> > >
> > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > FEAT_FWB. Without them it is possible that the host will read something
> > > other than what was most recently written by the guest if it is using
> > > noncacheable memory attributes at stage-1.
> > >
> > > I imagine the actual bottleneck is the DSB required after every
> > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > >
> > >  - Perform the appropriate CMOs for every valid leaf entry *without*
> > >    issuing a DSB.
> > >
> > >  - Elide TLBIs entirely that take place in the middle of the walk
> > >
> > >  - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > >    completed and the invalid PTEs are made visible to the hardware
> > >    walkers. This should be done implicitly by the TLBI implementation
> > >
> > >  - Invalidate the [addr, addr + size) range of IPAs
> > >
> > > This would also avoid over-invalidating stage-1 since we blast the
> > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > >
> > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > is complete, don't you think there's a risk of race if the guest can
> > hit in the TLB even though the page was unmapped?
>
> Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> the page tables and completing the TLB invalidation. Apologies.
>
> Otherwise, the only requirement we need to uphold w/ either the MMU
> notifiers or userspace is that the translation has been invalidated at
> the time of return.
>
Actually, my concern about the race was against the hardware. If we
follow the above approach, let's say we invalidated a certain set of
PTEs, but the TLBs aren't yet invalidated. During this point if
another vCPU accesses the range governed by the invalidated PTEs,
wouldn't it still hit in the TLB? Have I misunderstood you or am I
missing something?

Thank you.
Raghavendra
> --
> Thanks,
> Oliver
Oliver Upton April 4, 2023, 9:30 p.m. UTC | #5
On Tue, Apr 04, 2023 at 02:07:06PM -0700, Raghavendra Rao Ananta wrote:
> On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > >
> > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > > The current implementation of the stage-2 unmap walker
> > > > > traverses the entire page-table to clear and flush the TLBs
> > > > > for each entry. This could be very expensive, especially if
> > > > > the VM is not backed by hugepages. The unmap operation could be
> > > > > made efficient by disconnecting the table at the very
> > > > > top (level at which the largest block mapping can be hosted)
> > > > > and do the rest of the unmapping using free_removed_table().
> > > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > > that has been disconnected from the rest of the page-table.
> > > > >
> > > > > Suggested-by: Ricardo Koller <ricarkol@google.com>
> > > > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > > > ---
> > > > >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 44 insertions(+)
> > > > >
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > > + * be disconnected from the rest of the page-table without the need to
> > > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > > + */
> > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > +                            enum kvm_pgtable_walk_flags visit)
> > > > > +{
> > > > > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > > > > +
> > > > > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > > +             return 0;
> > > > > +
> > > > > +     if (!stage2_try_break_pte(ctx, mmu))
> > > > > +             return -EAGAIN;
> > > > > +
> > > > > +     /*
> > > > > +      * Gain back a reference for stage2_unmap_walker() to free
> > > > > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > > +      */
> > > > > +     mm_ops->get_page(ctx->ptep);
> > > >
> > > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > > the refcount to 0 on the page before this ever gets called.
> > > >
> > > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > > FEAT_FWB. Without them it is possible that the host will read something
> > > > other than what was most recently written by the guest if it is using
> > > > noncacheable memory attributes at stage-1.
> > > >
> > > > I imagine the actual bottleneck is the DSB required after every
> > > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > > >
> > > >  - Perform the appropriate CMOs for every valid leaf entry *without*
> > > >    issuing a DSB.
> > > >
> > > >  - Elide TLBIs entirely that take place in the middle of the walk
> > > >
> > > >  - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > > >    completed and the invalid PTEs are made visible to the hardware
> > > >    walkers. This should be done implicitly by the TLBI implementation
> > > >
> > > >  - Invalidate the [addr, addr + size) range of IPAs
> > > >
> > > > This would also avoid over-invalidating stage-1 since we blast the
> > > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > > >
> > > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > > is complete, don't you think there's a risk of race if the guest can
> > > hit in the TLB even though the page was unmapped?
> >
> > Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> > the page tables and completing the TLB invalidation. Apologies.
> >
> > Otherwise, the only requirement we need to uphold w/ either the MMU
> > notifiers or userspace is that the translation has been invalidated at
> > the time of return.
> >
> Actually, my concern about the race was against the hardware. If we
> follow the above approach, let's say we invalidated a certain set of
> PTEs, but the TLBs aren't yet invalidated. During this point if
> another vCPU accesses the range governed by the invalidated PTEs,
> wouldn't it still hit in the TLB? Have I misunderstood you or am I
> missing something?

Yep, that's exactly what would happen. There is no way to eliminate the
race you mention, there will always be a window of time where the page
tables no longer contain a particular translation but the TLBs may still
be holding a valid entry.

This race is benign so long as we guarantee that all translations for
the affected address (i.e. in the page tables, cached in a TLB) have
been invalidated before returning to the caller. For example, MM cannot
start swapping out guest memory until it is guaranteed that the guest is
no longer writing to it.
Raghavendra Rao Ananta April 4, 2023, 9:45 p.m. UTC | #6
On Tue, Apr 4, 2023 at 2:31 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Apr 04, 2023 at 02:07:06PM -0700, Raghavendra Rao Ananta wrote:
> > On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > > > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > >
> > > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > > > The current implementation of the stage-2 unmap walker
> > > > > > traverses the entire page-table to clear and flush the TLBs
> > > > > > for each entry. This could be very expensive, especially if
> > > > > > the VM is not backed by hugepages. The unmap operation could be
> > > > > > made efficient by disconnecting the table at the very
> > > > > > top (level at which the largest block mapping can be hosted)
> > > > > > and do the rest of the unmapping using free_removed_table().
> > > > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > > > that has been disconnected from the rest of the page-table.
> > > > > >
> > > > > > Suggested-by: Ricardo Koller <ricarkol@google.com>
> > > > > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > > > > ---
> > > > > >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 44 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +/*
> > > > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > > > + * be disconnected from the rest of the page-table without the need to
> > > > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > > > + */
> > > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > > +                            enum kvm_pgtable_walk_flags visit)
> > > > > > +{
> > > > > > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > > > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > > > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > > > > > +
> > > > > > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > > > +             return 0;
> > > > > > +
> > > > > > +     if (!stage2_try_break_pte(ctx, mmu))
> > > > > > +             return -EAGAIN;
> > > > > > +
> > > > > > +     /*
> > > > > > +      * Gain back a reference for stage2_unmap_walker() to free
> > > > > > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > > > +      */
> > > > > > +     mm_ops->get_page(ctx->ptep);
> > > > >
> > > > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > > > the refcount to 0 on the page before this ever gets called.
> > > > >
> > > > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > > > FEAT_FWB. Without them it is possible that the host will read something
> > > > > other than what was most recently written by the guest if it is using
> > > > > noncacheable memory attributes at stage-1.
> > > > >
> > > > > I imagine the actual bottleneck is the DSB required after every
> > > > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > > > >
> > > > >  - Perform the appropriate CMOs for every valid leaf entry *without*
> > > > >    issuing a DSB.
> > > > >
> > > > >  - Elide TLBIs entirely that take place in the middle of the walk
> > > > >
> > > > >  - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > > > >    completed and the invalid PTEs are made visible to the hardware
> > > > >    walkers. This should be done implicitly by the TLBI implementation
> > > > >
> > > > >  - Invalidate the [addr, addr + size) range of IPAs
> > > > >
> > > > > This would also avoid over-invalidating stage-1 since we blast the
> > > > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > > > >
> > > > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > > > is complete, don't you think there's a risk of race if the guest can
> > > > hit in the TLB even though the page was unmapped?
> > >
> > > Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> > > the page tables and completing the TLB invalidation. Apologies.
> > >
> > > Otherwise, the only requirement we need to uphold w/ either the MMU
> > > notifiers or userspace is that the translation has been invalidated at
> > > the time of return.
> > >
> > Actually, my concern about the race was against the hardware. If we
> > follow the above approach, let's say we invalidated a certain set of
> > PTEs, but the TLBs aren't yet invalidated. During this point if
> > another vCPU accesses the range governed by the invalidated PTEs,
> > wouldn't it still hit in the TLB? Have I misunderstood you or am I
> > missing something?
>
> Yep, that's exactly what would happen. There is no way to eliminate the
> race you mention, there will always be a window of time where the page
> tables no longer contain a particular translation but the TLBs may still
> be holding a valid entry.
>
This is new to me :)

> This race is benign so long as we guarantee that all translations for
> the affected address (i.e. in the page tables, cached in a TLB) have
> been invalidated before returning to the caller. For example, MM cannot
> start swapping out guest memory until it is guaranteed that the guest is
> no longer writing to it.
>
Well, if you feel that the risk is acceptable, we can probably defer
the invalidations until the unmap walk is finished.

Thank you.
Raghavendra

> --
> Thanks,
> Oliver
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0858d1fa85d6b..af3729d0971f2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1017,6 +1017,49 @@  static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	return 0;
 }
 
+/*
+ * The fast walker executes only if the unmap size is exactly equal to the
+ * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
+ * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
+ * be disconnected from the rest of the page-table without the need to
+ * traverse all the PTEs, at all the levels, and unmap each and every one
+ * of them. The disconnected table is freed using free_removed_table().
+ */
+static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			       enum kvm_pgtable_walk_flags visit)
+{
+	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+	kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
+	struct kvm_s2_mmu *mmu = ctx->arg;
+
+	if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
+		return 0;
+
+	if (!stage2_try_break_pte(ctx, mmu))
+		return -EAGAIN;
+
+	/*
+	 * Gain back a reference for stage2_unmap_walker() to free
+	 * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
+	 */
+	mm_ops->get_page(ctx->ptep);
+
+	mm_ops->free_removed_table(childp, ctx->level);
+	return 0;
+}
+
+static void kvm_pgtable_try_fast_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	struct kvm_pgtable_walker walker = {
+		.cb	= fast_stage2_unmap_walker,
+		.arg	= pgt->mmu,
+		.flags	= KVM_PGTABLE_WALK_TABLE_PRE,
+	};
+
+	if (size == kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL))
+		kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
 int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	struct kvm_pgtable_walker walker = {
@@ -1025,6 +1068,7 @@  int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
 	};
 
+	kvm_pgtable_try_fast_stage2_unmap(pgt, addr, size);
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
 }