diff mbox series

[RFC,06/17] KVM: arm64: Implement break-before-make sequence for parallel walks

Message ID 20220415215901.1737897-7-oupton@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Parallelize stage 2 fault handling | expand

Commit Message

Oliver Upton April 15, 2022, 9:58 p.m. UTC
The ARM architecture requires that software use the 'break-before-make'
sequence whenever memory is being remapped. An additional requirement of
parallel page walks is a mechanism to ensure exclusive access to a pte,
thereby avoiding two threads changing the pte and invariably stomping on
one another.

Roll the two concepts together into a new helper to implement the
'break' sequence. Use a special invalid pte value to indicate that the
pte is under the exclusive control of a thread. If software walkers are
traversing the tables in parallel, use an atomic compare-exchange to
break the pte. Retry execution on a failed attempt to break the pte, in
the hopes that either the instruction will succeed or the pte lock will
be successfully acquired.

Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
evicted pte was valid. For counted non-table ptes drop the reference
immediately. Otherwise, references on tables are dropped in post-order
traversal as the walker must recurse on the pruned subtree.

All of the new atomics do nothing (for now), as there are a few other
bits of the map walker that need to be addressed before actually walking
in parallel.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
 1 file changed, 146 insertions(+), 26 deletions(-)

Comments

Quentin Perret April 20, 2022, 4:55 p.m. UTC | #1
On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in

'break-before-make' presumably :-) ?

<snip>
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +	/* Yikes! We really shouldn't install to an entry we don't own. */
> +	WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +	if (stage2_pte_is_counted(new))
> +		mm_ops->get_page(ptep);
> +
> +	if (kvm_pte_valid(new)) {
> +		WRITE_ONCE(*ptep, new);
> +		dsb(ishst);
> +	} else {
> +		smp_store_release(ptep, new);
> +	}
> +}

I'm struggling a bit to understand this pattern. We currently use
smp_store_release() to install valid mappings, which this patch seems
to change. Is the behaviour change intentional? If so, could you please
share some details about the reasoning that applies here?

Thanks,
Quentin
Oliver Upton April 20, 2022, 5:06 p.m. UTC | #2
On Wed, Apr 20, 2022 at 9:55 AM Quentin Perret <qperret@google.com> wrote:
>
> On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> > +/*
> > + * Used to indicate a pte for which a 'make-before-break' sequence is in
>
> 'break-before-make' presumably :-) ?

Gosh, I'd certainly hope so! ;)

> <snip>
> > +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> > +{
> > +     /* Yikes! We really shouldn't install to an entry we don't own. */
> > +     WARN_ON(!stage2_pte_is_locked(*ptep));
> > +
> > +     if (stage2_pte_is_counted(new))
> > +             mm_ops->get_page(ptep);
> > +
> > +     if (kvm_pte_valid(new)) {
> > +             WRITE_ONCE(*ptep, new);
> > +             dsb(ishst);
> > +     } else {
> > +             smp_store_release(ptep, new);
> > +     }
> > +}
>
> I'm struggling a bit to understand this pattern. We currently use
> smp_store_release() to install valid mappings, which this patch seems
> to change. Is the behaviour change intentional? If so, could you please
> share some details about the reasoning that applies here?

This is unintentional. We still need to do smp_store_release(),
especially considering we acquire a lock on the PTE in the break
pattern. In fact, I believe the compare-exchange could be loosened to
have only acquire semantics. What I had really meant to do here (but
goofed) is to avoid the DSB when changing between invalid PTEs.

Thanks for the review!

--
Best,
Oliver
Ben Gardon April 21, 2022, 4:57 p.m. UTC | #3
On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped. An additional requirement of
> parallel page walks is a mechanism to ensure exclusive access to a pte,
> thereby avoiding two threads changing the pte and invariably stomping on
> one another.
>
> Roll the two concepts together into a new helper to implement the
> 'break' sequence. Use a special invalid pte value to indicate that the
> pte is under the exclusive control of a thread. If software walkers are
> traversing the tables in parallel, use an atomic compare-exchange to
> break the pte. Retry execution on a failed attempt to break the pte, in
> the hopes that either the instruction will succeed or the pte lock will
> be successfully acquired.
>
> Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> evicted pte was valid. For counted non-table ptes drop the reference
> immediately. Otherwise, references on tables are dropped in post-order
> traversal as the walker must recurse on the pruned subtree.
>
> All of the new atomics do nothing (for now), as there are a few other
> bits of the map walker that need to be addressed before actually walking
> in parallel.

I want to make sure I understand the make before break / PTE locking
patterns here.
Please check my understanding of the following cases:

Case 1: Change a leaf PTE (for some reason)
1. Traverse the page table to the leaf
2. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. Replace the locked PTE with the new value

In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Case 2:  Drop a page table
1. Traverse to some non-leaf PTE
2. Lock the PTE, invalidating it
3. Recurse into the child page table
4. Lock the PTEs in the child page table. (We need to lock ALL the
PTEs here right? I don't think we'd get away with locking only the
valid ones)
5. Flush TLBs
6. Unlock the PTE from 2
7. Free the child page after an RCU grace period (via callback)

Case 3: Drop a range of leaf PTEs
1. Traverse the page table to the first leaf
2. For each leaf in the range:
        a. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. unlock the locked PTEs

In this case we have to lock ALL PTEs in the range too, right? My
worry about the whole locking scheme is making sure each thread
correctly remembers which PTEs it locked versus which might have been
locked by other threads.
On x86 we solved this by only locking one SPTE at a time, flushing,
then fixing it, but if you're locking a bunch at once it might get
complicated.
Making this locking scheme work without demolishing performance seems hard.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
>  1 file changed, 146 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index bf46d6d24951..059ebb921125 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -49,6 +49,12 @@
>  #define KVM_INVALID_PTE_OWNER_MASK     GENMASK(9, 2)
>  #define KVM_MAX_OWNER_ID               1
>
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED         BIT(10)
> +
>  struct kvm_pgtable_walk_data {
>         struct kvm_pgtable              *pgt;
>         struct kvm_pgtable_walker       *walker;
> @@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
>         return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
>  }
>
> +static bool stage2_pte_is_locked(kvm_pte_t pte)
> +{
> +       return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
> +}
> +
> +static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
> +{
> +       if (!shared) {
> +               WRITE_ONCE(*ptep, new);
> +               return true;
> +       }
> +
> +       return cmpxchg(ptep, old, new) == old;
> +}
> +
> +/**
> + * stage2_try_break_pte() - Invalidates a pte according to the
> + *                         'break-before-make' sequence.
> + *
> + * @ptep: Pointer to the pte to break
> + * @old: The previously observed value of the pte; used for compare-exchange in
> + *      a parallel walk
> + * @addr: IPA corresponding to the pte
> + * @level: Table level of the pte
> + * @shared: true if the tables are shared by multiple software walkers
> + * @data: pointer to the map walker data
> + *
> + * Returns: true if the pte was successfully broken.
> + *
> + * If the removed pt was valid, performs the necessary DSB and TLB flush for
> + * the old value. Drops references to the page table if a non-table entry was
> + * removed. Otherwise, the table reference is preserved as the walker must also
> + * recurse through the child tables.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
> +                                struct stage2_map_data *data)
> +{
> +       /*
> +        * Another thread could have already visited this pte and taken
> +        * ownership.
> +        */
> +       if (stage2_pte_is_locked(old)) {
> +               /*
> +                * If the table walker has exclusive access to the page tables
> +                * then no other software walkers should have locked the pte.
> +                */
> +               WARN_ON(!shared);
> +               return false;
> +       }
> +
> +       if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
> +               return false;
> +
> +       /*
> +        * If we removed a valid pte, break-then-make rules are in effect as a
> +        * translation may have been cached that traversed this entry.
> +        */
> +       if (kvm_pte_valid(old)) {
> +               dsb(ishst);
> +
> +               if (kvm_pte_table(old, level))
> +                       /*
> +                        * Invalidate the whole stage-2, as we may have numerous leaf
> +                        * entries below us which would otherwise need invalidating
> +                        * individually.
> +                        */
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +               else
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +       }
> +
> +       /*
> +        * Don't drop the reference on table entries yet, as the walker must
> +        * first recurse on the unlinked subtree to unlink and drop references
> +        * to child tables.
> +        */
> +       if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
> +               data->mm_ops->put_page(ptep);
> +
> +       return true;
> +}
> +
> +/**
> + * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
> + *                    sequence.
> + *
> + * @ptep: pointer to the pte to make
> + * @new: new pte value to install
> + *
> + * Assumes that the pte addressed by ptep has already been broken and is under
> + * the ownership of the table walker. If the new pte to be installed is a valid
> + * entry, perform a DSB to make the write visible. Raise the reference count on
> + * the table if the new pte requires a reference.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +       /* Yikes! We really shouldn't install to an entry we don't own. */
> +       WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +       if (stage2_pte_is_counted(new))
> +               mm_ops->get_page(ptep);
> +
> +       if (kvm_pte_valid(new)) {
> +               WRITE_ONCE(*ptep, new);
> +               dsb(ishst);
> +       } else {
> +               smp_store_release(ptep, new);
> +       }
> +}
> +
>  static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>                            u32 level, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> @@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         else
>                 new = kvm_init_invalid_leaf_owner(data->owner_id);
>
> -       if (stage2_pte_is_counted(old)) {
> -               /*
> -                * Skip updating the PTE if we are trying to recreate the exact
> -                * same mapping or only change the access permissions. Instead,
> -                * the vCPU will exit one more time from guest if still needed
> -                * and then go through the path of relaxing permissions.
> -                */
> -               if (!stage2_pte_needs_update(old, new))
> -                       return -EAGAIN;
> +       /*
> +        * Skip updating the PTE if we are trying to recreate the exact same
> +        * mapping or only change the access permissions. Instead, the vCPU will
> +        * exit one more time from the guest if still needed and then go through
> +        * the path of relaxing permissions.
> +        */
> +       if (!stage2_pte_needs_update(old, new))
> +               return -EAGAIN;
>
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -       }
> +       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +               return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
>         if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> @@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
>                 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>
> -       smp_store_release(ptep, new);
> -       if (stage2_pte_is_counted(new))
> -               mm_ops->get_page(ptep);
> +       stage2_make_pte(ptep, new, data->mm_ops);
>         if (kvm_phys_is_valid(phys))
>                 data->phys += granule;
>         return 0;
> @@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*old, data->mm_ops);
> -       kvm_clear_pte(ptep);
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
>
> -       /*
> -        * Invalidate the whole stage-2, as we may have numerous leaf
> -        * entries below us which would otherwise need invalidating
> -        * individually.
> -        */
> -       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         data->anchor = ptep;
>         return 0;
>  }
> @@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         if (!data->memcache)
>                 return -ENOMEM;
>
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
> +
>         childp = mm_ops->zalloc_page(data->memcache);
> -       if (!childp)
> +       if (!childp) {
> +               /*
> +                * Release the pte if we were unable to install a table to allow
> +                * another thread to make an attempt.
> +                */
> +               stage2_make_pte(ptep, 0, data->mm_ops);
>                 return -ENOMEM;
> +       }
>
>         /*
>          * If we've run into an existing block mapping then replace it with
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(*old))
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -
>         kvm_set_table_pte(ptep, childp, mm_ops);
>         mm_ops->get_page(ptep);
>         *old = *ptep;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
Oliver Upton April 21, 2022, 6:52 p.m. UTC | #4
On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped. An additional requirement of
> > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > thereby avoiding two threads changing the pte and invariably stomping on
> > one another.
> >
> > Roll the two concepts together into a new helper to implement the
> > 'break' sequence. Use a special invalid pte value to indicate that the
> > pte is under the exclusive control of a thread. If software walkers are
> > traversing the tables in parallel, use an atomic compare-exchange to
> > break the pte. Retry execution on a failed attempt to break the pte, in
> > the hopes that either the instruction will succeed or the pte lock will
> > be successfully acquired.
> >
> > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > evicted pte was valid. For counted non-table ptes drop the reference
> > immediately. Otherwise, references on tables are dropped in post-order
> > traversal as the walker must recurse on the pruned subtree.
> >
> > All of the new atomics do nothing (for now), as there are a few other
> > bits of the map walker that need to be addressed before actually walking
> > in parallel.
> 
> I want to make sure I understand the make before break / PTE locking
> patterns here.
> Please check my understanding of the following cases:
> 
> Case 1: Change a leaf PTE (for some reason)
> 1. Traverse the page table to the leaf
> 2. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. Replace the locked PTE with the new value
> 
> In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Right, if we're changing the OA of a leaf PTE. If we are just adjusting
attributes on a leaf we go through stage2_attr_walker(), which skips
step 2 and does the rest in this order: 1, 4, 3.

> Case 2:  Drop a page table
> 1. Traverse to some non-leaf PTE
> 2. Lock the PTE, invalidating it
> 3. Recurse into the child page table
> 4. Lock the PTEs in the child page table. (We need to lock ALL the
> PTEs here right? I don't think we'd get away with locking only the
> valid ones)

Right. We can just skip some of the TLBI/DSB dance when making an
invalid->invalid transition.

> 5. Flush TLBs
> 6. Unlock the PTE from 2
> 7. Free the child page after an RCU grace period (via callback)
> 
> Case 3: Drop a range of leaf PTEs
> 1. Traverse the page table to the first leaf
> 2. For each leaf in the range:
>         a. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. unlock the locked PTEs
> 
> In this case we have to lock ALL PTEs in the range too, right? My
> worry about the whole locking scheme is making sure each thread
> correctly remembers which PTEs it locked versus which might have been
> locked by other threads.

Isn't exclusivity accomplished by checking what you get back from the
xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
get anything else, then I've taken ownership of that PTE.

> On x86 we solved this by only locking one SPTE at a time, flushing,
> then fixing it, but if you're locking a bunch at once it might get
> complicated.
> Making this locking scheme work without demolishing performance seems hard.

We only change at most a single active PTE per fault on the stage 2 MMU.
We do one of three things on that path:

 1. Install a page/block PTE to an empty PTE
 2. Replace a table PTE with a block PTE
 3. Replace a block PTE with a table PTE

1 is pretty cheap and can skip flushes altogether.

2 only requires a single TLBI (a big, painful flush of the stage 2 context),
but child PTEs needn't be flushed.

3 also requires a single TLBI, but can be done with an IPA and level
hint.

Perhaps the answer is to push teardown into the rcu callback altogether,
IOW don't mess with links in the subtree until then. At that point
there's no need for TLBIs nor atomics.

--
Thanks,
Oliver
Sean Christopherson April 25, 2022, 3:13 p.m. UTC | #5
On Fri, Apr 15, 2022, Oliver Upton wrote:
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped.

What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
huge page?
Oliver Upton April 25, 2022, 4:53 p.m. UTC | #6
On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 15, 2022, Oliver Upton wrote:
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped.
>
> What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> huge page?

Both, but in the case of this series it is mostly concerned with
promotion/demotion. I'll make this language a bit more precise next
time around.

--
Thanks,
Oliver
Sean Christopherson April 25, 2022, 6:16 p.m. UTC | #7
On Mon, Apr 25, 2022, Oliver Upton wrote:
> On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 15, 2022, Oliver Upton wrote:
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped.
> >
> > What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> > huge page?
> 
> Both, but in the case of this series it is mostly concerned with
> promotion/demotion. I'll make this language a bit more precise next
> time around.

Please be very precise :-)  It matters because it should be impossible for KVM to
actually change a PFN in a valid PTE.  Callers of mmu_notifier_change_pte() are
required to bookend it with mmu_notifier_invalidate_range_start/end(), i.e. KVM
should have zapped all PTEs and should not establish new PTEs.  I'd actually like
to drop mmu_notifier_change_pte() altogether, because for all intents and purposes,
it's dead code.  But convincing "everyone" that dropping it instead of trying to
salvage it for KSM is too much work :-)
Ben Gardon April 26, 2022, 9:32 p.m. UTC | #8
On Thu, Apr 21, 2022 at 11:52 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped. An additional requirement of
> > > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > > thereby avoiding two threads changing the pte and invariably stomping on
> > > one another.
> > >
> > > Roll the two concepts together into a new helper to implement the
> > > 'break' sequence. Use a special invalid pte value to indicate that the
> > > pte is under the exclusive control of a thread. If software walkers are
> > > traversing the tables in parallel, use an atomic compare-exchange to
> > > break the pte. Retry execution on a failed attempt to break the pte, in
> > > the hopes that either the instruction will succeed or the pte lock will
> > > be successfully acquired.
> > >
> > > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > > evicted pte was valid. For counted non-table ptes drop the reference
> > > immediately. Otherwise, references on tables are dropped in post-order
> > > traversal as the walker must recurse on the pruned subtree.
> > >
> > > All of the new atomics do nothing (for now), as there are a few other
> > > bits of the map walker that need to be addressed before actually walking
> > > in parallel.
> >
> > I want to make sure I understand the make before break / PTE locking
> > patterns here.
> > Please check my understanding of the following cases:
> >
> > Case 1: Change a leaf PTE (for some reason)
> > 1. Traverse the page table to the leaf
> > 2. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. Replace the locked PTE with the new value
> >
> > In this case, no need to lock the parent SPTEs, right? This is pretty simple.
>
> Right, if we're changing the OA of a leaf PTE. If we are just adjusting
> attributes on a leaf we go through stage2_attr_walker(), which skips
> step 2 and does the rest in this order: 1, 4, 3.
>
> > Case 2:  Drop a page table
> > 1. Traverse to some non-leaf PTE
> > 2. Lock the PTE, invalidating it
> > 3. Recurse into the child page table
> > 4. Lock the PTEs in the child page table. (We need to lock ALL the
> > PTEs here right? I don't think we'd get away with locking only the
> > valid ones)
>
> Right. We can just skip some of the TLBI/DSB dance when making an
> invalid->invalid transition.
>
> > 5. Flush TLBs
> > 6. Unlock the PTE from 2
> > 7. Free the child page after an RCU grace period (via callback)
> >
> > Case 3: Drop a range of leaf PTEs
> > 1. Traverse the page table to the first leaf
> > 2. For each leaf in the range:
> >         a. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. unlock the locked PTEs
> >
> > In this case we have to lock ALL PTEs in the range too, right? My
> > worry about the whole locking scheme is making sure each thread
> > correctly remembers which PTEs it locked versus which might have been
> > locked by other threads.
>
> Isn't exclusivity accomplished by checking what you get back from the
> xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
> get anything else, then I've taken ownership of that PTE.

That's true if you only modify one PTE at a time, but if you want to
batch flushes by:
1. Locking a bunch of PTEs
2. TLB invalidate
3. Set them to some new value (e.g. 0)
Then you need to track which ones you locked. If you locked an entire
contiguous region, that works, but you need some way to ensure you
don't think you locked a pte locked by another thread.

>
> > On x86 we solved this by only locking one SPTE at a time, flushing,
> > then fixing it, but if you're locking a bunch at once it might get
> > complicated.
> > Making this locking scheme work without demolishing performance seems hard.
>
> We only change at most a single active PTE per fault on the stage 2 MMU.
> We do one of three things on that path:
>
>  1. Install a page/block PTE to an empty PTE
>  2. Replace a table PTE with a block PTE
>  3. Replace a block PTE with a table PTE
>
> 1 is pretty cheap and can skip flushes altogether.
>
> 2 only requires a single TLBI (a big, painful flush of the stage 2 context),
> but child PTEs needn't be flushed.
>
> 3 also requires a single TLBI, but can be done with an IPA and level
> hint.
>
> Perhaps the answer is to push teardown into the rcu callback altogether,
> IOW don't mess with links in the subtree until then. At that point
> there's no need for TLBIs nor atomics.
>
> --
> Thanks,
> Oliver
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bf46d6d24951..059ebb921125 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -49,6 +49,12 @@ 
 #define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
 #define KVM_MAX_OWNER_ID		1
 
+/*
+ * Used to indicate a pte for which a 'make-before-break' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable		*pgt;
 	struct kvm_pgtable_walker	*walker;
@@ -707,6 +713,122 @@  static bool stage2_pte_is_counted(kvm_pte_t pte)
 	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
+static bool stage2_pte_is_locked(kvm_pte_t pte)
+{
+	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
+}
+
+static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
+{
+	if (!shared) {
+		WRITE_ONCE(*ptep, new);
+		return true;
+	}
+
+	return cmpxchg(ptep, old, new) == old;
+}
+
+/**
+ * stage2_try_break_pte() - Invalidates a pte according to the
+ *			    'break-before-make' sequence.
+ *
+ * @ptep: Pointer to the pte to break
+ * @old: The previously observed value of the pte; used for compare-exchange in
+ *	 a parallel walk
+ * @addr: IPA corresponding to the pte
+ * @level: Table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @data: pointer to the map walker data
+ *
+ * Returns: true if the pte was successfully broken.
+ *
+ * If the removed pt was valid, performs the necessary DSB and TLB flush for
+ * the old value. Drops references to the page table if a non-table entry was
+ * removed. Otherwise, the table reference is preserved as the walker must also
+ * recurse through the child tables.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
+				 struct stage2_map_data *data)
+{
+	/*
+	 * Another thread could have already visited this pte and taken
+	 * ownership.
+	 */
+	if (stage2_pte_is_locked(old)) {
+		/*
+		 * If the table walker has exclusive access to the page tables
+		 * then no other software walkers should have locked the pte.
+		 */
+		WARN_ON(!shared);
+		return false;
+	}
+
+	if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
+		return false;
+
+	/*
+	 * If we removed a valid pte, break-then-make rules are in effect as a
+	 * translation may have been cached that traversed this entry.
+	 */
+	if (kvm_pte_valid(old)) {
+		dsb(ishst);
+
+		if (kvm_pte_table(old, level))
+			/*
+			 * Invalidate the whole stage-2, as we may have numerous leaf
+			 * entries below us which would otherwise need invalidating
+			 * individually.
+			 */
+			kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+		else
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+	}
+
+	/*
+	 * Don't drop the reference on table entries yet, as the walker must
+	 * first recurse on the unlinked subtree to unlink and drop references
+	 * to child tables.
+	 */
+	if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
+		data->mm_ops->put_page(ptep);
+
+	return true;
+}
+
+/**
+ * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
+ *		       sequence.
+ *
+ * @ptep: pointer to the pte to make
+ * @new: new pte value to install
+ *
+ * Assumes that the pte addressed by ptep has already been broken and is under
+ * the ownership of the table walker. If the new pte to be installed is a valid
+ * entry, perform a DSB to make the write visible. Raise the reference count on
+ * the table if the new pte requires a reference.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
+{
+	/* Yikes! We really shouldn't install to an entry we don't own. */
+	WARN_ON(!stage2_pte_is_locked(*ptep));
+
+	if (stage2_pte_is_counted(new))
+		mm_ops->get_page(ptep);
+
+	if (kvm_pte_valid(new)) {
+		WRITE_ONCE(*ptep, new);
+		dsb(ishst);
+	} else {
+		smp_store_release(ptep, new);
+	}
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -760,18 +882,17 @@  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	else
 		new = kvm_init_invalid_leaf_owner(data->owner_id);
 
-	if (stage2_pte_is_counted(old)) {
-		/*
-		 * Skip updating the PTE if we are trying to recreate the exact
-		 * same mapping or only change the access permissions. Instead,
-		 * the vCPU will exit one more time from guest if still needed
-		 * and then go through the path of relaxing permissions.
-		 */
-		if (!stage2_pte_needs_update(old, new))
-			return -EAGAIN;
+	/*
+	 * Skip updating the PTE if we are trying to recreate the exact same
+	 * mapping or only change the access permissions. Instead, the vCPU will
+	 * exit one more time from the guest if still needed and then go through
+	 * the path of relaxing permissions.
+	 */
+	if (!stage2_pte_needs_update(old, new))
+		return -EAGAIN;
 
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-	}
+	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
 	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
@@ -781,9 +902,7 @@  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
-	smp_store_release(ptep, new);
-	if (stage2_pte_is_counted(new))
-		mm_ops->get_page(ptep);
+	stage2_make_pte(ptep, new, data->mm_ops);
 	if (kvm_phys_is_valid(phys))
 		data->phys += granule;
 	return 0;
@@ -800,15 +919,10 @@  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*old, data->mm_ops);
-	kvm_clear_pte(ptep);
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
 
-	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
-	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	data->anchor = ptep;
 	return 0;
 }
@@ -837,18 +951,24 @@  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	if (!data->memcache)
 		return -ENOMEM;
 
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
+
 	childp = mm_ops->zalloc_page(data->memcache);
-	if (!childp)
+	if (!childp) {
+		/*
+		 * Release the pte if we were unable to install a table to allow
+		 * another thread to make an attempt.
+		 */
+		stage2_make_pte(ptep, 0, data->mm_ops);
 		return -ENOMEM;
+	}
 
 	/*
 	 * If we've run into an existing block mapping then replace it with
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(*old))
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
 	*old = *ptep;