diff mbox series

[v3,5/7] KVM: x86: Participate in bitmap-based PTE aging

Message ID 20240401232946.1837665-6-jthoughton@google.com (mailing list archive)
State New
Headers show
Series mm/kvm: Improve parallelism for access bit harvesting | expand

Commit Message

James Houghton April 1, 2024, 11:29 p.m. UTC
Only handle the TDP MMU case for now. In other cases, if a bitmap was
not provided, fallback to the slowpath that takes mmu_lock, or, if a
bitmap was provided, inform the caller that the bitmap is unreliable.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
 arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
 3 files changed, 37 insertions(+), 3 deletions(-)

Comments

David Matlack April 11, 2024, 5:08 p.m. UTC | #1
On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
>  arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
>  3 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3b58e2306621..c30918d0887e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
>   */
>  #define KVM_EXIT_HYPERCALL_MBZ		GENMASK_ULL(31, 1)
>  
> +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> +{
> +	/*
> +	 * Indicate that we support bitmap-based aging when using the TDP MMU
> +	 * and the accessed bit is available in the TDP page tables.
> +	 *
> +	 * We have no other preparatory work to do here, so we do not need to
> +	 * redefine kvm_arch_finish_bitmap_age().
> +	 */
> +	return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> +					 && shadow_accessed_mask;
> +}
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..fae1a75750bb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
>  
> -	if (kvm_memslots_have_rmaps(kvm))
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		if (range->lockless) {
> +			kvm_age_set_unreliable(range);
> +			return false;
> +		}

If a VM has TDP MMU enabled, supports A/D bits, and is using nested
virtualization, MGLRU will effectively be blind to all accesses made by
the VM.

kvm_arch_prepare_bitmap_age() will return true indicating that the
bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
return false immediately and indicate the bitmap is unreliable because a
shadow root is allocate. The notfier will then return
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
consumed or used. So I think MGLRU will assume all memory is
unaccessed?

One way to improve the situation would be to re-order the TDP MMU
function first and return young instead of false, so that way MGLRU at
least has visibility into accesses made by L1 (and L2 if EPT is disable
in L2). But that still means MGLRU is blind to accesses made by L2.

What about grabbing the mmu_lock if there's a shadow root allocated and
get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?

	if (kvm_memslots_have_rmaps(kvm)) {
		write_lock(&kvm->mmu_lock);
		young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
		write_unlock(&kvm->mmu_lock);
	}

The TDP MMU walk would still be lockless. KVM only has to take the
mmu_lock to collect accesses made by L2.

kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
as well, but that seems relatively simple with the helper functions.
David Matlack April 11, 2024, 5:28 p.m. UTC | #2
On 2024-04-11 10:08 AM, David Matlack wrote:
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
> > 
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> >  arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
> >  3 files changed, 37 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 3b58e2306621..c30918d0887e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> >   */
> >  #define KVM_EXIT_HYPERCALL_MBZ		GENMASK_ULL(31, 1)
> >  
> > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > +{
> > +	/*
> > +	 * Indicate that we support bitmap-based aging when using the TDP MMU
> > +	 * and the accessed bit is available in the TDP page tables.
> > +	 *
> > +	 * We have no other preparatory work to do here, so we do not need to
> > +	 * redefine kvm_arch_finish_bitmap_age().
> > +	 */
> > +	return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > +					 && shadow_accessed_mask;
> > +}
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 992e651540e8..fae1a75750bb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	bool young = false;
> >  
> > -	if (kvm_memslots_have_rmaps(kvm))
> > +	if (kvm_memslots_have_rmaps(kvm)) {
> > +		if (range->lockless) {
> > +			kvm_age_set_unreliable(range);
> > +			return false;
> > +		}
> 
> If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> virtualization, MGLRU will effectively be blind to all accesses made by
> the VM.
> 
> kvm_arch_prepare_bitmap_age() will return true indicating that the
> bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> return false immediately and indicate the bitmap is unreliable because a
> shadow root is allocate. The notfier will then return
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> consumed or used. So I think MGLRU will assume all memory is
> unaccessed?
> 
> One way to improve the situation would be to re-order the TDP MMU
> function first and return young instead of false, so that way MGLRU at
> least has visibility into accesses made by L1 (and L2 if EPT is disable
> in L2). But that still means MGLRU is blind to accesses made by L2.
> 
> What about grabbing the mmu_lock if there's a shadow root allocated and
> get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> 
> 	if (kvm_memslots_have_rmaps(kvm)) {
> 		write_lock(&kvm->mmu_lock);
> 		young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> 		write_unlock(&kvm->mmu_lock);
> 	}
> 
> The TDP MMU walk would still be lockless. KVM only has to take the
> mmu_lock to collect accesses made by L2.
> 
> kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> as well, but that seems relatively simple with the helper functions.

Wait, even simpler, just check kvm_memslots_have_rmaps() in
kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
bitmap request.

i.e.

static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct mmu_notifier *mn)
{
	/*
	 * Indicate that we support bitmap-based aging when using the TDP MMU
	 * and the accessed bit is available in the TDP page tables.
	 *
	 * We have no other preparatory work to do here, so we do not need to
	 * redefine kvm_arch_finish_bitmap_age().
	 */
	return IS_ENABLED(CONFIG_X86_64)
		&& tdp_mmu_enabled
		&& shadow_accessed_mask
		&& !kvm_memslots_have_rmaps(kvm);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
        bool young = false;

        if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
                young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);

        if (tdp_mmu_enabled)
                young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

        return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
        bool young = false;

        if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
                young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);

        if (tdp_mmu_enabled)
                young |= kvm_tdp_mmu_test_age_gfn(kvm, range);

        return young;
}

Sure this could race with the creation of a shadow root but so can the
non-bitmap code.
David Matlack April 11, 2024, 6 p.m. UTC | #3
On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao <yuzhao@google.com>
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > >  arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ             GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +    * Indicate that we support bitmap-based aging when using the TDP MMU
> > > +    * and the accessed bit is available in the TDP page tables.
> > > +    *
> > > +    * We have no other preparatory work to do here, so we do not need to
> > > +    * redefine kvm_arch_finish_bitmap_age().
> > > +    */
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +                                    && shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >     bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +           if (range->lockless) {
> > > +                   kvm_age_set_unreliable(range);
> > > +                   return false;
> > > +           }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

The control flow of all this and naming of functions and macros is
overall confusing. args.unreliable and
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE for one. Also I now realize
kvm_arch_prepare/finish_bitmap_age() are used even when the bitmap is
_not_ provided, so those names are also misleading.
David Matlack April 11, 2024, 6:07 p.m. UTC | #4
On Thu, Apr 11, 2024 at 11:00 AM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > > >
> > > > Suggested-by: Yu Zhao <yuzhao@google.com>
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > > >  arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
> > > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 3b58e2306621..c30918d0887e 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > > >   */
> > > >  #define KVM_EXIT_HYPERCALL_MBZ             GENMASK_ULL(31, 1)
> > > >
> > > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > > +{
> > > > +   /*
> > > > +    * Indicate that we support bitmap-based aging when using the TDP MMU
> > > > +    * and the accessed bit is available in the TDP page tables.
> > > > +    *
> > > > +    * We have no other preparatory work to do here, so we do not need to
> > > > +    * redefine kvm_arch_finish_bitmap_age().
> > > > +    */
> > > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > > +                                    && shadow_accessed_mask;
> > > > +}
> > > > +
> > > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 992e651540e8..fae1a75750bb 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > >  {
> > > >     bool young = false;
> > > >
> > > > -   if (kvm_memslots_have_rmaps(kvm))
> > > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > > +           if (range->lockless) {
> > > > +                   kvm_age_set_unreliable(range);
> > > > +                   return false;
> > > > +           }
> > >
> > > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > > virtualization, MGLRU will effectively be blind to all accesses made by
> > > the VM.
> > >
> > > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > > return false immediately and indicate the bitmap is unreliable because a
> > > shadow root is allocate. The notfier will then return
> > > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
> return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

Nope, wrong again. Just ignore me while I try to figure out how this
actually works :)
David Matlack April 12, 2024, 8:44 p.m. UTC | #5
On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.

I think this patch will trigger a lockdep assert in

  kvm_tdp_mmu_age_gfn_range
    kvm_tdp_mmu_handle_gfn
      for_each_tdp_mmu_root
        __for_each_tdp_mmu_root
          kvm_lockdep_assert_mmu_lock_held

... because it walks tdp_mmu_roots without holding mmu_lock.

Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
similar here and also update the comment above tdp_mmu_roots describing
how tdp_mmu_roots can be read locklessly.

[1] https://lore.kernel.org/kvmarm/ZItX64Bbx5vdjo9M@google.com/
James Houghton April 19, 2024, 8:47 p.m. UTC | #6
On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao <yuzhao@google.com>
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > >  arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 10 +++++++++-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ             GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +    * Indicate that we support bitmap-based aging when using the TDP MMU
> > > +    * and the accessed bit is available in the TDP page tables.
> > > +    *
> > > +    * We have no other preparatory work to do here, so we do not need to
> > > +    * redefine kvm_arch_finish_bitmap_age().
> > > +    */
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +                                    && shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >     bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +           if (range->lockless) {
> > > +                   kvm_age_set_unreliable(range);
> > > +                   return false;
> > > +           }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> > consumed or used. So I think MGLRU will assume all memory is
> > unaccessed?
> >
> > One way to improve the situation would be to re-order the TDP MMU
> > function first and return young instead of false, so that way MGLRU at
> > least has visibility into accesses made by L1 (and L2 if EPT is disable
> > in L2). But that still means MGLRU is blind to accesses made by L2.
> >
> > What about grabbing the mmu_lock if there's a shadow root allocated and
> > get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> >
> >       if (kvm_memslots_have_rmaps(kvm)) {
> >               write_lock(&kvm->mmu_lock);
> >               young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > The TDP MMU walk would still be lockless. KVM only has to take the
> > mmu_lock to collect accesses made by L2.
> >
> > kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> > as well, but that seems relatively simple with the helper functions.
>
> Wait, even simpler, just check kvm_memslots_have_rmaps() in
> kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
> bitmap request.
>
> i.e.
>
> static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct mmu_notifier *mn)
> {
>         /*
>          * Indicate that we support bitmap-based aging when using the TDP MMU
>          * and the accessed bit is available in the TDP page tables.
>          *
>          * We have no other preparatory work to do here, so we do not need to
>          * redefine kvm_arch_finish_bitmap_age().
>          */
>         return IS_ENABLED(CONFIG_X86_64)
>                 && tdp_mmu_enabled
>                 && shadow_accessed_mask
>                 && !kvm_memslots_have_rmaps(kvm);
> }
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
>         bool young = false;
>
>         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
>                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>
>         if (tdp_mmu_enabled)
>                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
>         return young;
> }
>
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
>         bool young = false;
>
>         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
>                 young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
>
>         if (tdp_mmu_enabled)
>                 young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
>
>         return young;


Yeah I think this is the right thing to do. Given your other
suggestions (on patch 3), I think this will look something like this
-- let me know if I've misunderstood something:

bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);

if (check_rmap)
  KVM_MMU_LOCK(kvm);

rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?

if (check_rmap)
  kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)

if (tdp_mmu_enabled)
  kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe

rcu_read_unlock();
if (check_rmap)
  KVM_MMU_UNLOCK(kvm);

> }
>
> Sure this could race with the creation of a shadow root but so can the
> non-bitmap code.
James Houghton April 19, 2024, 8:54 p.m. UTC | #7
On Fri, Apr 12, 2024 at 1:44 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
>
> I think this patch will trigger a lockdep assert in
>
>   kvm_tdp_mmu_age_gfn_range
>     kvm_tdp_mmu_handle_gfn
>       for_each_tdp_mmu_root
>         __for_each_tdp_mmu_root
>           kvm_lockdep_assert_mmu_lock_held
>
> ... because it walks tdp_mmu_roots without holding mmu_lock.

Indeed, thanks. I'll make sure to build with CONFIG_LOCKDEP for the
future versions and check for errors.

>
> Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
> similar here and also update the comment above tdp_mmu_roots describing
> how tdp_mmu_roots can be read locklessly.

I'll add the macro / function to do the lockless walk of tdp_mmu_roots
and explain why it is safe. Thanks for pointing out this big mistake.

> [1] https://lore.kernel.org/kvmarm/ZItX64Bbx5vdjo9M@google.com/
David Matlack April 19, 2024, 9:06 p.m. UTC | #8
On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> >         bool young = false;
> >
> >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> >         if (tdp_mmu_enabled)
> >                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> >         return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> >         bool young = false;
> >
> >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> >                 young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> >         if (tdp_mmu_enabled)
> >                 young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> >         return young;
> 
> 
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
> 
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> 
> if (check_rmap)
>   KVM_MMU_LOCK(kvm);
> 
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> 
> if (check_rmap)
>   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> 
> if (tdp_mmu_enabled)
>   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> 
> rcu_read_unlock();
> if (check_rmap)
>   KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
        bool young = false;

	/* Shadow MMU aging holds write-lock. */
        if (kvm_memslots_have_rmaps(kvm)) {
                write_lock(&kvm->mmu_lock);
                young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
                write_unlock(&kvm->mmu_lock);
        }

	/* TDM MMU aging is lockless. */
        if (tdp_mmu_enabled)
                young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

        return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
        unsigned long *bitmap = range->arg.metadata->bitmap;
        bool young = false;

	/* SHadow MMU aging holds write-lock and does not support bitmap. */
        if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
                write_lock(&kvm->mmu_lock);
                young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
                write_unlock(&kvm->mmu_lock);
        }

	/* TDM MMU aging is lockless and supports bitmap. */
        if (tdp_mmu_enabled)
                young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

        return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?
James Houghton April 19, 2024, 9:48 p.m. UTC | #9
On Fri, Apr 19, 2024 at 2:07 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-04-19 01:47 PM, James Houghton wrote:
> > On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
> > > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > >         bool young = false;
> > >
> > >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > >
> > >         if (tdp_mmu_enabled)
> > >                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> > >
> > >         return young;
> > > }
> > >
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > >         bool young = false;
> > >
> > >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > >                 young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> > >
> > >         if (tdp_mmu_enabled)
> > >                 young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> > >
> > >         return young;
> >
> >
> > Yeah I think this is the right thing to do. Given your other
> > suggestions (on patch 3), I think this will look something like this
> > -- let me know if I've misunderstood something:
> >
> > bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> >
> > if (check_rmap)
> >   KVM_MMU_LOCK(kvm);
> >
> > rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> >
> > if (check_rmap)
> >   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> >
> > if (tdp_mmu_enabled)
> >   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> >
> > rcu_read_unlock();
> > if (check_rmap)
> >   KVM_MMU_UNLOCK(kvm);
>
> I was thinking a little different. If you follow my suggestion to first
> make the TDP MMU aging lockless, you'll end up with something like this
> prior to adding bitmap support (note: the comments are just for
> demonstrative purposes):
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
>         bool young = false;
>
>         /* Shadow MMU aging holds write-lock. */
>         if (kvm_memslots_have_rmaps(kvm)) {
>                 write_lock(&kvm->mmu_lock);
>                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>                 write_unlock(&kvm->mmu_lock);
>         }
>
>         /* TDM MMU aging is lockless. */
>         if (tdp_mmu_enabled)
>                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
>         return young;
> }
>
> Then when you add bitmap support it would look something like this:
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
>         unsigned long *bitmap = range->arg.metadata->bitmap;
>         bool young = false;
>
>         /* SHadow MMU aging holds write-lock and does not support bitmap. */
>         if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
>                 write_lock(&kvm->mmu_lock);
>                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>                 write_unlock(&kvm->mmu_lock);
>         }
>
>         /* TDM MMU aging is lockless and supports bitmap. */
>         if (tdp_mmu_enabled)
>                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
>         return young;
> }
>
> rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

Oh yes this is a lot better. I hope I would have seen this when it
came time to actually update this patch. Thanks.

>
> That brings up a question I've been wondering about. If KVM only
> advertises support for the bitmap lookaround when shadow roots are not
> allocated, does that mean MGLRU will be blind to accesses made by L2
> when nested virtualization is enabled? And does that mean the Linux MM
> will think all L2 memory is cold (i.e. good candidate for swapping)
> because it isn't seeing accesses made by L2?

Yes, I think so (for both questions). That's better than KVM not
participating in MGLRU aging at all, which is the case today (IIUC --
also ignoring the case where KVM accesses guest memory directly). We
could have MGLRU always invoke the mmu notifiers, but frequently
taking the MMU lock for writing might be worse than evicting when we
shouldn't. Maybe Yu tried this at some point, but I can't find any
results for this.
Yu Zhao April 21, 2024, 12:19 a.m. UTC | #10
On Fri, Apr 19, 2024 at 3:48 PM James Houghton <jthoughton@google.com> wrote:
>
> On Fri, Apr 19, 2024 at 2:07 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On 2024-04-19 01:47 PM, James Houghton wrote:
> > > On Thu, Apr 11, 2024 at 10:28 AM David Matlack <dmatlack@google.com> wrote:
> > > > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > >         bool young = false;
> > > >
> > > >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > >
> > > >         if (tdp_mmu_enabled)
> > > >                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> > > >
> > > >         return young;
> > > > }
> > > >
> > > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > >         bool young = false;
> > > >
> > > >         if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > >                 young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> > > >
> > > >         if (tdp_mmu_enabled)
> > > >                 young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> > > >
> > > >         return young;
> > >
> > >
> > > Yeah I think this is the right thing to do. Given your other
> > > suggestions (on patch 3), I think this will look something like this
> > > -- let me know if I've misunderstood something:
> > >
> > > bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> > >
> > > if (check_rmap)
> > >   KVM_MMU_LOCK(kvm);
> > >
> > > rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> > >
> > > if (check_rmap)
> > >   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> > >
> > > if (tdp_mmu_enabled)
> > >   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> > >
> > > rcu_read_unlock();
> > > if (check_rmap)
> > >   KVM_MMU_UNLOCK(kvm);
> >
> > I was thinking a little different. If you follow my suggestion to first
> > make the TDP MMU aging lockless, you'll end up with something like this
> > prior to adding bitmap support (note: the comments are just for
> > demonstrative purposes):
> >
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> >         bool young = false;
> >
> >         /* Shadow MMU aging holds write-lock. */
> >         if (kvm_memslots_have_rmaps(kvm)) {
> >                 write_lock(&kvm->mmu_lock);
> >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >                 write_unlock(&kvm->mmu_lock);
> >         }
> >
> >         /* TDM MMU aging is lockless. */
> >         if (tdp_mmu_enabled)
> >                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> >         return young;
> > }
> >
> > Then when you add bitmap support it would look something like this:
> >
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> >         unsigned long *bitmap = range->arg.metadata->bitmap;
> >         bool young = false;
> >
> >         /* SHadow MMU aging holds write-lock and does not support bitmap. */
> >         if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
> >                 write_lock(&kvm->mmu_lock);
> >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >                 write_unlock(&kvm->mmu_lock);
> >         }
> >
> >         /* TDM MMU aging is lockless and supports bitmap. */
> >         if (tdp_mmu_enabled)
> >                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> >         return young;
> > }
> >
> > rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().
>
> Oh yes this is a lot better. I hope I would have seen this when it
> came time to actually update this patch. Thanks.
>
> >
> > That brings up a question I've been wondering about. If KVM only
> > advertises support for the bitmap lookaround when shadow roots are not
> > allocated, does that mean MGLRU will be blind to accesses made by L2
> > when nested virtualization is enabled? And does that mean the Linux MM
> > will think all L2 memory is cold (i.e. good candidate for swapping)
> > because it isn't seeing accesses made by L2?
>
> Yes, I think so (for both questions). That's better than KVM not
> participating in MGLRU aging at all, which is the case today (IIUC --
> also ignoring the case where KVM accesses guest memory directly). We
> could have MGLRU always invoke the mmu notifiers, but frequently
> taking the MMU lock for writing might be worse than evicting when we
> shouldn't. Maybe Yu tried this at some point, but I can't find any
> results for this.

No, in this case only the fast path (page table scanning) is disabled.
MGLRU still sees the A-bit from L2 using the rmap, i.e., the slow path
calling folio_check_references().
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b58e2306621..c30918d0887e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2324,4 +2324,18 @@  int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
  */
 #define KVM_EXIT_HYPERCALL_MBZ		GENMASK_ULL(31, 1)
 
+#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
+static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+	/*
+	 * Indicate that we support bitmap-based aging when using the TDP MMU
+	 * and the accessed bit is available in the TDP page tables.
+	 *
+	 * We have no other preparatory work to do here, so we do not need to
+	 * redefine kvm_arch_finish_bitmap_age().
+	 */
+	return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
+					 && shadow_accessed_mask;
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..fae1a75750bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1674,8 +1674,14 @@  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		if (range->lockless) {
+			kvm_age_set_unreliable(range);
+			return false;
+		}
+
 		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1687,8 +1693,14 @@  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		if (range->lockless) {
+			kvm_age_set_unreliable(range);
+			return false;
+		}
+
 		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..edea01bc145f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1217,6 +1217,9 @@  static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 	if (!is_accessed_spte(iter->old_spte))
 		return false;
 
+	if (!kvm_gfn_should_age(range, iter->gfn))
+		return false;
+
 	if (spte_ad_enabled(iter->old_spte)) {
 		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
 							 iter->old_spte,
@@ -1250,7 +1253,12 @@  bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
 			 struct kvm_gfn_range *range)
 {
-	return is_accessed_spte(iter->old_spte);
+	bool young = is_accessed_spte(iter->old_spte);
+
+	if (young)
+		kvm_gfn_record_young(range, iter->gfn);
+
+	return young;
 }
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)