[20/22] kvm: mmu: NX largepage recovery for TDP MMU

Message ID	20200925212302.3979661-21-bgardon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qKOj=DC=vger.kernel.org=kvm-owner@kernel.org> Sender: "bgardon via sendgmr" <bgardon@bgardon.sea.corp.google.com> Date: Fri, 25 Sep 2020 14:23:00 -0700 In-Reply-To: <20200925212302.3979661-1-bgardon@google.com> Message-Id: <20200925212302.3979661-21-bgardon@google.com> Mime-Version: 1.0 References: <20200925212302.3979661-1-bgardon@google.com> Subject: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU From: Ben Gardon <bgardon@google.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Cannon Matthews <cannonmatthews@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Sean Christopherson <sean.j.christopherson@intel.com>, Peter Shier <pshier@google.com>, Peter Feiner <pfeiner@google.com>, Junaid Shahid <junaids@google.com>, Jim Mattson <jmattson@google.com>, Yulei Zhang <yulei.kernel@gmail.com>, Wanpeng Li <kernellwp@gmail.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Xiao Guangrong <xiaoguangrong.eric@gmail.com>, Ben Gardon <bgardon@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Introduce the TDP MMU \| expand [00/22] Introduce the TDP MMU [01/22] kvm: mmu: Separate making SPTEs from set_spte [02/22] kvm: mmu: Introduce tdp_iter [03/22] kvm: mmu: Init / Uninit the TDP MMU [04/22] kvm: mmu: Allocate and free TDP MMU roots [05/22] kvm: mmu: Add functions to handle changed TDP SPTEs [06/22] kvm: mmu: Make address space ID a property of memslots [07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU [08/22] kvm: mmu: Separate making non-leaf sptes from link_shadow_page [09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg [10/22] kvm: mmu: Add TDP MMU PF handler [11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page [12/22] kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU [13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU [14/22] kvm: mmu: Add access tracking for tdp_mmu [15/22] kvm: mmu: Support changed pte notifier in tdp MMU [16/22] kvm: mmu: Add dirty logging handler for changed sptes [17/22] kvm: mmu: Support dirty logging for the TDP MMU [18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU [19/22] kvm: mmu: Support write protection for nesting in tdp MMU [20/22] kvm: mmu: NX largepage recovery for TDP MMU [21/22] kvm: mmu: Support MMIO in the TDP MMU [22/22] kvm: mmu: Don't clear write flooding count for direct roots

Ben Gardon Sept. 25, 2020, 9:23 p.m. UTC

When KVM maps a largepage backed region at a lower level in order to
make it executable (i.e. NX large page shattering), it reduces the TLB
performance of that region. In order to avoid making this degradation
permanent, KVM must periodically reclaim shattered NX largepages by
zapping them and allowing them to be rebuilt in the page fault handler.

With this patch, the TDP MMU does not respect KVM's rate limiting on
reclaim. It traverses the entire TDP structure every time. This will be
addressed in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h |  4 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 66 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  2 +
 5 files changed, 97 insertions(+), 5 deletions(-)

Paolo Bonzini Sept. 26, 2020, 1:14 a.m. UTC | #1

On 25/09/20 23:23, Ben Gardon wrote:
> +
> +	if (!kvm->arch.tdp_mmu_enabled)
> +		return err;
> +
> +	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 1,
> +			"kvm-nx-lpage-tdp-mmu-recovery",
> +			&kvm->arch.nx_lpage_tdp_mmu_recovery_thread);

Any reason to have two threads?

Paolo

Paolo Bonzini Sept. 29, 2020, 6:24 p.m. UTC | #2

On 25/09/20 23:23, Ben Gardon wrote:
> +	struct list_head tdp_mmu_lpage_disallowed_pages;

This list is never INIT_LIST_HEAD-ed, but I see other issues if I do so
(or maybe it's just too late).

Paolo

> +	u64 tdp_mmu_lpage_disallowed_page_count;

Sean Christopherson Sept. 30, 2020, 6:15 p.m. UTC | #3

On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> +/*
> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> + * exist in order to allow execute access on a region that would otherwise be
> + * mapped as a large page.
> + */
> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> +{
> +	struct kvm_mmu_page *sp;
> +	bool flush;
> +	int rcu_idx;
> +	unsigned int ratio;
> +	ulong to_zap;
> +	u64 old_spte;
> +
> +	rcu_idx = srcu_read_lock(&kvm->srcu);
> +	spin_lock(&kvm->mmu_lock);
> +
> +	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> +	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;

This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
MMU never increments nx_lpage_splits, it instead has its own counter,
tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
guaranteed to be zero and thus this is completely untested.

I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
will be collisions with other code incrementing nx_lpage_splits.   And the TDP
MMU should be updating stats anyways.

> +
> +	while (to_zap &&
> +	       !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> +		/*
> +		 * We use a separate list instead of just using active_mmu_pages
> +		 * because the number of lpage_disallowed pages is expected to
> +		 * be relatively small compared to the total.
> +		 */
> +		sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> +				      struct kvm_mmu_page,
> +				      lpage_disallowed_link);
> +
> +		old_spte = *sp->parent_sptep;
> +		*sp->parent_sptep = 0;
> +
> +		list_del(&sp->lpage_disallowed_link);
> +		kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> +
> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> +				    old_spte, 0, sp->role.level + 1);
> +
> +		flush = true;
> +
> +		if (!--to_zap || need_resched() ||
> +		    spin_needbreak(&kvm->mmu_lock)) {
> +			flush = false;
> +			kvm_flush_remote_tlbs(kvm);
> +			if (to_zap)
> +				cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +	srcu_read_unlock(&kvm->srcu, rcu_idx);
> +}
> +
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 2ecb047211a6d..45ea2d44545db 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  
>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot, gfn_t gfn);
> +
> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
>

Paolo Bonzini Sept. 30, 2020, 7:56 p.m. UTC | #4

On 30/09/20 20:15, Sean Christopherson wrote:
> On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
>> +/*
>> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
>> + * exist in order to allow execute access on a region that would otherwise be
>> + * mapped as a large page.
>> + */
>> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
>> +{
>> +	struct kvm_mmu_page *sp;
>> +	bool flush;
>> +	int rcu_idx;
>> +	unsigned int ratio;
>> +	ulong to_zap;
>> +	u64 old_spte;
>> +
>> +	rcu_idx = srcu_read_lock(&kvm->srcu);
>> +	spin_lock(&kvm->mmu_lock);
>> +
>> +	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
>> +	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
> 
> This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> MMU never increments nx_lpage_splits, it instead has its own counter,
> tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> guaranteed to be zero and thus this is completely untested.

Except if you do shadow paging (through nested EPT) and then it bombs
immediately. :)

> I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> MMU should be updating stats anyways.

This is true, but having two counters is necessary (in the current
implementation) because otherwise you zap more than the requested ratio
of pages.

The simplest solution is to add a "bool tdp_page" to struct
kvm_mmu_page, so that you can have a single list of
lpage_disallowed_pages and a single thread.  The while loop can then
dispatch to the right "zapper" code.

Anyway this patch is completely broken, so let's kick it away to the
next round.

Paolo

>> +
>> +	while (to_zap &&
>> +	       !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
>> +		/*
>> +		 * We use a separate list instead of just using active_mmu_pages
>> +		 * because the number of lpage_disallowed pages is expected to
>> +		 * be relatively small compared to the total.
>> +		 */
>> +		sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
>> +				      struct kvm_mmu_page,
>> +				      lpage_disallowed_link);
>> +
>> +		old_spte = *sp->parent_sptep;
>> +		*sp->parent_sptep = 0;
>> +
>> +		list_del(&sp->lpage_disallowed_link);
>> +		kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
>> +
>> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
>> +				    old_spte, 0, sp->role.level + 1);
>> +
>> +		flush = true;
>> +
>> +		if (!--to_zap || need_resched() ||
>> +		    spin_needbreak(&kvm->mmu_lock)) {
>> +			flush = false;
>> +			kvm_flush_remote_tlbs(kvm);
>> +			if (to_zap)
>> +				cond_resched_lock(&kvm->mmu_lock);
>> +		}
>> +	}
>> +
>> +	if (flush)
>> +		kvm_flush_remote_tlbs(kvm);
>> +
>> +	spin_unlock(&kvm->mmu_lock);
>> +	srcu_read_unlock(&kvm->srcu, rcu_idx);
>> +}
>> +
>> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
>> index 2ecb047211a6d..45ea2d44545db 100644
>> --- a/arch/x86/kvm/mmu/tdp_mmu.h
>> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
>> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>  
>>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>>  				   struct kvm_memory_slot *slot, gfn_t gfn);
>> +
>> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
>>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
>> -- 
>> 2.28.0.709.gb0816b6eb0-goog
>>
>

Ben Gardon Sept. 30, 2020, 10:23 p.m. UTC | #5

On Fri, Sep 25, 2020 at 6:15 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:23, Ben Gardon wrote:
> > +
> > +     if (!kvm->arch.tdp_mmu_enabled)
> > +             return err;
> > +
> > +     err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 1,
> > +                     "kvm-nx-lpage-tdp-mmu-recovery",
> > +                     &kvm->arch.nx_lpage_tdp_mmu_recovery_thread);
>
> Any reason to have two threads?
>
> Paolo

At some point it felt cleaner. In this patch set NX reclaim is pretty
similar between the "shadow MMU" and TDP MMU so they don't really need
to be separate threads.

>

Ben Gardon Sept. 30, 2020, 10:27 p.m. UTC | #6

On Wed, Sep 30, 2020 at 11:16 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> > +/*
> > + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> > + * exist in order to allow execute access on a region that would otherwise be
> > + * mapped as a large page.
> > + */
> > +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     bool flush;
> > +     int rcu_idx;
> > +     unsigned int ratio;
> > +     ulong to_zap;
> > +     u64 old_spte;
> > +
> > +     rcu_idx = srcu_read_lock(&kvm->srcu);
> > +     spin_lock(&kvm->mmu_lock);
> > +
> > +     ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> > +     to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
>
> This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> MMU never increments nx_lpage_splits, it instead has its own counter,
> tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> guaranteed to be zero and thus this is completely untested.

Good catch, I should write some NX reclaim selftests.

>
> I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> MMU should be updating stats anyways.

A VM actually can have both the legacy MMU and TDP MMU, by design. The
legacy MMU handles nested. Eventually I'd like the TDP MMU to be
responsible for building nested shadow TDP tables, but haven't
implemented it.

>
> > +
> > +     while (to_zap &&
> > +            !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> > +             /*
> > +              * We use a separate list instead of just using active_mmu_pages
> > +              * because the number of lpage_disallowed pages is expected to
> > +              * be relatively small compared to the total.
> > +              */
> > +             sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> > +                                   struct kvm_mmu_page,
> > +                                   lpage_disallowed_link);
> > +
> > +             old_spte = *sp->parent_sptep;
> > +             *sp->parent_sptep = 0;
> > +
> > +             list_del(&sp->lpage_disallowed_link);
> > +             kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> > +
> > +             handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> > +                                 old_spte, 0, sp->role.level + 1);
> > +
> > +             flush = true;
> > +
> > +             if (!--to_zap || need_resched() ||
> > +                 spin_needbreak(&kvm->mmu_lock)) {
> > +                     flush = false;
> > +                     kvm_flush_remote_tlbs(kvm);
> > +                     if (to_zap)
> > +                             cond_resched_lock(&kvm->mmu_lock);
> > +             }
> > +     }
> > +
> > +     if (flush)
> > +             kvm_flush_remote_tlbs(kvm);
> > +
> > +     spin_unlock(&kvm->mmu_lock);
> > +     srcu_read_unlock(&kvm->srcu, rcu_idx);
> > +}
> > +
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 2ecb047211a6d..45ea2d44545db 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >
> >  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >                                  struct kvm_memory_slot *slot, gfn_t gfn);
> > +
> > +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

Ben Gardon Sept. 30, 2020, 10:33 p.m. UTC | #7

On Wed, Sep 30, 2020 at 12:56 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 30/09/20 20:15, Sean Christopherson wrote:
> > On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> >> +/*
> >> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> >> + * exist in order to allow execute access on a region that would otherwise be
> >> + * mapped as a large page.
> >> + */
> >> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> >> +{
> >> +    struct kvm_mmu_page *sp;
> >> +    bool flush;
> >> +    int rcu_idx;
> >> +    unsigned int ratio;
> >> +    ulong to_zap;
> >> +    u64 old_spte;
> >> +
> >> +    rcu_idx = srcu_read_lock(&kvm->srcu);
> >> +    spin_lock(&kvm->mmu_lock);
> >> +
> >> +    ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> >> +    to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
> >
> > This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> > MMU never increments nx_lpage_splits, it instead has its own counter,
> > tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> > guaranteed to be zero and thus this is completely untested.
>
> Except if you do shadow paging (through nested EPT) and then it bombs
> immediately. :)
>
> > I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> > a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> > will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> > MMU should be updating stats anyways.
>
> This is true, but having two counters is necessary (in the current
> implementation) because otherwise you zap more than the requested ratio
> of pages.
>
> The simplest solution is to add a "bool tdp_page" to struct
> kvm_mmu_page, so that you can have a single list of
> lpage_disallowed_pages and a single thread.  The while loop can then
> dispatch to the right "zapper" code.

I actually did add that bool in patch 4: kvm: mmu: Allocate and free
TDP MMU roots.
I'm a little nervous about putting them in the same list, but I agree
it would definitely simplify the implementation of reclaim.

>
> Anyway this patch is completely broken, so let's kick it away to the
> next round.

Understood, sorry I didn't test this one better. I'll incorporate your
feedback and include it in the next series.

>
> Paolo
>
> >> +
> >> +    while (to_zap &&
> >> +           !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> >> +            /*
> >> +             * We use a separate list instead of just using active_mmu_pages
> >> +             * because the number of lpage_disallowed pages is expected to
> >> +             * be relatively small compared to the total.
> >> +             */
> >> +            sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> >> +                                  struct kvm_mmu_page,
> >> +                                  lpage_disallowed_link);
> >> +
> >> +            old_spte = *sp->parent_sptep;
> >> +            *sp->parent_sptep = 0;
> >> +
> >> +            list_del(&sp->lpage_disallowed_link);
> >> +            kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> >> +
> >> +            handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> >> +                                old_spte, 0, sp->role.level + 1);
> >> +
> >> +            flush = true;
> >> +
> >> +            if (!--to_zap || need_resched() ||
> >> +                spin_needbreak(&kvm->mmu_lock)) {
> >> +                    flush = false;
> >> +                    kvm_flush_remote_tlbs(kvm);
> >> +                    if (to_zap)
> >> +                            cond_resched_lock(&kvm->mmu_lock);
> >> +            }
> >> +    }
> >> +
> >> +    if (flush)
> >> +            kvm_flush_remote_tlbs(kvm);
> >> +
> >> +    spin_unlock(&kvm->mmu_lock);
> >> +    srcu_read_unlock(&kvm->srcu, rcu_idx);
> >> +}
> >> +
> >> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> >> index 2ecb047211a6d..45ea2d44545db 100644
> >> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> >> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> >> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >>
> >>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >>                                 struct kvm_memory_slot *slot, gfn_t gfn);
> >> +
> >> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
> >>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> >> --
> >> 2.28.0.709.gb0816b6eb0-goog
> >>
> >
>

[20/22] kvm: mmu: NX largepage recovery for TDP MMU

Commit Message

Comments

Patch