diff mbox series

[04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected

Message ID 20240809190319.1710470-5-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Fix multiple #PF RO infinite loop bugs | expand

Commit Message

Sean Christopherson Aug. 9, 2024, 7:03 p.m. UTC
When doing "fast unprotection" of nested TDP page tables, skip emulation
if and only if at least one gfn was unprotected, i.e. continue with
emulation if simply resuming is likely to hit the same fault and risk
putting the vCPU into an infinite loop.

Note, it's entirely possible to get a false negative, e.g. if a different
vCPU faults on the same gfn and unprotects the gfn first, but that's a
relatively rare edge case, and emulating is still functionally ok, i.e.
the risk of putting the vCPU isn't an infinite loop isn't justified.

Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Comments

Yuan Yao Aug. 14, 2024, 2:22 p.m. UTC | #1
On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> When doing "fast unprotection" of nested TDP page tables, skip emulation
> if and only if at least one gfn was unprotected, i.e. continue with
> emulation if simply resuming is likely to hit the same fault and risk
> putting the vCPU into an infinite loop.
>
> Note, it's entirely possible to get a false negative, e.g. if a different
> vCPU faults on the same gfn and unprotects the gfn first, but that's a
> relatively rare edge case, and emulating is still functionally ok, i.e.
> the risk of putting the vCPU isn't an infinite loop isn't justified.
>
> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3aa04c498ea..95058ac4b78c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	bool direct = vcpu->arch.mmu->root_role.direct;
>
>  	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> +	 * Before emulating the instruction, check to see if the access may be
> +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> +	 * protected.  Because AMD CPUs walk nested page table using a write
> +	 * operation, walking NPT entries in L1 can trigger write faults even
> +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> +	 * excessive number of L1 instructions without triggering KVM's write-
> +	 * flooding detection, i.e. without unprotecting the gfn.
> +	 *
> +	 * If the error code was due to a RO violation while translating the
> +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> +	 * shadow pages, then the above scenario is likely being hit.  Try to
> +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> +	 * its NPT entries without triggering emulation.  If one or more shadow
> +	 * pages was zapped, skip emulation and resume L1 to let it natively
> +	 * execute the instruction.  If no shadow pages were zapped, then the
> +	 * write-fault is due to something else entirely, i.e. KVM needs to
> +	 * emulate, as resuming the guest will put it into an infinite loop.
>  	 */

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>  	if (direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
>  		return RET_PF_FIXED;
> -	}
>
>  	/*
>  	 * The gfn is write-protected, but if emulation fails we can still
> --
> 2.46.0.76.ge559c4bf1a-goog
>
>
Paolo Bonzini Aug. 14, 2024, 4:47 p.m. UTC | #2
On 8/9/24 21:03, Sean Christopherson wrote:
> When doing "fast unprotection" of nested TDP page tables, skip emulation
> if and only if at least one gfn was unprotected, i.e. continue with
> emulation if simply resuming is likely to hit the same fault and risk
> putting the vCPU into an infinite loop.
> 
> Note, it's entirely possible to get a false negative, e.g. if a different
> vCPU faults on the same gfn and unprotects the gfn first, but that's a
> relatively rare edge case, and emulating is still functionally ok, i.e.
> the risk of putting the vCPU isn't an infinite loop isn't justified.

English snafu - "the risk of causing a livelock for the vCPU is 
negligible", perhaps?

Paolo

> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
>   1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3aa04c498ea..95058ac4b78c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   	bool direct = vcpu->arch.mmu->root_role.direct;
>   
>   	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> +	 * Before emulating the instruction, check to see if the access may be
> +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> +	 * protected.  Because AMD CPUs walk nested page table using a write
> +	 * operation, walking NPT entries in L1 can trigger write faults even
> +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> +	 * excessive number of L1 instructions without triggering KVM's write-
> +	 * flooding detection, i.e. without unprotecting the gfn.
> +	 *
> +	 * If the error code was due to a RO violation while translating the
> +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> +	 * shadow pages, then the above scenario is likely being hit.  Try to
> +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> +	 * its NPT entries without triggering emulation.  If one or more shadow
> +	 * pages was zapped, skip emulation and resume L1 to let it natively
> +	 * execute the instruction.  If no shadow pages were zapped, then the
> +	 * write-fault is due to something else entirely, i.e. KVM needs to
> +	 * emulate, as resuming the guest will put it into an infinite loop.
>   	 */
>   	if (direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
>   		return RET_PF_FIXED;
> -	}
>   
>   	/*
>   	 * The gfn is write-protected, but if emulation fails we can still
Yao Yuan Aug. 15, 2024, 11:31 p.m. UTC | #3
On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > if and only if at least one gfn was unprotected, i.e. continue with
> > emulation if simply resuming is likely to hit the same fault and risk
> > putting the vCPU into an infinite loop.
> >
> > Note, it's entirely possible to get a false negative, e.g. if a different
> > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > relatively rare edge case, and emulating is still functionally ok, i.e.
> > the risk of putting the vCPU isn't an infinite loop isn't justified.
> >
> > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> >  1 file changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e3aa04c498ea..95058ac4b78c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  	bool direct = vcpu->arch.mmu->root_role.direct;
> >
> >  	/*
> > -	 * Before emulating the instruction, check if the error code
> > -	 * was due to a RO violation while translating the guest page.
> > -	 * This can occur when using nested virtualization with nested
> > -	 * paging in both guests. If true, we simply unprotect the page
> > -	 * and resume the guest.
> > +	 * Before emulating the instruction, check to see if the access may be
> > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > +	 * protected.  Because AMD CPUs walk nested page table using a write

Hi Sean,

I Just want to consult how often of this on EPT:

The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
in middle of walking the guest CR3 page table, and the guest
CR3 page table page is write-protected on EPT01, are these
guest CR3 page table pages also are EPT12 page table pages
often?  I just think most of time they should be data page
on guest CR3 table for L1 to access them by L1 GVA, if so
the PFERR_GUEST_FINAL_MASK should be set but not
PFERR_GUEST_PAGE_MASK.

> > +	 * operation, walking NPT entries in L1 can trigger write faults even
> > +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> > +	 * excessive number of L1 instructions without triggering KVM's write-
> > +	 * flooding detection, i.e. without unprotecting the gfn.
> > +	 *
> > +	 * If the error code was due to a RO violation while translating the
> > +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> > +	 * shadow pages, then the above scenario is likely being hit.  Try to
> > +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> > +	 * its NPT entries without triggering emulation.  If one or more shadow
> > +	 * pages was zapped, skip emulation and resume L1 to let it natively
> > +	 * execute the instruction.  If no shadow pages were zapped, then the
> > +	 * write-fault is due to something else entirely, i.e. KVM needs to
> > +	 * emulate, as resuming the guest will put it into an infinite loop.
> >  	 */
>
> Reviewed-by: Yuan Yao <yuan.yao@intel.com>
>
> >  	if (direct &&
> > -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> > -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> > +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> > +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> >  		return RET_PF_FIXED;
> > -	}
> >
> >  	/*
> >  	 * The gfn is write-protected, but if emulation fails we can still
> > --
> > 2.46.0.76.ge559c4bf1a-goog
> >
> >
>
Sean Christopherson Aug. 23, 2024, 12:39 a.m. UTC | #4
+Tom

On Fri, Aug 16, 2024, Yao Yuan wrote:
> On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> > On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > > if and only if at least one gfn was unprotected, i.e. continue with
> > > emulation if simply resuming is likely to hit the same fault and risk
> > > putting the vCPU into an infinite loop.
> > >
> > > Note, it's entirely possible to get a false negative, e.g. if a different
> > > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > > relatively rare edge case, and emulating is still functionally ok, i.e.
> > > the risk of putting the vCPU isn't an infinite loop isn't justified.
> > >
> > > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> > >  1 file changed, 20 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e3aa04c498ea..95058ac4b78c 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  	bool direct = vcpu->arch.mmu->root_role.direct;
> > >
> > >  	/*
> > > -	 * Before emulating the instruction, check if the error code
> > > -	 * was due to a RO violation while translating the guest page.
> > > -	 * This can occur when using nested virtualization with nested
> > > -	 * paging in both guests. If true, we simply unprotect the page
> > > -	 * and resume the guest.
> > > +	 * Before emulating the instruction, check to see if the access may be
> > > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > > +	 * protected.  Because AMD CPUs walk nested page table using a write
> 
> Hi Sean,
> 
> I Just want to consult how often of this on EPT:
> 
> The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
> in middle of walking the guest CR3 page table, and the guest
> CR3 page table page is write-protected on EPT01, are these
> guest CR3 page table pages also are EPT12 page table pages
> often?  I just think most of time they should be data page
> on guest CR3 table for L1 to access them by L1 GVA, if so
> the PFERR_GUEST_FINAL_MASK should be set but not
> PFERR_GUEST_PAGE_MASK.

Hmm, now I'm confused too.  I thought this was handling the case where L1 is
accessing EPT12/NTP12 entries, but as you say, that doesn't make sense because
those accesses would be PFERR_GUEST_FINAL_MASK.  And the EPT12/NPT12 entries
should never be used by hardware.

The only thing that I can think of is if L1 stops using the pages for EPT/NPT
entries, and instead reallocates them for IA32 page tables.  But that doesn't
make much sense either, because KVM's write-flooding detection should kick in
when L1 repurposes the page for its own page tables, before L1 actually starts
using the page tables.

I'll try to reproduce the excessive emulation that this code is handling, because
I think I'm missing something too.
Sean Christopherson Aug. 23, 2024, 11:46 p.m. UTC | #5
On Thu, Aug 22, 2024, Sean Christopherson wrote:
> On Fri, Aug 16, 2024, Yao Yuan wrote:
> > On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> > > On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > > > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > > > if and only if at least one gfn was unprotected, i.e. continue with
> > > > emulation if simply resuming is likely to hit the same fault and risk
> > > > putting the vCPU into an infinite loop.
> > > >
> > > > Note, it's entirely possible to get a false negative, e.g. if a different
> > > > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > > > relatively rare edge case, and emulating is still functionally ok, i.e.
> > > > the risk of putting the vCPU isn't an infinite loop isn't justified.
> > > >
> > > > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> > > >  1 file changed, 20 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e3aa04c498ea..95058ac4b78c 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >  	bool direct = vcpu->arch.mmu->root_role.direct;
> > > >
> > > >  	/*
> > > > -	 * Before emulating the instruction, check if the error code
> > > > -	 * was due to a RO violation while translating the guest page.
> > > > -	 * This can occur when using nested virtualization with nested
> > > > -	 * paging in both guests. If true, we simply unprotect the page
> > > > -	 * and resume the guest.
> > > > +	 * Before emulating the instruction, check to see if the access may be
> > > > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > > > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > > > +	 * protected.  Because AMD CPUs walk nested page table using a write
> > 
> > Hi Sean,
> > 
> > I Just want to consult how often of this on EPT:
> > 
> > The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
> > in middle of walking the guest CR3 page table, and the guest
> > CR3 page table page is write-protected on EPT01, are these
> > guest CR3 page table pages also are EPT12 page table pages
> > often?  I just think most of time they should be data page
> > on guest CR3 table for L1 to access them by L1 GVA, if so
> > the PFERR_GUEST_FINAL_MASK should be set but not
> > PFERR_GUEST_PAGE_MASK.
> 
> Hmm, now I'm confused too.  I thought this was handling the case where L1 is
> accessing EPT12/NTP12 entries, but as you say, that doesn't make sense because
> those accesses would be PFERR_GUEST_FINAL_MASK.  And the EPT12/NPT12 entries
> should never be used by hardware.
> 
> The only thing that I can think of is if L1 stops using the pages for EPT/NPT
> entries, and instead reallocates them for IA32 page tables.  But that doesn't
> make much sense either, because KVM's write-flooding detection should kick in
> when L1 repurposes the page for its own page tables, before L1 actually starts
> using the page tables.
> 
> I'll try to reproduce the excessive emulation that this code is handling, because
> I think I'm missing something too.

*sigh*

After poking for an hour and never being able to hit that error code, _period_,
even if without the vcpu->arch.mmu->root_role.direct guard, I came to the conclusion
that the only way to hit this is if L1 is reusing its own page tables for L2.

And then I finally wised up and went hunting on lore, and indeed that appears to
be what this is trying to handle[1]:

 : I think this patch is necessary for functional reasons (not just perf),
 : because we added the other patch to look at the GPA and stop walking the
 : guest page tables on a NPF.
 : 
 : The issue I think was that hardware has taken an NPF because the page
 : table is marked RO, and it saves the GPA in the VMCB.  KVM was then
 : going and emulating the instruction and it saw that a GPA was available.
 : But that GPA was not the GPA of the instruction it is emulating, since
 : it was the GPA of the tablewalk page that had the fault. It was debugged
 : that at the time and realized that emulating the instruction was
 : unnecessary so we added this new code in there which fixed the functional
 : issue and helps perf.
 : 
 : I don't have any data on how much perf, as I recall it was most effective
 : when the L1 guest page tables and L2 nested page tables were exactly the
 : same.  In that case, it avoided emulations for code that L1 executes which
 : I think could be as much as one emulation per 4kb code page.

To muddy the waters more, the vcpu->arch.mmu->root_role.direct was a proactive
suggestion from Paolo[2], i.e. not in response to an actual failure that someone
encountered.

Further adding to the confusion was Paolo's commit that extended the behavior
to Intel CPUs.  The AuthorDate vs. CommitDate is particularly interesting, as
it suggests that Paolo wrote the patch in response to the SVM series being
posted:

  Subject: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 23 Nov 2016 12:01:38 -0500	[thread overview]

  commit eebed2438923f8df465c27f8fa41303771fdb2e8
  Author:     Paolo Bonzini <pbonzini@redhat.com>
  AuthorDate: Mon Nov 28 14:39:58 2016 +0100

and then commited after asking many of the same questions we just asked, with
Brijesh's reply[1] coming just a few days before:

  Subject: Re: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 2 Aug 2017 12:42:20 +0200	[thread overview]


  Commit:     Paolo Bonzini <pbonzini@redhat.com>
  CommitDate: Thu Aug 10 16:44:04 2017 +0200

    kvm: nVMX: Add support for fast unprotection of nested guest page tables
    
    This is the same as commit 147277540bbc ("kvm: svm: Add support for
    additional SVM NPF error codes", 2016-11-23), but for Intel processors.
    In this case, the exit qualification field's bit 8 says whether the
    EPT violation occurred while translating the guest's final physical
    address or rather while translating the guest page tables.


  Subject: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 23 Nov 2016 12:01:38 -0500	[thread overview]

Intel support is especially misleading, because sharing page tables between EPT
and IA32 is rather nonsensical due to them having different formats.  I.e. I doubt
Paolo had a use case for the VMX changes, and was just providing parity with SVM.
Of course, reusing L1's page tables as the NPT tables for L2 is quite crazy too,
but at least the PTE formats are identical. 

Given that the patches were original posted as part the SEV enabling series[3],
my best guest is that the AMD folks were doing some prototyping or emulation
hackery that involved running a nested guest by simply reusing CR3 as hCR3.

So, I'll rewrite the comment to make it clear that practically speaking, this
scenario can be hit if and only if L1 is reusing its own page tables for L2's NPT.

[1] https://lore.kernel.org/all/f8a2c258-0b53-b33c-1dae-a6f6e0e68239@amd.com
[2] https://lore.kernel.org/all/21b9f4db-f929-80f6-6ad2-6fa3b77f82c0@redhat.com
[3] https://lore.kernel.org/all/147190822443.9523.7814744422402462127.stgit__43469.155036337$1471909238$gmane$org@brijesh-build-machine
Sean Christopherson Aug. 26, 2024, 8:28 p.m. UTC | #6
On Fri, Aug 23, 2024, Sean Christopherson wrote:
> On Thu, Aug 22, 2024, Sean Christopherson wrote:
> Intel support is especially misleading, because sharing page tables between EPT
> and IA32 is rather nonsensical due to them having different formats.  I.e. I doubt
> Paolo had a use case for the VMX changes, and was just providing parity with SVM.
> Of course, reusing L1's page tables as the NPT tables for L2 is quite crazy too,

Actually, it's not _that_ crazy, e.g. KVM s390 does this for last-level page tables
so that changes to the host userspace mappings don't require mmu_notifier-induced
changes in KVM.

> but at least the PTE formats are identical.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e3aa04c498ea..95058ac4b78c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5967,17 +5967,29 @@  static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	bool direct = vcpu->arch.mmu->root_role.direct;
 
 	/*
-	 * Before emulating the instruction, check if the error code
-	 * was due to a RO violation while translating the guest page.
-	 * This can occur when using nested virtualization with nested
-	 * paging in both guests. If true, we simply unprotect the page
-	 * and resume the guest.
+	 * Before emulating the instruction, check to see if the access may be
+	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
+	 * gfn being written is for gPTEs that KVM is shadowing and has write-
+	 * protected.  Because AMD CPUs walk nested page table using a write
+	 * operation, walking NPT entries in L1 can trigger write faults even
+	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
+	 * excessive number of L1 instructions without triggering KVM's write-
+	 * flooding detection, i.e. without unprotecting the gfn.
+	 *
+	 * If the error code was due to a RO violation while translating the
+	 * guest page, the current MMU is direct (L1 is active), and KVM has
+	 * shadow pages, then the above scenario is likely being hit.  Try to
+	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
+	 * its NPT entries without triggering emulation.  If one or more shadow
+	 * pages was zapped, skip emulation and resume L1 to let it natively
+	 * execute the instruction.  If no shadow pages were zapped, then the
+	 * write-fault is due to something else entirely, i.e. KVM needs to
+	 * emulate, as resuming the guest will put it into an infinite loop.
 	 */
 	if (direct &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
+	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
+	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
 		return RET_PF_FIXED;
-	}
 
 	/*
 	 * The gfn is write-protected, but if emulation fails we can still