diff mbox series

[15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path

Message ID 20240809190319.1710470-16-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Fix multiple #PF RO infinite loop bugs | expand

Commit Message

Sean Christopherson Aug. 9, 2024, 7:03 p.m. UTC
Move the event re-injection unprotect+retry logic into
kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
the #PF actually hit a write-protected gfn.  Note, there is a small
possibility that the gfn was unprotected by a different tasking between
hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
the guest immediately anyways because KVM will treat the fault as spurious.

As a bonus, unprotecting _after_ handling the page fault also addresses the
case where the installing a SPTE to handle fault encounters a shadowed PTE,
i.e. *creates* a read-only SPTE.

Opportunstically add a comment explaining what on earth the intent of the
code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid
instruction emulation when event delivery is pending").

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

Comments

Paolo Bonzini Aug. 14, 2024, 5:43 p.m. UTC | #1
On 8/9/24 21:03, Sean Christopherson wrote:
> Move the event re-injection unprotect+retry logic into
> kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
> the #PF actually hit a write-protected gfn.  Note, there is a small
> possibility that the gfn was unprotected by a different tasking between
> hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
> the guest immediately anyways because KVM will treat the fault as spurious.
> 
> As a bonus, unprotecting _after_ handling the page fault also addresses the
> case where the installing a SPTE to handle fault encounters a shadowed PTE,
> i.e. *creates* a read-only SPTE.
> 
> Opportunstically add a comment explaining what on earth the intent of the
> code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid
> instruction emulation when event delivery is pending").
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 28 ++++++++--------------------
>   1 file changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f64ad36ca9e0..d3c0220ff7ee 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2753,23 +2753,6 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
>   	return r;
>   }
>   
> -static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
> -{
> -	gpa_t gpa;
> -	int r;
> -
> -	if (vcpu->arch.mmu->root_role.direct)
> -		return 0;
> -
> -	gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL);
> -	if (gpa == INVALID_GPA)
> -		return 0;
> -
> -	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> -
> -	return r;
> -}
> -
>   static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>   {
>   	trace_kvm_mmu_unsync_page(sp);
> @@ -4640,8 +4623,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   	if (!flags) {
>   		trace_kvm_page_fault(vcpu, fault_address, error_code);
>   
> -		if (kvm_event_needs_reinjection(vcpu))
> -			kvm_mmu_unprotect_page_virt(vcpu, fault_address);
>   		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
>   				insn_len);
>   	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
> @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   	 * execute the instruction.  If no shadow pages were zapped, then the
>   	 * write-fault is due to something else entirely, i.e. KVM needs to
>   	 * emulate, as resuming the guest will put it into an infinite loop.
> +	 *
> +	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
> +	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
> +	 * KVM emulates multiple instructions before completing even injection,
> +	 * the event could be delayed beyond what is architecturally allowed,
> +	 * e.g. KVM could inject an IRQ after the TPR has been raised.

This paragraph should go before the description of
kvm_mmu_unprotect_gfn_and_retry:

	 * There are two cases in which we try to unprotect the page here
	 * preemptively, i.e. zap any shadow pages, before emulating the
	 * instruction.
	 *
	 * First, the access may be due to L1 accessing nested NPT/EPT entries
	 * used for L2, i.e. if the gfn being written is for gPTEs that KVM is
	 * shadowing and has write-protected.  In this case, because AMD CPUs
	 * walk nested page table using a write operation, walking NPT entries
	 * in L1 can trigger write faults even when L1 isn't modifying PTEs.
	 * KVM would then emulate an excessive number of L1 instructions
	 * without triggering KVM's write-flooding detection, i.e. without
	 * unprotecting the gfn.  This is detected as a RO violation while
	 * translating the guest page when the current MMU is direct.
	 *
	 * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU,
	 * unprotect the gfn and reenter the guest if an event is awaiting
	 * reinjection.  If KVM emulates multiple instructions before completing
	 * event injection, the event could be delayed beyond what is
	 * architecturally allowed, e.g. KVM could inject an IRQ after the
	 * TPR has been raised.
	 *
	 * In both cases, if one or more shadow pages were zapped, skip
	 * emulation and resume L1 to let it natively execute the instruction.
	 * If no shadow pages were zapped, then the write-fault is due to
	 * something else entirely and KVM needs to emulate, as resuming
	 * the guest will put it into an infinite loop.

Thanks,

Paolo

>   	 */
> -	if (direct && (is_write_to_guest_page_table(error_code)) &&
> +	if (((direct && is_write_to_guest_page_table(error_code)) ||
> +	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
>   	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
>   		return RET_PF_FIXED;
>
Sean Christopherson Aug. 26, 2024, 9:52 p.m. UTC | #2
On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >   	 * execute the instruction.  If no shadow pages were zapped, then the
> >   	 * write-fault is due to something else entirely, i.e. KVM needs to
> >   	 * emulate, as resuming the guest will put it into an infinite loop.
> > +	 *
> > +	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
> > +	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
> > +	 * KVM emulates multiple instructions before completing even injection,
> > +	 * the event could be delayed beyond what is architecturally allowed,
> > +	 * e.g. KVM could inject an IRQ after the TPR has been raised.
> 
> This paragraph should go before the description of
> kvm_mmu_unprotect_gfn_and_retry:

Hmm, I disagree.  The comment ends up being disconnected from the code, especially
by the end of the series.  E.g. when reading kvm_mmu_write_protect_fault(), someone
would have to jump twice (to kvm_mmu_unprotect_gfn_and_retry and then
__kvm_mmu_unprotect_gfn_and_retry()) in order to understand the checks buried
in kvm_mmu_write_protect_fault().

And the comment also becomes stale when kvm_mmu_unprotect_gfn_and_retry() is
used by x86_emulate_instruction().  That's obviously solvable by extending the
function comment, but then we end up with a rather massive function comment that
documents its callers, not the function itself.

> 
> 	 * There are two cases in which we try to unprotect the page here
> 	 * preemptively, i.e. zap any shadow pages, before emulating the
> 	 * instruction.
> 	 *
> 	 * First, the access may be due to L1 accessing nested NPT/EPT entries
> 	 * used for L2, i.e. if the gfn being written is for gPTEs that KVM is
> 	 * shadowing and has write-protected.  In this case, because AMD CPUs
> 	 * walk nested page table using a write operation, walking NPT entries
> 	 * in L1 can trigger write faults even when L1 isn't modifying PTEs.
> 	 * KVM would then emulate an excessive number of L1 instructions
> 	 * without triggering KVM's write-flooding detection, i.e. without
> 	 * unprotecting the gfn.  This is detected as a RO violation while
> 	 * translating the guest page when the current MMU is direct.
> 	 *
> 	 * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU,
> 	 * unprotect the gfn and reenter the guest if an event is awaiting
> 	 * reinjection.  If KVM emulates multiple instructions before completing
> 	 * event injection, the event could be delayed beyond what is
> 	 * architecturally allowed, e.g. KVM could inject an IRQ after the
> 	 * TPR has been raised.
> 	 *
> 	 * In both cases, if one or more shadow pages were zapped, skip
> 	 * emulation and resume L1 to let it natively execute the instruction.
> 	 * If no shadow pages were zapped, then the write-fault is due to
> 	 * something else entirely and KVM needs to emulate, as resuming
> 	 * the guest will put it into an infinite loop.
> 
> Thanks,
> 
> Paolo
> 
> >   	 */
> > -	if (direct && (is_write_to_guest_page_table(error_code)) &&
> > +	if (((direct && is_write_to_guest_page_table(error_code)) ||
> > +	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
> >   	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
> >   		return RET_PF_FIXED;
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f64ad36ca9e0..d3c0220ff7ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2753,23 +2753,6 @@  bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
 	return r;
 }
 
-static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
-{
-	gpa_t gpa;
-	int r;
-
-	if (vcpu->arch.mmu->root_role.direct)
-		return 0;
-
-	gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL);
-	if (gpa == INVALID_GPA)
-		return 0;
-
-	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
-
-	return r;
-}
-
 static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	trace_kvm_mmu_unsync_page(sp);
@@ -4640,8 +4623,6 @@  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	if (!flags) {
 		trace_kvm_page_fault(vcpu, fault_address, error_code);
 
-		if (kvm_event_needs_reinjection(vcpu))
-			kvm_mmu_unprotect_page_virt(vcpu, fault_address);
 		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
 				insn_len);
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
@@ -6037,8 +6018,15 @@  static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * execute the instruction.  If no shadow pages were zapped, then the
 	 * write-fault is due to something else entirely, i.e. KVM needs to
 	 * emulate, as resuming the guest will put it into an infinite loop.
+	 *
+	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
+	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
+	 * KVM emulates multiple instructions before completing even injection,
+	 * the event could be delayed beyond what is architecturally allowed,
+	 * e.g. KVM could inject an IRQ after the TPR has been raised.
 	 */
-	if (direct && (is_write_to_guest_page_table(error_code)) &&
+	if (((direct && is_write_to_guest_page_table(error_code)) ||
+	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
 	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
 		return RET_PF_FIXED;