Message ID | 20240809190319.1710470-16-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Fix multiple #PF RO infinite loop bugs | expand |
On 8/9/24 21:03, Sean Christopherson wrote: > Move the event re-injection unprotect+retry logic into > kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if > the #PF actually hit a write-protected gfn. Note, there is a small > possibility that the gfn was unprotected by a different tasking between > hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume > the guest immediately anyways because KVM will treat the fault as spurious. > > As a bonus, unprotecting _after_ handling the page fault also addresses the > case where the installing a SPTE to handle fault encounters a shadowed PTE, > i.e. *creates* a read-only SPTE. > > Opportunstically add a comment explaining what on earth the intent of the > code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid > instruction emulation when event delivery is pending"). > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 28 ++++++++-------------------- > 1 file changed, 8 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index f64ad36ca9e0..d3c0220ff7ee 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2753,23 +2753,6 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa) > return r; > } > > -static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) > -{ > - gpa_t gpa; > - int r; > - > - if (vcpu->arch.mmu->root_role.direct) > - return 0; > - > - gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL); > - if (gpa == INVALID_GPA) > - return 0; > - > - r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT); > - > - return r; > -} > - > static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) > { > trace_kvm_mmu_unsync_page(sp); > @@ -4640,8 +4623,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > if (!flags) { > trace_kvm_page_fault(vcpu, fault_address, error_code); > > - if (kvm_event_needs_reinjection(vcpu)) > - kvm_mmu_unprotect_page_virt(vcpu, fault_address); > r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, > insn_len); > } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { > @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > * execute the instruction. If no shadow pages were zapped, then the > * write-fault is due to something else entirely, i.e. KVM needs to > * emulate, as resuming the guest will put it into an infinite loop. > + * > + * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to > + * unprotect the gfn and retry if an event is awaiting reinjection. If > + * KVM emulates multiple instructions before completing even injection, > + * the event could be delayed beyond what is architecturally allowed, > + * e.g. KVM could inject an IRQ after the TPR has been raised. This paragraph should go before the description of kvm_mmu_unprotect_gfn_and_retry: * There are two cases in which we try to unprotect the page here * preemptively, i.e. zap any shadow pages, before emulating the * instruction. * * First, the access may be due to L1 accessing nested NPT/EPT entries * used for L2, i.e. if the gfn being written is for gPTEs that KVM is * shadowing and has write-protected. In this case, because AMD CPUs * walk nested page table using a write operation, walking NPT entries * in L1 can trigger write faults even when L1 isn't modifying PTEs. * KVM would then emulate an excessive number of L1 instructions * without triggering KVM's write-flooding detection, i.e. without * unprotecting the gfn. This is detected as a RO violation while * translating the guest page when the current MMU is direct. * * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU, * unprotect the gfn and reenter the guest if an event is awaiting * reinjection. If KVM emulates multiple instructions before completing * event injection, the event could be delayed beyond what is * architecturally allowed, e.g. KVM could inject an IRQ after the * TPR has been raised. * * In both cases, if one or more shadow pages were zapped, skip * emulation and resume L1 to let it natively execute the instruction. * If no shadow pages were zapped, then the write-fault is due to * something else entirely and KVM needs to emulate, as resuming * the guest will put it into an infinite loop. Thanks, Paolo > */ > - if (direct && (is_write_to_guest_page_table(error_code)) && > + if (((direct && is_write_to_guest_page_table(error_code)) || > + (!direct && kvm_event_needs_reinjection(vcpu))) && > kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa)) > return RET_PF_FIXED; >
On Wed, Aug 14, 2024, Paolo Bonzini wrote: > On 8/9/24 21:03, Sean Christopherson wrote: > > @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > * execute the instruction. If no shadow pages were zapped, then the > > * write-fault is due to something else entirely, i.e. KVM needs to > > * emulate, as resuming the guest will put it into an infinite loop. > > + * > > + * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to > > + * unprotect the gfn and retry if an event is awaiting reinjection. If > > + * KVM emulates multiple instructions before completing even injection, > > + * the event could be delayed beyond what is architecturally allowed, > > + * e.g. KVM could inject an IRQ after the TPR has been raised. > > This paragraph should go before the description of > kvm_mmu_unprotect_gfn_and_retry: Hmm, I disagree. The comment ends up being disconnected from the code, especially by the end of the series. E.g. when reading kvm_mmu_write_protect_fault(), someone would have to jump twice (to kvm_mmu_unprotect_gfn_and_retry and then __kvm_mmu_unprotect_gfn_and_retry()) in order to understand the checks buried in kvm_mmu_write_protect_fault(). And the comment also becomes stale when kvm_mmu_unprotect_gfn_and_retry() is used by x86_emulate_instruction(). That's obviously solvable by extending the function comment, but then we end up with a rather massive function comment that documents its callers, not the function itself. > > * There are two cases in which we try to unprotect the page here > * preemptively, i.e. zap any shadow pages, before emulating the > * instruction. > * > * First, the access may be due to L1 accessing nested NPT/EPT entries > * used for L2, i.e. if the gfn being written is for gPTEs that KVM is > * shadowing and has write-protected. In this case, because AMD CPUs > * walk nested page table using a write operation, walking NPT entries > * in L1 can trigger write faults even when L1 isn't modifying PTEs. > * KVM would then emulate an excessive number of L1 instructions > * without triggering KVM's write-flooding detection, i.e. without > * unprotecting the gfn. This is detected as a RO violation while > * translating the guest page when the current MMU is direct. > * > * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU, > * unprotect the gfn and reenter the guest if an event is awaiting > * reinjection. If KVM emulates multiple instructions before completing > * event injection, the event could be delayed beyond what is > * architecturally allowed, e.g. KVM could inject an IRQ after the > * TPR has been raised. > * > * In both cases, if one or more shadow pages were zapped, skip > * emulation and resume L1 to let it natively execute the instruction. > * If no shadow pages were zapped, then the write-fault is due to > * something else entirely and KVM needs to emulate, as resuming > * the guest will put it into an infinite loop. > > Thanks, > > Paolo > > > */ > > - if (direct && (is_write_to_guest_page_table(error_code)) && > > + if (((direct && is_write_to_guest_page_table(error_code)) || > > + (!direct && kvm_event_needs_reinjection(vcpu))) && > > kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa)) > > return RET_PF_FIXED; >
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f64ad36ca9e0..d3c0220ff7ee 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2753,23 +2753,6 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa) return r; } -static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) -{ - gpa_t gpa; - int r; - - if (vcpu->arch.mmu->root_role.direct) - return 0; - - gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL); - if (gpa == INVALID_GPA) - return 0; - - r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT); - - return r; -} - static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) { trace_kvm_mmu_unsync_page(sp); @@ -4640,8 +4623,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, if (!flags) { trace_kvm_page_fault(vcpu, fault_address, error_code); - if (kvm_event_needs_reinjection(vcpu)) - kvm_mmu_unprotect_page_virt(vcpu, fault_address); r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, insn_len); } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, * execute the instruction. If no shadow pages were zapped, then the * write-fault is due to something else entirely, i.e. KVM needs to * emulate, as resuming the guest will put it into an infinite loop. + * + * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to + * unprotect the gfn and retry if an event is awaiting reinjection. If + * KVM emulates multiple instructions before completing even injection, + * the event could be delayed beyond what is architecturally allowed, + * e.g. KVM could inject an IRQ after the TPR has been raised. */ - if (direct && (is_write_to_guest_page_table(error_code)) && + if (((direct && is_write_to_guest_page_table(error_code)) || + (!direct && kvm_event_needs_reinjection(vcpu))) && kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa)) return RET_PF_FIXED;
Move the event re-injection unprotect+retry logic into kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if the #PF actually hit a write-protected gfn. Note, there is a small possibility that the gfn was unprotected by a different tasking between hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume the guest immediately anyways because KVM will treat the fault as spurious. As a bonus, unprotecting _after_ handling the page fault also addresses the case where the installing a SPTE to handle fault encounters a shadowed PTE, i.e. *creates* a read-only SPTE. Opportunstically add a comment explaining what on earth the intent of the code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid instruction emulation when event delivery is pending"). Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-)