Message ID | 20220415103414.86555-1-jiangshanlai@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode | expand |
On Fri, Apr 15, 2022 at 6:33 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote: > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > unconditionally for each call. > > The guest PAE root page is not write-protected. > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > values every time or it is different from the return value of > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > And it will cause FNAME(fetch) installs the spte in a wrong sp > or links a sp to a wrong parent since FNAME(gpte_changed) can't > check these kind of changes. > > Cache the PDPTEs and the problem is resolved. The guest is responsible > to info the host if its PAE root page is updated which will cause > nested vmexit and the host updates the cache when next nested run. > > The commit e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE > from guest memory") fixs the same problem for non-nested case. > > Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") > Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Ping
On Fri, Apr 15, 2022, Lai Jiangshan wrote: > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > unconditionally for each call. > > The guest PAE root page is not write-protected. > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > values every time or it is different from the return value of > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > And it will cause FNAME(fetch) installs the spte in a wrong sp > or links a sp to a wrong parent since FNAME(gpte_changed) can't > check these kind of changes. > > Cache the PDPTEs and the problem is resolved. The guest is responsible > to info the host if its PAE root page is updated which will cause > nested vmexit and the host updates the cache when next nested run. Hmm, no, the guest is responsible for invalidating translations that can be cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. Per the APM, the PDPTEs can be cached like regular PTEs: Under SVM, however, when the processor is in guest mode with PAE enabled, the guest PDPT entries are not cached or validated at this point, but instead are loaded and checked on demand in the normal course of address translation, just like page directory and page table entries. Any reserved bit violations ared etected at the point of use, and result in a page-fault (#PF) exception rather than a general-protection (#GP) exception. So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because the old entry can't have been cached in the TLB. In practice, snapshotting at nested VMRUN would likely work, but architecturally it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans, e.g. async #PF comes to mind. I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other shadow pages, which shouldn't be too awful to do as part of your series to route PDPTEs through kvm_mmu_get_page().
On Tue, May 17, 2022 at 4:45 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Apr 15, 2022, Lai Jiangshan wrote: > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > > unconditionally for each call. > > > > The guest PAE root page is not write-protected. > > > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > > values every time or it is different from the return value of > > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > > > And it will cause FNAME(fetch) installs the spte in a wrong sp > > or links a sp to a wrong parent since FNAME(gpte_changed) can't > > check these kind of changes. > > > > Cache the PDPTEs and the problem is resolved. The guest is responsible > > to info the host if its PAE root page is updated which will cause > > nested vmexit and the host updates the cache when next nested run. > > Hmm, no, the guest is responsible for invalidating translations that can be > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. > Per the APM, the PDPTEs can be cached like regular PTEs: > > Under SVM, however, when the processor is in guest mode with PAE enabled, the > guest PDPT entries are not cached or validated at this point, but instead are > loaded and checked on demand in the normal course of address translation, just > like page directory and page table entries. Any reserved bit violations ared > etected at the point of use, and result in a page-fault (#PF) exception rather > than a general-protection (#GP) exception. > > So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then > any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because > the old entry can't have been cached in the TLB. In this case, it is still !PRESENT in the shadow page, and it will cause a vmexit when L2 tries to use the translation. I can't see anything wrong in TLB or vTLB(shadow pages). But I think some code is needed to reload the cached PDPTEs when guest_mmu->get_pdptrs() returns !PRESENT and reload mmu if the cache is changed. (and add code to avoid infinite loop) The patch fails to reload mmu if the cache is changed which leaves the problem described in the changelog partial resolved only. Maybe we need to add mmu parameter back in load_pdptrs() for it. > > In practice, snapshotting at nested VMRUN would likely work, but architecturally > it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans, > e.g. async #PF comes to mind. > > I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other > shadow pages, which shouldn't be too awful to do as part of your series to route > PDPTEs through kvm_mmu_get_page(). In the one-off special shadow page (will be renamed to one-off local shadow page) patchsets, PAE PDPTEs is not write-protected. Wirte-protecting it causing nasty code.
On Tue, May 17, 2022 at 9:02 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote: > > On Tue, May 17, 2022 at 4:45 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Apr 15, 2022, Lai Jiangshan wrote: > > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > > > unconditionally for each call. > > > > > > The guest PAE root page is not write-protected. > > > > > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > > > values every time or it is different from the return value of > > > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > > > > > And it will cause FNAME(fetch) installs the spte in a wrong sp > > > or links a sp to a wrong parent since FNAME(gpte_changed) can't > > > check these kind of changes. > > > > > > Cache the PDPTEs and the problem is resolved. The guest is responsible > > > to info the host if its PAE root page is updated which will cause > > > nested vmexit and the host updates the cache when next nested run. > > > > Hmm, no, the guest is responsible for invalidating translations that can be > > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. > > Per the APM, the PDPTEs can be cached like regular PTEs: > > > > Under SVM, however, when the processor is in guest mode with PAE enabled, the > > guest PDPT entries are not cached or validated at this point, but instead are > > loaded and checked on demand in the normal course of address translation, just > > like page directory and page table entries. Any reserved bit violations ared > > etected at the point of use, and result in a page-fault (#PF) exception rather > > than a general-protection (#GP) exception. > > > > So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then > > any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because > > the old entry can't have been cached in the TLB. > > In this case, it is still !PRESENT in the shadow page, and it will cause > a vmexit when L2 tries to use the translation. I can't see anything wrong > in TLB or vTLB(shadow pages). > > But I think some code is needed to reload the cached PDPTEs > when guest_mmu->get_pdptrs() returns !PRESENT and reload mmu if > the cache is changed. (and add code to avoid infinite loop) > > The patch fails to reload mmu if the cache is changed which > leaves the problem described in the changelog partial resolved > only. > > Maybe we need to add mmu parameter back in load_pdptrs() for it. > It is still too complicated, I will try to do a direct check in FNAME(fetch) instead of (re-)using the cache. > > > > In practice, snapshotting at nested VMRUN would likely work, but architecturally > > it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans, > > e.g. async #PF comes to mind. > > > > I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other > > shadow pages, which shouldn't be too awful to do as part of your series to route > > PDPTEs through kvm_mmu_get_page(). > > In the one-off special shadow page (will be renamed to one-off local > shadow page) > patchsets, PAE PDPTEs is not write-protected. Wirte-protecting it causing nasty > code.
On Mon, May 16, 2022 at 2:06 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Apr 15, 2022, Lai Jiangshan wrote: > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > > unconditionally for each call. > > > > The guest PAE root page is not write-protected. > > > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > > values every time or it is different from the return value of > > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > > > And it will cause FNAME(fetch) installs the spte in a wrong sp > > or links a sp to a wrong parent since FNAME(gpte_changed) can't > > check these kind of changes. > > > > Cache the PDPTEs and the problem is resolved. The guest is responsible > > to info the host if its PAE root page is updated which will cause > > nested vmexit and the host updates the cache when next nested run. > > Hmm, no, the guest is responsible for invalidating translations that can be > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. > Per the APM, the PDPTEs can be cached like regular PTEs: > > Under SVM, however, when the processor is in guest mode with PAE enabled, the > guest PDPT entries are not cached or validated at this point, but instead are > loaded and checked on demand in the normal course of address translation, just > like page directory and page table entries. Any reserved bit violations ared > etected at the point of use, and result in a page-fault (#PF) exception rather > than a general-protection (#GP) exception. This paragraph from the APM describes the behavior of CR3 loads while in SVM guest-mode. But this patch is changing how KVM emulates SVM host-mode (i.e. L1), right? It seems like AMD makes no guarantee whether or not CR3 loads pre-load PDPTEs while in SVM host-mode. (Although the APM does say that "modern processors" do not pre-load PDPTEs.) > > So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then > any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because > the old entry can't have been cached in the TLB. > > In practice, snapshotting at nested VMRUN would likely work, but architecturally > it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans, > e.g. async #PF comes to mind. > > I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other > shadow pages, which shouldn't be too awful to do as part of your series to route > PDPTEs through kvm_mmu_get_page().
On Thu, May 26, 2022 at 5:45 AM David Matlack <dmatlack@google.com> wrote: > > On Mon, May 16, 2022 at 2:06 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Apr 15, 2022, Lai Jiangshan wrote: > > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > > > unconditionally for each call. > > > > > > The guest PAE root page is not write-protected. > > > > > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > > > values every time or it is different from the return value of > > > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > > > > > And it will cause FNAME(fetch) installs the spte in a wrong sp > > > or links a sp to a wrong parent since FNAME(gpte_changed) can't > > > check these kind of changes. > > > > > > Cache the PDPTEs and the problem is resolved. The guest is responsible > > > to info the host if its PAE root page is updated which will cause > > > nested vmexit and the host updates the cache when next nested run. > > > > Hmm, no, the guest is responsible for invalidating translations that can be > > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. > > Per the APM, the PDPTEs can be cached like regular PTEs: > > > > Under SVM, however, when the processor is in guest mode with PAE enabled, the > > guest PDPT entries are not cached or validated at this point, but instead are > > loaded and checked on demand in the normal course of address translation, just > > like page directory and page table entries. Any reserved bit violations ared > > etected at the point of use, and result in a page-fault (#PF) exception rather > > than a general-protection (#GP) exception. > > This paragraph from the APM describes the behavior of CR3 loads while > in SVM guest-mode. But this patch is changing how KVM emulates SVM > host-mode (i.e. L1), right? It seems like AMD makes no guarantee > whether or not CR3 loads pre-load PDPTEs while in SVM host-mode. > (Although the APM does say that "modern processors" do not pre-load > PDPTEs.) Oh, I also missed the fact that L1 is the host when emulating it. The code is for host-mode (L1)'s nested_cr3 which is using the traditional PAE PDPTEs loading and checking. So using caches is the only correct way, right? If so, I can update this patch only (not adding it to the patchset of one-off local shadow page) and add some checks to see if the loaded caches changed. Maybe I just ignore it since I'm not familiar with SVM enough. I hope it served as a bug report. Thanks Lai
On 5/26/22 10:38, Lai Jiangshan wrote: >> (Although the APM does say that "modern processors" do not pre-load >> PDPTEs.) This changed between the Oct 2020 and Nov 2021, so I suppose the change was done in Zen 3. > Oh, I also missed the fact that L1 is the host when emulating it. > > The code is for host-mode (L1)'s nested_cr3 which is using the > traditional PAE PDPTEs loading and checking. > > So using caches is the only correct way, right? The caching behavior for NPT PDPTEs does not matter too much. What matters is that a PDPTE with reserved bits should cause a #NPF at usage time rather than a VMentry failure or a #NPF immediately after VMentry. Paolo
On Thu, May 26, 2022 at 5:33 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 5/26/22 10:38, Lai Jiangshan wrote: > >> (Although the APM does say that "modern processors" do not pre-load > >> PDPTEs.) > > This changed between the Oct 2020 and Nov 2021, so I suppose the change > was done in Zen 3. > > > Oh, I also missed the fact that L1 is the host when emulating it. > > > > The code is for host-mode (L1)'s nested_cr3 which is using the > > traditional PAE PDPTEs loading and checking. > > > > So using caches is the only correct way, right? > > The caching behavior for NPT PDPTEs does not matter too much. What > matters is that a PDPTE with reserved bits should cause a #NPF at usage > time rather than a VMentry failure or a #NPF immediately after VMentry. > Since there is mmu->get_pdptrs() in mmu_alloc_shadow_roots(), you can't conform this now. It will be easier to only cause a #NPF at usage time after the one-off local patchset.
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d736ec6514ca..a34983d2dc07 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -72,18 +72,22 @@ static void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_excep } } -static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) +static void nested_svm_cache_tdp_pdptrs(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); u64 cr3 = svm->nested.ctl.nested_cr3; - u64 pdpte; + u64 *pdptrs = vcpu->arch.mmu->pdptrs; int ret; - ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte, - offset_in_page(cr3) + index * 8, 8); + ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), pdptrs, + offset_in_page(cr3), 8 * 4); if (ret) - return 0; - return pdpte; + memset(pdptrs, 0, 8 * 4); +} + +static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) +{ + return vcpu->arch.mmu->pdptrs[index]; } static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu) @@ -109,6 +113,8 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) kvm_init_shadow_npt_mmu(vcpu, X86_CR0_PG, svm->vmcb01.ptr->save.cr4, svm->vmcb01.ptr->save.efer, svm->nested.ctl.nested_cr3); + if (vcpu->arch.mmu->root_level == PT32E_ROOT_LEVEL) + nested_svm_cache_tdp_pdptrs(vcpu); vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit;