Message ID | 20210707183616.5620-38-brijesh.singh@amd.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On Wed, Jul 07, 2021, Brijesh Singh wrote: > Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to > resolve the RMP violation encountered during the NPT table walk. Heh, please elaborate on exactly what that recommendation is. A recommendation isn't exactly architectural, i.e. is subject to change :-) And, do we have to follow the APM's recommendation? Specifically, can KVM treat #NPF RMP violations as guest errors, or is that not allowed by the GHCB spec? I.e. can we mandate accesses be preceded by page state change requests? It would simplify KVM (albeit not much of a simplificiation) and would also make debugging easier since transitions would require an explicit guest request and guest bugs would result in errors instead of random corruption/weirdness. > index 46323af09995..117e2e08d7ed 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1399,6 +1399,9 @@ struct kvm_x86_ops { > > void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); > void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); > + > + int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, > + int level, u64 error_code); > }; > > struct kvm_x86_nested_ops { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e60f54455cdc..b6a676ba1862 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > write_unlock(&vcpu->kvm->mmu_lock); > } > > +static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) > +{ > + kvm_pfn_t pfn; > + int level; > + > + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level))) > + return RET_PF_RETRY; > + > + kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code); > + return RET_PF_RETRY; > +} > + > int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > void *insn, int insn_len) > { > @@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > goto emulate; > } > > + if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) { > + r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code); Adding a kvm_x86_ops hook is silly, there's literally one path, npf_interception() that can encounter RMP violations. Just invoke snp_handle_rmp_page_fault() from there. That works even if kvm_mmu_get_tdp_walk() stays around since it was exported earlier. > + if (r == RET_PF_RETRY) > + return 1; > + else > + return r; > + } > + > if (r == RET_PF_INVALID) { > r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, > lower_32_bits(error_code), false); > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 839cf321c6dd..53a60edc810e 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -3519,3 +3519,60 @@ void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn > BUG_ON(rc != 0); > } > } > + > +int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, > + int level, u64 error_code) > +{ > + struct rmpentry *e; > + int rlevel, rc = 0; > + bool private; > + gfn_t gfn; > + > + e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rlevel); > + if (!e) > + return 1; > + > + private = !!(error_code & PFERR_GUEST_ENC_MASK); > + > + /* > + * See APM section 15.36.11 on how to handle the RMP fault for the large pages. Please do not punt the reader to the APM for things like this. It's ok when there are gory details about CPU behavior that aren't worth commenting, but under no circumstance should KVM's software implementation be "documented" in a CPU spec. > + * > + * npt rmp access action > + * -------------------------------------------------- > + * 4k 2M C=1 psmash > + * x x C=1 if page is not private then add a new RMP entry > + * x x C=0 if page is private then make it shared > + * 2M 4k C=x zap > + */ > + if ((error_code & PFERR_GUEST_SIZEM_MASK) || > + ((level == PG_LEVEL_4K) && (rlevel == PG_LEVEL_2M) && private)) { > + rc = snp_rmptable_psmash(vcpu, pfn); > + goto zap_gfn; > + } > + > + /* > + * If it's a private access, and the page is not assigned in the RMP table, create a > + * new private RMP entry. > + */ > + if (!rmpentry_assigned(e) && private) { > + rc = snp_make_page_private(vcpu, gpa, pfn, PG_LEVEL_4K); > + goto zap_gfn; > + } > + > + /* > + * If it's a shared access, then make the page shared in the RMP table. > + */ > + if (rmpentry_assigned(e) && !private) > + rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K); Hrm, this really feels like it needs to be protected by mmu_lock. Functionally, it might all work out in the end after enough RMP violations, but it's extremely difficult to reason about and probably even more difficult if multiple vCPUs end up fighting over a gfn. My gut reaction is that this is also backwards, i.e. KVM should update the RMP to match its TDP SPTEs, not the other way around. The one big complication is that the TDP MMU only takes mmu_lock for read. A few options come to mind but none of them are all that pretty. I'll wait to hear back on whether or not we can make PSC request mandatory before thinking too hard on this one. > +zap_gfn: > + /* > + * Now that we have updated the RMP pagesize, zap the existing rmaps for > + * large entry ranges so that nested page table gets rebuilt with the updated RMP > + * pagesize. > + */ > + gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1); > + kvm_zap_gfn_range(vcpu->kvm, gfn, gfn + 512); > + > + return 0; > +}
On 7/19/21 7:10 PM, Sean Christopherson wrote: > On Wed, Jul 07, 2021, Brijesh Singh wrote: >> Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to >> resolve the RMP violation encountered during the NPT table walk. > > Heh, please elaborate on exactly what that recommendation is. A recommendation > isn't exactly architectural, i.e. is subject to change :-) I will try to expand it :) > > And, do we have to follow the APM's recommendation? Yes, unless we want to be very strict on what a guest can do. Specifically, can KVM treat > #NPF RMP violations as guest errors, or is that not allowed by the GHCB spec? The GHCB spec does not say anything about the #NPF RMP violation error. And not all #NPF RMP is a guest error (mainly those size mismatch etc). > I.e. can we mandate accesses be preceded by page state change requests? This is a good question, the GHCB spec does not enforce that a guest *must* use page state. If the page state changes is not done by the guest then it will cause #NPF and its up to the hypervisor to decide on what it wants to do. It would > simplify KVM (albeit not much of a simplificiation) and would also make debugging > easier since transitions would require an explicit guest request and guest bugs > would result in errors instead of random corruption/weirdness. > I am good with enforcing this from the KVM. But the question is, what fault we should inject in the guest when KVM detects that guest has issued the page state change. >> index 46323af09995..117e2e08d7ed 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1399,6 +1399,9 @@ struct kvm_x86_ops { >> >> void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); >> void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); >> + >> + int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, >> + int level, u64 error_code); >> }; >> >> struct kvm_x86_nested_ops { >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index e60f54455cdc..b6a676ba1862 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, >> write_unlock(&vcpu->kvm->mmu_lock); >> } >> >> +static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) >> +{ >> + kvm_pfn_t pfn; >> + int level; >> + >> + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level))) >> + return RET_PF_RETRY; >> + >> + kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code); >> + return RET_PF_RETRY; >> +} >> + >> int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, >> void *insn, int insn_len) >> { >> @@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, >> goto emulate; >> } >> >> + if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) { >> + r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code); > > Adding a kvm_x86_ops hook is silly, there's literally one path, npf_interception() > that can encounter RMP violations. Just invoke snp_handle_rmp_page_fault() from > there. That works even if kvm_mmu_get_tdp_walk() stays around since it was > exported earlier. > Noted. >> + >> + /* >> + * If it's a shared access, then make the page shared in the RMP table. >> + */ >> + if (rmpentry_assigned(e) && !private) >> + rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K); > > Hrm, this really feels like it needs to be protected by mmu_lock. Functionally, > it might all work out in the end after enough RMP violations, but it's extremely > difficult to reason about and probably even more difficult if multiple vCPUs end > up fighting over a gfn. > Lets see what's your thought on enforcing the page state change for the KVM. If we want the guest to issue the page state change before the access then this case will simply need to inject an error in the guest and we can remove all of it. > My gut reaction is that this is also backwards, i.e. KVM should update the RMP > to match its TDP SPTEs, not the other way around. > > The one big complication is that the TDP MMU only takes mmu_lock for read. A few > options come to mind but none of them are all that pretty. I'll wait to hear back > on whether or not we can make PSC request mandatory before thinking too hard on > this one. >
On Tue, Jul 20, 2021, Brijesh Singh wrote: > > On 7/19/21 7:10 PM, Sean Christopherson wrote: > > On Wed, Jul 07, 2021, Brijesh Singh wrote: > > > Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to > > > resolve the RMP violation encountered during the NPT table walk. > > > > Heh, please elaborate on exactly what that recommendation is. A recommendation > > isn't exactly architectural, i.e. is subject to change :-) > > I will try to expand it :) > > > > > And, do we have to follow the APM's recommendation? > > Yes, unless we want to be very strict on what a guest can do. > > > Specifically, can KVM treat #NPF RMP violations as guest errors, or is that > > not allowed by the GHCB spec? > > The GHCB spec does not say anything about the #NPF RMP violation error. And > not all #NPF RMP is a guest error (mainly those size mismatch etc). > > > I.e. can we mandate accesses be preceded by page state change requests? > > This is a good question, the GHCB spec does not enforce that a guest *must* > use page state. If the page state changes is not done by the guest then it > will cause #NPF and its up to the hypervisor to decide on what it wants to > do. Drat. Is there any hope of pushing through a GHCB change to require the guest to use PSC? > > It would simplify KVM (albeit not much of a simplificiation) and would also > > make debugging easier since transitions would require an explicit guest > > request and guest bugs would result in errors instead of random > > corruption/weirdness. > > I am good with enforcing this from the KVM. But the question is, what fault > we should inject in the guest when KVM detects that guest has issued the > page state change. Injecting a fault, at least from KVM, isn't an option since there's no architectural behavior we can leverage. E.g. a guest that isn't enlightened enough to properly use PSC isn't going to do anything useful with a #MC or #VC. Sadly, as is I think our only options are to either automatically convert RMP entries as need, or to punt the exit to userspace. Maybe we could do both, e.g. have a module param to control the behavior? The problem with punting to userspace is that KVM would also need a way for userspace to fix the issue, otherwise we're just taking longer to kill the guest :-/
On 7/20/21 5:31 PM, Sean Christopherson wrote: ... >> This is a good question, the GHCB spec does not enforce that a guest *must* >> use page state. If the page state changes is not done by the guest then it >> will cause #NPF and its up to the hypervisor to decide on what it wants to >> do. > Drat. Is there any hope of pushing through a GHCB change to require the guest > to use PSC? Well, I am not sure if we can push it through GHCB. Other hypervisor also need to agree to it. We need to define them some architectural way for hypervisor to detect the violation and notify guest about it. >>> It would simplify KVM (albeit not much of a simplificiation) and would also >>> make debugging easier since transitions would require an explicit guest >>> request and guest bugs would result in errors instead of random >>> corruption/weirdness. >> I am good with enforcing this from the KVM. But the question is, what fault >> we should inject in the guest when KVM detects that guest has issued the >> page state change. > Injecting a fault, at least from KVM, isn't an option since there's no architectural > behavior we can leverage. E.g. a guest that isn't enlightened enough to properly > use PSC isn't going to do anything useful with a #MC or #VC. > > Sadly, as is I think our only options are to either automatically convert RMP > entries as need, or to punt the exit to userspace. Maybe we could do both, e.g. > have a module param to control the behavior? The problem with punting to userspace > is that KVM would also need a way for userspace to fix the issue, otherwise we're > just taking longer to kill the guest :-/ > I think we should automatically convert the RMP entries at time, its possible that non Linux guest may access the page without going through the PSC. thanks
On Tue, Jul 20, 2021, Brijesh Singh wrote: > > On 7/20/21 5:31 PM, Sean Christopherson wrote: > ... > >> This is a good question, the GHCB spec does not enforce that a guest *must* > >> use page state. If the page state changes is not done by the guest then it > >> will cause #NPF and its up to the hypervisor to decide on what it wants to > >> do. > > Drat. Is there any hope of pushing through a GHCB change to require the guest > > to use PSC? > > Well, I am not sure if we can push it through GHCB. Other hypervisor > also need to agree to it. We need to define them some architectural way > for hypervisor to detect the violation and notify guest about it. And other guest's, too :-/ > >>> It would simplify KVM (albeit not much of a simplificiation) and would also > >>> make debugging easier since transitions would require an explicit guest > >>> request and guest bugs would result in errors instead of random > >>> corruption/weirdness. > >> I am good with enforcing this from the KVM. But the question is, what fault > >> we should inject in the guest when KVM detects that guest has issued the > >> page state change. > > Injecting a fault, at least from KVM, isn't an option since there's no architectural > > behavior we can leverage. E.g. a guest that isn't enlightened enough to properly > > use PSC isn't going to do anything useful with a #MC or #VC. > > > > Sadly, as is I think our only options are to either automatically convert RMP > > entries as need, or to punt the exit to userspace. Maybe we could do both, e.g. > > have a module param to control the behavior? The problem with punting to userspace > > is that KVM would also need a way for userspace to fix the issue, otherwise we're > > just taking longer to kill the guest :-/ > > > I think we should automatically convert the RMP entries at time, its > possible that non Linux guest may access the page without going through > the PSC. Agreed. I don't love that KVM will disallow automatic conversions when the host is accessing guest memory, but not when the guest is accessing memory. On the other hand, auto-converting when accessing from the host is far, far worse. And FWIW, IIRC this is also aligns with the expected/proposed TDX behavior, so that's a plus.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 46323af09995..117e2e08d7ed 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1399,6 +1399,9 @@ struct kvm_x86_ops { void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); + + int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, + int level, u64 error_code); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e60f54455cdc..b6a676ba1862 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, write_unlock(&vcpu->kvm->mmu_lock); } +static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) +{ + kvm_pfn_t pfn; + int level; + + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level))) + return RET_PF_RETRY; + + kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code); + return RET_PF_RETRY; +} + int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, void *insn, int insn_len) { @@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, goto emulate; } + if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) { + r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code); + if (r == RET_PF_RETRY) + return 1; + else + return r; + } + if (r == RET_PF_INVALID) { r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, lower_32_bits(error_code), false); diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 839cf321c6dd..53a60edc810e 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3519,3 +3519,60 @@ void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn BUG_ON(rc != 0); } } + +int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, + int level, u64 error_code) +{ + struct rmpentry *e; + int rlevel, rc = 0; + bool private; + gfn_t gfn; + + e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rlevel); + if (!e) + return 1; + + private = !!(error_code & PFERR_GUEST_ENC_MASK); + + /* + * See APM section 15.36.11 on how to handle the RMP fault for the large pages. + * + * npt rmp access action + * -------------------------------------------------- + * 4k 2M C=1 psmash + * x x C=1 if page is not private then add a new RMP entry + * x x C=0 if page is private then make it shared + * 2M 4k C=x zap + */ + if ((error_code & PFERR_GUEST_SIZEM_MASK) || + ((level == PG_LEVEL_4K) && (rlevel == PG_LEVEL_2M) && private)) { + rc = snp_rmptable_psmash(vcpu, pfn); + goto zap_gfn; + } + + /* + * If it's a private access, and the page is not assigned in the RMP table, create a + * new private RMP entry. + */ + if (!rmpentry_assigned(e) && private) { + rc = snp_make_page_private(vcpu, gpa, pfn, PG_LEVEL_4K); + goto zap_gfn; + } + + /* + * If it's a shared access, then make the page shared in the RMP table. + */ + if (rmpentry_assigned(e) && !private) + rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K); + +zap_gfn: + /* + * Now that we have updated the RMP pagesize, zap the existing rmaps for + * large entry ranges so that nested page table gets rebuilt with the updated RMP + * pagesize. + */ + gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1); + kvm_zap_gfn_range(vcpu->kvm, gfn, gfn + 512); + + return 0; +} diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 4ff6fc86dd18..32e35d396508 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4579,6 +4579,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .get_tdp_max_page_level = sev_get_tdp_max_page_level, .write_page_begin = sev_snp_write_page_begin, + + .handle_rmp_page_fault = snp_handle_rmp_page_fault, }; static struct kvm_x86_init_ops svm_init_ops __initdata = { diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index e0276ad8a1ae..ccdaaa4e1fb1 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -577,6 +577,8 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm); struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu); int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level); void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); +int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, + int level, u64 error_code); /* vmenter.S */
Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to resolve the RMP violation encountered during the NPT table walk. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> --- arch/x86/include/asm/kvm_host.h | 3 ++ arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++ arch/x86/kvm/svm/sev.c | 57 +++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/svm/svm.h | 2 ++ 5 files changed, 84 insertions(+)