Message ID | 20240510015822.503071-1-michael.roth@amd.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On Thu, May 09, 2024, Michael Roth wrote: > --- > arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++-- > 1 file changed, 28 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 62ad38b2a8c9..cecd8360378f 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, > return RET_PF_CONTINUE; > } > > -static bool page_fault_can_be_fast(struct kvm_page_fault *fault) > +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > { > /* > * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only > @@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault) > if (fault->rsvd) > return false; > > + /* > + * For hardware-protected VMs, certain conditions like attempting to > + * perform a write to a page which is not in the state that the guest > + * expects it to be in can result in a nested/extended #PF. In this > + * case, the below code might misconstrue this situation as being the > + * result of a write-protected access, and treat it as a spurious case > + * rather than taking any action to satisfy the real source of the #PF > + * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the > + * guest spinning on a #PF indefinitely. > + * > + * For now, just skip the fast path for hardware-protected VMs since > + * they don't currently utilize any of this machinery anyway. In the > + * future, these considerations will need to be taken into account if > + * there's any need/desire to re-enable the fast path for > + * hardware-protected VMs. > + * > + * Since software-protected VMs don't have a notion of a shared vs. > + * private that's separate from what KVM is tracking, the above > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the > + * special handling for that case for now. Very technically, it can occur if userspace _just_ modified the attributes. And as I've said multiple times, at least for now, I want to avoid special casing SW-protected VMs unless it is *absolutely* necessary, because their sole purpose is to allow testing flows that are impossible to excercise without SNP/TDX hardware. > + */ > + if (kvm_slot_can_be_private(fault->slot) && > + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)) Heh, !(x && y) kills me, I misread this like 4 times. Anyways, I don't like the heuristic. It doesn't tie the restriction back to the cause in any reasonable way. Can't this simply be? if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn) return false; Which is much, much more self-explanatory.
On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote: > > > + * Since software-protected VMs don't have a notion of a shared vs. > > + * private that's separate from what KVM is tracking, the above > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the > > + * special handling for that case for now. > > Very technically, it can occur if userspace _just_ modified the attributes. And > as I've said multiple times, at least for now, I want to avoid special casing > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose > is to allow testing flows that are impossible to excercise without SNP/TDX hardware. Yep, it is not like they have to be optimized. > > + */ > > + if (kvm_slot_can_be_private(fault->slot) && > > + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && > > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)) > > Heh, !(x && y) kills me, I misread this like 4 times. > > Anyways, I don't like the heuristic. It doesn't tie the restriction back to the > cause in any reasonable way. Can't this simply be? > > if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn) > return false; You beat me to it by seconds. And it can also be guarded by a check on kvm->arch.has_private_mem to avoid the attributes lookup. > Which is much, much more self-explanatory. Both more self-explanatory and more correct. Paolo
On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote: > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > + * Since software-protected VMs don't have a notion of a shared vs. > > > + * private that's separate from what KVM is tracking, the above > > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the > > > + * special handling for that case for now. > > > > Very technically, it can occur if userspace _just_ modified the attributes. And > > as I've said multiple times, at least for now, I want to avoid special casing > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware. > > Yep, it is not like they have to be optimized. Ok, I thought there were maybe some future plans to use sw-protected VMs to get some added protections from userspace. But even then there'd probably still be extra considerations for how to handle access tracking so white-listing them probably isn't right anyway. I was also partly tempted to take this route because it would cover this TDX patch as well: https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/ and avoid any weirdness about checking kvm_mem_is_private() without checking mmu_invalidate_seq, but I think those cases all end up resolving themselves eventually and added some comments around that. > > > > + */ > > > + if (kvm_slot_can_be_private(fault->slot) && > > > + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && > > > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)) > > > > Heh, !(x && y) kills me, I misread this like 4 times. > > > > Anyways, I don't like the heuristic. It doesn't tie the restriction back to the > > cause in any reasonable way. Can't this simply be? > > > > if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn) > > return false; > > You beat me to it by seconds. And it can also be guarded by a check on > kvm->arch.has_private_mem to avoid the attributes lookup. I re-tested with things implemented this way and everything seems to look good. It's not clear to me whether this would cover the cases the above-mentioned TDX patch handles, but no biggie if that's still needed. The new version of the patch is here: https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208 And I've updated my branches with to replace the old patch and also incorporate Sean's suggestions for patch 22: https://github.com/mdroth/linux/commits/snp-host-v15c3-unsquashed and have them here with things already squashed in/relocated: https://github.com/mdroth/linux/commits/snp-host-v15c3 Thanks for the feedback Sean, Paolo. -Mike > > > Which is much, much more self-explanatory. > > Both more self-explanatory and more correct. > > Paolo > >
On Fri, May 10, 2024, Michael Roth wrote: > On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote: > > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > + * Since software-protected VMs don't have a notion of a shared vs. > > > > + * private that's separate from what KVM is tracking, the above > > > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the > > > > + * special handling for that case for now. > > > > > > Very technically, it can occur if userspace _just_ modified the attributes. And > > > as I've said multiple times, at least for now, I want to avoid special casing > > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose > > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware. > > > > Yep, it is not like they have to be optimized. > > Ok, I thought there were maybe some future plans to use sw-protected VMs > to get some added protections from userspace. But even then there'd > probably still be extra considerations for how to handle access tracking > so white-listing them probably isn't right anyway. > > I was also partly tempted to take this route because it would cover this > TDX patch as well: > > https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/ Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are fixing, just in a less precise way. S-EPT entries only support RWX=0 and RWX=111b, i.e. it should be impossible to have a write-fault to a present S-EPT entry. And if TDX is running afoul of this code: if (!fault->present) return !kvm_ad_enabled(); then KVM should do the sane thing and require A/D support be enabled for TDX. And if it's something else entirely, that changelog has some explaining to do. > and avoid any weirdness about checking kvm_mem_is_private() without > checking mmu_invalidate_seq, but I think those cases all end up > resolving themselves eventually and added some comments around that. Yep, checking state that is protected by mmu_invalidate_seq outside of mmu_lock is definitely allowed, e.g. the entire fast page fault path operates outside of mmu_lock and thus outside of mmu_invalidate_seq's purview. It's a-ok because the SPTE are done with an atomic CMPXCHG, and so KVM only needs to ensure its page tables aren't outright _freed_. If the zap triggered by the attributes change "wins", then the fast #PF path will fail the CMPXCHG and be an expensive NOP. If the fast #PF wins, the zap will pave over the fast #PF fix, and the IPI+flush that is needed for all zaps, to ensure vCPUs don't have stale references, does the rest. And if there's an attributes race that causes the fast #PF to bail early, the vCPU will see the correct state on the next page fault.
On Fri, May 10, 2024 at 08:59:09AM -0700, Sean Christopherson <seanjc@google.com> wrote: > On Fri, May 10, 2024, Michael Roth wrote: > > On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote: > > > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > + * Since software-protected VMs don't have a notion of a shared vs. > > > > > + * private that's separate from what KVM is tracking, the above > > > > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the > > > > > + * special handling for that case for now. > > > > > > > > Very technically, it can occur if userspace _just_ modified the attributes. And > > > > as I've said multiple times, at least for now, I want to avoid special casing > > > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose > > > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware. > > > > > > Yep, it is not like they have to be optimized. > > > > Ok, I thought there were maybe some future plans to use sw-protected VMs > > to get some added protections from userspace. But even then there'd > > probably still be extra considerations for how to handle access tracking > > so white-listing them probably isn't right anyway. > > > > I was also partly tempted to take this route because it would cover this > > TDX patch as well: > > > > https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/ > > Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are > fixing, just in a less precise way. S-EPT entries only support RWX=0 and RWX=111b, > i.e. it should be impossible to have a write-fault to a present S-EPT entry. > > And if TDX is running afoul of this code: > > if (!fault->present) > return !kvm_ad_enabled(); > > then KVM should do the sane thing and require A/D support be enabled for TDX. > > And if it's something else entirely, that changelog has some explaining to do. Yes, it's for KVM_EXIT_MEMORY_FAULT case. Because Secure-EPT has non-present or all RWX allowed, fast page fault always returns RET_PF_INVALID by is_shadow_present_pte() check. I lightly tested the patch at [1] and it works for TDX KVM. [1] https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208 Just in case for that patch, Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 62ad38b2a8c9..cecd8360378f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, return RET_PF_CONTINUE; } -static bool page_fault_can_be_fast(struct kvm_page_fault *fault) +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { /* * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only @@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault) if (fault->rsvd) return false; + /* + * For hardware-protected VMs, certain conditions like attempting to + * perform a write to a page which is not in the state that the guest + * expects it to be in can result in a nested/extended #PF. In this + * case, the below code might misconstrue this situation as being the + * result of a write-protected access, and treat it as a spurious case + * rather than taking any action to satisfy the real source of the #PF + * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the + * guest spinning on a #PF indefinitely. + * + * For now, just skip the fast path for hardware-protected VMs since + * they don't currently utilize any of this machinery anyway. In the + * future, these considerations will need to be taken into account if + * there's any need/desire to re-enable the fast path for + * hardware-protected VMs. + * + * Since software-protected VMs don't have a notion of a shared vs. + * private that's separate from what KVM is tracking, the above + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the + * special handling for that case for now. + */ + if (kvm_slot_can_be_private(fault->slot) && + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)) + return false; + /* * #PF can be fast if: * @@ -3407,7 +3433,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) u64 *sptep; uint retry_count = 0; - if (!page_fault_can_be_fast(fault)) + if (!page_fault_can_be_fast(vcpu, fault)) return ret; walk_shadow_page_lockless_begin(vcpu);
For hardware-protected VMs like SEV-SNP guests, certain conditions like attempting to perform a write to a page which is not in the state that the guest expects it to be in can result in a nested/extended #PF which can only be satisfied by the host performing an implicit page state change to transition the page into the expected shared/private state. This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event that gets forwarded to userspace to handle via KVM_SET_MEMORY_ATTRIBUTES. However, the fast_page_fault() code might misconstrue this situation as being the result of a write-protected access, and treat it as a spurious case when it sees that writes are already allowed for the sPTE. This results in the KVM MMU trying to resume the guest rather than taking any action to satisfy the real source of the #PF such as generating a KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs. For now, just skip the fast path for hardware-protected VMs since they don't currently utilize any of this access-tracking machinery anyway. In the future, these considerations will need to be taken into account if there's any need/desire to re-enable the fast path for hardware-protected VMs. Since software-protected VMs don't have a notion of a shared vs. private that's separate from what KVM is tracking, the above KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the special handling for that case for now. Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Michael Roth <michael.roth@amd.com> --- arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-)