Message ID | 20240215235405.368539-9-amoorthy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults | expand |
On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote: [...] > + if (is_error_noslot_pfn(pfn)) { > + kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE, > + write_fault, exec_fault, false); Hmm... Reinterpreting the fault context into something that wants to be arch-neutral might make this a bit difficult for userspace to understand. The CPU can take an instruction abort on an S1PTW due to missing write permissions, i.e. hardware cannot write to the stage-1 descriptor for an AF or DBM update. In this case HPFAR points to the IPA of the stage-1 descriptor that took the fault, not the target page. It would seem this gets expressed to userspace as an intent to write and execute on the stage-1 page tables, no?
On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote: > On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote: > > [...] > > > + if (is_error_noslot_pfn(pfn)) { > > + kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE, > > + write_fault, exec_fault, false); > > Hmm... Reinterpreting the fault context into something that wants to be > arch-neutral might make this a bit difficult for userspace to > understand. > > The CPU can take an instruction abort on an S1PTW due to missing write > permissions, i.e. hardware cannot write to the stage-1 descriptor for an > AF or DBM update. In this case HPFAR points to the IPA of the stage-1 > descriptor that took the fault, not the target page. > > It would seem this gets expressed to userspace as an intent to write and > execute on the stage-1 page tables, no? Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should* shake out as a write fault on the stage-1 descriptor. With that said, an architecture-neutral UAPI may not be able to capture the nuance of a fault. This UAPI will become much more load-bearing in the future, and the loss of granularity could become an issue. Marc had some ideas about forwarding the register state to userspace directly, which should be the right level of information for _any_ fault taken to userspace.
On Mon, Mar 04, 2024, Oliver Upton wrote: > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote: > > On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote: > > > > [...] > > > > > + if (is_error_noslot_pfn(pfn)) { > > > + kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE, > > > + write_fault, exec_fault, false); > > > > Hmm... Reinterpreting the fault context into something that wants to be > > arch-neutral might make this a bit difficult for userspace to > > understand. > > > > The CPU can take an instruction abort on an S1PTW due to missing write > > permissions, i.e. hardware cannot write to the stage-1 descriptor for an > > AF or DBM update. In this case HPFAR points to the IPA of the stage-1 > > descriptor that took the fault, not the target page. > > > > It would seem this gets expressed to userspace as an intent to write and > > execute on the stage-1 page tables, no? > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should* > shake out as a write fault on the stage-1 descriptor. > > With that said, an architecture-neutral UAPI may not be able to capture > the nuance of a fault. This UAPI will become much more load-bearing in > the future, and the loss of granularity could become an issue. What is the possible fallout from loss of granularity/nuance? E.g. if the worst case scenario is that KVM may exit to userspace multiple times in order to resolve the problem, IMO that's an acceptable cost for having "dumb", common uAPI. The intent/contract of the exit to userspace isn't for userspace to be able to completely understand what fault occurred, but rather for KVM to communicate what action userspace needs to take in order for KVM to make forward progress. > Marc had some ideas about forwarding the register state to userspace > directly, which should be the right level of information for _any_ fault > taken to userspace. I don't know enough about ARM to weigh in on that side of things, but for x86 this definitely doesn't hold true. E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting fault captures the "reserved bits set" information in register state. But that's purely an (optional) imlementation detail of KVM that should never be exposed to userspace. Ditto for things like access tracking on hardware without A/D bits, and shadow paging, which again can generate fault state that is inscrutable/misleading without context that only KVM knows (and shouldn't expose to userspace).
On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote: > On Mon, Mar 04, 2024, Oliver Upton wrote: > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote: [...] > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should* > > shake out as a write fault on the stage-1 descriptor. > > > > With that said, an architecture-neutral UAPI may not be able to capture > > the nuance of a fault. This UAPI will become much more load-bearing in > > the future, and the loss of granularity could become an issue. > > What is the possible fallout from loss of granularity/nuance? E.g. if the worst > case scenario is that KVM may exit to userspace multiple times in order to resolve > the problem, IMO that's an acceptable cost for having "dumb", common uAPI. > > The intent/contract of the exit to userspace isn't for userspace to be able to > completely understand what fault occurred, but rather for KVM to communicate what > action userspace needs to take in order for KVM to make forward progress. For one, the stage-2 page tables can describe permissions beyond RWX. MTE tag allocation can be controlled at stage-2, which (confusingly) desribes if the guest can insert tags in an opaque, physical space not described by HPFAR. There is a corresponding bit in ESR_EL2 that describes this at the time of a fault, and R/W/X flags aren't enough to convey the right corrective action. > > Marc had some ideas about forwarding the register state to userspace > > directly, which should be the right level of information for _any_ fault > > taken to userspace. > > I don't know enough about ARM to weigh in on that side of things, but for x86 > this definitely doesn't hold true. We tend to directly model the CPU architecture wherever possible, as it is the only way to create something intelligible. That same rationale applies to a huge portion of KVM UAPI; it is architecture-dependent by design. > E.g. on the x86 side, KVM intentionally sets > reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting > fault captures the "reserved bits set" information in register state. But that's > purely an (optional) imlementation detail of KVM that should never be exposed to > userspace. MMIO accesses would show up elsewhere though, right? If these magic SPTEs were causing -EFAULT exits then something must've gone sideways. Either way, I have no issues whatsoever if the direction for x86 is to provide abstracted fault information.
On Mon, Mar 04, 2024, Oliver Upton wrote: > On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote: > > On Mon, Mar 04, 2024, Oliver Upton wrote: > > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote: > > [...] > > > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with > > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should* > > > shake out as a write fault on the stage-1 descriptor. > > > > > > With that said, an architecture-neutral UAPI may not be able to capture > > > the nuance of a fault. This UAPI will become much more load-bearing in > > > the future, and the loss of granularity could become an issue. > > > > What is the possible fallout from loss of granularity/nuance? E.g. if the worst > > case scenario is that KVM may exit to userspace multiple times in order to resolve > > the problem, IMO that's an acceptable cost for having "dumb", common uAPI. > > > > The intent/contract of the exit to userspace isn't for userspace to be able to > > completely understand what fault occurred, but rather for KVM to communicate what > > action userspace needs to take in order for KVM to make forward progress. > > For one, the stage-2 page tables can describe permissions beyond RWX. > MTE tag allocation can be controlled at stage-2, which (confusingly) > desribes if the guest can insert tags in an opaque, physical space not > described by HPFAR. > > There is a corresponding bit in ESR_EL2 that describes this at the time > of a fault, and R/W/X flags aren't enough to convey the right corrective > action. > > > > Marc had some ideas about forwarding the register state to userspace > > > directly, which should be the right level of information for _any_ fault > > > taken to userspace. > > > > I don't know enough about ARM to weigh in on that side of things, but for x86 > > this definitely doesn't hold true. > > We tend to directly model the CPU architecture wherever possible, as it > is the only way to create something intelligible. That same rationale > applies to a huge portion of KVM UAPI; it is architecture-dependent by > design. Heh, "by design" :-) I'm not saying "no arch-specific code in memory_fault", all I'm saying is that stuff that can be arch-neutral, should be arch-neutral. And AFAIK, basic RWX information is common across all architectures. E.g. if KVM needs to communicate MTE information on top of basic RWX info, why not add a flag to memory_fault.flags that communicates that MTE is enabled and relevant info can be found in an "extended" data field? The presense of MTE stuff shouldn't affect the fundamental access information, e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE irrespective of whether or not MTE is in play. The one thing we may want to squeak in before 6.8 is released is a placeholder in memory_fault, though I don't think that's strictly necessary since the union as a whole is padded to 256 bytes. I suppose userspace could allocate based on sizeof(kvm_run.memory_fault), but that's a bit of a stretch. > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for > > "caching" emulated MMIO accesses, and the resulting fault captures the > > "reserved bits set" information in register state. But that's purely an > > (optional) imlementation detail of KVM that should never be exposed to > > userspace. > > MMIO accesses would show up elsewhere though, right? Yes, but I don't see how that's relevant. Maybe I'm just misunderstanding what you're saying/asking. > If these magic SPTEs were causing -EFAULT exits then something must've gone > sideways. More or less. This scenario can happen if the guest re-accesses a GFN that doesn't have a memslot, but in the interim userspace made the GFN private. It's likely a misbehaving userspace, but that really doesn't matter. KVM's contract is to report that KVM exited to userspace because the guest was trying to access GFN X as shared, but the GFN is configured as private by userspace. My point was that dumping fault/register information straight to userspace in this scenario, without massaging/filtering that information, is not a sane option on x86. > Either way, I have no issues whatsoever if the direction for x86 is to > provide abstracted fault information. I don't understand how ARM can get away with NOT providing a layer of abstraction. Copying fault state verbatim to userspace will bleed KVM implementation details into userspace, and risks breakage of KVM's ABI due to changes in hardware. Abstracting gory hardware details from userspace is one of the main roles of the kernel. A concrete example of hardware throwing a wrench in things is AMD's upcoming "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable CPUs for *any* VM that supports guest-controlled encrypted memory. If KVM reported the page fault error code directly to userspace, then running the same VM on different hardware generations, e.g. after live migration, would generate different error codes. Are we talking past each other? I'm genuinely confused by the pushback on capturing RWX information. Yes, the RWX info may be insufficient in some cases, but its existence doesn't preclude KVM from providing more information as needed.
On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote: [...] > The presense of MTE stuff shouldn't affect the fundamental access information, "When FEAT_MTE is implemented, for a synchronous Data Abort on an instruction that directly accesses Allocation Tags, ISV is 0." If there is no instruction syndrome, there's insufficient fault context to determine if the guest was doing a read or a write. > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE > irrespective of whether or not MTE is in play. When the MMU generates such an abort, it *is not* read, write, or execute. It is a NoTagAccess fault. There is no sane way to describe this in terms of RWX. > The one thing we may want to squeak in before 6.8 is released is a placeholder > in memory_fault, though I don't think that's strictly necessary since the union > as a whole is padded to 256 bytes. I suppose userspace could allocate based on > sizeof(kvm_run.memory_fault), but that's a bit of a stretch. Strictly speaking, that isn't ABI any more, but compile-time brittleness to header changes. IOW, old userspace could still run on new kernel b/c it compiled against the old structure size and only knows about the fields present at that time. > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for > > > "caching" emulated MMIO accesses, and the resulting fault captures the > > > "reserved bits set" information in register state. But that's purely an > > > (optional) imlementation detail of KVM that should never be exposed to > > > userspace. > > > > MMIO accesses would show up elsewhere though, right? > > Yes, but I don't see how that's relevant. Maybe I'm just misunderstanding what > you're saying/asking. If "reserved" EPT violations found their way to userspace via the "memory fault" exit structure then that'd likely be due to a KVM bug. The only expected flows in the near term are this and CoCo crap. > > Either way, I have no issues whatsoever if the direction for x86 is to > > provide abstracted fault information. > > I don't understand how ARM can get away with NOT providing a layer of abstraction. > Copying fault state verbatim to userspace will bleed KVM implementation details > into userspace, The memslot flag already bleeds KVM implementation detail into userspace to a degree. The event we're trying to let userspace handle is at the intersection of a specific hardware/software state. > Abstracting gory hardware details from userspace is one of the main roles of the > kernel. Where it can be accomplished without a loss (or misrepresentation) of information, agreed. But KVM UAPI is so architecture-specific that it seems arbitrary to draw the line here. > A concrete example of hardware throwing a wrench in things is AMD's upcoming > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable > CPUs for *any* VM that supports guest-controlled encrypted memory. If KVM reported > the page fault error code directly to userspace, then running the same VM on > different hardware generations, e.g. after live migration, would generate different > error codes. > > Are we talking past each other? I'm genuinely confused by the pushback on > capturing RWX information. Yes, the RWX info may be insufficient in some cases, > but its existence doesn't preclude KVM from providing more information as needed. My pushback isn't exactly on RWX (even though I noted the MTE quirk above). What I'm poking at here is the general infrastructure for reflecting faults into userspace, which is aggressively becoming more relevant.
On Tue, Mar 05, 2024, Oliver Upton wrote: > On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote: > > [...] > > > The presense of MTE stuff shouldn't affect the fundamental access information, > > "When FEAT_MTE is implemented, for a synchronous Data Abort on an > instruction that directly accesses Allocation Tags, ISV is 0." > > If there is no instruction syndrome, there's insufficient fault context > to determine if the guest was doing a read or a write. > > > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE > > irrespective of whether or not MTE is in play. > > When the MMU generates such an abort, it *is not* read, write, or execute. > It is a NoTagAccess fault. There is no sane way to describe this in > terms of RWX. RWX=0, with KVM_MEMORY_EXIT_FLAG_MTE seems like a reasonable way to communicate that, no? > > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for > > > > "caching" emulated MMIO accesses, and the resulting fault captures the > > > > "reserved bits set" information in register state. But that's purely an > > > > (optional) imlementation detail of KVM that should never be exposed to > > > > userspace. > > > > > > MMIO accesses would show up elsewhere though, right? > > > > Yes, but I don't see how that's relevant. Maybe I'm just misunderstanding what > > you're saying/asking. > > If "reserved" EPT violations found their way to userspace via the > "memory fault" exit structure then that'd likely be due to a KVM bug. Heh, sadly no. Userspace can convert a GFN to private at any time, and the TDX and SNP specs allow for implicit converstion "requests" from the guest, i.e. KVM isn't allowed to kill the guest if the guest accesses a GFN with the "wrong" private/shared attribute. This particular case would likely be hit only if there's a userspace and/or guest bug, but whether the setup is broken or just unique isn't KVM's call to make. > The only expected flows in the near term are this and CoCo crap. > > > > Either way, I have no issues whatsoever if the direction for x86 is to > > > provide abstracted fault information. > > > > I don't understand how ARM can get away with NOT providing a layer of abstraction. > > Copying fault state verbatim to userspace will bleed KVM implementation details > > into userspace, > > The memslot flag already bleeds KVM implementation detail into userspace > to a degree. The event we're trying to let userspace handle is at the > intersection of a specific hardware/software state. Yes, but IMO there's a huge difference between userspace knowing that KVM uses gup() and memslots to translate gfn=>hva=>pfn, or even knowing that KVM plays games with reserved stage-2 PTE bits, and userspace knowing exactly how KVM configures stage-2 PTEs. Another example would be dirty logging on Intel CPUs. The *admin* can decide whether to use a write-protection scheme or page-modification logging, but KVM's ABI with userspace provides a layer of abstraction (dirty ring or bitmap) so that the userspace VMM doesn't need to do X for write-protection and Y for PML. > > Abstracting gory hardware details from userspace is one of the main roles of the > > kernel. > > Where it can be accomplished without a loss (or misrepresentation) of > information, agreed. But KVM UAPI is so architecture-specific that it > seems arbitrary to draw the line here. I don't think it's arbitrary. KVM's existing uAPI for mapping memory into the guest is almost entirely arch-neutral, and I want to preserve that for related functionality unless it's literally impossible to do so. > > A concrete example of hardware throwing a wrench in things is AMD's upcoming > > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable > > CPUs for *any* VM that supports guest-controlled encrypted memory. If KVM reported > > the page fault error code directly to userspace, then running the same VM on > > different hardware generations, e.g. after live migration, would generate different > > error codes. > > > > Are we talking past each other? I'm genuinely confused by the pushback on > > capturing RWX information. Yes, the RWX info may be insufficient in some cases, > > but its existence doesn't preclude KVM from providing more information as needed. > > My pushback isn't exactly on RWX (even though I noted the MTE quirk > above). What I'm poking at here is the general infrastructure for > reflecting faults into userspace, which is aggressively becoming more > relevant. But the purpose of memory_fault isn't to reflect faults into userspace, it's to alert userspace that KVM has encountered a memory fault that requires userspace action to resolve. That distinction matters because there are and will be MMU features that KVM supports, and that can generate novel faults, but such faults won't be punted to userspace unless KVM provides a way for userspace to explicitly control the MMU feature. And if KVM lets userspace control a feature, then KVM needs new uAPI to expose the controls. Which means that we should always have an opportunity to expand memory_fault, e.g. with new flags, to support such features.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index d52757f9e1cb..7012f40332b3 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8031,7 +8031,7 @@ unavailable to host or other VMs. 7.34 KVM_CAP_MEMORY_FAULT_INFO ------------------------------ -:Architectures: x86 +:Architectures: x86, arm64 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. The presence of this capability indicates that KVM_RUN will fill diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index a25265aca432..ca4617f53250 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -240,6 +240,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_SYSTEM_SUSPEND: case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: + case KVM_CAP_MEMORY_FAULT_INFO: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index dfe0cbb5937c..5b740ddfcc8e 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1492,8 +1492,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_send_hwpoison_signal(hva, vma_shift); return 0; } - if (is_error_noslot_pfn(pfn)) + if (is_error_noslot_pfn(pfn)) { + kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE, + write_fault, exec_fault, false); return -EFAULT; + } if (kvm_is_device_pfn(pfn)) { /*
At the moment the only intended use case for KVM_CAP_MEMORY_FAULT_INFO on arm64 is to annotate EFAULTs from the stage-2 fault handler, so add that annotation now. Signed-off-by: Anish Moorthy <amoorthy@google.com> --- Documentation/virt/kvm/api.rst | 2 +- arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/mmu.c | 5 ++++- 3 files changed, 6 insertions(+), 2 deletions(-)