diff mbox series

[v7,08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler

Message ID 20240215235405.368539-9-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults | expand

Commit Message

Anish Moorthy Feb. 15, 2024, 11:53 p.m. UTC
At the moment the only intended use case for KVM_CAP_MEMORY_FAULT_INFO
on arm64 is to annotate EFAULTs from the stage-2 fault handler, so
add that annotation now.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/arm64/kvm/arm.c           | 1 +
 arch/arm64/kvm/mmu.c           | 5 ++++-
 3 files changed, 6 insertions(+), 2 deletions(-)

Comments

Oliver Upton March 4, 2024, 8 p.m. UTC | #1
On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:

[...]

> +	if (is_error_noslot_pfn(pfn)) {
> +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> +					      write_fault, exec_fault, false);

Hmm... Reinterpreting the fault context into something that wants to be
arch-neutral might make this a bit difficult for userspace to
understand.

The CPU can take an instruction abort on an S1PTW due to missing write
permissions, i.e. hardware cannot write to the stage-1 descriptor for an
AF or DBM update. In this case HPFAR points to the IPA of the stage-1
descriptor that took the fault, not the target page.

It would seem this gets expressed to userspace as an intent to write and
execute on the stage-1 page tables, no?
Oliver Upton March 4, 2024, 8:10 p.m. UTC | #2
On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:
> 
> [...]
> 
> > +	if (is_error_noslot_pfn(pfn)) {
> > +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > +					      write_fault, exec_fault, false);
> 
> Hmm... Reinterpreting the fault context into something that wants to be
> arch-neutral might make this a bit difficult for userspace to
> understand.
> 
> The CPU can take an instruction abort on an S1PTW due to missing write
> permissions, i.e. hardware cannot write to the stage-1 descriptor for an
> AF or DBM update. In this case HPFAR points to the IPA of the stage-1
> descriptor that took the fault, not the target page.
> 
> It would seem this gets expressed to userspace as an intent to write and
> execute on the stage-1 page tables, no?

Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
shake out as a write fault on the stage-1 descriptor.

With that said, an architecture-neutral UAPI may not be able to capture
the nuance of a fault. This UAPI will become much more load-bearing in
the future, and the loss of granularity could become an issue.

Marc had some ideas about forwarding the register state to userspace
directly, which should be the right level of information for _any_ fault
taken to userspace.
Sean Christopherson March 4, 2024, 8:32 p.m. UTC | #3
On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> > On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:
> > 
> > [...]
> > 
> > > +	if (is_error_noslot_pfn(pfn)) {
> > > +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > > +					      write_fault, exec_fault, false);
> > 
> > Hmm... Reinterpreting the fault context into something that wants to be
> > arch-neutral might make this a bit difficult for userspace to
> > understand.
> > 
> > The CPU can take an instruction abort on an S1PTW due to missing write
> > permissions, i.e. hardware cannot write to the stage-1 descriptor for an
> > AF or DBM update. In this case HPFAR points to the IPA of the stage-1
> > descriptor that took the fault, not the target page.
> > 
> > It would seem this gets expressed to userspace as an intent to write and
> > execute on the stage-1 page tables, no?
> 
> Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> shake out as a write fault on the stage-1 descriptor.
> 
> With that said, an architecture-neutral UAPI may not be able to capture
> the nuance of a fault. This UAPI will become much more load-bearing in
> the future, and the loss of granularity could become an issue.

What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
case scenario is that KVM may exit to userspace multiple times in order to resolve
the problem, IMO that's an acceptable cost for having "dumb", common uAPI.

The intent/contract of the exit to userspace isn't for userspace to be able to
completely understand what fault occurred, but rather for KVM to communicate what
action userspace needs to take in order for KVM to make forward progress.

> Marc had some ideas about forwarding the register state to userspace
> directly, which should be the right level of information for _any_ fault
> taken to userspace.

I don't know enough about ARM to weigh in on that side of things, but for x86
this definitely doesn't hold true.  E.g. on the x86 side, KVM intentionally sets
reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting
fault captures the "reserved bits set" information in register state.  But that's
purely an (optional) imlementation detail of KVM that should never be exposed to
userspace.

Ditto for things like access tracking on hardware without A/D bits, and shadow
paging, which again can generate fault state that is inscrutable/misleading
without context that only KVM knows (and shouldn't expose to userspace).
Oliver Upton March 4, 2024, 9:03 p.m. UTC | #4
On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Oliver Upton wrote:
> > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:

[...]

> > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > shake out as a write fault on the stage-1 descriptor.
> > 
> > With that said, an architecture-neutral UAPI may not be able to capture
> > the nuance of a fault. This UAPI will become much more load-bearing in
> > the future, and the loss of granularity could become an issue.
> 
> What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> case scenario is that KVM may exit to userspace multiple times in order to resolve
> the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> 
> The intent/contract of the exit to userspace isn't for userspace to be able to
> completely understand what fault occurred, but rather for KVM to communicate what
> action userspace needs to take in order for KVM to make forward progress.

For one, the stage-2 page tables can describe permissions beyond RWX.
MTE tag allocation can be controlled at stage-2, which (confusingly)
desribes if the guest can insert tags in an opaque, physical space not
described by HPFAR.

There is a corresponding bit in ESR_EL2 that describes this at the time
of a fault, and R/W/X flags aren't enough to convey the right corrective
action.

> > Marc had some ideas about forwarding the register state to userspace
> > directly, which should be the right level of information for _any_ fault
> > taken to userspace.
> 
> I don't know enough about ARM to weigh in on that side of things, but for x86
> this definitely doesn't hold true.

We tend to directly model the CPU architecture wherever possible, as it
is the only way to create something intelligible. That same rationale
applies to a huge portion of KVM UAPI; it is architecture-dependent by
design.

> E.g. on the x86 side, KVM intentionally sets
> reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting
> fault captures the "reserved bits set" information in register state.  But that's
> purely an (optional) imlementation detail of KVM that should never be exposed to
> userspace.

MMIO accesses would show up elsewhere though, right? If these magic
SPTEs were causing -EFAULT exits then something must've gone sideways.

Either way, I have no issues whatsoever if the direction for x86 is to
provide abstracted fault information.
Sean Christopherson March 4, 2024, 10:49 p.m. UTC | #5
On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Oliver Upton wrote:
> > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > > shake out as a write fault on the stage-1 descriptor.
> > > 
> > > With that said, an architecture-neutral UAPI may not be able to capture
> > > the nuance of a fault. This UAPI will become much more load-bearing in
> > > the future, and the loss of granularity could become an issue.
> > 
> > What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> > case scenario is that KVM may exit to userspace multiple times in order to resolve
> > the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> > 
> > The intent/contract of the exit to userspace isn't for userspace to be able to
> > completely understand what fault occurred, but rather for KVM to communicate what
> > action userspace needs to take in order for KVM to make forward progress.
> 
> For one, the stage-2 page tables can describe permissions beyond RWX.
> MTE tag allocation can be controlled at stage-2, which (confusingly)
> desribes if the guest can insert tags in an opaque, physical space not
> described by HPFAR.
> 
> There is a corresponding bit in ESR_EL2 that describes this at the time
> of a fault, and R/W/X flags aren't enough to convey the right corrective
> action.
>
> > > Marc had some ideas about forwarding the register state to userspace
> > > directly, which should be the right level of information for _any_ fault
> > > taken to userspace.
> > 
> > I don't know enough about ARM to weigh in on that side of things, but for x86
> > this definitely doesn't hold true.
> 
> We tend to directly model the CPU architecture wherever possible, as it
> is the only way to create something intelligible. That same rationale
> applies to a huge portion of KVM UAPI; it is architecture-dependent by
> design.

Heh, "by design" :-)

I'm not saying "no arch-specific code in memory_fault", all I'm saying is that
stuff that can be arch-neutral, should be arch-neutral.  And AFAIK, basic RWX
information is common across all architectures.

E.g. if KVM needs to communicate MTE information on top of basic RWX info, why
not add a flag to memory_fault.flags that communicates that MTE is enabled and
relevant info can be found in an "extended" data field?

The presense of MTE stuff shouldn't affect the fundamental access information,
e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
irrespective of whether or not MTE is in play.

The one thing we may want to squeak in before 6.8 is released is a placeholder
in memory_fault, though I don't think that's strictly necessary since the union
as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

> > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > "caching" emulated MMIO accesses, and the resulting fault captures the
> > "reserved bits set" information in register state.  But that's purely an
> > (optional) imlementation detail of KVM that should never be exposed to
> > userspace.
> 
> MMIO accesses would show up elsewhere though, right?

Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
you're saying/asking.

> If these magic SPTEs were causing -EFAULT exits then something must've gone
> sideways.

More or less.  This scenario can happen if the guest re-accesses a GFN that
doesn't have a memslot, but in the interim userspace made the GFN private.  It's
likely a misbehaving userspace, but that really doesn't matter.  KVM's contract
is to report that KVM exited to userspace because the guest was trying to access
GFN X as shared, but the GFN is configured as private by userspace.

My point was that dumping fault/register information straight to userspace in this
scenario, without massaging/filtering that information, is not a sane option on
x86.
 
> Either way, I have no issues whatsoever if the direction for x86 is to
> provide abstracted fault information.

I don't understand how ARM can get away with NOT providing a layer of abstraction.
Copying fault state verbatim to userspace will bleed KVM implementation details
into userspace, and risks breakage of KVM's ABI due to changes in hardware.
Abstracting gory hardware details from userspace is one of the main roles of the
kernel.

A concrete example of hardware throwing a wrench in things is AMD's upcoming
"encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
the page fault error code directly to userspace, then running the same VM on
different hardware generations, e.g. after live migration, would generate different
error codes.
 
Are we talking past each other?  I'm genuinely confused by the pushback on
capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
but its existence doesn't preclude KVM from providing more information as needed.
Oliver Upton March 5, 2024, 1:01 a.m. UTC | #6
On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote:

[...]

> The presense of MTE stuff shouldn't affect the fundamental access information,

  "When FEAT_MTE is implemented, for a synchronous Data Abort on an
  instruction that directly accesses Allocation Tags, ISV is 0."

If there is no instruction syndrome, there's insufficient fault context
to determine if the guest was doing a read or a write.

> e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
> irrespective of whether or not MTE is in play.

When the MMU generates such an abort, it *is not* read, write, or execute.
It is a NoTagAccess fault. There is no sane way to describe this in
terms of RWX.

> The one thing we may want to squeak in before 6.8 is released is a placeholder
> in memory_fault, though I don't think that's strictly necessary since the union
> as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
> sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

Strictly speaking, that isn't ABI any more, but compile-time brittleness
to header changes. IOW, old userspace could still run on new kernel b/c
it compiled against the old structure size and only knows about the
fields present at that time.

> > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > > "caching" emulated MMIO accesses, and the resulting fault captures the
> > > "reserved bits set" information in register state.  But that's purely an
> > > (optional) imlementation detail of KVM that should never be exposed to
> > > userspace.
> > 
> > MMIO accesses would show up elsewhere though, right?
> 
> Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
> you're saying/asking.

If "reserved" EPT violations found their way to userspace via the
"memory fault" exit structure then that'd likely be due to a KVM bug.
The only expected flows in the near term are this and CoCo crap.

> > Either way, I have no issues whatsoever if the direction for x86 is to
> > provide abstracted fault information.
> 
> I don't understand how ARM can get away with NOT providing a layer of abstraction.
> Copying fault state verbatim to userspace will bleed KVM implementation details
> into userspace,

The memslot flag already bleeds KVM implementation detail into userspace
to a degree. The event we're trying to let userspace handle is at the
intersection of a specific hardware/software state.

> Abstracting gory hardware details from userspace is one of the main roles of the
> kernel.

Where it can be accomplished without a loss (or misrepresentation) of
information, agreed. But KVM UAPI is so architecture-specific that it
seems arbitrary to draw the line here.

> A concrete example of hardware throwing a wrench in things is AMD's upcoming
> "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
> CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
> the page fault error code directly to userspace, then running the same VM on
> different hardware generations, e.g. after live migration, would generate different
> error codes.
>  
> Are we talking past each other?  I'm genuinely confused by the pushback on
> capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
> but its existence doesn't preclude KVM from providing more information as needed.

My pushback isn't exactly on RWX (even though I noted the MTE quirk
above). What I'm poking at here is the general infrastructure for
reflecting faults into userspace, which is aggressively becoming more
relevant.
Sean Christopherson March 5, 2024, 3:39 p.m. UTC | #7
On Tue, Mar 05, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote:
> 
> [...]
> 
> > The presense of MTE stuff shouldn't affect the fundamental access information,
> 
>   "When FEAT_MTE is implemented, for a synchronous Data Abort on an
>   instruction that directly accesses Allocation Tags, ISV is 0."
> 
> If there is no instruction syndrome, there's insufficient fault context
> to determine if the guest was doing a read or a write.
> 
> > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
> > irrespective of whether or not MTE is in play.
> 
> When the MMU generates such an abort, it *is not* read, write, or execute.
> It is a NoTagAccess fault. There is no sane way to describe this in
> terms of RWX.

RWX=0, with KVM_MEMORY_EXIT_FLAG_MTE seems like a reasonable way to communicate
that, no?

> > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > > > "caching" emulated MMIO accesses, and the resulting fault captures the
> > > > "reserved bits set" information in register state.  But that's purely an
> > > > (optional) imlementation detail of KVM that should never be exposed to
> > > > userspace.
> > > 
> > > MMIO accesses would show up elsewhere though, right?
> > 
> > Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
> > you're saying/asking.
> 
> If "reserved" EPT violations found their way to userspace via the
> "memory fault" exit structure then that'd likely be due to a KVM bug.

Heh, sadly no.  Userspace can convert a GFN to private at any time, and the
TDX and SNP specs allow for implicit converstion "requests" from the guest, i.e.
KVM isn't allowed to kill the guest if the guest accesses a GFN with the "wrong"
private/shared attribute.

This particular case would likely be hit only if there's a userspace and/or guest
bug, but whether the setup is broken or just unique isn't KVM's call to make.

> The only expected flows in the near term are this and CoCo crap.
> 
> > > Either way, I have no issues whatsoever if the direction for x86 is to
> > > provide abstracted fault information.
> > 
> > I don't understand how ARM can get away with NOT providing a layer of abstraction.
> > Copying fault state verbatim to userspace will bleed KVM implementation details
> > into userspace,
> 
> The memslot flag already bleeds KVM implementation detail into userspace
> to a degree. The event we're trying to let userspace handle is at the
> intersection of a specific hardware/software state.

Yes, but IMO there's a huge difference between userspace knowing that KVM uses gup()
and memslots to translate gfn=>hva=>pfn, or even knowing that KVM plays games with
reserved stage-2 PTE bits, and userspace knowing exactly how KVM configures stage-2
PTEs.

Another example would be dirty logging on Intel CPUs.  The *admin* can decide
whether to use a write-protection scheme or page-modification logging, but KVM's
ABI with userspace provides a layer of abstraction (dirty ring or bitmap) so that
the userspace VMM doesn't need to do X for write-protection and Y for PML.

> > Abstracting gory hardware details from userspace is one of the main roles of the
> > kernel.
> 
> Where it can be accomplished without a loss (or misrepresentation) of
> information, agreed. But KVM UAPI is so architecture-specific that it
> seems arbitrary to draw the line here.

I don't think it's arbitrary.  KVM's existing uAPI for mapping memory into the
guest is almost entirely arch-neutral, and I want to preserve that for related
functionality unless it's literally impossible to do so.

> > A concrete example of hardware throwing a wrench in things is AMD's upcoming
> > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
> > CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
> > the page fault error code directly to userspace, then running the same VM on
> > different hardware generations, e.g. after live migration, would generate different
> > error codes.
> >  
> > Are we talking past each other?  I'm genuinely confused by the pushback on
> > capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
> > but its existence doesn't preclude KVM from providing more information as needed.
> 
> My pushback isn't exactly on RWX (even though I noted the MTE quirk
> above). What I'm poking at here is the general infrastructure for
> reflecting faults into userspace, which is aggressively becoming more
> relevant.

But the purpose of memory_fault isn't to reflect faults into userspace, it's to
alert userspace that KVM has encountered a memory fault that requires userspace
action to resolve.

That distinction matters because there are and will be MMU features that KVM
supports, and that can generate novel faults, but such faults won't be punted to
userspace unless KVM provides a way for userspace to explicitly control the MMU
feature.

And if KVM lets userspace control a feature, then KVM needs new uAPI to expose
the controls.  Which means that we should always have an opportunity to expand
memory_fault, e.g. with new flags, to support such features.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index d52757f9e1cb..7012f40332b3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8031,7 +8031,7 @@  unavailable to host or other VMs.
 7.34 KVM_CAP_MEMORY_FAULT_INFO
 ------------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that KVM_RUN will fill
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a25265aca432..ca4617f53250 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -240,6 +240,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dfe0cbb5937c..5b740ddfcc8e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1492,8 +1492,11 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
+					      write_fault, exec_fault, false);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*