diff mbox series

[5/8] kvm: Add cap/kvm_run field for memory fault exits

Message ID 20230215011614.725983-6-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Add memory fault exits to avoid slow GUP | expand

Commit Message

Anish Moorthy Feb. 15, 2023, 1:16 a.m. UTC
This new KVM exit allows userspace to handle missing memory. It
indicates that the pages in the range [gpa, gpa + size) must be mapped.

The "flags" field actually goes unused in this series: it's included for
forward compatibility with [1], should this series happen to go in
first.

[1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h       | 13 +++++++++++
 include/uapi/linux/kvm.h       | 13 ++++++++++-
 tools/include/uapi/linux/kvm.h |  7 ++++++
 virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
 5 files changed, 100 insertions(+), 1 deletion(-)

Comments

Marc Zyngier Feb. 15, 2023, 8:41 a.m. UTC | #1
On Wed, 15 Feb 2023 01:16:11 +0000,
Anish Moorthy <amoorthy@google.com> wrote:
> 
> This new KVM exit allows userspace to handle missing memory. It
> indicates that the pages in the range [gpa, gpa + size) must be mapped.
> 
> The "flags" field actually goes unused in this series: it's included for
> forward compatibility with [1], should this series happen to go in
> first.
> 
> [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
>  include/linux/kvm_host.h       | 13 +++++++++++
>  include/uapi/linux/kvm.h       | 13 ++++++++++-
>  tools/include/uapi/linux/kvm.h |  7 ++++++
>  virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
>  5 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9807b05a1b571..4b06e60668686 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.137 KVM_SET_MEM_FAULT_NOWAIT
> +------------------------------
> +
> +:Capability: KVM_CAP_MEM_FAULT_NOWAIT
> +:Architectures: x86, arm64
> +:Type: vm ioctl
> +:Parameters: bool state (in)

Huh. See towards the end of this patch why I think this is a pretty
bad idea.

> +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.

Please pick a meaningful error code instead of -1. And if you intended
this as -EPERM, please explain the rationale (-EINVAL would seem more
appropriate).

> +
> +Enables (state=true) or disables (state=false) waitless memory faults. For more
> +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
> +
>  5. The kvm_run structure
>  ========================
>  
> @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +which userspace may choose to handle.

No, the vcpu hasn't "encountered a memory error". The hypervisor has
taken a page fault, which is very different. And it isn't that KVM
couldn't handle it (or we wouldn't have a hypervisor at all). From
what I understand of this series (possibly very little), userspace has
to *ask* for these, and they are delivered in specific circumstances.
Which are?

> +
> +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> +may handle the error and return to KVM to retry the previous memory access.

What are these *exactly*? In what unit? What guarantees that the
process eventually converges? How is userspace supposed to handle the
fault (which is *not* an error)? Don't you need to communicate other
information, such as the type of fault (read, write, permission or
translation fault...)?

> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> @@ -7577,6 +7604,21 @@ This capability is aimed to mitigate the threat that malicious VMs can
>  cause CPU stuck (due to event windows don't open up) and make the CPU
>  unavailable to host or other VMs.
>  
> +7.34 KVM_CAP_MEM_FAULT_NOWAIT
> +-----------------------------
> +
> +:Architectures: x86, arm64
> +:Target: VM
> +:Parameters: None
> +:Returns: 0 on success, or -EINVAL if capability is not supported.
> +
> +The presence of this capability indicates that userspace can enable/disable
> +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
> +
> +When waitless memory faults are enabled, fast get_user_pages failures when
> +handling EPT/Shadow Page Table violations will cause a vCPU exit
> +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.

Do you really expect a random VMM hacker to understand what GUP is?
Also, the x86 verbiage makes zero sense to me. Please write this in a
way that is useful for userspace people, because they are the ones
consuming this documentation. I really can't imagine anyone being able
to write something useful based on this.

> +
>  8. Other capabilities.
>  ======================
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 109b18e2789c4..9352e7f8480fb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -801,6 +801,9 @@ struct kvm {
>  	bool vm_bugged;
>  	bool vm_dead;
>  
> +	rwlock_t mem_fault_nowait_lock;
> +	bool mem_fault_nowait;

A full-fat rwlock to protect a single bool? What benefits do you
expect from a rwlock? Why is it preferable to an atomic access, or a
simple bitop?

> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
>  #endif
> @@ -2278,4 +2281,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +static inline bool memory_faults_enabled(struct kvm *kvm)
> +{
> +	bool ret;
> +
> +	read_lock(&kvm->mem_fault_nowait_lock);
> +	ret = kvm->mem_fault_nowait;
> +	read_unlock(&kvm->mem_fault_nowait_lock);
> +	return ret;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 55155e262646e..064fbfed97f01 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;

Sigh... This doesn't even match the documentation.

>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1175,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>  #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> +#define KVM_CAP_MEM_FAULT_NOWAIT 226
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1658,7 +1666,7 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> -/* Available with  KVM_CAP_S390_VCPU_RESETS */
> +/* Available with KVM_CAP_S390_VCPU_RESETS */

Unrelated change?

>  #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
>  #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
>  
> @@ -2228,4 +2236,7 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Available with KVM_CAP_MEM_FAULT_NOWAIT */
> +#define KVM_SET_MEM_FAULT_NOWAIT _IOWR(KVMIO, 0xd2, bool)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 20522d4ba1e0d..5d9e3f48a9634 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dae5f48151032..8e5bfc00d1181 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1149,6 +1149,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	INIT_LIST_HEAD(&kvm->devices);
>  	kvm->max_vcpus = KVM_MAX_VCPUS;
>  
> +	rwlock_init(&kvm->mem_fault_nowait_lock);
> +	kvm->mem_fault_nowait = false;
> +
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
>  	/*
> @@ -2313,6 +2316,16 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +static int kvm_vm_ioctl_set_mem_fault_nowait(struct kvm *kvm, bool state)
> +{
> +	if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MEM_FAULT_NOWAIT))
> +		return -1;
> +	write_lock(&kvm->mem_fault_nowait_lock);
> +	kvm->mem_fault_nowait = state;
> +	write_unlock(&kvm->mem_fault_nowait_lock);
> +	return 0;
> +}
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4675,6 +4688,10 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEM_FAULT_NOWAIT:
> +		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
> +			return -EINVAL;
> +		return 0;
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -4892,6 +4909,15 @@ static long kvm_vm_ioctl(struct file *filp,
>  		r = 0;
>  		break;
>  	}
> +	case KVM_SET_MEM_FAULT_NOWAIT: {
> +		bool state;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&state, argp, sizeof(state)))

A copy_from_user for... a bool? Why don't you directly treat argp as
the boolean itself? Of even better, make it a more interesting
quantity so that the API can evolve in the future. As it stands, it is
not extensible, which means it is already dead, if Linux APIs are
anything to go by.

On a related subject: is there a set of patches for any of the main
VMMs making any use of this?

	M.
Oliver Upton Feb. 15, 2023, 8:59 a.m. UTC | #2
On Wed, Feb 15, 2023 at 01:16:11AM +0000, Anish Moorthy wrote:
> This new KVM exit allows userspace to handle missing memory. It
> indicates that the pages in the range [gpa, gpa + size) must be mapped.
> 
> The "flags" field actually goes unused in this series: it's included for
> forward compatibility with [1], should this series happen to go in
> first.
> 
> [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
>  include/linux/kvm_host.h       | 13 +++++++++++
>  include/uapi/linux/kvm.h       | 13 ++++++++++-
>  tools/include/uapi/linux/kvm.h |  7 ++++++
>  virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
>  5 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9807b05a1b571..4b06e60668686 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.137 KVM_SET_MEM_FAULT_NOWAIT
> +------------------------------
> +
> +:Capability: KVM_CAP_MEM_FAULT_NOWAIT
> +:Architectures: x86, arm64
> +:Type: vm ioctl
> +:Parameters: bool state (in)
> +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
> +
> +Enables (state=true) or disables (state=false) waitless memory faults. For more
> +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.

Why not use KVM_ENABLE_CAP instead for this? Its a preexisting UAPI for
toggling KVM behaviors.

>  5. The kvm_run structure
>  ========================
>  
> @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +which userspace may choose to handle.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> +may handle the error and return to KVM to retry the previous memory access.

How is userspace expected to differentiate the gup_fast() failed exit
from the guest-private memory exit? I don't think flags are a good idea
for this, as it comes with the illusion that both events can happen on a
single exit. In reality, these are mutually exclusive.

A fault type/code would be better here, with the option to add flags at
a later date that could be used to further describe the exit (if
needed).
Sean Christopherson Feb. 15, 2023, 5:07 p.m. UTC | #3
On Wed, Feb 15, 2023, Marc Zyngier wrote:
> On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy <amoorthy@google.com> wrote:
> >  8. Other capabilities.
> >  ======================
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 109b18e2789c4..9352e7f8480fb 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -801,6 +801,9 @@ struct kvm {
> >  	bool vm_bugged;
> >  	bool vm_dead;
> >  
> > +	rwlock_t mem_fault_nowait_lock;
> > +	bool mem_fault_nowait;
> 
> A full-fat rwlock to protect a single bool? What benefits do you
> expect from a rwlock? Why is it preferable to an atomic access, or a
> simple bitop?

There's no need to have any kind off dedicated atomicity.  The only readers are
in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.
Anish Moorthy Feb. 16, 2023, 6:53 p.m. UTC | #4
On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Why not use KVM_ENABLE_CAP instead for this? Its a preexisting UAPI for
> toggling KVM behaviors.

Oh I wrote it this way because, even with the "args" field in "struct
kvm_enable_cap" to express the enable/disable intent, it felt weird to allow
disabling the feature through KVM_ENABLE_CAP. But it seems like that's the
convention, so I'll make the change.

> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > +
> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
> > +
>
> How is userspace expected to differentiate the gup_fast() failed exit
> from the guest-private memory exit? I don't think flags are a good idea
> for this, as it comes with the illusion that both events can happen on a
> single exit. In reality, these are mutually exclusive.
>
> A fault type/code would be better here, with the option to add flags at
> a later date that could be used to further describe the exit (if
> needed).

Agreed. Something like this, then?

+    struct {
+        __u32 fault_code;
+        __u64 reserved;
+        __u64 gpa;
+        __u64 size;
+    } memory_fault;

The "reserved" field is meant to be the placeholder for a future "flags" field.
Let me know if there's a better/more conventional way to achieve this.

On Wed, Feb 15, 2023 at 9:07 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 15, 2023, Marc Zyngier wrote:
> > On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy <amoorthy@google.com> wrote:
> > >  8. Other capabilities.
> > >  ======================
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 109b18e2789c4..9352e7f8480fb 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -801,6 +801,9 @@ struct kvm {
> > >     bool vm_bugged;
> > >     bool vm_dead;
> > >
> > > +   rwlock_t mem_fault_nowait_lock;
> > > +   bool mem_fault_nowait;
> >
> > A full-fat rwlock to protect a single bool? What benefits do you
> > expect from a rwlock? Why is it preferable to an atomic access, or a
> > simple bitop?
>
> There's no need to have any kind off dedicated atomicity.  The only readers are
> in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.

I think we do need atomicity here. When KVM_CAP_MEM_FAULT_NOWAIT is enabled
async page faults are essentially disabled: so userspace will likely want to
disable the cap at some point (such as the end of live migration post-copy).
Since we want to support this without having to pause vCPUs, there's an
atomicity requirement.

Marc, I used an rwlock simply because it seemed like the easiest correct thing
to do. I had a hunch that I'd be asked to change this to an atomic, so I can go
ahead and do that barring any other suggestions.

On Wed, Feb 15, 2023 at 12:41 AM Marc Zyngier <maz@kernel.org> wrote:
>
> > +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
>
> Please pick a meaningful error code instead of -1. And if you intended
> this as -EPERM, please explain the rationale (-EINVAL would seem more
> appropriate).

As I mention earlier, I'll be switching to toggling the capability via
KVM_ENABLE_CAP so this KVM_SET_MEM_FAULT_NOWAIT ioctl is going to go away. I
will make sure to set an appropriate error code if the user passes nonsensical
"args" to KVM_ENABLE_CAP though, whatever that ends up meaning.

> > +
> > +Enables (state=true) or disables (state=false) waitless memory faults. For more
> > +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
> > +
> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +which userspace may choose to handle.
>
> No, the vcpu hasn't "encountered a memory error". The hypervisor has
> taken a page fault, which is very different. And it isn't that KVM
> couldn't handle it (or we wouldn't have a hypervisor at all). From
> what I understand of this series (possibly very little), userspace has
> to *ask* for these, and they are delivered in specific circumstances.
> Which are?

Thanks for the correction: "encountered a memory error" was incorrect phrasing.
What is your opinion on the following, hopefully more accurate, documentation?

"An exit reason of KVM_EXIT_MEMORY_FAULT indicates that the vCPU encountered a
memory fault for which the userspace page tables did not contain a present
mapping. This exit is only generated when KVM_CAP_MEM_FAULT_NOWAIT is enabled:
otherwise KVM will attempt to resolve the fault without exiting to userspace,
returning -1 with errno=EFAULT when it cannot."

It might also help to note that the vCPUs exit in exactly the same cases where
userspace would get a page fault when trying to access the memory through the
VMA/mapping provided to the memslot. Please let me know if you feel that this
information should be included in the documentation, or if you see anything else
I've gotten wrong.

As for handling these faults: userspace should consider its own knowledge of the
guest and determine how best to resolve each fault. For instance if userspace
hasn't UFFDIO_COPY/CONTINUEd a faulting page yet, then it must do so to
establish the mapping. If it already did, then the next step would be to fault
in the page via MADV_POPULATE_READ|WRITE.

A naive userspace could just always MADV_POPULATE_READ|WRITE in response to
these faults: that would basically yield KVM's behavior *without* the capability
enabled, just with extra steps, worse performance, and no chance for async page
faults.

A final note: this is just how the fault applies in the context of
userfaultfd-based post-copy live migration. Other uses might come up in the
future, in which case userspace would want to take some other sort of action.

> > +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> > +may handle the error and return to KVM to retry the previous memory access.
>
> What are these *exactly*? In what unit?

Sorry, I'll add some descriptive comments here. "gpa" is the guest physical
address of the faulting page (rounded down to be aligned with "size"), and
"size" is just the size in bytes of the faulting page.

> ... What guarantees that the
> process eventually converges? How is userspace supposed to handle the
> fault (which is *not* an error)? Don't you need to communicate other
> information, such as the type of fault (read, write, permission or
> translation fault...)?

Ok, two parts to the response here.

1. As Oliver touches on earlier, we'll probably want to use this same field for
   different classes of memory fault in the future (such as the ones which Chao
   is introducing in [1]): so it does make sense to add "code" and "flags"
   fields which can be used to communicate more information to the user (and
   which can just be set to MEM_FAULT_NOWAIT/0 in this series).

2. As to why a code/flags of MEM_FAULT_NOWAIT/0 should be enough information to
   convey to the user here: the assumption behind this series is that
   MADV_POPULATE_READ|WRITE will always be enough to ensure that the vCPU can
   run again. This goes back to the earlier claim about vCPUs exiting in the
   exact same cases as when userspace would get a page fault trying to access
   the same memory. MADV_POPULATE_READ|WRITE is intended to resolve exactly
   these cases, so userspace shouldn't need anything more.

Again, a VMM enabling the new exit to make post-copy faster must determine what
action to take depending on whether it has UFFDIO_COPY|CONTINUEd the faulting
page (at least for now, perhaps there will be more uses in the future): we can
keep discussing if this seems fragile. Just keeping things limited to
userfaultfd, having KVM communicate to userspace which action is required could
get difficult. For UFFDIO_CONTINUE we'd have to check if the VMA had
VM_UFFD_MINOR, and for UFFDIO_COPY we'd have to at least check the page cache.

> >
> > +7.34 KVM_CAP_MEM_FAULT_NOWAIT
> > +-----------------------------
> > +
> > +:Architectures: x86, arm64
> > +:Target: VM
> > +:Parameters: None
> > +:Returns: 0 on success, or -EINVAL if capability is not supported.
> > +
> > +The presence of this capability indicates that userspace can enable/disable
> > +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
> > +
> > +When waitless memory faults are enabled, fast get_user_pages failures when
> > +handling EPT/Shadow Page Table violations will cause a vCPU exit
> > +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.
>
> Do you really expect a random VMM hacker to understand what GUP is?
> Also, the x86 verbiage makes zero sense to me. Please write this in a
> way that is useful for userspace people, because they are the ones
> consuming this documentation. I really can't imagine anyone being able
> to write something useful based on this.

My bad: obviously I shouldn't be exposing the kernel's implementation details
here. As for what the documentation should actually say, I'm thinking I can
adapt the revised description of the KVM exit from earlier ("An exit reason of
KVM_EXIT_MEMORY_FAULT indicates..."). Again, please let me know if you see
anything which should be changed there.

> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 flags;
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
>
> Sigh... This doesn't even match the documentation.

Ack! Ack.

> > -/* Available with  KVM_CAP_S390_VCPU_RESETS */
> > +/* Available with KVM_CAP_S390_VCPU_RESETS */
>
> Unrelated change?

Yes, I'll drop it from the next revision.

> > +             r = -EFAULT;
> > +             if (copy_from_user(&state, argp, sizeof(state)))
>
> A copy_from_user for... a bool? Why don't you directly treat argp as
> the boolean itself? Of even better, make it a more interesting
> quantity so that the API can evolve in the future. As it stands, it is
> not extensible, which means it is already dead, if Linux APIs are
> anything to go by.

I'll keep that in mind for the future, but this is now moot since I'll be
removing KVM_SET_MEM_FAULT_NOWAIT.

> On a related subject: is there a set of patches for any of the main
> VMMs making any use of this?

No. For what it's worth, the demand_paging_selftest demonstrates the basics of
what userspace can do with the new exit. Would it be helpful to see how QEMU
could use it as well?

[1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/
Sean Christopherson Feb. 16, 2023, 9:38 p.m. UTC | #5
On Thu, Feb 16, 2023, Anish Moorthy wrote:
> On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 109b18e2789c4..9352e7f8480fb 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -801,6 +801,9 @@ struct kvm {
> > > >     bool vm_bugged;
> > > >     bool vm_dead;
> > > >
> > > > +   rwlock_t mem_fault_nowait_lock;
> > > > +   bool mem_fault_nowait;
> > >
> > > A full-fat rwlock to protect a single bool? What benefits do you
> > > expect from a rwlock? Why is it preferable to an atomic access, or a
> > > simple bitop?
> >
> > There's no need to have any kind off dedicated atomicity.  The only readers are
> > in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.
> 
> I think we do need atomicity here.

Atomicity, yes.  Mutually exclusivity, no.  AFAICT, nothing will break if userspace
has multiple in-flight calls to toggled the flag.  And if we do want to guarantee
there's only one writer, then kvm->lock or kvm->slots_lock will suffice.

> When KVM_CAP_MEM_FAULT_NOWAIT is enabled async page faults are essentially
> disabled: so userspace will likely want to disable the cap at some point
> (such as the end of live migration post-copy).

Ah, this is a dynamic thing and not a set-and-forget thing.

> Since we want to support this without having to pause vCPUs, there's an
> atomicity requirement.

Ensuring that vCPUs "see" the new value and not corrupting memory are two very
different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
they're not even necessary to avoid memory corruption, e.g. the result of concurrent
writes to a bool is non-deterministic, but so is the order of two tasks contending
for a lock, so it's a moot point.

I think what you really want to achieve is that vCPUs observe the NOWAIT flag
before KVM returns to userspace.  There are a variety of ways to make that happen,
but since this all about accessing guest memory, the simplest is likely to
"protect" the flag with kvm->srcu, i.e. require SRCU be held by readers and then
do a synchronize_srcu() to ensure all vCPUs have picked up the new value.

Speaking of SRCU (which protect memslots), why not make this a memslot flag?  If
the goal is to be able to turn the behavior on/off dynamically, wouldn't it be
beneficial to turn off the NOWAIT behavior when a memslot is fully transfered?

A memslot flag would likely be simpler to implement as it would piggyback all of
the existing infrastructure to handle memslot updates.
Anish Moorthy Feb. 17, 2023, 7:14 p.m. UTC | #6
On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > How is userspace expected to differentiate the gup_fast() failed exit
> > from the guest-private memory exit? I don't think flags are a good idea
> > for this, as it comes with the illusion that both events can happen on a
> > single exit. In reality, these are mutually exclusive.
> >
> > A fault type/code would be better here, with the option to add flags at
> > a later date that could be used to further describe the exit (if
> > needed).
>
> Agreed. Something like this, then?
>
> +    struct {
> +        __u32 fault_code;
> +        __u64 reserved;
> +        __u64 gpa;
> +        __u64 size;
> +    } memory_fault;
>
> The "reserved" field is meant to be the placeholder for a future "flags" field.
> Let me know if there's a better/more conventional way to achieve this.

On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> 1. As Oliver touches on earlier, we'll probably want to use this same field for
>    different classes of memory fault in the future (such as the ones which Chao
>    is introducing in [1]): so it does make sense to add "code" and "flags"
>    fields which can be used to communicate more information to the user (and
>    which can just be set to MEM_FAULT_NOWAIT/0 in this series).

Let me walk back my responses here: I took a closer look at Chao's series, and
it doesn't seem that I should be trying to share KVM_EXIT_MEMORY_FAULT with it
in the first place. As far as I can understand (not that much, to be clear :)
we're signaling unrelated things, so it makes more sense to use different exits
(ie, rename mine to KVM_EXIT_MEMORY_FAULT_NOWAIT). That would prevent any
potential confusion about mutual exclusivity.

That also removes the need for a "fault_code" field in "memory_fault", which I
could rename to something more general like "guest_memory_range". As for the
"reserved" field, we could keep it around if we care about reusing
"guest_memory_range" between exits, or discard it if we don't.

On Thu, Feb 16, 2023 at 1:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 16, 2023, Anish Moorthy wrote:
> > On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index 109b18e2789c4..9352e7f8480fb 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -801,6 +801,9 @@ struct kvm {
> > > > >     bool vm_bugged;
> > > > >     bool vm_dead;
> > > > >
> > > > > +   rwlock_t mem_fault_nowait_lock;
> > > > > +   bool mem_fault_nowait;
> > > >
> > > > A full-fat rwlock to protect a single bool? What benefits do you
> > > > expect from a rwlock? Why is it preferable to an atomic access, or a
> > > > simple bitop?
> > >
> > > There's no need to have any kind off dedicated atomicity.  The only readers are
> > > in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.
> >
> > I think we do need atomicity here.
>
> Atomicity, yes.  Mutually exclusivity, no.  AFAICT, nothing will break if userspace
> has multiple in-flight calls to toggled the flag.  And if we do want to guarantee
> there's only one writer, then kvm->lock or kvm->slots_lock will suffice.
>
> > Since we want to support this without having to pause vCPUs, there's an
> > atomicity requirement.
>
> Ensuring that vCPUs "see" the new value and not corrupting memory are two very
> different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
> nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
> they're not even necessary to avoid memory corruption...

Oh, that's news to me- I've learned to treat any unprotected concurrent accesses
to memory as undefined behavior: guess there's always more to learn.

> I think what you really want to achieve is that vCPUs observe the NOWAIT flag
> before KVM returns to userspace.  There are a variety of ways to make that happen,
> but since this all about accessing guest memory, the simplest is likely to
> "protect" the flag with kvm->srcu, i.e. require SRCU be held by readers and then
> do a synchronize_srcu() to ensure all vCPUs have picked up the new value.
>
> Speaking of SRCU (which protect memslots), why not make this a memslot flag?  If
> the goal is to be able to turn the behavior on/off dynamically, wouldn't it be
> beneficial to turn off the NOWAIT behavior when a memslot is fully transfered?

Thanks for the suggestions, I never considered making this a memslot flag
actually. so I'll go look into that.
Sean Christopherson Feb. 17, 2023, 8:33 p.m. UTC | #7
On Fri, Feb 17, 2023, Anish Moorthy wrote:
> On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > How is userspace expected to differentiate the gup_fast() failed exit
> > > from the guest-private memory exit?

Sorry, missed this comment first time around.  I don't see any reason why userspace
needs to uniquely identify the gup_fast() case.  Similar to page faults from
hardware, KVM should describe the access in sufficient detail so that userspace
can resolve the issue, whatever that may be, but KVM should make any assumptions
about _why_ the failure occurred.  

gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
KVM will exit if and only if the guest accesses a valid page that hasn't yet been
transfered from the source, but the _actual_ behavior is that KVM will exit if
the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
speculative to some degree, as that suggests the access may have succeeded if
KVM "waited", which may or may not be true.

E.g. see the virtiofs trunctionation use case that planted the original seed for
this approach: https://lore.kernel.org/kvm/20201005161620.GC11938@linux.intel.com.

> > > I don't think flags are a good idea for this, as it comes with the
> > > illusion that both events can happen on a single exit. In reality, these
> > > are mutually exclusive.

They aren't mutually exclusive.  Obviously KVM will only detect one other the
other, but it's entirely possible that a guest could be accessing the "wrong"
flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
should see that (a) it needs to change the memory attributes and (b) that it
needs to demand the to-be-installed page from the source.

> > > A fault type/code would be better here, with the option to add flags at
> > > a later date that could be used to further describe the exit (if
> > > needed).
> >
> > Agreed.

Not agreed :-)

> > +    struct {
> > +        __u32 fault_code;
> > +        __u64 reserved;
> > +        __u64 gpa;
> > +        __u64 size;
> > +    } memory_fault;
> >
> > The "reserved" field is meant to be the placeholder for a future "flags" field.
> > Let me know if there's a better/more conventional way to achieve this.
> 
> On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > 1. As Oliver touches on earlier, we'll probably want to use this same field for
> >    different classes of memory fault in the future (such as the ones which Chao
> >    is introducing in [1]): so it does make sense to add "code" and "flags"
> >    fields which can be used to communicate more information to the user (and
> >    which can just be set to MEM_FAULT_NOWAIT/0 in this series).
> 
> Let me walk back my responses here: I took a closer look at Chao's series, and
> it doesn't seem that I should be trying to share KVM_EXIT_MEMORY_FAULT with it
> in the first place. As far as I can understand (not that much, to be clear :)
> we're signaling unrelated things, so it makes more sense to use different exits
> (ie, rename mine to KVM_EXIT_MEMORY_FAULT_NOWAIT). That would prevent any
> potential confusion about mutual exclusivity.

Hard "no" on a separate exit reason unless someone comes up with a very compelling
argument.

Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
functionality in Chao's series is lacking and/or will conflict with this UFFD,
then we can and should address those issues _before_ it gets merged.
Sean Christopherson Feb. 17, 2023, 8:47 p.m. UTC | #8
On Fri, Feb 17, 2023, Anish Moorthy wrote:
> On Thu, Feb 16, 2023 at 1:38 PM Sean Christopherson <seanjc@google.com> wrote:
> > Ensuring that vCPUs "see" the new value and not corrupting memory are two very
> > different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
> > nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
> > they're not even necessary to avoid memory corruption...
> 
> Oh, that's news to me- I've learned to treat any unprotected concurrent accesses
> to memory as undefined behavior: guess there's always more to learn.

To expand a bit, undefined and non-deterministic are two different things.  In
this case, the behavior is not deterministic, but it _is_ defined.  Readers will
either see %true or %false, but they will not see %beagle. 

And more importantly _KVM_ doesn't care whether or not the behavior is deterministic,
that's userspace's responsibility.  E.g. if KVM were endangered by the flag changing
then some form of locking absolutely would be needed, but that's not the case here.
Anish Moorthy Feb. 23, 2023, 1:16 a.m. UTC | #9
On Fri, Feb 17, 2023 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote
> > > > I don't think flags are a good idea for this, as it comes with the
> > > > illusion that both events can happen on a single exit. In reality, these
> > > > are mutually exclusive.
>
> They aren't mutually exclusive.  Obviously KVM will only detect one other the
> other, but it's entirely possible that a guest could be accessing the "wrong"
> flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
> should see that (a) it needs to change the memory attributes and (b) that it
> needs to demand the to-be-installed page from the source.
>
> > > > A fault type/code would be better here, with the option to add flags at
> > > > a later date that could be used to further describe the exit (if
> > > > needed).
> > >
> > > Agreed.
>
> Not agreed :-)
> ...
> Hard "no" on a separate exit reason unless someone comes up with a very compelling
> argument.
>
> Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
> functionality in Chao's series is lacking and/or will conflict with this UFFD,
> then we can and should address those issues _before_ it gets merged.

Ok so I have a v2 of the series basically ready to go, but I realized
that I should
probably have brought up my modified API here to make sure it was
sane: so, I'll do
that first

In v2, I've
(1)  renamed the kvm cap from KVM_CAP_MEM_FAULT_NOWAIT to
KVM_CAP_MEMORY_FAULT_EXIT due to Sean's earlier comment

> gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
> KVM will exit if and only if the guest accesses a valid page that hasn't yet been
> transfered from the source, but the _actual_ behavior is that KVM will exit if
> the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
> speculative to some degree, as that suggests the access may have succeeded if
> KVM "waited", which may or may not be true.

(2) kept the definition of kvm_run.memory_fault as
struct {
    __u64 flags;
    __u64 gpa;
    __ u64 len;
} memory_fault;
which, apart from the name of the "len" field, is exactly what Chao
has in their series.
flags remains a bitfield describing the reason for the memory fault:
in the two places
this series generates this fault, it sets a bit in flags. Userspace is
meant to check whether
a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

(3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
indicates whether this flag is supported. As such, trying to enable
the new capability
directly generates an EINVAL, which is meant to make the incorrect usage clear.

Hopefully this all appears sane?
Sean Christopherson Feb. 23, 2023, 8:55 p.m. UTC | #10
On Wed, Feb 22, 2023, Anish Moorthy wrote:
> On Fri, Feb 17, 2023 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote
> > > > > I don't think flags are a good idea for this, as it comes with the
> > > > > illusion that both events can happen on a single exit. In reality, these
> > > > > are mutually exclusive.
> >
> > They aren't mutually exclusive.  Obviously KVM will only detect one other the
> > other, but it's entirely possible that a guest could be accessing the "wrong"
> > flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
> > should see that (a) it needs to change the memory attributes and (b) that it
> > needs to demand the to-be-installed page from the source.
> >
> > > > > A fault type/code would be better here, with the option to add flags at
> > > > > a later date that could be used to further describe the exit (if
> > > > > needed).
> > > >
> > > > Agreed.
> >
> > Not agreed :-)
> > ...
> > Hard "no" on a separate exit reason unless someone comes up with a very compelling
> > argument.
> >
> > Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
> > functionality in Chao's series is lacking and/or will conflict with this UFFD,
> > then we can and should address those issues _before_ it gets merged.
> 
> Ok so I have a v2 of the series basically ready to go, but I realized
> that I should
> probably have brought up my modified API here to make sure it was
> sane: so, I'll do
> that first
> 
> In v2, I've
> (1)  renamed the kvm cap from KVM_CAP_MEM_FAULT_NOWAIT to
> KVM_CAP_MEMORY_FAULT_EXIT due to Sean's earlier comment
> 
> > gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
> > KVM will exit if and only if the guest accesses a valid page that hasn't yet been
> > transfered from the source, but the _actual_ behavior is that KVM will exit if
> > the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
> > speculative to some degree, as that suggests the access may have succeeded if
> > KVM "waited", which may or may not be true.
> 
> (2) kept the definition of kvm_run.memory_fault as
> struct {
>     __u64 flags;
>     __u64 gpa;
>     __ u64 len;
> } memory_fault;
> which, apart from the name of the "len" field, is exactly what Chao
> has in their series.

Off-topic, please adjust whatever email client you're using to not wrap so
agressively and at seeming random times.

As written, this makes my eyes bleed, whereas formatting like so does not :-)

  Ok so I have a v2 of the series basically ready to go, but I realized that I
  should probably have brought up my modified API here to make sure it was sane:
  so, I'll do that first

  ...

  which, apart from the name of the "len" field, is exactly what Chao
  has in their series.

  Flags remains a bitfield describing the reason for the memory fault:
  in the two places this series generates this fault, it sets a bit in flags.
  Userspace is meant to check whether a memory_fault was generated due to
  KVM_CAP_MEMORY_FAULT_EXIT using the KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

> flags remains a bitfield describing the reason for the memory fault:
> in the two places
> this series generates this fault, it sets a bit in flags. Userspace is
> meant to check whether
> a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
> KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

Before sending a new version, let's bottom out on whether or not a
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT flag is necessary.  I'm not dead set against
a flag, but as I called out earlier[*], it can have false positives.  I.e. userspace
needs to be able to suss out the real problem anyways.  And if userspace needs to
be that smart, what's the point of the flag?

[*] https://lore.kernel.org/all/Y+%2FkgMxQPOswAz%2F2@google.com

> 
> (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> indicates whether this flag is supported.

The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
likely be ok with a partial conversion for the initial implementation if there
are paths that would require an absurd amount of work to convert.

> As such, trying to enable the new capability directly generates an EINVAL,
> which is meant to make the incorrect usage clear.
> 
> Hopefully this all appears sane?
Anish Moorthy Feb. 23, 2023, 11:03 p.m. UTC | #11
On Thu, Feb 23, 2023 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> Off-topic, please adjust whatever email client you're using to not wrap so
> agressively and at seeming random times.
> As written, this makes my eyes bleed, whereas formatting like so does not :-)

Oof, that *is* pretty bad. I think I have the cause figured out
though, so I'll be careful about that from now on :)

>   Ok so I have a v2 of the series basically ready to go, but I realized that I
>   should probably have brought up my modified API here to make sure it was sane:
>   so, I'll do that first
>
>   ...
>
>   which, apart from the name of the "len" field, is exactly what Chao
>   has in their series.
>
>   Flags remains a bitfield describing the reason for the memory fault:
>   in the two places this series generates this fault, it sets a bit in flags.
>   Userspace is meant to check whether a memory_fault was generated due to
>   KVM_CAP_MEMORY_FAULT_EXIT using the KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.
>
> > flags remains a bitfield describing the reason for the memory fault:
> > in the two places
> > this series generates this fault, it sets a bit in flags. Userspace is
> > meant to check whether
> > a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
> > KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.
>
> Before sending a new version, let's bottom out on whether or not a
> KVM_MEMORY_FAULT_EXIT_REASON_ABSENT flag is necessary.  I'm not dead set against
> a flag, but as I called out earlier[*], it can have false positives.  I.e. userspace
> needs to be able to suss out the real problem anyways.  And if userspace needs to
> be that smart, what's the point of the flag?
>
> [*] https://lore.kernel.org/all/Y+%2FkgMxQPOswAz%2F2@google.com

My understanding of your previous message was off: I didn't realize
that fast gup would also fail for present mappings where the
read/write permission was incorrect. So I'd agree that
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT should be dropped: best not to
mislead with false positives.

> >
> > (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> > indicates whether this flag is supported.
>
> The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
> KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
> guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
> likely be ok with a partial conversion for the initial implementation if there
> are paths that would require an absurd amount of work to convert.

I'm actually not sure where all of the sources of guest-memory
-EFAULTs are or how I'd go about finding them. Is the standalone
behavior of KVM_CAP_MEMORY_FAULT_EXIT something you're suggesting
because that new name is too broad for what it does right now? If so
then I'd rather just rename it again: but if that functionality should
be included with this series, then I'll take a look at the required
work given a couple of pointers :)

I will say, a partial implementation seems worse than no
implementation: isn't there a risk that someone ends up depending on
the incomplete behavior?
Sean Christopherson Feb. 24, 2023, 12:01 a.m. UTC | #12
On Thu, Feb 23, 2023, Anish Moorthy wrote:
> On Thu, Feb 23, 2023 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > > (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> > > indicates whether this flag is supported.
> >
> > The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
> > KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
> > guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
> > likely be ok with a partial conversion for the initial implementation if there
> > are paths that would require an absurd amount of work to convert.
> 
> I'm actually not sure where all of the sources of guest-memory -EFAULTs are
> or how I'd go about finding them.

Heh, if anyone can says they can list _all_ sources of KVM accesses to guest
memory of the top of their head, they're lying.

Finding the sources is a bit tedious, but shouldn't be too hard.  The scope is
is -EFAULT in the context of KVM_RUN that gets returned to userspace.  There are
150+ references to -EFAULT to wade through, but the vast majority are obviously
not in scope, e.g. are for uaccess failures in other ioctls().

> Is the standalone behavior of KVM_CAP_MEMORY_FAULT_EXIT something you're
> suggesting because that new name is too broad for what it does right now?

No, I want a standalone thing because I want to start driving toward KVM never
returning -EFAULT to host userspace for accesses to guest memory in the context
of ioctl(KVM_RUN).  E.g. I want to replace the "return -EFAULT" in
kvm_handle_error_pfn() with something like "return kvm_handle_memory_fault_exit(...)".

My hope/goal is that return useful information will allow userspace to do more
interesting things with guest memory without needing invasive, one-off changes
to KVM.  At the very least, it should get us to the point where a memory fault
from KVM_RUN exits to userspace with sane, helpful information, not a generic
"-EFAULT, have fun debugging!".

> If so then I'd rather just rename it again: but if that functionality should
> be included with this series, then I'll take a look at the required work
> given a couple of pointers :)
> 
> I will say, a partial implementation seems worse than no
> implementation: isn't there a risk that someone ends up depending on
> the incomplete behavior?

In this case, I don't think so.  We definitely would need to document that KVM
may still return -EFAULT in certain scenarios, but we have at least one known
use case (this one) where catching the common cases is sufficient.  And if/when
use cases come along that need 100% accuracy, we can hunt down and fix the KVM
remaining "bugs".
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9807b05a1b571..4b06e60668686 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5937,6 +5937,18 @@  delivery must be provided via the "reg_aen" struct.
 The "pad" and "reserved" fields may be used for future extensions and should be
 set to 0s by userspace.
 
+4.137 KVM_SET_MEM_FAULT_NOWAIT
+------------------------------
+
+:Capability: KVM_CAP_MEM_FAULT_NOWAIT
+:Architectures: x86, arm64
+:Type: vm ioctl
+:Parameters: bool state (in)
+:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
+
+Enables (state=true) or disables (state=false) waitless memory faults. For more
+information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
+
 5. The kvm_run structure
 ========================
 
@@ -6544,6 +6556,21 @@  array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+which userspace may choose to handle.
+
+'gpa' and 'size' indicate the memory range the error occurs at. Userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7577,6 +7604,21 @@  This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEM_FAULT_NOWAIT
+-----------------------------
+
+:Architectures: x86, arm64
+:Target: VM
+:Parameters: None
+:Returns: 0 on success, or -EINVAL if capability is not supported.
+
+The presence of this capability indicates that userspace can enable/disable
+waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
+
+When waitless memory faults are enabled, fast get_user_pages failures when
+handling EPT/Shadow Page Table violations will cause a vCPU exit
+(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 109b18e2789c4..9352e7f8480fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -801,6 +801,9 @@  struct kvm {
 	bool vm_bugged;
 	bool vm_dead;
 
+	rwlock_t mem_fault_nowait_lock;
+	bool mem_fault_nowait;
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
@@ -2278,4 +2281,14 @@  static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline bool memory_faults_enabled(struct kvm *kvm)
+{
+	bool ret;
+
+	read_lock(&kvm->mem_fault_nowait_lock);
+	ret = kvm->mem_fault_nowait;
+	read_unlock(&kvm->mem_fault_nowait_lock);
+	return ret;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 55155e262646e..064fbfed97f01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1175,6 +1182,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
+#define KVM_CAP_MEM_FAULT_NOWAIT 226
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1658,7 +1666,7 @@  struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
-/* Available with  KVM_CAP_S390_VCPU_RESETS */
+/* Available with KVM_CAP_S390_VCPU_RESETS */
 #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
 #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
 
@@ -2228,4 +2236,7 @@  struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEM_FAULT_NOWAIT */
+#define KVM_SET_MEM_FAULT_NOWAIT _IOWR(KVMIO, 0xd2, bool)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 20522d4ba1e0d..5d9e3f48a9634 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dae5f48151032..8e5bfc00d1181 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1149,6 +1149,9 @@  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	INIT_LIST_HEAD(&kvm->devices);
 	kvm->max_vcpus = KVM_MAX_VCPUS;
 
+	rwlock_init(&kvm->mem_fault_nowait_lock);
+	kvm->mem_fault_nowait = false;
+
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
 	/*
@@ -2313,6 +2316,16 @@  static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+static int kvm_vm_ioctl_set_mem_fault_nowait(struct kvm *kvm, bool state)
+{
+	if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MEM_FAULT_NOWAIT))
+		return -1;
+	write_lock(&kvm->mem_fault_nowait_lock);
+	kvm->mem_fault_nowait = state;
+	write_unlock(&kvm->mem_fault_nowait_lock);
+	return 0;
+}
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4675,6 +4688,10 @@  static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_MEM_FAULT_NOWAIT:
+		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
+			return -EINVAL;
+		return 0;
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -4892,6 +4909,15 @@  static long kvm_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_SET_MEM_FAULT_NOWAIT: {
+		bool state;
+
+		r = -EFAULT;
+		if (copy_from_user(&state, argp, sizeof(state)))
+			goto out;
+		r = kvm_vm_ioctl_set_mem_fault_nowait(kvm, state);
+		break;
+	}
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;