diff mbox series

[RFC,14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT

Message ID 20240710234222.2333120-15-jthoughton@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: Post-copy live migration for guest_memfd | expand

Commit Message

James Houghton July 10, 2024, 11:42 p.m. UTC
It is possible that KVM wants to access a userfault-enabled GFN in a
path where it is difficult to return out to userspace with the fault
information. For these cases, add a mechanism for KVM to wait for a GFN
to not be userfault-enabled.

The mechanism introduced in this patch uses an eventfd to signal that a
userfault is ready to be read. Userspace then reads the userfault with
KVM_READ_USERFAULT. The fault itself is stored in a list, and KVM will
busy-wait for the gfn to not be userfault-enabled.

The implementation of this mechanism is certain to change before KVM
Userfault could possibly be merged. Really the main concerns are whether
or not this kind of asynchronous userfault system is required and if the
UAPI for reading faults works.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h |  7 +++
 include/uapi/linux/kvm.h |  7 +++
 virt/kvm/kvm_main.c      | 92 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 104 insertions(+), 2 deletions(-)

Comments

David Matlack July 11, 2024, 11:52 p.m. UTC | #1
On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
>
> +       case KVM_READ_USERFAULT: {
> +               struct kvm_fault fault;
> +               gfn_t gfn;
> +
> +               r = kvm_vm_ioctl_read_userfault(kvm, &gfn);
> +               if (r)
> +                       goto out;
> +
> +               fault.address = gfn;
> +
> +               /* TODO: if this fails, this gfn is lost. */
> +               r = -EFAULT;
> +               if (copy_to_user(&fault, argp, sizeof(fault)))

You could do the copy under the spin_lock() with
copy_to_user_nofault() to avoid losing gfn.
Nikita Kalyazin July 26, 2024, 4:50 p.m. UTC | #2
Hi James,

On 11/07/2024 00:42, James Houghton wrote:
> It is possible that KVM wants to access a userfault-enabled GFN in a
> path where it is difficult to return out to userspace with the fault
> information. For these cases, add a mechanism for KVM to wait for a GFN
> to not be userfault-enabled.
In this patch series, an asynchronous notification mechanism is used 
only in cases "where it is difficult to return out to userspace with the 
fault information". However, we (AWS) have a use case where we would 
like to be notified asynchronously about _all_ faults. Firecracker can 
restore a VM from a memory snapshot where the guest memory is supplied 
via a Userfaultfd by a process separate from the VMM itself [1]. While 
it looks technically possible for the VMM process to handle exits via 
forwarding the faults to the other process, that would require building 
a complex userspace protocol on top and likely introduce extra latency 
on the critical path. This also implies that a KVM API 
(KVM_READ_USERFAULT) is not suitable, because KVM checks that the ioctls 
are performed specifically by the VMM process [2]:
	if (kvm->mm != current->mm || kvm->vm_dead)
		return -EIO;

 > The implementation of this mechanism is certain to change before KVM
 > Userfault could possibly be merged.
How do you envision resolving faults in userspace? Copying the page in 
(provided that userspace mapping of guest_memfd is supported [3]) and 
clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look 
sufficient to resolve the fault because an attempt to copy the page 
directly in userspace will trigger a fault on its own and may lead to a 
deadlock in the case where the original fault was caused by the VMM. An 
interface similar to UFFDIO_COPY is needed that would allocate a page, 
copy the content in and update page tables.

[1] Firecracker snapshot restore via UserfaultFD: 
https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
[2] KVM ioctl check for the address space: 
https://elixir.bootlin.com/linux/v6.10.1/source/virt/kvm/kvm_main.c#L5083
[3] mmap() of guest_memfd: 
https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4

Thanks,
Nikita
James Houghton July 26, 2024, 6 p.m. UTC | #3
On Fri, Jul 26, 2024 at 9:50 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> Hi James,
>
> On 11/07/2024 00:42, James Houghton wrote:
> > It is possible that KVM wants to access a userfault-enabled GFN in a
> > path where it is difficult to return out to userspace with the fault
> > information. For these cases, add a mechanism for KVM to wait for a GFN
> > to not be userfault-enabled.
> In this patch series, an asynchronous notification mechanism is used
> only in cases "where it is difficult to return out to userspace with the
> fault information". However, we (AWS) have a use case where we would
> like to be notified asynchronously about _all_ faults. Firecracker can
> restore a VM from a memory snapshot where the guest memory is supplied
> via a Userfaultfd by a process separate from the VMM itself [1]. While
> it looks technically possible for the VMM process to handle exits via
> forwarding the faults to the other process, that would require building
> a complex userspace protocol on top and likely introduce extra latency
> on the critical path.
> This also implies that a KVM API
> (KVM_READ_USERFAULT) is not suitable, because KVM checks that the ioctls
> are performed specifically by the VMM process [2]:
>         if (kvm->mm != current->mm || kvm->vm_dead)
>                 return -EIO;

If it would be useful, we could absolutely have a flag to have all
faults go through the asynchronous mechanism. :) It's meant to just be
an optimization. For me, it is a necessary optimization.

Userfaultfd doesn't scale particularly well: we have to grab two locks
to work with the wait_queues. You could create several userfaultfds,
but the underlying issue is still there. KVM Userfault, if it uses a
wait_queue for the async fault mechanism, will have the same
bottleneck. Anish and I worked on making userfaults more scalable for
KVM[1], and we ended up with a scheme very similar to what we have in
this KVM Userfault series.

My use case already requires using a reasonably complex API for
interacting with a separate userland process for fetching memory, and
it's really fast. I've never tried to hook userfaultfd into this other
process, but I'm quite certain that [1] + this process's interface
scale better than userfaultfd does. Perhaps userfaultfd, for
not-so-scaled-up cases, could be *slightly* faster, but I mostly care
about what happens when we scale to hundreds of vCPUs.

[1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>  > The implementation of this mechanism is certain to change before KVM
>  > Userfault could possibly be merged.
> How do you envision resolving faults in userspace? Copying the page in
> (provided that userspace mapping of guest_memfd is supported [3]) and
> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
> sufficient to resolve the fault because an attempt to copy the page
> directly in userspace will trigger a fault on its own

This is not true for KVM Userfault, at least for right now. Userspace
accesses to guest memory will not trigger KVM Userfaults. (I know this
name is terrible -- regular old userfaultfd() userfaults will indeed
get triggered, provided you've set things up properly.)

KVM Userfault is merely meant to catch KVM's own accesses to guest
memory (including vCPU accesses). For non-guest_memfd memslots,
userspace can totally just write through the VMA it has made (KVM
Userfault *cannot*, by virtue of being completely divorced from mm,
intercept this access). For guest_memfd, userspace could write to
guest memory through a VMA if that's where guest_memfd is headed, but
perhaps it will rely on exact details of how userspace is meant to
populate guest_memfd memory.

You're totally right that, in essence, we will need some kind of
non-faulting way to interact with guest memory. With traditional
memslots and VMAs, we have that already; guest_memfd memslots and
VMAs, I think we will have that eventually.

> and may lead to a
> deadlock in the case where the original fault was caused by the VMM. An
> interface similar to UFFDIO_COPY is needed that would allocate a page,
> copy the content in and update page tables.

In case it's interesting or useful at all, we actually use
UFFDIO_CONTINUE for our live migration use case. We mmap() memory
twice -- one of them we register with userfaultfd and also give to
KVM. The other one we use to install memory -- our non-faulting view
of guest memory!

>
> [1] Firecracker snapshot restore via UserfaultFD:
> https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
> [2] KVM ioctl check for the address space:
> https://elixir.bootlin.com/linux/v6.10.1/source/virt/kvm/kvm_main.c#L5083
> [3] mmap() of guest_memfd:
> https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
>
> Thanks,
> Nikita

Thanks for the feedback!
Nikita Kalyazin July 29, 2024, 5:17 p.m. UTC | #4
On 26/07/2024 19:00, James Houghton wrote:
> If it would be useful, we could absolutely have a flag to have all
> faults go through the asynchronous mechanism. :) It's meant to just be
> an optimization. For me, it is a necessary optimization.
> 
> Userfaultfd doesn't scale particularly well: we have to grab two locks
> to work with the wait_queues. You could create several userfaultfds,
> but the underlying issue is still there. KVM Userfault, if it uses a
> wait_queue for the async fault mechanism, will have the same
> bottleneck. Anish and I worked on making userfaults more scalable for
> KVM[1], and we ended up with a scheme very similar to what we have in
> this KVM Userfault series.
Yes, I see your motivation. Does this approach support async pagefaults 
[1]? Ie would all the guest processes on the vCPU need to stall until a 
fault is resolved or is there a way to let the vCPU run and only block 
the faulted process?

A more general question is, it looks like Userfaultfd's main purpose was 
to support the postcopy use case [2], yet it fails to do that 
efficiently for large VMs. Would it be ideologically better to try to 
improve Userfaultfd's performance (similar to how it was attempted in 
[3]) or is that something you have already looked into and reached a 
dead end as a part of [4]?

[1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
[2] https://lwn.net/Articles/636226/
[3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
[4] 
https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

> My use case already requires using a reasonably complex API for
> interacting with a separate userland process for fetching memory, and
> it's really fast. I've never tried to hook userfaultfd into this other
> process, but I'm quite certain that [1] + this process's interface
> scale better than userfaultfd does. Perhaps userfaultfd, for
> not-so-scaled-up cases, could be *slightly* faster, but I mostly care
> about what happens when we scale to hundreds of vCPUs.
> 
> [1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/
Do I understand it right that in your setup, when an EPT violation occurs,
  - VMM shares the fault information with the other process via a 
userspace protocol
  - the process fetches the memory, installs it (?) and notifies VMM
  - VMM calls KVM run to resume execution
?
Would you be ok to share an outline of the API you mentioned?

>> How do you envision resolving faults in userspace? Copying the page in
>> (provided that userspace mapping of guest_memfd is supported [3]) and
>> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
>> sufficient to resolve the fault because an attempt to copy the page
>> directly in userspace will trigger a fault on its own
> 
> This is not true for KVM Userfault, at least for right now. Userspace
> accesses to guest memory will not trigger KVM Userfaults. (I know this
> name is terrible -- regular old userfaultfd() userfaults will indeed
> get triggered, provided you've set things up properly.)
> 
> KVM Userfault is merely meant to catch KVM's own accesses to guest
> memory (including vCPU accesses). For non-guest_memfd memslots,
> userspace can totally just write through the VMA it has made (KVM
> Userfault *cannot*, by virtue of being completely divorced from mm,
> intercept this access). For guest_memfd, userspace could write to
> guest memory through a VMA if that's where guest_memfd is headed, but
> perhaps it will rely on exact details of how userspace is meant to
> populate guest_memfd memory.
True, it isn't the case right now. I think I fast-forwarded to a state 
where notifications about VMM-triggered faults to the guest_memfd are 
also sent asynchronously.

> In case it's interesting or useful at all, we actually use
> UFFDIO_CONTINUE for our live migration use case. We mmap() memory
> twice -- one of them we register with userfaultfd and also give to
> KVM. The other one we use to install memory -- our non-faulting view
> of guest memory!
That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy 
(vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the 
same process? What benefits does it bring?
James Houghton July 29, 2024, 9:09 p.m. UTC | #5
On Mon, Jul 29, 2024 at 10:17 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> On 26/07/2024 19:00, James Houghton wrote:
> > If it would be useful, we could absolutely have a flag to have all
> > faults go through the asynchronous mechanism. :) It's meant to just be
> > an optimization. For me, it is a necessary optimization.
> >
> > Userfaultfd doesn't scale particularly well: we have to grab two locks
> > to work with the wait_queues. You could create several userfaultfds,
> > but the underlying issue is still there. KVM Userfault, if it uses a
> > wait_queue for the async fault mechanism, will have the same
> > bottleneck. Anish and I worked on making userfaults more scalable for
> > KVM[1], and we ended up with a scheme very similar to what we have in
> > this KVM Userfault series.
> Yes, I see your motivation. Does this approach support async pagefaults
> [1]? Ie would all the guest processes on the vCPU need to stall until a
> fault is resolved or is there a way to let the vCPU run and only block
> the faulted process?

As implemented, it didn't hook into the async page faults stuff. I
think it's technically possible to do that, but we didn't explore it.

> A more general question is, it looks like Userfaultfd's main purpose was
> to support the postcopy use case [2], yet it fails to do that
> efficiently for large VMs. Would it be ideologically better to try to
> improve Userfaultfd's performance (similar to how it was attempted in
> [3]) or is that something you have already looked into and reached a
> dead end as a part of [4]?

My end goal with [4] was to take contention out of the vCPU +
userfault path completely (so, if we are taking a lock exclusively, we
are the only one taking it). I came to the conclusion that the way to
do this that made the most sense was Anish's memory fault exits idea.
I think it's possible to make userfaults scale better themselves, but
it's much more challenging than the memory fault exits approach for
KVM (and I don't have a good way to do it in mind).

> [1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
> [2] https://lwn.net/Articles/636226/
> [3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
> [4]
> https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
>
> > My use case already requires using a reasonably complex API for
> > interacting with a separate userland process for fetching memory, and
> > it's really fast. I've never tried to hook userfaultfd into this other
> > process, but I'm quite certain that [1] + this process's interface
> > scale better than userfaultfd does. Perhaps userfaultfd, for
> > not-so-scaled-up cases, could be *slightly* faster, but I mostly care
> > about what happens when we scale to hundreds of vCPUs.
> >
> > [1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/
> Do I understand it right that in your setup, when an EPT violation occurs,
>   - VMM shares the fault information with the other process via a
> userspace protocol
>   - the process fetches the memory, installs it (?) and notifies VMM
>   - VMM calls KVM run to resume execution
> ?

That's right.

> Would you be ok to share an outline of the API you mentioned?

I can share some information. The source (remote) and target (local)
VMMs register guest memory (shared memory) with this network worker
process. On the target during post-copy, the gfn of a fault is
converted into its corresponding local and remote offsets. The API for
then fetching the memory is basically something like
CopyFromRemote(remote_offset, local_offset, length), and the
communication with the process to handle this command is done just
with shared memory. After memory is copied, the faulting thread does a
UFFDIO_CONTINUE (with MODE_DONTWAKE) to map the page, and then we
KVM_RUN to resume. This will make more sense with the description of
UFFDIO_CONTINUE below.

Let me know if you'd like to know more, though I'm not intimately
familiar with all the details of this network worker process.

> >> How do you envision resolving faults in userspace? Copying the page in
> >> (provided that userspace mapping of guest_memfd is supported [3]) and
> >> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
> >> sufficient to resolve the fault because an attempt to copy the page
> >> directly in userspace will trigger a fault on its own
> >
> > This is not true for KVM Userfault, at least for right now. Userspace
> > accesses to guest memory will not trigger KVM Userfaults. (I know this
> > name is terrible -- regular old userfaultfd() userfaults will indeed
> > get triggered, provided you've set things up properly.)
> >
> > KVM Userfault is merely meant to catch KVM's own accesses to guest
> > memory (including vCPU accesses). For non-guest_memfd memslots,
> > userspace can totally just write through the VMA it has made (KVM
> > Userfault *cannot*, by virtue of being completely divorced from mm,
> > intercept this access). For guest_memfd, userspace could write to
> > guest memory through a VMA if that's where guest_memfd is headed, but
> > perhaps it will rely on exact details of how userspace is meant to
> > populate guest_memfd memory.
> True, it isn't the case right now. I think I fast-forwarded to a state
> where notifications about VMM-triggered faults to the guest_memfd are
> also sent asynchronously.
>
> > In case it's interesting or useful at all, we actually use
> > UFFDIO_CONTINUE for our live migration use case. We mmap() memory
> > twice -- one of them we register with userfaultfd and also give to
> > KVM. The other one we use to install memory -- our non-faulting view
> > of guest memory!
> That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy
> (vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the
> same process? What benefits does it bring?

The cover letter for the patch series where UFFDIO_CONTINUE was
introduced does a good job at explaining why it's useful for live
migration[5]. But I can summarize it here: when doing pre-copy, we
send many copies of memory to the target. Upon resuming on the target,
we want to get faults on the pages with stale content. It may take a
while to send the final dirty bitmap to the target, and we don't want
to leave the VM paused for that long (i.e., treat everything as
stale). When the dirty bitmap arrives, we want to be able to quickly
(like, without having to copy anything) say "stop getting faults on
these pages, they are in fact clean." Using shared memory (i.e.,
having a page cache) with UFFDIO_CONTINUE (well, really
UFFD_FEATURE_MINOR*) allows us to do this.

It also turns out that it is basically necessary if we want our
network API of choice to be able to directly write into guest memory.

[5]: https://lore.kernel.org/linux-mm/20210225002658.2021807-1-axelrasmussen@google.com/
Peter Xu Aug. 1, 2024, 10:22 p.m. UTC | #6
On Mon, Jul 29, 2024 at 02:09:16PM -0700, James Houghton wrote:
> > A more general question is, it looks like Userfaultfd's main purpose was
> > to support the postcopy use case [2], yet it fails to do that
> > efficiently for large VMs. Would it be ideologically better to try to
> > improve Userfaultfd's performance (similar to how it was attempted in
> > [3]) or is that something you have already looked into and reached a
> > dead end as a part of [4]?
> 
> My end goal with [4] was to take contention out of the vCPU +
> userfault path completely (so, if we are taking a lock exclusively, we
> are the only one taking it). I came to the conclusion that the way to
> do this that made the most sense was Anish's memory fault exits idea.
> I think it's possible to make userfaults scale better themselves, but
> it's much more challenging than the memory fault exits approach for
> KVM (and I don't have a good way to do it in mind).
> 
> > [1] https://lore.kernel.org/lkml/4AEFB823.4040607@redhat.com/T/
> > [2] https://lwn.net/Articles/636226/
> > [3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@redhat.com/
> > [4]
> > https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

Thanks for the link here on [3].  Just to mention I still remember I have
more thoughts on userfault-generic optimizations on top of this one at that
time, like >1 queues rather than one.  Maybe that could also help, maybe
not.

Even with that I think it'll be less-scalable than vcpu exits for
sure.. but still, I am always not yet convinced those "speed" are extremely
necessary, because postcopy overhead should be page movements, IMHO.  Maybe
there's scalability on the locks with userfault right now, but maybe that's
fixable?

I'm not sure whether I'm right, but IMHO the perf here isn't the critical
part.  Now IMHO it's about guest_memfd is not aligned to how userfault is
defined (with a mapping first, if without fd-extension), I think it indeed
can make sense, or say, have choice on implementing that in KVM if that's
easier.  So maybe other things besides the perf point here matters more.

Thanks,
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dc12d0a5498b..3b9780d85877 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -734,8 +734,15 @@  struct kvm_memslots {
 	int node_idx;
 };
 
+struct kvm_userfault_list_entry {
+	struct list_head list;
+	gfn_t gfn;
+};
+
 struct kvm_userfault_ctx {
 	struct eventfd_ctx *ev_fd;
+	spinlock_t list_lock;
+	struct list_head gfn_list;
 };
 
 struct kvm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6aa99b4587c6..8cd8e08f11e1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1554,4 +1554,11 @@  struct kvm_create_guest_memfd {
 #define KVM_USERFAULT_ENABLE		(1ULL << 0)
 #define KVM_USERFAULT_DISABLE		(1ULL << 1)
 
+struct kvm_fault {
+	__u64 address;
+	/* TODO: reserved fields */
+};
+
+#define KVM_READ_USERFAULT		_IOR(KVMIO, 0xd5, struct kvm_fault)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4ac018cac704..d2ca16ddcaa1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2678,6 +2678,43 @@  static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEM_READONLY;
 }
 
+static int read_userfault(struct kvm_userfault_ctx __rcu *ctx, gfn_t *gfn)
+{
+	struct kvm_userfault_list_entry *entry;
+
+	spin_lock(&ctx->list_lock);
+
+	entry = list_first_entry_or_null(&ctx->gfn_list,
+					 struct kvm_userfault_list_entry,
+					 list);
+
+	list_del(&entry->list);
+
+	spin_unlock(&ctx->list_lock);
+
+	if (!entry)
+		return -ENOENT;
+
+	*gfn = entry->gfn;
+	return 0;
+}
+
+static void signal_userfault(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_userfault_ctx __rcu *ctx =
+		srcu_dereference(kvm->userfault_ctx, &kvm->srcu);
+	struct kvm_userfault_list_entry entry;
+
+	entry.gfn = gfn;
+	INIT_LIST_HEAD(&entry.list);
+
+	spin_lock(&ctx->list_lock);
+	list_add(&entry.list, &ctx->gfn_list);
+	spin_unlock(&ctx->list_lock);
+
+	eventfd_signal(ctx->ev_fd);
+}
+
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write, bool atomic)
 {
@@ -2687,8 +2724,14 @@  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t
 	if (memslot_is_readonly(slot) && write)
 		return KVM_HVA_ERR_RO_BAD;
 
-	if (gfn_has_userfault(slot->kvm, gfn))
-		return KVM_HVA_ERR_USERFAULT;
+	if (gfn_has_userfault(slot->kvm, gfn)) {
+		if (atomic)
+			return KVM_HVA_ERR_USERFAULT;
+		signal_userfault(slot->kvm, gfn);
+		while (gfn_has_userfault(slot->kvm, gfn))
+			/* TODO: don't busy-wait */
+			cpu_relax();
+	}
 
 	if (nr_pages)
 		*nr_pages = slot->npages - (gfn - slot->base_gfn);
@@ -5009,6 +5052,10 @@  static int kvm_enable_userfault(struct kvm *kvm, int event_fd)
 	}
 
 	ret = 0;
+
+	INIT_LIST_HEAD(&userfault_ctx->gfn_list);
+	spin_lock_init(&userfault_ctx->list_lock);
+
 	userfault_ctx->ev_fd = ev_fd;
 
 	rcu_assign_pointer(kvm->userfault_ctx, userfault_ctx);
@@ -5037,6 +5084,27 @@  static int kvm_vm_ioctl_enable_userfault(struct kvm *kvm, int options,
 	else
 		return kvm_disable_userfault(kvm);
 }
+
+static int kvm_vm_ioctl_read_userfault(struct kvm *kvm, gfn_t *gfn)
+{
+	int ret;
+	int idx;
+	struct kvm_userfault_ctx __rcu *ctx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	ctx = srcu_dereference(kvm->userfault_ctx, &kvm->srcu);
+
+	ret = -ENOENT;
+	if (!ctx)
+		goto out;
+
+	ret = read_userfault(ctx, gfn);
+
+out:
+	srcu_read_unlock(&kvm->srcu, idx);
+	return ret;
+}
 #endif
 
 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
@@ -5403,6 +5471,26 @@  static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_gmem_create(kvm, &guest_memfd);
 		break;
 	}
+#endif
+#ifdef CONFIG_KVM_USERFAULT
+	case KVM_READ_USERFAULT: {
+		struct kvm_fault fault;
+		gfn_t gfn;
+
+		r = kvm_vm_ioctl_read_userfault(kvm, &gfn);
+		if (r)
+			goto out;
+
+		fault.address = gfn;
+
+		/* TODO: if this fails, this gfn is lost. */
+		r = -EFAULT;
+		if (copy_to_user(&fault, argp, sizeof(fault)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 #endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);