diff mbox series

[v7,06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings

Message ID 20240215235405.368539-7-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults | expand

Commit Message

Anish Moorthy Feb. 15, 2024, 11:53 p.m. UTC
Allowing KVM to fault in pages during vcpu-context guest memory accesses
can be undesirable: during userfaultfd-based postcopy, it can cause
significant performance issues due to vCPUs contending for
userfaultfd-internal locks.

Add a new memslot flag (KVM_MEM_EXIT_ON_MISSING) through which userspace
can indicate that KVM_RUN should exit instead of faulting in pages
during vcpu-context guest memory accesses. The unfaulted pages are
reported by the accompanying KVM_EXIT_MEMORY_FAULT_INFO, allowing
userspace to determine and take appropriate action.

The basic implementation strategy is to check the memslot flag from
within __gfn_to_pfn_memslot() and override the caller-provided arguments
accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
out of this behavior, and do so by passing can_exit_on_missing=false.

No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING or
passes can_exit_on_missing=true to __gfn_to_pfn_memslot().

Suggested-by: James Houghton <jthoughton@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst         | 23 +++++++++++++++++-
 arch/arm64/kvm/mmu.c                   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c                 |  4 ++--
 include/linux/kvm_host.h               | 12 +++++++++-
 include/uapi/linux/kvm.h               |  2 ++
 virt/kvm/Kconfig                       |  3 +++
 virt/kvm/kvm_main.c                    | 32 ++++++++++++++++++++++----
 9 files changed, 70 insertions(+), 12 deletions(-)

Comments

Sean Christopherson March 8, 2024, 10:07 p.m. UTC | #1
On Thu, Feb 15, 2024, Anish Moorthy wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f5d45c49e36..bf7bc21d56ac 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
>    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)

David M.,

Before this gets queued anywhere, a few questions related to the generic KVM
userfault stuff you're working on:

  1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
     should exit to userspace, even for guest_memfd?  Or are you envisioning the
     "data invalid" gfn attribute as being a superset?

     We danced very close to this topic in the PUCK call, but I don't _think_ we
     ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
     effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
     flag.

     I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
     but after re-watching parts of the PUCK recording, e.g. about decoupling
     KVM from userspace page tables, I suspect past me was wrong.

  2. What is your best guess as to when KVM userfault patches will be available,
     even if only in RFC form?

The reason I ask is because Oliver pointed out (off-list) that (a) Google is the
primary user for KVM_MEM_EXIT_ON_MISSING, possibly the _only_ user for the
forseeable future, and (b) if Google moves on to KVM userfault before ever
ingesting KVM_MEM_EXIT_ON_MISSING from upstream, then we'll have effectively
added dead code to KVM's eternal ABI.
David Matlack March 9, 2024, 12:46 a.m. UTC | #2
On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 9f5d45c49e36..bf7bc21d56ac 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> 
> David M.,
> 
> Before this gets queued anywhere, a few questions related to the generic KVM
> userfault stuff you're working on:
> 
>   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
>      should exit to userspace, even for guest_memfd?  Or are you envisioning the
>      "data invalid" gfn attribute as being a superset?
> 
>      We danced very close to this topic in the PUCK call, but I don't _think_ we
>      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
>      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
>      flag.
> 
>      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
>      but after re-watching parts of the PUCK recording, e.g. about decoupling
>      KVM from userspace page tables, I suspect past me was wrong.

No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.

The plan is to introduce a new gfn attribute and exit to userspace based
on that. I do forsee having an on/off switch for the new attribute, but
it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.

> 
>   2. What is your best guess as to when KVM userfault patches will be available,
>      even if only in RFC form?

We're aiming for the end of April for RFC with KVM/ARM support.

> 
> The reason I ask is because Oliver pointed out (off-list) that (a) Google is the
> primary user for KVM_MEM_EXIT_ON_MISSING, possibly the _only_ user for the
> forseeable future, and (b) if Google moves on to KVM userfault before ever
> ingesting KVM_MEM_EXIT_ON_MISSING from upstream, then we'll have effectively
> added dead code to KVM's eternal ABI.
Oliver Upton March 11, 2024, 4:45 a.m. UTC | #3
Hey,

Thanks Sean for bringing this up on the list, didn't have time for a lot
of upstream stuffs :)

On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >    #define KVM_MEM_READONLY	(1UL << 1)
> > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > 
> > David M.,
> > 
> > Before this gets queued anywhere, a few questions related to the generic KVM
> > userfault stuff you're working on:
> > 
> >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> >      "data invalid" gfn attribute as being a superset?
> > 
> >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> >      flag.
> > 
> >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> >      KVM from userspace page tables, I suspect past me was wrong.
> 
> No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> 
> The plan is to introduce a new gfn attribute and exit to userspace based
> on that. I do forsee having an on/off switch for the new attribute, but
> it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.

With that in mind, unless someone else has a usecase for the
KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
take this bit of the series upstream. The "memory fault" UAPI should
still be useful when the KVM userfault stuff comes along.

Anish, apologies, you must have whiplash from all the bikeshedding,
nitpicking, and other fun you've been put through on this series. Thanks
for being patient.

> > 
> >   2. What is your best guess as to when KVM userfault patches will be available,
> >      even if only in RFC form?
> 
> We're aiming for the end of April for RFC with KVM/ARM support.

Just to make sure everyone is read in on what this entails -- is this
the implementation that only worries about vCPUs touching non-present
memory, leaving the question of other UAPIs that consume guest memory
(e.g. GIC/ITS table save/restore) up for further discussion?
David Matlack March 11, 2024, 4:20 p.m. UTC | #4
On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > >   2. What is your best guess as to when KVM userfault patches will be available,
> > >      even if only in RFC form?
> >
> > We're aiming for the end of April for RFC with KVM/ARM support.
>
> Just to make sure everyone is read in on what this entails -- is this
> the implementation that only worries about vCPUs touching non-present
> memory, leaving the question of other UAPIs that consume guest memory
> (e.g. GIC/ITS table save/restore) up for further discussion?

Yes. The initial version will only support returning to userspace on
invalid vCPU accesses with KVM_EXIT_MEMORY_FAULT. Non-vCPU accesses to
invalid pages (e.g. GIC/ITS table save/restore) will trigger an error
return from __gfn_to_hva_many() (which will cause the corresponding
ioctl to fail). It will be userspace's responsibility to clear the
invalid attribute before invoking those ioctls.

For x86 we may need an blocking kernel-to-userspace notification
mechanism for code paths in the emulator, but we'd like to investigate
and discuss if there are any other cleaner alternatives before going
too far down that route.
Sean Christopherson March 11, 2024, 4:36 p.m. UTC | #5
On Sun, Mar 10, 2024, Oliver Upton wrote:
> On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> > On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > > >    #define KVM_MEM_READONLY	(1UL << 1)
> > > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > > 
> > > David M.,
> > > 
> > > Before this gets queued anywhere, a few questions related to the generic KVM
> > > userfault stuff you're working on:
> > > 
> > >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> > >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> > >      "data invalid" gfn attribute as being a superset?
> > > 
> > >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> > >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> > >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> > >      flag.
> > > 
> > >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> > >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> > >      KVM from userspace page tables, I suspect past me was wrong.
> > 
> > No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> > 
> > The plan is to introduce a new gfn attribute and exit to userspace based
> > on that. I do forsee having an on/off switch for the new attribute, but
> > it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.
> 
> With that in mind, unless someone else has a usecase for the
> KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
> take this bit of the series upstream. The "memory fault" UAPI should
> still be useful when the KVM userfault stuff comes along.

+1

Though I'll go a step further and say that even if someone does have a use case,
we should still wait.  The imminent collision with David Steven's kvm_follow_pfn()
series[*] is going to be a painful rebase no matter what, and once that's out of
the way, rebasing this series onto future kernels shouldn't be crazy difficult.

In other words, _if_ it turns out there's value in KVM_MEM_EXIT_ON_MISSING even
with David M's work, the cost of waiting another cycle (or two) is relatively
small.

Oh, and I'll plan on grabbing patches 1-4 for 6.10.

[*]https://lore.kernel.org/all/20240229025759.1187910-1-stevensd@google.com
Anish Moorthy March 11, 2024, 5:08 p.m. UTC | #6
On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hey,
>
> Thanks Sean for bringing this up on the list, didn't have time for a lot
> of upstream stuffs :)
>
> On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> > On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > > >    #define KVM_MEM_READONLY       (1UL << 1)
> > > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > >
> > > David M.,
> > >
> > > Before this gets queued anywhere, a few questions related to the generic KVM
> > > userfault stuff you're working on:
> > >
> > >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> > >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> > >      "data invalid" gfn attribute as being a superset?
> > >
> > >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> > >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> > >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> > >      flag.
> > >
> > >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> > >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> > >      KVM from userspace page tables, I suspect past me was wrong.
> >
> > No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> >
> > The plan is to introduce a new gfn attribute and exit to userspace based
> > on that. I do forsee having an on/off switch for the new attribute, but
> > it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.
>
> With that in mind, unless someone else has a usecase for the
> KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
> take this bit of the series upstream. The "memory fault" UAPI should
> still be useful when the KVM userfault stuff comes along.
>
> Anish, apologies, you must have whiplash from all the bikeshedding,
> nitpicking, and other fun you've been put through on this series. Thanks
> for being patient.

No worries- I got a lot of patient (and much-needed) review as well
:). And I understand not wanting to add an eternal feature when
something better is coming down the line.

On Mon, Mar 11, 2024 at 9:36 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Oh, and I'll plan on grabbing patches 1-4 for 6.10.

I think patches 10/11/12 are useful changes to the selftest that make
sense to merge even with KVM_MEM_EXIT_ON_MISSING being mothballed-
they should rebase without any issues. And the annotations on the
stage-2 fault handlers seem like they should still be added, but I
suppose David can do that with his series.
Oliver Upton March 11, 2024, 9:21 p.m. UTC | #7
On Mon, Mar 11, 2024 at 10:08:56AM -0700, Anish Moorthy wrote:
> I think patches 10/11/12 are useful changes to the selftest that make
> sense to merge even with KVM_MEM_EXIT_ON_MISSING being mothballed-
> they should rebase without any issues. And the annotations on the
> stage-2 fault handlers seem like they should still be added, but I
> suppose David can do that with his series.

Yeah, let's fold the vCPU exit portions of the UAPI into the overall KVM
userfault series. In that case there is sufficient context at the time
of the "memory fault" to generate a 'precise' fault context (this GFN
failed since it isn't marked as present).

Compare that to the current implementation, which actually annotates
_any_ __gfn_to_pfn_memslot() failures on the way out to userspace. I
haven't seen anyone saying their userspace wants to use this, and I'd
rather not take a new feature without a user, even if it is comparably
trivial.
Nikita Kalyazin July 3, 2024, 5:34 p.m. UTC | #8
Hi David,

On 11/03/2024 16:20, David Matlack wrote:
> On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>>>>
>>>>    2. What is your best guess as to when KVM userfault patches will be available,
>>>>       even if only in RFC form?
>>>
>>> We're aiming for the end of April for RFC with KVM/ARM support.
>>
>> Just to make sure everyone is read in on what this entails -- is this
>> the implementation that only worries about vCPUs touching non-present
>> memory, leaving the question of other UAPIs that consume guest memory
>> (e.g. GIC/ITS table save/restore) up for further discussion?
> 
> Yes. The initial version will only support returning to userspace on
> invalid vCPU accesses with KVM_EXIT_MEMORY_FAULT. Non-vCPU accesses to
> invalid pages (e.g. GIC/ITS table save/restore) will trigger an error
> return from __gfn_to_hva_many() (which will cause the corresponding
> ioctl to fail). It will be userspace's responsibility to clear the
> invalid attribute before invoking those ioctls.
> 
> For x86 we may need an blocking kernel-to-userspace notification
> mechanism for code paths in the emulator, but we'd like to investigate
> and discuss if there are any other cleaner alternatives before going
> too far down that route.

I wasn't able to locate any follow-ups on the LKML about this topic.
May I know if you are still working on or planning to work on this?

Thanks,
Nikita
David Matlack July 3, 2024, 8:11 p.m. UTC | #9
On Wed, Jul 3, 2024 at 10:35 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>
> Hi David,
>
> On 11/03/2024 16:20, David Matlack wrote:
> > On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >>>>
> >>>>    2. What is your best guess as to when KVM userfault patches will be available,
> >>>>       even if only in RFC form?
> >>>
> >>> We're aiming for the end of April for RFC with KVM/ARM support.
> >>
> >> Just to make sure everyone is read in on what this entails -- is this
> >> the implementation that only worries about vCPUs touching non-present
> >> memory, leaving the question of other UAPIs that consume guest memory
> >> (e.g. GIC/ITS table save/restore) up for further discussion?
> >
> > Yes. The initial version will only support returning to userspace on
> > invalid vCPU accesses with KVM_EXIT_MEMORY_FAULT. Non-vCPU accesses to
> > invalid pages (e.g. GIC/ITS table save/restore) will trigger an error
> > return from __gfn_to_hva_many() (which will cause the corresponding
> > ioctl to fail). It will be userspace's responsibility to clear the
> > invalid attribute before invoking those ioctls.
> >
> > For x86 we may need an blocking kernel-to-userspace notification
> > mechanism for code paths in the emulator, but we'd like to investigate
> > and discuss if there are any other cleaner alternatives before going
> > too far down that route.
>
> I wasn't able to locate any follow-ups on the LKML about this topic.
> May I know if you are still working on or planning to work on this?

Yes, James Houghton at Google has been working on this. We decided to
build a more complete RFC (with x86 and ARM) support, so that
reviewers can get an idea of the full scope of the feature, so it has
taken a bit longer than originally planned. But the RFC is code
complete now. I think James is planning to send the patches next week.
Nikita Kalyazin July 4, 2024, 10:10 a.m. UTC | #10
On 03/07/2024 21:11, David Matlack wrote:
> Yes, James Houghton at Google has been working on this. We decided to
> build a more complete RFC (with x86 and ARM) support, so that
> reviewers can get an idea of the full scope of the feature, so it has
> taken a bit longer than originally planned. But the RFC is code
> complete now. I think James is planning to send the patches next week.

Great to hear, looking forward to seeing it!
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f5d45c49e36..bf7bc21d56ac 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1353,6 +1353,7 @@  yet and must be cleared on entry.
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
+  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1383,7 +1384,7 @@  It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports three flags
+The flags field supports four flags
 
 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
@@ -1393,6 +1394,7 @@  to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
 incompatible with KVM_SET_USER_MEMORY_REGION.
+4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -1408,6 +1410,9 @@  Instead, an abort (data abort if the cause of the page-table update
 was a load or a store, instruction abort if it was an instruction
 fetch) is injected in the guest.
 
+Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
+exclusive.
+
 4.36 KVM_SET_TSS_ADDR
 ---------------------
 
@@ -8044,6 +8049,22 @@  error/annotated fault.
 
 See KVM_EXIT_MEMORY_FAULT for more information.
 
+7.35 KVM_CAP_EXIT_ON_MISSING
+----------------------------
+
+:Architectures: None
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that userspace may set the
+KVM_MEM_EXIT_ON_MISSING flag on memslots. Said flag will cause KVM_RUN to fail
+(-EFAULT) in response to guest-context memory accesses which would require KVM
+to page fault on the userspace mapping.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO. Userspace should take appropriate action.
+This could mean, for instance, checking that the fault is resolvable, faulting
+in the relevant userspace mapping, then retrying KVM_RUN.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d14504821b79..dfe0cbb5937c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1487,7 +1487,7 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmap_read_unlock(current->mm);
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+				   write_fault, &writable, false, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2b1f0cdd8c18..31ebfe4fe8e1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -614,7 +614,7 @@  int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	} else {
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
+					   writing, &write_ok, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 4a1abb9f7c05..03b0f1c4a0d8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -853,7 +853,7 @@  int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
+					   writing, upgrade_p, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..b89a9518f6de 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4371,7 +4371,7 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4393,7 +4393,7 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 32cbe5c3a9d1..210e07c4c2eb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1216,7 +1216,8 @@  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva);
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
@@ -2394,4 +2395,13 @@  static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+/*
+ * Whether vCPUs should exit upon trying to access memory for which the
+ * userspace mappings are missing.
+ */
+static inline bool kvm_is_slot_exit_on_missing(const struct kvm_memory_slot *slot)
+{
+	return slot && slot->flags & KVM_MEM_EXIT_ON_MISSING;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 36a51b162a71..e9f33ae93dee 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@  struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_EXIT_ON_MISSING	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -920,6 +921,7 @@  struct kvm_enable_cap {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_EXIT_ON_MISSING 236
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 29b73eedfe74..c7bdde127af4 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,6 @@  config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config HAVE_KVM_EXIT_ON_MISSING
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67ca580a18c5..469b99898be8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1600,7 +1600,7 @@  static void kvm_replace_memslot(struct kvm *kvm,
  * only allows these.
  */
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
-	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | KVM_MEM_EXIT_ON_MISSING)
 
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
@@ -1618,8 +1618,14 @@  static int check_memory_region_flags(struct kvm *kvm,
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING))
+		valid_flags |= KVM_MEM_EXIT_ON_MISSING;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
+	else if ((mem->flags & KVM_MEM_READONLY) &&
+		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
+		return -EINVAL;
 
 	return 0;
 }
@@ -3024,7 +3030,8 @@  kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -3047,6 +3054,19 @@  kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 		writable = NULL;
 	}
 
+	/* When the slot is exit-on-missing (and when we should respect that)
+	 * set atomic=true to prevent GUP from faulting in the userspace
+	 * mappings.
+	 */
+	if (!atomic && can_exit_on_missing &&
+	    kvm_is_slot_exit_on_missing(slot)) {
+		atomic = true;
+		if (async) {
+			*async = false;
+			async = NULL;
+		}
+	}
+
 	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
 			  writable);
 }
@@ -3056,21 +3076,21 @@  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+				    NULL, write_fault, writable, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
@@ -4877,6 +4897,8 @@  static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
+	case KVM_CAP_EXIT_ON_MISSING:
+		return IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING);
 	default:
 		break;
 	}