mbox series

[v1,00/13] KVM: Introduce KVM Userfault

Message ID 20241204191349.1730936-1-jthoughton@google.com (mailing list archive)
Headers show
Series KVM: Introduce KVM Userfault | expand

Message

James Houghton Dec. 4, 2024, 7:13 p.m. UTC
This is a continuation of the original KVM Userfault RFC[1] from July.
It contains the simplifications we talked about at LPC[2].

Please see the RFC[1] for the problem description. In summary,
guest_memfd VMs have no mechanism for doing post-copy live migration.
KVM Userfault provides such a mechanism. Today there is no upstream
mechanism for installing memory into a guest_memfd, but there will
be one soon (e.g. [3]).

There is a second problem that KVM Userfault solves: userfaultfd-based
post-copy doesn't scale very well. KVM Userfault when used with
userfaultfd can scale much better in the common case that most post-copy
demand fetches are a result of vCPU access violations. This is a
continuation of the solution Anish was working on[4]. This aspect of
KVM Userfault is important for userfaultfd-based live migration when
scaling up to hundreds of vCPUs with ~30us network latency for a
PAGE_SIZE demand-fetch.

The implementation in this series is version than the RFC[1]. It adds...
 1. a new memslot flag is added: KVM_MEM_USERFAULT,
 2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
 3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
 4. a new KVM capability KVM_CAP_USERFAULT.

KVM Userfault does not attempt to catch KVM's own accesses to guest
memory. That is left up to userfaultfd.

When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
are zapped, and new faults will check `userfault_bitmap` to see if the
fault should exit to userspace.

When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
permitted.

When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
(either eagerly or on-demand; the architecture can decide).

KVM Userfault is not compatible with async page faults. Nikita has
proposed a new implementation of async page faults that is more
userspace-driven that *is* compatible with KVM Userfault[5].

Performance
===========

The takeaways I have are:

1. For cases where lock contention is not a concern, there is a
   discernable win because KVM Userfault saves the trip through the
   userfaultfd poll/read/WAKE cycle.

2. Using a single userfaultfd without KVM Userfault gets very slow as
   the number of vCPUs increases, and it gets even slower when you add
   more reader threads. This is due to contention on the userfaultfd
   wait_queue locks. This is the contention that KVM Userfault avoids.
   Compare this to the multiple-userfaultfd runs; they are much faster
   because the wait_queue locks are sharded perfectly (1 per vCPU).
   Perfect sharding is only possible because the vCPUs are configured to
   touch only their own chunk of memory.

Config:
 - 64M per vcpu
 - vcpus only touch their 64M (`-b 64M -a`)
 - THPs disabled
 - MGLRU disabled

Each run used the following command:

./demand_paging_test -b 64M -a -v <#vcpus>  \
	-s shmem		     \ # if using shmem
	-r <#readers> -u <uffd_mode> \ # if using userfaultfd
	-k \			     \ # if using KVM Userfault
	-m 3			       # when on arm64

note: I patched demand_paging_test so that, when using shmem, the page
      cache will always be preallocated, not only in the `-u MINOR`
      case. Otherwise the comparison would be unfair. I left this patch
      out in the selftest commits, but I am happy to add it if it would
      be useful.

== x86 (96 LPUs, 48 cores, TDP MMU enabled) ==

-- Anonymous memory, single userfaultfd

	userfault mode
vcpus				2	8	64
	no userfault		306845	220402	47720
	MISSING (single reader)	90724	26059	3029
	MISSING			86840	37912	1664
	MISSING + KVM UF	143021	104385	34910
	KVM UF			160326	128247	39913

-- shmem (preallocated), single userfaultfd

vcpus				2	8	64
	no userfault		375130	214635	54420
	MINOR (single reader)	102336	31704	3244
	MINOR 			97981	36982	1673
	MINOR + KVM UF		161835	113716	33577
	KVM UF			181972	131204	42914

-- shmem (preallocated), multiple userfaultfds

vcpus				2	8	64
	no userfault		374060	216108	54433
	MINOR			102661	56978	11300
	MINOR + KVM UF		167080	123461	48382
	KVM UF			180439	122310	42539

== arm64 (96 PEs, AmpereOne) ==

-- shmem (preallocated), single userfaultfd

vcpus:				2	8	64
	no userfault		419069	363081	34348
	MINOR (single reader)	87421	36147	3764
	MINOR			84953	43444	1323
	MINOR + KVM UF		164509	139986	12373
	KVM UF			185706	122153	12153

-- shmem (preallocated), multiple userfaultfds

vcpus:				2	8	64
	no userfault		401931	334142	36117
	MINOR			83696	75617	15996
	MINOR + KVM UF		176327	115784	12198
	KVM UF			190074	126966	12084

This series is based on the latest kvm/next.

[1]: https://lore.kernel.org/kvm/20240710234222.2333120-1-jthoughton@google.com/
[2]: https://lpc.events/event/18/contributions/1757/
[3]: https://lore.kernel.org/kvm/20241112073837.22284-1-yan.y.zhao@intel.com/
[4]: https://lore.kernel.org/all/20240215235405.368539-1-amoorthy@google.com/
[5]: https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@amazon.com/#t

James Houghton (13):
  KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
  KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
  KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
  KVM: arm64: Add support for KVM_MEM_USERFAULT
  KVM: selftests: Fix vm_mem_region_set_flags docstring
  KVM: selftests: Fix prefault_mem logic
  KVM: selftests: Add va_start/end into uffd_desc
  KVM: selftests: Add KVM Userfault mode to demand_paging_test
  KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
  KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
  KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
    details

 Documentation/virt/kvm/api.rst                |  33 +++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          |  23 ++-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  27 +++-
 arch/x86/kvm/mmu/mmu_internal.h               |  20 ++-
 arch/x86/kvm/x86.c                            |  36 +++--
 include/linux/kvm_host.h                      |  19 ++-
 include/uapi/linux/kvm.h                      |   6 +-
 .../selftests/kvm/demand_paging_test.c        | 145 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 .../selftests/kvm/include/userfaultfd_util.h  |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  42 ++++-
 .../selftests/kvm/lib/userfaultfd_util.c      |   2 +
 .../selftests/kvm/set_memory_region_test.c    |  33 ++++
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  47 +++++-
 17 files changed, 409 insertions(+), 36 deletions(-)


base-commit: 4d911c7abee56771b0219a9fbf0120d06bdc9c14

Comments

Peter Xu Dec. 24, 2024, 9:07 p.m. UTC | #1
James,

On Wed, Dec 04, 2024 at 07:13:35PM +0000, James Houghton wrote:
> This is a continuation of the original KVM Userfault RFC[1] from July.
> It contains the simplifications we talked about at LPC[2].
> 
> Please see the RFC[1] for the problem description. In summary,
> guest_memfd VMs have no mechanism for doing post-copy live migration.
> KVM Userfault provides such a mechanism. Today there is no upstream
> mechanism for installing memory into a guest_memfd, but there will
> be one soon (e.g. [3]).
> 
> There is a second problem that KVM Userfault solves: userfaultfd-based
> post-copy doesn't scale very well. KVM Userfault when used with
> userfaultfd can scale much better in the common case that most post-copy
> demand fetches are a result of vCPU access violations. This is a
> continuation of the solution Anish was working on[4]. This aspect of
> KVM Userfault is important for userfaultfd-based live migration when
> scaling up to hundreds of vCPUs with ~30us network latency for a
> PAGE_SIZE demand-fetch.

I think it would be clearer to nail down the goal of the feature.  If it's
a perf-oriented feature we don't need to mention gmem, but maybe it's not.

> 
> The implementation in this series is version than the RFC[1]. It adds...
>  1. a new memslot flag is added: KVM_MEM_USERFAULT,
>  2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
>  3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
>  4. a new KVM capability KVM_CAP_USERFAULT.
> 
> KVM Userfault does not attempt to catch KVM's own accesses to guest
> memory. That is left up to userfaultfd.

I assume it means this is an "perf optimization" feature then?  As it
doesn't work for remote-fault processes like firecracker, or
remote-emulated processes like QEMU's vhost-user?

Even though it could still 100% cover x86_64's setup if it's not as
complicated as above?  I mean, I assumed above sentence was for archs like
ARM that I remember having no-vcpu-context accesses so things like that is
not covered too.  Perhaps x86_64 is the goal?  If so, would also be good to
mention some details.

> 
> When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
> are zapped, and new faults will check `userfault_bitmap` to see if the
> fault should exit to userspace.
> 
> When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
> permitted.
> 
> When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
> (either eagerly or on-demand; the architecture can decide).
> 
> KVM Userfault is not compatible with async page faults. Nikita has
> proposed a new implementation of async page faults that is more
> userspace-driven that *is* compatible with KVM Userfault[5].
> 
> Performance
> ===========
> 
> The takeaways I have are:
> 
> 1. For cases where lock contention is not a concern, there is a
>    discernable win because KVM Userfault saves the trip through the
>    userfaultfd poll/read/WAKE cycle.
> 
> 2. Using a single userfaultfd without KVM Userfault gets very slow as
>    the number of vCPUs increases, and it gets even slower when you add
>    more reader threads. This is due to contention on the userfaultfd
>    wait_queue locks. This is the contention that KVM Userfault avoids.
>    Compare this to the multiple-userfaultfd runs; they are much faster
>    because the wait_queue locks are sharded perfectly (1 per vCPU).
>    Perfect sharding is only possible because the vCPUs are configured to
>    touch only their own chunk of memory.

I'll try to spend some more time after holidays on this perf issue. But
will still be after the 1g support on !coco gmem if it would work out. As
the 1g function is still missing in QEMU, so that one has higher priority
comparing to either perf or downtime (e.g. I'll also need to measure
whether QEMU will need minor fault, or stick with missing as of now).

Maybe I'll also start to explore a bit on [g]memfd support on userfault,
I'm not sure whether anyone started working on some generic solution before
for CoCo / gmem postcopy - we need to still have a solution for either
firecrackers or OVS/vhost-user.  I feel like we need that sooner or later,
one way or another.  I think I'll start without minor faults support until
justified, and if I'll ever be able to start it at all in a few months next
year..

Let me know if there's any comment on above thoughts.

I guess this feauture might be useful to QEMU too, but QEMU always needs
uffd or something similar, then we need to measure and justify this one
useful in a real QEMU setup.  For example, need to see how the page
transfer overhead compares with lock contentions when there're, say, 400
vcpus.  If some speedup on userfault + the transfer overhead is close to
what we can get with vcpu exits, then QEMU may still stick with a simple
model.  But not sure.

When integrated with this feature, it also means some other overheads at
least to QEMU.  E.g., trap / resolve page fault needs two ops now (uffd and
the bitmap).  Meanwhile even if vcpu can get rid of uffd's one big
spinlock, it may contend again in userspace, either on page resolution or
on similar queuing.  I think I mentioned it previously but I guess it's
nontrivial to justify.  In all cases, I trust that you should have better
judgement on this.  It's just that QEMU can at least behave differently, so
not sure how it'll go there.

Happy holidays. :)

Thanks,
James Houghton Jan. 2, 2025, 5:53 p.m. UTC | #2
On Tue, Dec 24, 2024 at 4:07 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,

Hi Peter!

> On Wed, Dec 04, 2024 at 07:13:35PM +0000, James Houghton wrote:
> > This is a continuation of the original KVM Userfault RFC[1] from July.
> > It contains the simplifications we talked about at LPC[2].
> >
> > Please see the RFC[1] for the problem description. In summary,
> > guest_memfd VMs have no mechanism for doing post-copy live migration.
> > KVM Userfault provides such a mechanism. Today there is no upstream
> > mechanism for installing memory into a guest_memfd, but there will
> > be one soon (e.g. [3]).
> >
> > There is a second problem that KVM Userfault solves: userfaultfd-based
> > post-copy doesn't scale very well. KVM Userfault when used with
> > userfaultfd can scale much better in the common case that most post-copy
> > demand fetches are a result of vCPU access violations. This is a
> > continuation of the solution Anish was working on[4]. This aspect of
> > KVM Userfault is important for userfaultfd-based live migration when
> > scaling up to hundreds of vCPUs with ~30us network latency for a
> > PAGE_SIZE demand-fetch.
>
> I think it would be clearer to nail down the goal of the feature.  If it's
> a perf-oriented feature we don't need to mention gmem, but maybe it's not.

In my mind, both the gmem aspect and the performance aspect are
important. I don't think one is more important than the other, though
the performance win aspect of this is more immediately useful.

>
> >
> > The implementation in this series is version than the RFC[1]. It adds...
> >  1. a new memslot flag is added: KVM_MEM_USERFAULT,
> >  2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
> >  3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
> >  4. a new KVM capability KVM_CAP_USERFAULT.
> >
> > KVM Userfault does not attempt to catch KVM's own accesses to guest
> > memory. That is left up to userfaultfd.
>
> I assume it means this is an "perf optimization" feature then?  As it
> doesn't work for remote-fault processes like firecracker, or
> remote-emulated processes like QEMU's vhost-user?

For the !gmem case, yes KVM Userfault is not a replacement for
userfaultfd; for post-copy to function properly, we need to catch all
attempted accesses to guest memory (including from things like
vhost-net and KVM itself). It is indeed a performance optimization on
top of userfaultfd.

For setups where userfaultfd reader threads are running in a separate
process from the vCPU threads, yes KVM Userfault will make it so that
guest-mode vCPU faults will have to be initially handled by the vCPU
threads themselves. The fault information can always be forwarded to a
separate process afterwards, and the communication mechanism is
totally up to userspace (so they can optimize for their exact
use-case).

> Even though it could still 100% cover x86_64's setup if it's not as
> complicated as above?  I mean, I assumed above sentence was for archs like
> ARM that I remember having no-vcpu-context accesses so things like that is
> not covered too.  Perhaps x86_64 is the goal?  If so, would also be good to
> mention some details.

(In the !gmem case) It can't replace userfaultfd for any architecture
as long as there are kernel components that want to read or write to
guest memory.

The only real reason to make KVM Userfault try to completely replace
userfaultfd is to make post-copy with 1G pages work, but that requires
(1) userspace to catch their own accesses and (2) non-KVM components
catch their own accesses.

(1) is a pain (IIUC, infeasible for QEMU) and error-prone even if it
can be done, and (2) can essentially never be done upstream.

So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
worth the extra/duplicated complexity. And at LPC, Paolo and Sean
indicated that this direction was indeed wrong. I have another way to
make this work in mind. :)

For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
replacing it. And as of right now anyway, KVM Userfault *does* provide
a complete post-copy system for gmem.

When gmem pages can be mapped into userspace, for post-copy to remain
functional, userspace-mapped gmem will need userfaultfd integration.
Keep in mind that even after this integration happens, userfaultfd
alone will *not* be a complete post-copy solution, as vCPU faults
won't be resolved via the userspace page tables.

> >
> > When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
> > are zapped, and new faults will check `userfault_bitmap` to see if the
> > fault should exit to userspace.
> >
> > When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
> > permitted.
> >
> > When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
> > (either eagerly or on-demand; the architecture can decide).
> >
> > KVM Userfault is not compatible with async page faults. Nikita has
> > proposed a new implementation of async page faults that is more
> > userspace-driven that *is* compatible with KVM Userfault[5].
> >
> > Performance
> > ===========
> >
> > The takeaways I have are:
> >
> > 1. For cases where lock contention is not a concern, there is a
> >    discernable win because KVM Userfault saves the trip through the
> >    userfaultfd poll/read/WAKE cycle.
> >
> > 2. Using a single userfaultfd without KVM Userfault gets very slow as
> >    the number of vCPUs increases, and it gets even slower when you add
> >    more reader threads. This is due to contention on the userfaultfd
> >    wait_queue locks. This is the contention that KVM Userfault avoids.
> >    Compare this to the multiple-userfaultfd runs; they are much faster
> >    because the wait_queue locks are sharded perfectly (1 per vCPU).
> >    Perfect sharding is only possible because the vCPUs are configured to
> >    touch only their own chunk of memory.
>
> I'll try to spend some more time after holidays on this perf issue. But
> will still be after the 1g support on !coco gmem if it would work out. As
> the 1g function is still missing in QEMU, so that one has higher priority
> comparing to either perf or downtime (e.g. I'll also need to measure
> whether QEMU will need minor fault, or stick with missing as of now).
>
> Maybe I'll also start to explore a bit on [g]memfd support on userfault,
> I'm not sure whether anyone started working on some generic solution before
> for CoCo / gmem postcopy - we need to still have a solution for either
> firecrackers or OVS/vhost-user.  I feel like we need that sooner or later,
> one way or another.  I think I'll start without minor faults support until
> justified, and if I'll ever be able to start it at all in a few months next
> year..

Yeah we'll need userfaultfd support for !coco gmem when gmem supports
mapping into userspace to support setups that use OVS/vhost-user.

And feel free to start with MISSING or MINOR, I don't mind. Eventually
I'll probably need MINOR support; I'm happy to work on it when the
time comes (I'm waiting for KVM Userfault to get merged and then for
userspace mapping of gmem to get merged).

FWIW, I think userspace mapping of gmem + userfaultfd support for
userspace-mapped gmem + 1G page support for gmem = good 1G post-copy
for QEMU (i.e., use gmem instead of hugetlbfs after gmem supports 1G
pages).

Remember the feedback I got from LSFMM a while ago? "don't use
hugetlbfs." gmem seems like the natural replacement.

> Let me know if there's any comment on above thoughts.
>
> I guess this feauture might be useful to QEMU too, but QEMU always needs
> uffd or something similar, then we need to measure and justify this one
> useful in a real QEMU setup.  For example, need to see how the page
> transfer overhead compares with lock contentions when there're, say, 400
> vcpus.  If some speedup on userfault + the transfer overhead is close to
> what we can get with vcpu exits, then QEMU may still stick with a simple
> model.  But not sure.

It would be nice to integrate this into QEMU and see if there's a
measurable win. I'll try to find some time to look into this.

For GCE's userspace, there is a measurable win. Anish posted some
results a while ago here[1].

[1]: https://lore.kernel.org/kvm/CAF7b7mqrLP1VYtwB4i0x5HC1eYen9iMvZbKerCKWrCFv7tDg5Q@mail.gmail.com/

> When integrated with this feature, it also means some other overheads at
> least to QEMU.  E.g., trap / resolve page fault needs two ops now (uffd and
> the bitmap).

I think about it like this: instead of UFFDIO_CONTINUE + implicit wake
(or even UFFDIO_CONTINUE + UFFDIO_WAKE; this is how my userspace does
it actually), it is UFFDIO_CONTINUE (no wake) + bitmap set. So it's
kind of still two "ops" either way... :) and updating the bitmap is
really cheap compared to a userfaultfd wake-up.

(Having two things that describe something like "we should fault on
this page" is a little bit confusing.)

> Meanwhile even if vcpu can get rid of uffd's one big
> spinlock, it may contend again in userspace, either on page resolution or
> on similar queuing.  I think I mentioned it previously but I guess it's
> nontrivial to justify.  In all cases, I trust that you should have better
> judgement on this.  It's just that QEMU can at least behave differently, so
> not sure how it'll go there.

The contention that you may run into after you take away the uffd
spinlock depends on what userspace is doing, yeah. For example in GCE,
before the TDP MMU, we ran into KVM MMU lock contention, and then
later we ran into eventfd issues in our network request
submission/completion queues.

>
> Happy holidays. :)

You too!

Thanks for the feedback, Peter! Let me know if I've misunderstood any
of the points you were making.

> Thanks,
>
> --
> Peter Xu
>
Peter Xu Jan. 16, 2025, 8:19 p.m. UTC | #3
James,

Sorry for a late reply.

I still do have one or two pure questions, but nothing directly relevant to
your series.

On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> indicated that this direction was indeed wrong. I have another way to
> make this work in mind. :)

Do you still want to share it, more or less? :)

> 
> For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> replacing it. And as of right now anyway, KVM Userfault *does* provide
> a complete post-copy system for gmem.
> 
> When gmem pages can be mapped into userspace, for post-copy to remain
> functional, userspace-mapped gmem will need userfaultfd integration.
> Keep in mind that even after this integration happens, userfaultfd
> alone will *not* be a complete post-copy solution, as vCPU faults
> won't be resolved via the userspace page tables.

Do you know in context of CoCo, whether a private page can be accessed at
all outside of KVM?

I think I'm pretty sure now a private page can never be mapped to
userspace.  However, can another module like vhost-kernel access it during
postcopy?  My impression of that is still a yes, but then how about
vhost-user?

Here, the "vhost-kernel" part represents a question on whether private
pages can be accessed at all outside KVM.  While "vhost-user" part
represents a question on whether, if the previous vhost-kernel question
answers as "yes it can", such access attempt can happen in another
process/task (hence, not only does it lack KVM context, but also not
sharing the same task context).

Thanks,
Peter Xu Jan. 16, 2025, 8:32 p.m. UTC | #4
On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> James,
> 
> Sorry for a late reply.
> 
> I still do have one or two pure questions, but nothing directly relevant to
> your series.
> 
> On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> > So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> > worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> > indicated that this direction was indeed wrong. I have another way to
> > make this work in mind. :)
> 
> Do you still want to share it, more or less? :)
> 
> > 
> > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > a complete post-copy system for gmem.
> > 
> > When gmem pages can be mapped into userspace, for post-copy to remain
> > functional, userspace-mapped gmem will need userfaultfd integration.
> > Keep in mind that even after this integration happens, userfaultfd
> > alone will *not* be a complete post-copy solution, as vCPU faults
> > won't be resolved via the userspace page tables.
> 
> Do you know in context of CoCo, whether a private page can be accessed at
> all outside of KVM?
> 
> I think I'm pretty sure now a private page can never be mapped to
> userspace.  However, can another module like vhost-kernel access it during
> postcopy?  My impression of that is still a yes, but then how about
> vhost-user?
> 
> Here, the "vhost-kernel" part represents a question on whether private
> pages can be accessed at all outside KVM.  While "vhost-user" part
> represents a question on whether, if the previous vhost-kernel question
> answers as "yes it can", such access attempt can happen in another
> process/task (hence, not only does it lack KVM context, but also not
> sharing the same task context).

Right after I sent it, I just recalled whenever a device needs to access
the page, it needs to be converted to shared pages first..

So I suppose the questions were not valid at all!  It is not about the
context but that the pages will be shared always whenever a device in
whatever form will access it..

Fundamentally I'm thinking about whether userfaultfd must support (fd,
offset) tuple.  Now I suppose it's not, because vCPUs accessing
private/shared will all exit to userspace, while all non-vCPU / devices can
access shared pages only.

In that case, looks like userfaultfd can support CoCo on device emulations
by sticking with virtual-address traps like before, at least from that
specific POV.
Sean Christopherson Jan. 16, 2025, 10:16 p.m. UTC | #5
On Thu, Jan 16, 2025, Peter Xu wrote:
> On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > a complete post-copy system for gmem.
> > > 
> > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > Keep in mind that even after this integration happens, userfaultfd
> > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > won't be resolved via the userspace page tables.
> > 
> > Do you know in context of CoCo, whether a private page can be accessed at
> > all outside of KVM?
> > 
> > I think I'm pretty sure now a private page can never be mapped to
> > userspace.  However, can another module like vhost-kernel access it during
> > postcopy?  My impression of that is still a yes, but then how about
> > vhost-user?
> > 
> > Here, the "vhost-kernel" part represents a question on whether private
> > pages can be accessed at all outside KVM.  While "vhost-user" part
> > represents a question on whether, if the previous vhost-kernel question
> > answers as "yes it can", such access attempt can happen in another
> > process/task (hence, not only does it lack KVM context, but also not
> > sharing the same task context).
> 
> Right after I sent it, I just recalled whenever a device needs to access
> the page, it needs to be converted to shared pages first..

FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
private memory.  The basic gist is that the IOMMU will enforce access to private
memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.

[*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

> So I suppose the questions were not valid at all!  It is not about the
> context but that the pages will be shared always whenever a device in
> whatever form will access it..
> 
> Fundamentally I'm thinking about whether userfaultfd must support (fd,
> offset) tuple.  Now I suppose it's not, because vCPUs accessing
> private/shared will all exit to userspace, while all non-vCPU / devices can
> access shared pages only.
> 
> In that case, looks like userfaultfd can support CoCo on device emulations
> by sticking with virtual-address traps like before, at least from that
> specific POV.
> 
> -- 
> Peter Xu
>
James Houghton Jan. 16, 2025, 10:51 p.m. UTC | #6
On Thu, Jan 16, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > James,
> >
> > Sorry for a late reply.
> >
> > I still do have one or two pure questions, but nothing directly relevant to
> > your series.
> >
> > On Thu, Jan 02, 2025 at 12:53:11PM -0500, James Houghton wrote:
> > > So I'm not pushing for KVM Userfault to replace userfaultfd; it's not
> > > worth the extra/duplicated complexity. And at LPC, Paolo and Sean
> > > indicated that this direction was indeed wrong. I have another way to
> > > make this work in mind. :)
> >
> > Do you still want to share it, more or less? :)

I think I'm referring to how to make 4K demand fetches for 1G-backed
guest memory work, and I kind of said what I was thinking a little
further down:

On Thu, Jan 2, 2025 at 9:53 AM James Houghton <jthoughton@google.com> wrote:
>
> FWIW, I think userspace mapping of gmem + userfaultfd support for
> userspace-mapped gmem + 1G page support for gmem = good 1G post-copy
> for QEMU (i.e., use gmem instead of hugetlbfs after gmem supports 1G
> pages).
>
> Remember the feedback I got from LSFMM a while ago? "don't use
> hugetlbfs." gmem seems like the natural replacement.

I guess this might not work if QEMU *needs* to use HugeTLB for
whatever reason, but Google's hypervisor just needs 1G pages; it
doesn't matter where they come from really.

> > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > a complete post-copy system for gmem.
> > >
> > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > Keep in mind that even after this integration happens, userfaultfd
> > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > won't be resolved via the userspace page tables.
> >
> > Do you know in context of CoCo, whether a private page can be accessed at
> > all outside of KVM?
> >
> > I think I'm pretty sure now a private page can never be mapped to
> > userspace.  However, can another module like vhost-kernel access it during
> > postcopy?  My impression of that is still a yes, but then how about
> > vhost-user?
> >
> > Here, the "vhost-kernel" part represents a question on whether private
> > pages can be accessed at all outside KVM.  While "vhost-user" part
> > represents a question on whether, if the previous vhost-kernel question
> > answers as "yes it can", such access attempt can happen in another
> > process/task (hence, not only does it lack KVM context, but also not
> > sharing the same task context).
>
> Right after I sent it, I just recalled whenever a device needs to access
> the page, it needs to be converted to shared pages first..

Yep! This is my understanding anyway. Devices will need to GUP or use
the userspace page tables to access guest memory; both of which will
go to userfaultfd. And userspace hasn't told KVM to make some pages
shared, then these GUPs/faults will fail.

Maybe Trusted I/O changes some things here... let me reply to Sean. :)

> So I suppose the questions were not valid at all!  It is not about the
> context but that the pages will be shared always whenever a device in
> whatever form will access it..
>
> Fundamentally I'm thinking about whether userfaultfd must support (fd,
> offset) tuple.  Now I suppose it's not, because vCPUs accessing
> private/shared will all exit to userspace, while all non-vCPU / devices can
> access shared pages only.
>
> In that case, looks like userfaultfd can support CoCo on device emulations
> by sticking with virtual-address traps like before, at least from that
> specific POV.

Yeah, I don't think the userfaultfd API needs to change to support
gmem, because it's going to be using the VMAs / user mappings of gmem.
James Houghton Jan. 16, 2025, 11:04 p.m. UTC | #7
On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Peter Xu wrote:
> > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > a complete post-copy system for gmem.
> > > >
> > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > Keep in mind that even after this integration happens, userfaultfd
> > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > won't be resolved via the userspace page tables.
> > >
> > > Do you know in context of CoCo, whether a private page can be accessed at
> > > all outside of KVM?
> > >
> > > I think I'm pretty sure now a private page can never be mapped to
> > > userspace.  However, can another module like vhost-kernel access it during
> > > postcopy?  My impression of that is still a yes, but then how about
> > > vhost-user?
> > >
> > > Here, the "vhost-kernel" part represents a question on whether private
> > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > represents a question on whether, if the previous vhost-kernel question
> > > answers as "yes it can", such access attempt can happen in another
> > > process/task (hence, not only does it lack KVM context, but also not
> > > sharing the same task context).
> >
> > Right after I sent it, I just recalled whenever a device needs to access
> > the page, it needs to be converted to shared pages first..
>
> FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> private memory.  The basic gist is that the IOMMU will enforce access to private
> memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
>
> [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

Hi Sean,

Do you know what API the IOMMU driver would use to get the private
pages to map? Normally it'd use GUP, but GUP would/should fail for
guest-private pages, right?
Peter Xu Jan. 16, 2025, 11:17 p.m. UTC | #8
On Thu, Jan 16, 2025 at 03:04:45PM -0800, James Houghton wrote:
> On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 16, 2025, Peter Xu wrote:
> > > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > > a complete post-copy system for gmem.
> > > > >
> > > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > > Keep in mind that even after this integration happens, userfaultfd
> > > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > > won't be resolved via the userspace page tables.
> > > >
> > > > Do you know in context of CoCo, whether a private page can be accessed at
> > > > all outside of KVM?
> > > >
> > > > I think I'm pretty sure now a private page can never be mapped to
> > > > userspace.  However, can another module like vhost-kernel access it during
> > > > postcopy?  My impression of that is still a yes, but then how about
> > > > vhost-user?
> > > >
> > > > Here, the "vhost-kernel" part represents a question on whether private
> > > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > > represents a question on whether, if the previous vhost-kernel question
> > > > answers as "yes it can", such access attempt can happen in another
> > > > process/task (hence, not only does it lack KVM context, but also not
> > > > sharing the same task context).
> > >
> > > Right after I sent it, I just recalled whenever a device needs to access
> > > the page, it needs to be converted to shared pages first..
> >
> > FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> > private memory.  The basic gist is that the IOMMU will enforce access to private
> > memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> > TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
> >
> > [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf

Thanks, Sean.  This is interesting to know..

> 
> Hi Sean,
> 
> Do you know what API the IOMMU driver would use to get the private
> pages to map? Normally it'd use GUP, but GUP would/should fail for
> guest-private pages, right?

James,

I'm still reading the link Sean shared, looks like there's answer in the
white paper on this on assigned devices:

        TDIs access memory via either guest virtual address (GVA) space or
        guest physical address (GPA) space.  The I/O Memory Management Unit
        (IOMMU) in the host hardware is responsible for translating the
        provided GVAs or GPAs into system physical addresses
        (SPAs). Because SEV-SNP enforces access control at the time of
        translation, the IOMMU performs RMP entry lookups on translation

So I suppose after the device is attested and trusted, it can directly map
everything if wanted, and DMA directly to the encrypted pages.

OTOH, for my specific question (on vhost-kernel, or vhost-user), I suppose
they cannot be attested but still be part of host software.. so I'm
guessing they'll need to still stick with shared pages, and use a bounce
buffer to do DMAs..
Peter Xu Jan. 16, 2025, 11:31 p.m. UTC | #9
On Thu, Jan 16, 2025 at 02:51:11PM -0800, James Houghton wrote:
> I guess this might not work if QEMU *needs* to use HugeTLB for
> whatever reason, but Google's hypervisor just needs 1G pages; it
> doesn't matter where they come from really.

I see now.  Yes I suppose it works for QEMU too.

[...]

> > In that case, looks like userfaultfd can support CoCo on device emulations
> > by sticking with virtual-address traps like before, at least from that
> > specific POV.
> 
> Yeah, I don't think the userfaultfd API needs to change to support
> gmem, because it's going to be using the VMAs / user mappings of gmem.

There's other things I am still thinking on how the notification could
happen when CoCo is enabled, that especially when there's no vcpu context.

The first thing is any PV interfaces, and what's currently in my mind is
kvmclock.  I suppose that could work like untrusted dmas, so that when the
hypervisor wants to read/update the clock struct, it'll access a shared
page and then the guest can move it from/to to a private page.  Or I'm not
sure whether such information is proven to be not sensitive data, so the
guest can directly use a permanent shared page for such purpose (in which
case should still be part of guest memory, hence access to it can be
trapped just like other shared pages via userfaultfd).

The other thing is after I read the SEV-TIO then I found it could be easy
to implement page faults for trusted devices now.  For example, the white
paper said the host IOMMU will be responsible to translating trusted
devices' DMA into GPA/GVA, I think it means KVM would somehow share the
secondary pgtable to the IOMMU, and probably when DMA sees a missing page
it can now easily generate a page fault to the secondary page table.
However the question is this is a DMA op and it definitely also doesn't
have a vcpu context.  So the question is how to trap it.

So.. maybe (fd, offset) support might still be needed at some point, which
can be more future proof.  But I don't think I have a solid mind yet.
Sean Christopherson Jan. 16, 2025, 11:46 p.m. UTC | #10
On Thu, Jan 16, 2025, Peter Xu wrote:
> On Thu, Jan 16, 2025 at 03:04:45PM -0800, James Houghton wrote:
> > On Thu, Jan 16, 2025 at 2:16 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 16, 2025, Peter Xu wrote:
> > > > On Thu, Jan 16, 2025 at 03:19:49PM -0500, Peter Xu wrote:
> > > > > > For the gmem case, userfaultfd cannot be used, so KVM Userfault isn't
> > > > > > replacing it. And as of right now anyway, KVM Userfault *does* provide
> > > > > > a complete post-copy system for gmem.
> > > > > >
> > > > > > When gmem pages can be mapped into userspace, for post-copy to remain
> > > > > > functional, userspace-mapped gmem will need userfaultfd integration.
> > > > > > Keep in mind that even after this integration happens, userfaultfd
> > > > > > alone will *not* be a complete post-copy solution, as vCPU faults
> > > > > > won't be resolved via the userspace page tables.
> > > > >
> > > > > Do you know in context of CoCo, whether a private page can be accessed at
> > > > > all outside of KVM?
> > > > >
> > > > > I think I'm pretty sure now a private page can never be mapped to
> > > > > userspace.  However, can another module like vhost-kernel access it during
> > > > > postcopy?  My impression of that is still a yes, but then how about
> > > > > vhost-user?
> > > > >
> > > > > Here, the "vhost-kernel" part represents a question on whether private
> > > > > pages can be accessed at all outside KVM.  While "vhost-user" part
> > > > > represents a question on whether, if the previous vhost-kernel question
> > > > > answers as "yes it can", such access attempt can happen in another
> > > > > process/task (hence, not only does it lack KVM context, but also not
> > > > > sharing the same task context).
> > > >
> > > > Right after I sent it, I just recalled whenever a device needs to access
> > > > the page, it needs to be converted to shared pages first..
> > >
> > > FWIW, once Trusted I/O comes along, "trusted" devices will be able to access guest
> > > private memory.  The basic gist is that the IOMMU will enforce access to private
> > > memory, e.g. on AMD the IOMMU will check the RMP[*], and I believe the plan for
> > > TDX is to have the IOMMU share the Secure-EPT tables that are used by the CPU.
> > >
> > > [*] https://www.amd.com/content/dam/amd/en/documents/developer/sev-tio-whitepaper.pdf
> 
> Thanks, Sean.  This is interesting to know..
> 
> > 
> > Hi Sean,
> > 
> > Do you know what API the IOMMU driver would use to get the private
> > pages to map? Normally it'd use GUP, but GUP would/should fail for
> > guest-private pages, right?
> 
> James,
> 
> I'm still reading the link Sean shared, looks like there's answer in the
> white paper on this on assigned devices:
> 
>         TDIs access memory via either guest virtual address (GVA) space or
>         guest physical address (GPA) space.  The I/O Memory Management Unit
>         (IOMMU) in the host hardware is responsible for translating the
>         provided GVAs or GPAs into system physical addresses
>         (SPAs). Because SEV-SNP enforces access control at the time of
>         translation, the IOMMU performs RMP entry lookups on translation
> 
> So I suppose after the device is attested and trusted, it can directly map
> everything if wanted, and DMA directly to the encrypted pages.

But as James called out, the kernel still needs to actually map guest_memfd
memory (all other memory is shared), and guest_memfd does not and will not ever
support GUP/mmap() of *private* memory.

There's an RFC that's under heavy discussion that I assume will handle some/all?
of this (I have largely ignored the thread).

https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com

> OTOH, for my specific question (on vhost-kernel, or vhost-user), I suppose
> they cannot be attested but still be part of host software.. so I'm
> guessing they'll need to still stick with shared pages, and use a bounce
> buffer to do DMAs..

Yep.  There's no sane way to attest software that runs in "regular" mode on the
CPU, and so things like device emulation and vhost will always be restricted to
shared memory.