mbox series

[RFC,00/18] KVM: Post-copy live migration for guest_memfd

Message ID 20240710234222.2333120-1-jthoughton@google.com (mailing list archive)
Headers show
Series KVM: Post-copy live migration for guest_memfd | expand

Message

James Houghton July 10, 2024, 11:42 p.m. UTC
This patch series implements the KVM-based demand paging system that was
first introduced back in November[1] by David Matlack.

The working name for this new system is KVM Userfault, but that name is
very confusing so it will not be the final name.

Problem: post-copy with guest_memfd
===================================

Post-copy live migration makes it possible to migrate VMs from one host
to another no matter how fast they are writing to memory while keeping
the VM paused for a minimal amount of time. For post-copy to work, we
need:
 1. to be able to prevent KVM from being able to access particular pages
    of guest memory until we have populated it
 2. for userspace to know when KVM is trying to access a particular
    page.
 3. a way to allow the access to proceed.

Traditionally, post-copy live migration is implemented using
userfaultfd, which hooks into the main mm fault path. KVM hits this path
when it is doing HVA -> PFN translations (with GUP) or when it itself
attempts to access guest memory. Userfaultfd sends a page fault
notification to userspace, and KVM goes to sleep.

Userfaultfd works well, as it is not specific to KVM; everyone who
attempts to access guest memory will block the same way.

However, with guest_memfd, we do not use GUP to translate from GFN to
HPA (nor is there an intermediate HVA).

So userfaultfd in its current form cannot be used to support post-copy
live migration with guest_memfd-backed VMs.

Solution: hook into the gfn -> pfn translation
==============================================

The only way to implement post-copy with a non-KVM-specific
userfaultfd-like system would be to introduce the concept of a
file-userfault[2] to intercept faults on a guest_memfd.

Instead, we take the simpler approach of adding a KVM-specific API, and
we hook into the GFN -> HVA or GFN -> PFN translation steps (for
traditional memslots and for guest_memfd respectively).

I have intentionally added support for traditional memslots, as the
complexity that it adds is minimal, and it is useful for some VMMs, as
it can be used to fully implement post-copy live migration.

Implementation Details
======================

Let's break down how KVM implements each of the three core requirements
for implementing post-copy as laid out above:

--- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---

The most straightforward way to inform KVM of userfault-enabled pages is
to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.

There is already infrastructure in place for modifying and checking
memory attributes. Using this interface is slightly challenging, as there
is no UAPI for setting/clearing particular attributes; we must set the
exact attributes we want.

The synchronization that is in place for updating memory attributes is
not suitable for post-copy live migration either, which will require
updating memory attributes (from userfault to no-userfault) very
frequently.

Another potential interface could be to use something akin to a dirty
bitmap, where a bitmap describes which pages within a memslot (or VM)
should trigger userfaults. This way, it is straightforward to make
updates to the userfault status of a page cheap.

When KVM Userfault is enabled, we need to be careful not to map a
userfault page in response to a fault on a non-userfault page. In this
RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.

--- Page fault notifications ---

For page faults generated by vCPUs running in guest mode, if the page
the vCPU is trying to access is a userfault-enabled page, we use
KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.

For arm64, I believe this is actually all we need, provided we handle
steal_time properly.

For x86, where returning from deep within the instruction emulator (or
other non-trivial execution paths) is infeasible, being able to pause
execution while userspace fetches the page, just as userfaultfd would
do, is necessary. Let's call these "asynchronous userfaults."

A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
userfaults, and an eventfd is used to signal that new faults are
available for reading.

Today, we busy-wait for a gfn to have userfault disabled. This will
change in the future.

--- Fault resolution ---

Resolving userfaults today is as simple as removing the USERFAULT memory
attribute on the faulting gfn. This will change if we do not end up
using memory attributes for KVM Userfault. Having a range-based wake-up
like userfaultfd (see UFFDIO_WAKE) might also be helpful for
performance.

Problems with this series
=========================
- This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
- Memory attribute modification doesn't scale well.
- We busy-wait for pages to not be userfault-enabled.
- gfn_to_hva and gfn_to_pfn caches are not invalidated.
- Page tables are not collapsed when KVM Userfault is disabled.
- There is no self-test for asynchronous userfaults.
- Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
- Supports only x86 and arm64.
- Probably many more!

Thanks!

[1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/
[2]: https://lore.kernel.org/kvm/CADrL8HVwBjLpWDM9i9Co1puFWmJshZOKVu727fMPJUAbD+XX5g@mail.gmail.com/

James Houghton (18):
  KVM: Add KVM_USERFAULT build option
  KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT
  KVM: Put struct kvm pointer in memslot
  KVM: Fail __gfn_to_hva_many for userfault gfns.
  KVM: Add KVM_PFN_ERR_USERFAULT
  KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
  KVM: Provide attributes to kvm_arch_pre_set_memory_attributes
  KVM: x86: Add KVM Userfault support
  KVM: x86: Add vCPU fault fast-path for Userfault
  KVM: arm64: Add KVM Userfault support
  KVM: arm64: Add vCPU memory fault fast-path for Userfault
  KVM: arm64: Add userfault support for steal-time
  KVM: Add atomic parameter to __gfn_to_hva_many
  KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
  KVM: guest_memfd: Add KVM Userfault support
  KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
  KVM: selftests: Add KVM Userfault mode to demand_paging_test
  KVM: selftests: Remove restriction in vm_set_memory_attributes

 Documentation/virt/kvm/api.rst                |  23 ++
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   8 +-
 arch/arm64/kvm/mmu.c                          |  45 +++-
 arch/arm64/kvm/pvtime.c                       |  11 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  67 +++++-
 arch/x86/kvm/mmu/mmu_internal.h               |   3 +-
 include/linux/kvm_host.h                      |  41 +++-
 include/uapi/linux/kvm.h                      |  13 ++
 .../selftests/kvm/demand_paging_test.c        |  46 +++-
 .../testing/selftests/kvm/include/kvm_util.h  |   7 -
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        |  16 +-
 virt/kvm/kvm_main.c                           | 213 +++++++++++++++++-
 16 files changed, 457 insertions(+), 44 deletions(-)


base-commit: 02b0d3b9d4dd1ef76b3e8c63175f1ae9ff392313

Comments

James Houghton July 10, 2024, 11:48 p.m. UTC | #1
Ah, I put the wrong email for Peter! I'm so sorry!
James Houghton July 11, 2024, 5:54 p.m. UTC | #2
On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
> Solution: hook into the gfn -> pfn translation
> ==============================================
>
> The only way to implement post-copy with a non-KVM-specific
> userfaultfd-like system would be to introduce the concept of a
> file-userfault[2] to intercept faults on a guest_memfd.
>
> Instead, we take the simpler approach of adding a KVM-specific API, and
> we hook into the GFN -> HVA or GFN -> PFN translation steps (for
> traditional memslots and for guest_memfd respectively).
>
> I have intentionally added support for traditional memslots, as the
> complexity that it adds is minimal, and it is useful for some VMMs, as
> it can be used to fully implement post-copy live migration.

I want to clarify this sentence a little.

Today, because guest_memfd is only accessed by vCPUs (and is only ever
used for guest-private memory), the concept of "asynchronous
userfaults" isn't exactly necessary. However, when guest_memfd
supports shared memory and KVM is itself able to access it,
asynchronous userfaults become useful in the same way that they are
useful for the non-guest_memfd case.

In a world where guest_memfd requires asynchronous userfaults, adding
support for traditional memslots on top of that is quite simple, and
it somewhat simplies the UAPI.

And for why it is useful for userspace to be able to use KVM Userfault
to implement post-copy live migration, David mentioned this in his
initial RFC[1].

[1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/#t
David Matlack July 11, 2024, 11:37 p.m. UTC | #3
On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
>
> --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
>
> The most straightforward way to inform KVM of userfault-enabled pages is
> to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
>
> There is already infrastructure in place for modifying and checking
> memory attributes. Using this interface is slightly challenging, as there
> is no UAPI for setting/clearing particular attributes; we must set the
> exact attributes we want.

The thing we'll want to optimize specifically is clearing
ATTRIBUTE_USERFAULT. During post-copy migration, there will be
potentially hundreds of vCPUs in a single VM concurrently
demand-fetching memory. Clearing ATTRIBUTE_USERFAULT for each page
fetched is on the critical path of getting the vCPU back into
guest-mode.

Clearing ATTRIBUTE_USERFAULT just needs to clear the attribute. It
doesn't need to modify page tables or update any data structures other
than the attribute itself. But the existing UAPI takes both mmu_lock
and slots_lock IIRC.

I'm also concerned that the existing UAPI could lead to userspace
accidentally clearing ATTRIBUTE_USERFAULT when it goes to set
ATTRIBUTE_PRIVATE (or any other potential future attribute). Sure that
could be solved but that means centrally tracking attributes in
userspace and issuing one ioctl per contiguous region of guest memory
with matching attributes. Imagine a scenario where ~every other page
of guest memory as ATTRIBUTE_USERFAULT and then userspace wants to set
a differient attribute on a large region of memory. That's going to
take a _lot_ of ioctls.

Having a UAPI to set (attributes |= delta) and clear (attributes &=
~delta) attributes on a range of GFNs would solve both these problems.

>
> The synchronization that is in place for updating memory attributes is
> not suitable for post-copy live migration either, which will require
> updating memory attributes (from userfault to no-userfault) very
> frequently.

There is also the xarray. I imagine that will trigger a lot of dynamic
memory allocations during post-copy which will slow increase the total
time a vCPU is paused due to a USERFAULT page.

Is it feasible to convert attributes to a bitmap?

>
> Another potential interface could be to use something akin to a dirty
> bitmap, where a bitmap describes which pages within a memslot (or VM)
> should trigger userfaults. This way, it is straightforward to make
> updates to the userfault status of a page cheap.

Taking a similar approach to dirty logging is attractive for several reasons.

1. The infrastructure to manage per-memslot bitmaps already exists for
dirty logging.
2. It avoids the performance problems with xarrays by using a bitmap.
3. It avoids the performance problems with setting all attributes at once.

However it will require new specific UAPIs to set/clear. And it's
probably possible to optimize attributes to meet our needs, and those
changes will benefit all attributes.

>
> When KVM Userfault is enabled, we need to be careful not to map a
> userfault page in response to a fault on a non-userfault page. In this
> RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.
>
> --- Page fault notifications ---
>
> For page faults generated by vCPUs running in guest mode, if the page
> the vCPU is trying to access is a userfault-enabled page, we use
> KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.
>
> For arm64, I believe this is actually all we need, provided we handle
> steal_time properly.

There's steal time, and also the GIC pages. Steal time can use
KVM_EXIT_MEMORY_FAULT, but that requires special casing in the ARM
code. Alternatively, both can use the async mechanism and to avoid
special handling in the ARM code.

>
> For x86, where returning from deep within the instruction emulator (or
> other non-trivial execution paths) is infeasible, being able to pause
> execution while userspace fetches the page, just as userfaultfd would
> do, is necessary. Let's call these "asynchronous userfaults."
>
> A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
> userfaults, and an eventfd is used to signal that new faults are
> available for reading.
>
> Today, we busy-wait for a gfn to have userfault disabled. This will
> change in the future.
>
> --- Fault resolution ---
>
> Resolving userfaults today is as simple as removing the USERFAULT memory
> attribute on the faulting gfn. This will change if we do not end up
> using memory attributes for KVM Userfault. Having a range-based wake-up
> like userfaultfd (see UFFDIO_WAKE) might also be helpful for
> performance.
>
> Problems with this series
> =========================
> - This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
> - Memory attribute modification doesn't scale well.
> - We busy-wait for pages to not be userfault-enabled.

Async faults are a slow path so I think a wait queue would suffice.

> - gfn_to_hva and gfn_to_pfn caches are not invalidated.
> - Page tables are not collapsed when KVM Userfault is disabled.
> - There is no self-test for asynchronous userfaults.
> - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.

Userspace would probably treat this as fatal anyway right?
Wang, Wei W July 15, 2024, 3:25 p.m. UTC | #4
On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> This patch series implements the KVM-based demand paging system that was
> first introduced back in November[1] by David Matlack.
> 
> The working name for this new system is KVM Userfault, but that name is very
> confusing so it will not be the final name.
> 
Hi James,
I had implemented a similar approach for TDX post-copy migration, there are quite
some differences though. Got some questions about your design below.

> Problem: post-copy with guest_memfd
> ===================================
> 
> Post-copy live migration makes it possible to migrate VMs from one host to
> another no matter how fast they are writing to memory while keeping the VM
> paused for a minimal amount of time. For post-copy to work, we
> need:
>  1. to be able to prevent KVM from being able to access particular pages
>     of guest memory until we have populated it  2. for userspace to know when
> KVM is trying to access a particular
>     page.
>  3. a way to allow the access to proceed.
> 
> Traditionally, post-copy live migration is implemented using userfaultfd, which
> hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> PFN translations (with GUP) or when it itself attempts to access guest memory.
> Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> 
> Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> access guest memory will block the same way.
> 
> However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> (nor is there an intermediate HVA).
> 
> So userfaultfd in its current form cannot be used to support post-copy live
> migration with guest_memfd-backed VMs.
> 
> Solution: hook into the gfn -> pfn translation
> ==============================================
> 
> The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> system would be to introduce the concept of a file-userfault[2] to intercept
> faults on a guest_memfd.
> 
> Instead, we take the simpler approach of adding a KVM-specific API, and we
> hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> memslots and for guest_memfd respectively).


Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
pages (i.e. GFN -> HVA)? 
It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
shared pages to go through the existing userfaultfd mechanism:
- The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
- The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
  faults exiting to userspace for postcopy might not be necessary, because all pages on the
  destination side are initially “shared,” and the guest’s first access will always cause an
  exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
  fetch the page data from the source (VMM can know if a page data has been fetched
  from the source or not).

> 
> I have intentionally added support for traditional memslots, as the complexity
> that it adds is minimal, and it is useful for some VMMs, as it can be used to
> fully implement post-copy live migration.
> 
> Implementation Details
> ======================
> 
> Let's break down how KVM implements each of the three core requirements
> for implementing post-copy as laid out above:
> 
> --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> 
> The most straightforward way to inform KVM of userfault-enabled pages is to
> use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> 
> There is already infrastructure in place for modifying and checking memory
> attributes. Using this interface is slightly challenging, as there is no UAPI for
> setting/clearing particular attributes; we must set the exact attributes we want.
> 
> The synchronization that is in place for updating memory attributes is not
> suitable for post-copy live migration either, which will require updating
> memory attributes (from userfault to no-userfault) very frequently.
> 
> Another potential interface could be to use something akin to a dirty bitmap,
> where a bitmap describes which pages within a memslot (or VM) should trigger
> userfaults. This way, it is straightforward to make updates to the userfault
> status of a page cheap.
> 
> When KVM Userfault is enabled, we need to be careful not to map a userfault
> page in response to a fault on a non-userfault page. In this RFC, I've taken the
> simplest approach: force new PTEs to be PAGE_SIZE.
> 
> --- Page fault notifications ---
> 
> For page faults generated by vCPUs running in guest mode, if the page the
> vCPU is trying to access is a userfault-enabled page, we use

Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
Any functional issues if we just have all the page faults exit to userspace during the
post-copy period?
- As also mentioned above, userspace can easily know if a page needs to be
  fetched from the source or not, so upon a fault exit to userspace, VMM can
  decide to block the faulting vcpu thread or return back to KVM immediately.
- If improvement is really needed (would need profiling first) to reduce number
  of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
  Each page only needs to exit to userspace once for the purpose of fetching its data
  from the source in postcopy. It doesn't seem to need userspace to enable the exit
  again for the page (via a new uAPI), right?
James Houghton July 16, 2024, 5:10 p.m. UTC | #5
On Mon, Jul 15, 2024 at 8:28 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > This patch series implements the KVM-based demand paging system that was
> > first introduced back in November[1] by David Matlack.
> >
> > The working name for this new system is KVM Userfault, but that name is very
> > confusing so it will not be the final name.
> >
> Hi James,
> I had implemented a similar approach for TDX post-copy migration, there are quite
> some differences though. Got some questions about your design below.

Thanks for the feedback!!

>
> > Problem: post-copy with guest_memfd
> > ===================================
> >
> > Post-copy live migration makes it possible to migrate VMs from one host to
> > another no matter how fast they are writing to memory while keeping the VM
> > paused for a minimal amount of time. For post-copy to work, we
> > need:
> >  1. to be able to prevent KVM from being able to access particular pages
> >     of guest memory until we have populated it  2. for userspace to know when
> > KVM is trying to access a particular
> >     page.
> >  3. a way to allow the access to proceed.
> >
> > Traditionally, post-copy live migration is implemented using userfaultfd, which
> > hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> > PFN translations (with GUP) or when it itself attempts to access guest memory.
> > Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> >
> > Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> > access guest memory will block the same way.
> >
> > However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> > (nor is there an intermediate HVA).
> >
> > So userfaultfd in its current form cannot be used to support post-copy live
> > migration with guest_memfd-backed VMs.
> >
> > Solution: hook into the gfn -> pfn translation
> > ==============================================
> >
> > The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> > system would be to introduce the concept of a file-userfault[2] to intercept
> > faults on a guest_memfd.
> >
> > Instead, we take the simpler approach of adding a KVM-specific API, and we
> > hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> > memslots and for guest_memfd respectively).
>
>
> Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
> pages (i.e. GFN -> HVA)?
> It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
> shared pages to go through the existing userfaultfd mechanism:
> - The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
> - The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
>   faults exiting to userspace for postcopy might not be necessary, because all pages on the
>   destination side are initially “shared,” and the guest’s first access will always cause an
>   exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
>   fetch the page data from the source (VMM can know if a page data has been fetched
>   from the source or not).

You're right that, today, including support for guest-private memory
*only* indeed simplifies things (no async userfaults). I think your
strategy for implementing post-copy would work (so, shared->private
conversion faults for vCPU accesses to private memory, and userfaultfd
for everything else).

I'm not 100% sure what should happen in the case of a non-vCPU access
to should-be-private memory; today it seems like KVM just provides the
shared version of the page, so conventional use of userfaultfd
shouldn't break anything.

But eventually guest_memfd itself will support "shared" memory, and
(IIUC) it won't use VMAs, so userfaultfd won't be usable (without
changes anyway). For a non-confidential VM, all memory will be
"shared", so shared->private conversions can't help us there either.
Starting everything as private almost works (so using private->shared
conversions as a notification mechanism), but if the first time KVM
attempts to use a page is not from a vCPU (and is from a place where
we cannot easily return to userspace), the need for "async userfaults"
comes back.

For this use case, it seems cleaner to have a new interface. (And, as
far as I can tell, we would at least need some kind of "async
userfault"-like mechanism.)

Another reason why, today, KVM Userfault is helpful is that
userfaultfd has a couple drawbacks. Userfaultfd migration with
HugeTLB-1G is basically unusable, as HugeTLB pages cannot be mapped at
PAGE_SIZE. Some discussion here[1][2].

Moving the implementation of post-copy to KVM means that, throughout
post-copy, we can avoid changes to the main mm page tables, and we
only need to modify the second stage page tables. This saves the
memory needed to store the extra set of shattered page tables, and we
save the performance overhead of the page table modifications and
accounting that mm does.

There's some more discussion about these points in David's RFC[3].

[1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/ZdcKwK7CXgEsm-Co@x1n/
[3]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/

>
> >

> > I have intentionally added support for traditional memslots, as the complexity
> > that it adds is minimal, and it is useful for some VMMs, as it can be used to
> > fully implement post-copy live migration.
> >
> > Implementation Details
> > ======================
> >
> > Let's break down how KVM implements each of the three core requirements
> > for implementing post-copy as laid out above:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is to
> > use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking memory
> > attributes. Using this interface is slightly challenging, as there is no UAPI for
> > setting/clearing particular attributes; we must set the exact attributes we want.
> >
> > The synchronization that is in place for updating memory attributes is not
> > suitable for post-copy live migration either, which will require updating
> > memory attributes (from userfault to no-userfault) very frequently.
> >
> > Another potential interface could be to use something akin to a dirty bitmap,
> > where a bitmap describes which pages within a memslot (or VM) should trigger
> > userfaults. This way, it is straightforward to make updates to the userfault
> > status of a page cheap.
> >
> > When KVM Userfault is enabled, we need to be careful not to map a userfault
> > page in response to a fault on a non-userfault page. In this RFC, I've taken the
> > simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page the
> > vCPU is trying to access is a userfault-enabled page, we use
>
> Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
> Any functional issues if we just have all the page faults exit to userspace during the
> post-copy period?
> - As also mentioned above, userspace can easily know if a page needs to be
>   fetched from the source or not, so upon a fault exit to userspace, VMM can
>   decide to block the faulting vcpu thread or return back to KVM immediately.
> - If improvement is really needed (would need profiling first) to reduce number
>   of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
>   Each page only needs to exit to userspace once for the purpose of fetching its data
>   from the source in postcopy. It doesn't seem to need userspace to enable the exit
>   again for the page (via a new uAPI), right?

We don't necessarily need a way to go from no-fault -> fault for a
page, that's right[4]. But we do need a way for KVM to be able to
allow the access to proceed (i.e., go from fault -> no-fault). IOW, if
we get a fault and come out to userspace, we need a way to tell KVM
not to do that again. In the case of shared->private conversions, that
mechanism is toggling the memory attributes for a gfn. For
conventional userfaultfd, that's using UFFDIO_COPY/CONTINUE/POISON.
Maybe I'm misunderstanding your question.

[4]: It is helpful for poison emulation for HugeTLB-backed VMs today,
but this is not important.
Wang, Wei W July 17, 2024, 3:03 p.m. UTC | #6
On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> You're right that, today, including support for guest-private memory
> *only* indeed simplifies things (no async userfaults). I think your strategy for
> implementing post-copy would work (so, shared->private conversion faults for
> vCPU accesses to private memory, and userfaultfd for everything else).

Yes, it works and has been used for our internal tests.

> 
> I'm not 100% sure what should happen in the case of a non-vCPU access to
> should-be-private memory; today it seems like KVM just provides the shared
> version of the page, so conventional use of userfaultfd shouldn't break
> anything.

This seems to be the trusted IO usage (not aware of other usages, emulated device
backends, such as vhost, work with shared pages). Migration support for trusted device
passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
you've made it internally). So it seems too early to discuss this in detail.


> 
> But eventually guest_memfd itself will support "shared" memory, 

OK, I thought of this. Not sure how feasible it would be to extend gmem for
shared memory. I think questions like below need to be investigated:
#1 what are the tangible benefits of gmem based shared memory, compared to the
     legacy shared memory that we have now?
#2 There would be some gaps to make gmem usable for shared pages. For
      example, would it support userspace to map (without security concerns)?
#3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
     in the same issue as hugetlb? 

The support of using gmem for shared memory isn't in place yet, and this seems
to be a dependency for the support being added here.

> and
> (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> anyway). For a non-confidential VM, all memory will be "shared", so shared-
> >private conversions can't help us there either.
> Starting everything as private almost works (so using private->shared
> conversions as a notification mechanism), but if the first time KVM attempts to
> use a page is not from a vCPU (and is from a place where we cannot easily
> return to userspace), the need for "async userfaults"
> comes back.

Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
pages only, this wouldn't be an issue (it will be covered by userfaultfd).


> 
> For this use case, it seems cleaner to have a new interface. (And, as far as I can
> tell, we would at least need some kind of "async userfault"-like mechanism.)
> 
> Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> here[1][2].
> 
> Moving the implementation of post-copy to KVM means that, throughout
> post-copy, we can avoid changes to the main mm page tables, and we only
> need to modify the second stage page tables. This saves the memory needed
> to store the extra set of shattered page tables, and we save the performance
> overhead of the page table modifications and accounting that mm does.

It would be nice to see some data for comparisons between kvm faults and userfaultfd
e.g., end to end latency of handling a page fault via getting data from the source.
(I didn't find data from the link you shared. Please correct me if I missed it)


> We don't necessarily need a way to go from no-fault -> fault for a page, that's
> right[4]. But we do need a way for KVM to be able to allow the access to
> proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> userspace, we need a way to tell KVM not to do that again.
> In the case of shared->private conversions, that mechanism is toggling the memory
> attributes for a gfn.  For conventional userfaultfd, that's using
> UFFDIO_COPY/CONTINUE/POISON.
> Maybe I'm misunderstanding your question.

We can come back to this after the dependency discussion above is done. (If gmem is only
used for private pages, the support for postcopy, including changes required for VMMs, would
be simpler)
James Houghton July 18, 2024, 1:09 a.m. UTC | #7
On Wed, Jul 17, 2024 at 8:03 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>
> On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> > You're right that, today, including support for guest-private memory
> > *only* indeed simplifies things (no async userfaults). I think your strategy for
> > implementing post-copy would work (so, shared->private conversion faults for
> > vCPU accesses to private memory, and userfaultfd for everything else).
>
> Yes, it works and has been used for our internal tests.
>
> >
> > I'm not 100% sure what should happen in the case of a non-vCPU access to
> > should-be-private memory; today it seems like KVM just provides the shared
> > version of the page, so conventional use of userfaultfd shouldn't break
> > anything.
>
> This seems to be the trusted IO usage (not aware of other usages, emulated device
> backends, such as vhost, work with shared pages). Migration support for trusted device
> passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
> AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
> you've made it internally). So it seems too early to discuss this in detail.

We don't migrate VMs with passthrough devices.

I still think the way KVM handles non-vCPU accesses to private memory
is wrong: surely it is an error, yet we simply provide the shared
version of the page. *shrug*

>
> >
> > But eventually guest_memfd itself will support "shared" memory,
>
> OK, I thought of this. Not sure how feasible it would be to extend gmem for
> shared memory. I think questions like below need to be investigated:

An RFC for it got posted recently[1]. :)

> #1 what are the tangible benefits of gmem based shared memory, compared to the
>      legacy shared memory that we have now?

For [1], unmapping guest memory from the direct map.

> #2 There would be some gaps to make gmem usable for shared pages. For
>       example, would it support userspace to map (without security concerns)?

At least in [1], userspace would be able to mmap it, but KVM would
still not be able to GUP it (instead going through the normal
guest_memfd path).

> #3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
>      in the same issue as hugetlb?

Good question. At the end of the day, the problem is that GUP relies
on host mm page table mappings, and HugeTLB can't map things with
PAGE_SIZE PTEs.

At least as of [1], given that KVM doesn't GUP guest_memfd memory, we
don't rely on the host mm page table layout, so we don't have the same
problem.

For VMMs that want to catch userspace (or non-GUP kernel) accesses via
a guest_memfd VMA, then it's possible it has the same issue. But for
VMMs that don't care to catch these kinds of accesses (the kind of
user that would use KVM Userfault to implement post-copy), it doesn't
matter.

[1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/

>
> The support of using gmem for shared memory isn't in place yet, and this seems
> to be a dependency for the support being added here.

Perhaps I've been slightly preemptive. :) I still think there's useful
discussion here.

> > and
> > (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> > anyway). For a non-confidential VM, all memory will be "shared", so shared-
> > >private conversions can't help us there either.
> > Starting everything as private almost works (so using private->shared
> > conversions as a notification mechanism), but if the first time KVM attempts to
> > use a page is not from a vCPU (and is from a place where we cannot easily
> > return to userspace), the need for "async userfaults"
> > comes back.
>
> Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
> pages only, this wouldn't be an issue (it will be covered by userfaultfd).

We're on the same page here.

>
>
> >
> > For this use case, it seems cleaner to have a new interface. (And, as far as I can
> > tell, we would at least need some kind of "async userfault"-like mechanism.)
> >
> > Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> > couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> > unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> > here[1][2].
> >
> > Moving the implementation of post-copy to KVM means that, throughout
> > post-copy, we can avoid changes to the main mm page tables, and we only
> > need to modify the second stage page tables. This saves the memory needed
> > to store the extra set of shattered page tables, and we save the performance
> > overhead of the page table modifications and accounting that mm does.
>
> It would be nice to see some data for comparisons between kvm faults and userfaultfd
> e.g., end to end latency of handling a page fault via getting data from the source.
> (I didn't find data from the link you shared. Please correct me if I missed it)

I don't have an A/B comparison for kernel end-to-end fault latency. :(
But I can tell you that with 32us or so network latency, it's not a
huge difference (assuming Anish's series[2]).

The real performance issue comes when we are collapsing the page
tables at the end. We basically have to do ~2x of everything (TLB
flushes, etc.), plus additional accounting that HugeTLB/THP does
(adjusting refcount/mapcount), etc. And one must optimize how the
unmap MMU notifiers are called so as to not stall vCPUs unnecessarily.

[2]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@google.com/

>
>
> > We don't necessarily need a way to go from no-fault -> fault for a page, that's
> > right[4]. But we do need a way for KVM to be able to allow the access to
> > proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> > userspace, we need a way to tell KVM not to do that again.
> > In the case of shared->private conversions, that mechanism is toggling the memory
> > attributes for a gfn.  For conventional userfaultfd, that's using
> > UFFDIO_COPY/CONTINUE/POISON.
> > Maybe I'm misunderstanding your question.
>
> We can come back to this after the dependency discussion above is done. (If gmem is only
> used for private pages, the support for postcopy, including changes required for VMMs, would
> be simpler)
James Houghton July 18, 2024, 1:59 a.m. UTC | #8
On Thu, Jul 11, 2024 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Jul 10, 2024 at 4:42 PM James Houghton <jthoughton@google.com> wrote:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is
> > to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking
> > memory attributes. Using this interface is slightly challenging, as there
> > is no UAPI for setting/clearing particular attributes; we must set the
> > exact attributes we want.
>
> The thing we'll want to optimize specifically is clearing
> ATTRIBUTE_USERFAULT. During post-copy migration, there will be
> potentially hundreds of vCPUs in a single VM concurrently
> demand-fetching memory. Clearing ATTRIBUTE_USERFAULT for each page
> fetched is on the critical path of getting the vCPU back into
> guest-mode.
>
> Clearing ATTRIBUTE_USERFAULT just needs to clear the attribute. It
> doesn't need to modify page tables or update any data structures other
> than the attribute itself. But the existing UAPI takes both mmu_lock
> and slots_lock IIRC.
>
> I'm also concerned that the existing UAPI could lead to userspace
> accidentally clearing ATTRIBUTE_USERFAULT when it goes to set
> ATTRIBUTE_PRIVATE (or any other potential future attribute). Sure that
> could be solved but that means centrally tracking attributes in
> userspace and issuing one ioctl per contiguous region of guest memory
> with matching attributes. Imagine a scenario where ~every other page
> of guest memory as ATTRIBUTE_USERFAULT and then userspace wants to set
> a differient attribute on a large region of memory. That's going to
> take a _lot_ of ioctls.
>
> Having a UAPI to set (attributes |= delta) and clear (attributes &=
> ~delta) attributes on a range of GFNs would solve both these problems.

Hi David, sorry for the delay getting back to you.

I agree with all of these points.

>
> >
> > The synchronization that is in place for updating memory attributes is
> > not suitable for post-copy live migration either, which will require
> > updating memory attributes (from userfault to no-userfault) very
> > frequently.
>
> There is also the xarray. I imagine that will trigger a lot of dynamic
> memory allocations during post-copy which will slow increase the total
> time a vCPU is paused due to a USERFAULT page.
>
> Is it feasible to convert attributes to a bitmap?

I don't see any reason why we couldn't convert attributes to be a
bitmap (or to have some attributes be stored in bitmaps and others be
stored in the xarray).

>
> >
> > Another potential interface could be to use something akin to a dirty
> > bitmap, where a bitmap describes which pages within a memslot (or VM)
> > should trigger userfaults. This way, it is straightforward to make
> > updates to the userfault status of a page cheap.
>
> Taking a similar approach to dirty logging is attractive for several reasons.
>
> 1. The infrastructure to manage per-memslot bitmaps already exists for
> dirty logging.
> 2. It avoids the performance problems with xarrays by using a bitmap.
> 3. It avoids the performance problems with setting all attributes at once.
>
> However it will require new specific UAPIs to set/clear. And it's
> probably possible to optimize attributes to meet our needs, and those
> changes will benefit all attributes.

Ok so the three options in my head are:
1. Add an attribute diff UAPI and track the USERFAULT attribute in the xarray.
2. Add an attribute diff UAPI and track the USERFAULT attribute with a bitmap.
3. Add a new UAPI to enable KVM userfaults on gfns according to a
particular bitmap, similar to dirty logging.

(1) is problematic because it is valid to have every page (or, say,
every other page) have ATTRIBUTE_USERFAULT.

(2) seems ok to me.

(3) would be great, but maybe the much more complicated UAPI is not
worth it. (We get the ability to mark many different regions as
USERFAULT in one syscall, and KVM has a lot of code for handling
bitmap arguments.)

I'm hoping others will weigh in here.

> >
> > When KVM Userfault is enabled, we need to be careful not to map a
> > userfault page in response to a fault on a non-userfault page. In this
> > RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page
> > the vCPU is trying to access is a userfault-enabled page, we use
> > KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.
> >
> > For arm64, I believe this is actually all we need, provided we handle
> > steal_time properly.
>
> There's steal time, and also the GIC pages. Steal time can use
> KVM_EXIT_MEMORY_FAULT, but that requires special casing in the ARM
> code. Alternatively, both can use the async mechanism and to avoid
> special handling in the ARM code.

Oh, of course, I forgot about the GIC. Thanks. And yes, if the async
userfault mechanism is acceptable, using that would be better than
adding the special cases.

>
> >
> > For x86, where returning from deep within the instruction emulator (or
> > other non-trivial execution paths) is infeasible, being able to pause
> > execution while userspace fetches the page, just as userfaultfd would
> > do, is necessary. Let's call these "asynchronous userfaults."
> >
> > A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
> > userfaults, and an eventfd is used to signal that new faults are
> > available for reading.
> >
> > Today, we busy-wait for a gfn to have userfault disabled. This will
> > change in the future.
> >
> > --- Fault resolution ---
> >
> > Resolving userfaults today is as simple as removing the USERFAULT memory
> > attribute on the faulting gfn. This will change if we do not end up
> > using memory attributes for KVM Userfault. Having a range-based wake-up
> > like userfaultfd (see UFFDIO_WAKE) might also be helpful for
> > performance.
> >
> > Problems with this series
> > =========================
> > - This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
> > - Memory attribute modification doesn't scale well.
> > - We busy-wait for pages to not be userfault-enabled.
>
> Async faults are a slow path so I think a wait queue would suffice.

I think a wait queue seems like a good fit too. (It's what userfaultfd uses.)

>
> > - gfn_to_hva and gfn_to_pfn caches are not invalidated.
> > - Page tables are not collapsed when KVM Userfault is disabled.
> > - There is no self-test for asynchronous userfaults.
> > - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
>
> Userspace would probably treat this as fatal anyway right?

Yes, but I still think dropping the gfn isn't great. I'll fix this
when I change from using the hacky list-based thing to something more
sophisticated (like a wait_queue).
Wang, Wei W July 19, 2024, 2:47 p.m. UTC | #9
On Thursday, July 18, 2024 9:09 AM, James Houghton wrote:
> On Wed, Jul 17, 2024 at 8:03 AM Wang, Wei W <wei.w.wang@intel.com>
> wrote:
> >
> > On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> > > You're right that, today, including support for guest-private memory
> > > *only* indeed simplifies things (no async userfaults). I think your
> > > strategy for implementing post-copy would work (so, shared->private
> > > conversion faults for vCPU accesses to private memory, and userfaultfd for
> everything else).
> >
> > Yes, it works and has been used for our internal tests.
> >
> > >
> > > I'm not 100% sure what should happen in the case of a non-vCPU
> > > access to should-be-private memory; today it seems like KVM just
> > > provides the shared version of the page, so conventional use of
> > > userfaultfd shouldn't break anything.
> >
> > This seems to be the trusted IO usage (not aware of other usages,
> > emulated device backends, such as vhost, work with shared pages).
> > Migration support for trusted device passthrough doesn't seem to be
> > architecturally ready yet. Especially for postcopy, AFAIK, even the
> > legacy VM case lacks the support for device passthrough (not sure if you've
> made it internally). So it seems too early to discuss this in detail.
> 
> We don't migrate VMs with passthrough devices.
> 
> I still think the way KVM handles non-vCPU accesses to private memory is
> wrong: surely it is an error, yet we simply provide the shared version of the
> page. *shrug*
> 
> >
> > >
> > > But eventually guest_memfd itself will support "shared" memory,
> >
> > OK, I thought of this. Not sure how feasible it would be to extend
> > gmem for shared memory. I think questions like below need to be
> investigated:
> 
> An RFC for it got posted recently[1]. :)
> 
> > #1 what are the tangible benefits of gmem based shared memory, compared
> to the
> >      legacy shared memory that we have now?
> 
> For [1], unmapping guest memory from the direct map.
> 
> > #2 There would be some gaps to make gmem usable for shared pages. For
> >       example, would it support userspace to map (without security concerns)?
> 
> At least in [1], userspace would be able to mmap it, but KVM would still not be
> able to GUP it (instead going through the normal guest_memfd path).
> 
> > #3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it
> result
> >      in the same issue as hugetlb?
> 
> Good question. At the end of the day, the problem is that GUP relies on host
> mm page table mappings, and HugeTLB can't map things with PAGE_SIZE PTEs.
> 
> At least as of [1], given that KVM doesn't GUP guest_memfd memory, we don't
> rely on the host mm page table layout, so we don't have the same problem.
> 
> For VMMs that want to catch userspace (or non-GUP kernel) accesses via a
> guest_memfd VMA, then it's possible it has the same issue. But for VMMs that
> don't care to catch these kinds of accesses (the kind of user that would use
> KVM Userfault to implement post-copy), it doesn't matter.
> 
> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-
> roypat@amazon.co.uk/

Ah, I overlooked this series, thanks for the reminder.
Let me check the details first.
Peter Xu Aug. 1, 2024, 10:12 p.m. UTC | #10
On Wed, Jul 10, 2024 at 04:48:36PM -0700, James Houghton wrote:
> Ah, I put the wrong email for Peter! I'm so sorry!

So I have a pure (and even stupid) question to ask before the rest of
details.. it's pure question because I know little on guest_memfd,
especially on the future plans.

So.. Is there any chance guest_memfd can in the future provide 1G normal
(!CoCo) pages?  If yes, are these pages GUP-able, and mapp-able?

Thanks,