mbox series

[RFC,0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE

Message ID 20210107190453.3051110-1-axelrasmussen@google.com (mailing list archive)
Headers show
Series userfaultfd: handle minor faults, add UFFDIO_CONTINUE | expand

Message

Axel Rasmussen Jan. 7, 2021, 7:04 p.m. UTC
Overview
========

This series adds a new userfaultfd registration mode,
UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
By "minor" fault, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because it's possible to combine registration modes (e.g. a single VMA can be
userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
resolve faults once they are received, I spent some time thinking through how
the existing API interacts with the new feature.

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Remaining Work
==============

This patchset doesn't include updates to userfaultfd's documentation or
selftests. This will be added before I send a non-RFC version of this series
(I want to find out if there are strong objections to the API surface before
spending the time to document it.)

Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.

Axel Rasmussen (2):
  userfaultfd: add minor fault registration mode
  userfaultfd: add UFFDIO_CONTINUE ioctl

 fs/proc/task_mmu.c               |   1 +
 fs/userfaultfd.c                 | 143 ++++++++++++++++++++++++-------
 include/linux/mm.h               |   1 +
 include/linux/userfaultfd_k.h    |  14 ++-
 include/trace/events/mmflags.h   |   1 +
 include/uapi/linux/userfaultfd.h |  36 +++++++-
 mm/hugetlb.c                     |  42 +++++++--
 mm/userfaultfd.c                 |  86 ++++++++++++++-----
 8 files changed, 261 insertions(+), 63 deletions(-)

--
2.29.2.729.g45daf8777d-goog

Comments

Dr. David Alan Gilbert Jan. 11, 2021, 11:43 a.m. UTC | #1
* Axel Rasmussen (axelrasmussen@google.com) wrote:
> Overview
> ========
> 
> This series adds a new userfaultfd registration mode,
> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> By "minor" fault, I mean the following situation:
> 
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> One of the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have already been
> allocated & filled with some contents. The UFFD mapping has not yet been
> faulted in; when it is touched for the first time, this results in what I'm
> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> have huge_pte_none(), but find_lock_page() finds an existing page.
>
> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> userspace resolves the fault by either a) doing nothing if the contents are
> already correct, or b) updating the underlying contents using the second,
> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> "I have ensured the page contents are correct, carry on setting up the mapping".
> 
> Use Case
> ========
> 
> Consider the use case of VM live migration (e.g. under QEMU/KVM):
> 
> 1. While a VM is still running, we copy the contents of its memory to a
>    target machine. The pages are populated on the target by writing to the
>    non-UFFD mapping, using the setup described above. The VM is still running
>    (and therefore its memory is likely changing), so this may be repeated
>    several times, until we decide the target is "up to date enough".
> 
> 2. We pause the VM on the source, and start executing on the target machine.
>    During this gap, the VM's user(s) will *see* a pause, so it is desirable to
>    minimize this window.
> 
> 3. Between the last time any page was copied from the source to the target, and
>    when the VM was paused, the contents of that page may have changed - and
>    therefore the copy we have on the target machine is out of date. Although we
>    can keep track of which pages are out of date, for VMs with large amounts of
>    memory, it is "slow" to transfer this information to the target machine. We
>    want to resume execution before such a transfer would complete.
> 
> 4. So, the guest begins executing on the target machine. The first time it
>    touches its memory (via the UFFD-registered mapping), userspace wants to
>    intercept this fault. Userspace checks whether or not the page is up to date,
>    and if not, copies the updated page from the source machine, via the non-UFFD
>    mapping. Finally, whether a copy was performed or not, userspace issues a
>    UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
>    are correct, carry on setting up the mapping".
> 
> We don't have to do all of the final updates on-demand. The userfaultfd manager
> can, in the background, also copy over updated pages once it receives the map of
> which pages are up-to-date or not.

Yes, this would make the handover during postcopy of large VMs a heck of
a lot faster; and probably simpler; the cleanup code that tidies up the
re-dirty pages is pretty messy.

Dave

> Interaction with Existing APIs
> ==============================
> 
> Because it's possible to combine registration modes (e.g. a single VMA can be
> userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
> resolve faults once they are received, I spent some time thinking through how
> the existing API interacts with the new feature.
> 
> UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
> allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
> 
> - For non-shared memory or shmem, -EINVAL is returned.
> - For hugetlb, -EFAULT is returned.
> 
> UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
> modifications, the existing codepath assumes a new page needs to be allocated.
> This is okay, since userspace must have a second non-UFFD-registered mapping
> anyway, thus there isn't much reason to want to use these in any case (just
> memcpy or memset or similar).
> 
> - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
> - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
>   in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
> - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
>   -ENOENT in that case (regardless of the kind of fault).
> 
> Remaining Work
> ==============
> 
> This patchset doesn't include updates to userfaultfd's documentation or
> selftests. This will be added before I send a non-RFC version of this series
> (I want to find out if there are strong objections to the API surface before
> spending the time to document it.)
> 
> Currently the patchset only supports hugetlbfs. There is no reason it can't work
> with shmem, but I expect hugetlbfs to be much more commonly used since we're
> talking about backing guest memory for VMs. I plan to implement shmem support in
> a follow-up patch series.
> 
> Axel Rasmussen (2):
>   userfaultfd: add minor fault registration mode
>   userfaultfd: add UFFDIO_CONTINUE ioctl
> 
>  fs/proc/task_mmu.c               |   1 +
>  fs/userfaultfd.c                 | 143 ++++++++++++++++++++++++-------
>  include/linux/mm.h               |   1 +
>  include/linux/userfaultfd_k.h    |  14 ++-
>  include/trace/events/mmflags.h   |   1 +
>  include/uapi/linux/userfaultfd.h |  36 +++++++-
>  mm/hugetlb.c                     |  42 +++++++--
>  mm/userfaultfd.c                 |  86 ++++++++++++++-----
>  8 files changed, 261 insertions(+), 63 deletions(-)
> 
> --
> 2.29.2.729.g45daf8777d-goog
>
Mike Kravetz Jan. 11, 2021, 10:42 p.m. UTC | #2
On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> Overview
> ========
> 
> This series adds a new userfaultfd registration mode,
> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> By "minor" fault, I mean the following situation:
> 
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> One of the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have already been
> allocated & filled with some contents. The UFFD mapping has not yet been
> faulted in; when it is touched for the first time, this results in what I'm
> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> have huge_pte_none(), but find_lock_page() finds an existing page.
> 
> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> userspace resolves the fault by either a) doing nothing if the contents are
> already correct, or b) updating the underlying contents using the second,
> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> "I have ensured the page contents are correct, carry on setting up the mapping".
> 

One quick thought.

This is not going to work as expected with hugetlbfs pmd sharing.  If you
are not familiar with hugetlbfs pmd sharing, you are not alone. :)

pmd sharing is enabled for x86 and arm64 architectures.  If there are multiple
shared mappings of the same underlying hugetlbfs file or shared memory segment
that are 'suitably aligned', then the PMD pages associated with those regions
are shared by all the mappings.  Suitably aligned means 'on a 1GB boundary'
and 1GB in size.

When pmds are shared, your mappings will never see a 'minor fault'.  This
is because the PMD (page table entries) is shared.
Peter Xu Jan. 11, 2021, 11:08 p.m. UTC | #3
On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> > Overview
> > ========
> > 
> > This series adds a new userfaultfd registration mode,
> > UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> > By "minor" fault, I mean the following situation:
> > 
> > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> > One of the mappings is registered with userfaultfd (in minor mode), and the
> > other is not. Via the non-UFFD mapping, the underlying pages have already been
> > allocated & filled with some contents. The UFFD mapping has not yet been
> > faulted in; when it is touched for the first time, this results in what I'm
> > calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> > have huge_pte_none(), but find_lock_page() finds an existing page.
> > 
> > We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> > userspace resolves the fault by either a) doing nothing if the contents are
> > already correct, or b) updating the underlying contents using the second,
> > non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> > or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> > "I have ensured the page contents are correct, carry on setting up the mapping".
> > 
> 
> One quick thought.
> 
> This is not going to work as expected with hugetlbfs pmd sharing.  If you
> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
> 
> pmd sharing is enabled for x86 and arm64 architectures.  If there are multiple
> shared mappings of the same underlying hugetlbfs file or shared memory segment
> that are 'suitably aligned', then the PMD pages associated with those regions
> are shared by all the mappings.  Suitably aligned means 'on a 1GB boundary'
> and 1GB in size.
> 
> When pmds are shared, your mappings will never see a 'minor fault'.  This
> is because the PMD (page table entries) is shared.

Thanks for raising this, Mike.

I've got a few patches that plan to disable huge pmd sharing for uffd in
general, e.g.:

https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31

I believe we don't want that for missing mode too, but it's just not extremely
important for missing mode yet, because in missing mode we normally monitor all
the processes that will be using the registered mm range.  For example, in QEMU
postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
both the QEMU process and the DPDK program, so that either of the programs will
trigger a missing fault even if pmd shared between them.  However again I think
it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
could always be too tricky.

They're not yet posted to public yet since that's part of uffd-wp support for
hugetlbfs (along with shmem).  So just raise this up to avoid potential
duplicated work before I post the patchset.

(Will read into details soon; probably too many things piled up...)

Thanks,
Mike Kravetz Jan. 12, 2021, 12:13 a.m. UTC | #4
On 1/11/21 3:08 PM, Peter Xu wrote:
> On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
>> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
>>> Overview
>>> ========
>>>
>>> This series adds a new userfaultfd registration mode,
>>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
>>> By "minor" fault, I mean the following situation:
>>>
>>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
>>> One of the mappings is registered with userfaultfd (in minor mode), and the
>>> other is not. Via the non-UFFD mapping, the underlying pages have already been
>>> allocated & filled with some contents. The UFFD mapping has not yet been
>>> faulted in; when it is touched for the first time, this results in what I'm
>>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
>>> have huge_pte_none(), but find_lock_page() finds an existing page.
>>>
>>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
>>> userspace resolves the fault by either a) doing nothing if the contents are
>>> already correct, or b) updating the underlying contents using the second,
>>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
>>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
>>> "I have ensured the page contents are correct, carry on setting up the mapping".
>>>
>>
>> One quick thought.
>>
>> This is not going to work as expected with hugetlbfs pmd sharing.  If you
>> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
>>
>> pmd sharing is enabled for x86 and arm64 architectures.  If there are multiple
>> shared mappings of the same underlying hugetlbfs file or shared memory segment
>> that are 'suitably aligned', then the PMD pages associated with those regions
>> are shared by all the mappings.  Suitably aligned means 'on a 1GB boundary'
>> and 1GB in size.
>>
>> When pmds are shared, your mappings will never see a 'minor fault'.  This
>> is because the PMD (page table entries) is shared.
> 
> Thanks for raising this, Mike.
> 
> I've got a few patches that plan to disable huge pmd sharing for uffd in
> general, e.g.:
> 
> https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
> 
> I believe we don't want that for missing mode too, but it's just not extremely
> important for missing mode yet, because in missing mode we normally monitor all
> the processes that will be using the registered mm range.  For example, in QEMU
> postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> both the QEMU process and the DPDK program, so that either of the programs will
> trigger a missing fault even if pmd shared between them.  However again I think
> it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> could always be too tricky.
> 
> They're not yet posted to public yet since that's part of uffd-wp support for
> hugetlbfs (along with shmem).  So just raise this up to avoid potential
> duplicated work before I post the patchset.
> 
> (Will read into details soon; probably too many things piled up...)

Thanks for the heads up about this Peter.

I know Oracle DB really wants shared pmds -and- UFFD.  I need to get details
of their exact usage model.  I know they primarily use SIGBUS, but use
MISSING_HUGETLBFS as well.  We may need to be more selective in when to
disable.
Peter Xu Jan. 12, 2021, 1:49 a.m. UTC | #5
On Mon, Jan 11, 2021 at 04:13:41PM -0800, Mike Kravetz wrote:
> On 1/11/21 3:08 PM, Peter Xu wrote:
> > On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> >> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> >>> Overview
> >>> ========
> >>>
> >>> This series adds a new userfaultfd registration mode,
> >>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> >>> By "minor" fault, I mean the following situation:
> >>>
> >>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> >>> One of the mappings is registered with userfaultfd (in minor mode), and the
> >>> other is not. Via the non-UFFD mapping, the underlying pages have already been
> >>> allocated & filled with some contents. The UFFD mapping has not yet been
> >>> faulted in; when it is touched for the first time, this results in what I'm
> >>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> >>> have huge_pte_none(), but find_lock_page() finds an existing page.
> >>>
> >>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> >>> userspace resolves the fault by either a) doing nothing if the contents are
> >>> already correct, or b) updating the underlying contents using the second,
> >>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> >>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> >>> "I have ensured the page contents are correct, carry on setting up the mapping".
> >>>
> >>
> >> One quick thought.
> >>
> >> This is not going to work as expected with hugetlbfs pmd sharing.  If you
> >> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
> >>
> >> pmd sharing is enabled for x86 and arm64 architectures.  If there are multiple
> >> shared mappings of the same underlying hugetlbfs file or shared memory segment
> >> that are 'suitably aligned', then the PMD pages associated with those regions
> >> are shared by all the mappings.  Suitably aligned means 'on a 1GB boundary'
> >> and 1GB in size.
> >>
> >> When pmds are shared, your mappings will never see a 'minor fault'.  This
> >> is because the PMD (page table entries) is shared.
> > 
> > Thanks for raising this, Mike.
> > 
> > I've got a few patches that plan to disable huge pmd sharing for uffd in
> > general, e.g.:
> > 
> > https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> > https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
> > 
> > I believe we don't want that for missing mode too, but it's just not extremely
> > important for missing mode yet, because in missing mode we normally monitor all
> > the processes that will be using the registered mm range.  For example, in QEMU
> > postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> > both the QEMU process and the DPDK program, so that either of the programs will
> > trigger a missing fault even if pmd shared between them.  However again I think
> > it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> > could always be too tricky.
> > 
> > They're not yet posted to public yet since that's part of uffd-wp support for
> > hugetlbfs (along with shmem).  So just raise this up to avoid potential
> > duplicated work before I post the patchset.
> > 
> > (Will read into details soon; probably too many things piled up...)
> 
> Thanks for the heads up about this Peter.
> 
> I know Oracle DB really wants shared pmds -and- UFFD.  I need to get details
> of their exact usage model.  I know they primarily use SIGBUS, but use
> MISSING_HUGETLBFS as well.  We may need to be more selective in when to
> disable.

After a second thought, indeed it's possible to use it that way with pmd
sharing.  Actually we don't need to generate the fault for every page, if what
we want to do is simply "initializing the pages using some data" on the
registered ranges.  Should also be the case even for qemu+dpdk, because if
e.g. qemu faulted in a page, then it'll be nicer if dpdk can avoid faulting in
again (so when huge pmd sharing enabled we can even avoid the PF irq to install
the pte if at last page cache existed).  It should be similarly beneficial if
the other process is not faulting in but proactively filling the holes using
UFFDIO_COPY either for the current process or for itself; sounds like a valid
scenario for Google too when VM migrates.

I've modified my local tree to only disable pmd sharing for uffd-wp but keep
missing mode as-is [1].  A new helper uffd_disable_huge_pmd_share() is
introduced in patch "hugetlb/userfaultfd: Forbid huge pmd sharing when uffd
enabled", so should be easier if we would like to add minor mode too.

Thanks!

[1] https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs
Axel Rasmussen Jan. 12, 2021, 5:37 p.m. UTC | #6
On Mon, Jan 11, 2021 at 5:49 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jan 11, 2021 at 04:13:41PM -0800, Mike Kravetz wrote:
> > On 1/11/21 3:08 PM, Peter Xu wrote:
> > > On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> > >> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> > >>> Overview
> > >>> ========
> > >>>
> > >>> This series adds a new userfaultfd registration mode,
> > >>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> > >>> By "minor" fault, I mean the following situation:
> > >>>
> > >>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> > >>> One of the mappings is registered with userfaultfd (in minor mode), and the
> > >>> other is not. Via the non-UFFD mapping, the underlying pages have already been
> > >>> allocated & filled with some contents. The UFFD mapping has not yet been
> > >>> faulted in; when it is touched for the first time, this results in what I'm
> > >>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> > >>> have huge_pte_none(), but find_lock_page() finds an existing page.
> > >>>
> > >>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> > >>> userspace resolves the fault by either a) doing nothing if the contents are
> > >>> already correct, or b) updating the underlying contents using the second,
> > >>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> > >>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> > >>> "I have ensured the page contents are correct, carry on setting up the mapping".
> > >>>
> > >>
> > >> One quick thought.
> > >>
> > >> This is not going to work as expected with hugetlbfs pmd sharing.  If you
> > >> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
> > >>
> > >> pmd sharing is enabled for x86 and arm64 architectures.  If there are multiple
> > >> shared mappings of the same underlying hugetlbfs file or shared memory segment
> > >> that are 'suitably aligned', then the PMD pages associated with those regions
> > >> are shared by all the mappings.  Suitably aligned means 'on a 1GB boundary'
> > >> and 1GB in size.
> > >>
> > >> When pmds are shared, your mappings will never see a 'minor fault'.  This
> > >> is because the PMD (page table entries) is shared.
> > >
> > > Thanks for raising this, Mike.
> > >
> > > I've got a few patches that plan to disable huge pmd sharing for uffd in
> > > general, e.g.:
> > >
> > > https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> > > https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
> > >
> > > I believe we don't want that for missing mode too, but it's just not extremely
> > > important for missing mode yet, because in missing mode we normally monitor all
> > > the processes that will be using the registered mm range.  For example, in QEMU
> > > postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> > > both the QEMU process and the DPDK program, so that either of the programs will
> > > trigger a missing fault even if pmd shared between them.  However again I think
> > > it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> > > could always be too tricky.
> > >
> > > They're not yet posted to public yet since that's part of uffd-wp support for
> > > hugetlbfs (along with shmem).  So just raise this up to avoid potential
> > > duplicated work before I post the patchset.
> > >
> > > (Will read into details soon; probably too many things piled up...)
> >
> > Thanks for the heads up about this Peter.
> >
> > I know Oracle DB really wants shared pmds -and- UFFD.  I need to get details
> > of their exact usage model.  I know they primarily use SIGBUS, but use
> > MISSING_HUGETLBFS as well.  We may need to be more selective in when to
> > disable.
>
> After a second thought, indeed it's possible to use it that way with pmd
> sharing.  Actually we don't need to generate the fault for every page, if what
> we want to do is simply "initializing the pages using some data" on the
> registered ranges.  Should also be the case even for qemu+dpdk, because if
> e.g. qemu faulted in a page, then it'll be nicer if dpdk can avoid faulting in
> again (so when huge pmd sharing enabled we can even avoid the PF irq to install
> the pte if at last page cache existed).  It should be similarly beneficial if
> the other process is not faulting in but proactively filling the holes using
> UFFDIO_COPY either for the current process or for itself; sounds like a valid
> scenario for Google too when VM migrates.

Exactly right, but I'm a little unsure how to get it to work. There
are two different cases:

- Allocate + populate a page in the background (not on demand) during
postcopy (i.e., after the VM has started executing on the migration
target). In this case, we can be certain that the page contents are up
to date, since execution on the source was already paused. In this
case PMD sharing would actually be nice, because it would mean the VM
would never fault on this page going forward.

- Allocate + populate a page during precopy (i.e., while the VM is
still executing on the migration source). In this case, we *don't*
want PMD sharing, because we need to intercept the first time this
page is touched, verify it's up to date, and copy over the updated
data if not.

Another related situation to consider is, at some point on the target
machine, we'll receive the "dirty map" indicating which pages are out
of date or not. My original thinking was, when the VM faults on any of
these pages, from this point forward we'd just look at the map and
then UFFDIO_CONTINUE if things were up to date. But you're right that
a possible optimization is, once we receive the map, just immediately
"enable PMD sharing" on these pages, so the VM will never fault on
them.

But, this is all kind of speculative. I don't know of any existing API
for *userspace* to take an existing shared memory mapping without PMD
sharing, and "turn on" PMD sharing for particular page(s).

For now, I'll plan on disabling PMD sharing for MINOR registered
ranges. Thanks, Peter and Mike!


>
> I've modified my local tree to only disable pmd sharing for uffd-wp but keep
> missing mode as-is [1].  A new helper uffd_disable_huge_pmd_share() is
> introduced in patch "hugetlb/userfaultfd: Forbid huge pmd sharing when uffd
> enabled", so should be easier if we would like to add minor mode too.
>
> Thanks!
>
> [1] https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs
>
> --
> Peter Xu
>