[RFC,0/5] KVM: guest_memfd: support for uffd missing

Message ID	20250303133011.44095-1-kalyazin@amazon.com (mailing list archive)
Headers	show Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85A8B53365; Mon, 3 Mar 2025 13:30:24 +0000 (UTC) From: Nikita Kalyazin <kalyazin@amazon.com> To: <akpm@linux-foundation.org>, <pbonzini@redhat.com>, <shuah@kernel.org> CC: <kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <lorenzo.stoakes@oracle.com>, <david@redhat.com>, <ryan.roberts@arm.com>, <quic_eberman@quicinc.com>, <jthoughton@google.com>, <peterx@redhat.com>, <graf@amazon.de>, <jgowans@amazon.com>, <roypat@amazon.co.uk>, <derekmn@amazon.com>, <nsaenz@amazon.es>, <xmarcalx@amazon.com>, <kalyazin@amazon.com> Subject: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing Date: Mon, 3 Mar 2025 13:30:06 +0000 Message-ID: <20250303133011.44095-1-kalyazin@amazon.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	KVM: guest_memfd: support for uffd missing \| expand [RFC,0/5] KVM: guest_memfd: support for uffd missing [RFC,1/5] KVM: guest_memfd: add kvm_gmem_vma_is_gmem [RFC,2/5] KVM: guest_memfd: add support for uffd missing [RFC,3/5] mm: userfaultfd: allow to register userfaultfd for guest_memfd [RFC,4/5] mm: userfaultfd: support continue for guest_memfd [RFC,5/5] KVM: selftests: add uffd missing test for guest_memfd

Nikita Kalyazin March 3, 2025, 1:30 p.m. UTC

This series is built on top of the v3 write syscall support [1].

With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace.  However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables.  In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.

This series proposes a limited support for userfaultfd in guest_memfd:
 - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
   (as is fault support in general)
 - Only `page missing` event is currently supported
 - Userspace is supposed to respond to the event with the `write`
   syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
   process.   Note that we can't use `UFFDIO_COPY` here because
   userfaulfd code does not know how to prepare guest_memfd pages, eg
   remove them from direct map [4].

Not included in this series:
 - Proper interface for userfaultfd to recognise guest_memfd mappings
 - Proper handling of truncation cases after locking the page

Request for comments:
 - Is it a sensible workflow for guest_memfd to resolve a userfault
   `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
   of the alternatives is teaching `UFFDIO_COPY` how to deal with
   guest_memfd pages.
 - What is a way forward to make userfaultfd code aware of guest_memfd?
   I saw that Patrick hit a somewhat similar problem in [5] when trying
   to use direct map manipulation functions in KVM and was pointed by
   David at Elliot's guestmem library [6] that might include a shim for that.
   Would the library be the right place to expose required interfaces like
   `vma_is_gmem`?

Nikita

[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#ma130b29c130dbdc894aa08d8d56c16ec383f36dd
[5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@quicinc.com/T/

Nikita Kalyazin (5):
  KVM: guest_memfd: add kvm_gmem_vma_is_gmem
  KVM: guest_memfd: add support for uffd missing
  mm: userfaultfd: allow to register userfaultfd for guest_memfd
  mm: userfaultfd: support continue for guest_memfd
  KVM: selftests: add uffd missing test for guest_memfd

 include/linux/userfaultfd_k.h                 |  9 ++
 mm/userfaultfd.c                              | 23 ++++-
 .../testing/selftests/kvm/guest_memfd_test.c  | 88 +++++++++++++++++++
 virt/kvm/guest_memfd.c                        | 17 +++-
 virt/kvm/kvm_mm.h                             |  1 +
 5 files changed, 136 insertions(+), 2 deletions(-)


base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223

Peter Xu March 3, 2025, 9:29 p.m. UTC | #1

On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> This series is built on top of the v3 write syscall support [1].
> 
> With James's KVM userfault [2], it is possible to handle stage-2 faults
> in guest_memfd in userspace.  However, KVM itself also triggers faults
> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> PV EOI and page table walking code when fetching the MMIO instruction on
> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> that KVM would be accessing those pages via userspace page tables.  In
> order for such faults to be handled in userspace, guest_memfd needs to
> support userfaultfd.
> 
> This series proposes a limited support for userfaultfd in guest_memfd:
>  - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
>    (as is fault support in general)
>  - Only `page missing` event is currently supported
>  - Userspace is supposed to respond to the event with the `write`
>    syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
>    process.   Note that we can't use `UFFDIO_COPY` here because
>    userfaulfd code does not know how to prepare guest_memfd pages, eg
>    remove them from direct map [4].
> 
> Not included in this series:
>  - Proper interface for userfaultfd to recognise guest_memfd mappings
>  - Proper handling of truncation cases after locking the page
> 
> Request for comments:
>  - Is it a sensible workflow for guest_memfd to resolve a userfault
>    `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
>    of the alternatives is teaching `UFFDIO_COPY` how to deal with
>    guest_memfd pages.

Probably not..  I don't see what protects a thread fault concurrently
during write() happening, seeing partial data.  Since you check the page
cache it'll let it pass, but the partial page will be faulted in there.

I think we may need to either go with full MISSING or full MINOR traps.

One thing to mention is we probably need MINOR sooner or later to support
gmem huge pages.  The thing is for huge folios in gmem we can't rely on
missing in page cache, as we always need to allocate in hugetlb sizes.

>  - What is a way forward to make userfaultfd code aware of guest_memfd?
>    I saw that Patrick hit a somewhat similar problem in [5] when trying
>    to use direct map manipulation functions in KVM and was pointed by
>    David at Elliot's guestmem library [6] that might include a shim for that.
>    Would the library be the right place to expose required interfaces like
>    `vma_is_gmem`?

Not sure what's the best to do, but IIUC the current way this series uses
may not work as long as one tries to reference a kvm symbol from core mm..

One trick I used so far is leveraging vm_ops and provide hook function to
report specialties when it's gmem.  In general, I did not yet dare to
overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
be accepted.  E.g. something like this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e742738240c..b068bb79fdbc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -653,8 +653,26 @@ struct vm_operations_struct {
         */
        struct page *(*find_special_page)(struct vm_area_struct *vma,
                                          unsigned long addr);
+       /*
+        * When set, return the allowed orders bitmask in faults of mmap()
+        * ranges (e.g. for follow up huge_fault() processing).  Drivers
+        * can use this to bypass THP setups for specific types of VMAs.
+        */
+       unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
 };
 
+static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
+{
+       return vma->vm_ops && vma->vm_ops->get_supported_orders;
+}
+
+static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
+{
+       if (!vma_has_supported_orders(vma))
+               return 0;
+       return vma->vm_ops->get_supported_orders(vma);
+}
+

In my case I used that to allow gmem report huge page supports on faults.

Said that, above only existed in my own tree so far, so I also don't know
whether something like that could be accepted (even if it'll work for you).

Thanks,

James Houghton March 5, 2025, 7:35 p.m. UTC | #2

On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> > This series is built on top of the v3 write syscall support [1].
> >
> > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > in guest_memfd in userspace.  However, KVM itself also triggers faults
> > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > PV EOI and page table walking code when fetching the MMIO instruction on
> > x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > that KVM would be accessing those pages via userspace page tables.  In
> > order for such faults to be handled in userspace, guest_memfd needs to
> > support userfaultfd.
> >
> > This series proposes a limited support for userfaultfd in guest_memfd:
> >  - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
> >    (as is fault support in general)
> >  - Only `page missing` event is currently supported
> >  - Userspace is supposed to respond to the event with the `write`
> >    syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
> >    process.   Note that we can't use `UFFDIO_COPY` here because
> >    userfaulfd code does not know how to prepare guest_memfd pages, eg
> >    remove them from direct map [4].
> >
> > Not included in this series:
> >  - Proper interface for userfaultfd to recognise guest_memfd mappings
> >  - Proper handling of truncation cases after locking the page
> >
> > Request for comments:
> >  - Is it a sensible workflow for guest_memfd to resolve a userfault
> >    `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
> >    of the alternatives is teaching `UFFDIO_COPY` how to deal with
> >    guest_memfd pages.
>
> Probably not..  I don't see what protects a thread fault concurrently
> during write() happening, seeing partial data.  Since you check the page
> cache it'll let it pass, but the partial page will be faulted in there.

+1 here.

I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].

I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.

[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com/

> I think we may need to either go with full MISSING or full MINOR traps.

I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.

I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.

> One thing to mention is we probably need MINOR sooner or later to support
> gmem huge pages.  The thing is for huge folios in gmem we can't rely on
> missing in page cache, as we always need to allocate in hugetlb sizes.
>
> >  - What is a way forward to make userfaultfd code aware of guest_memfd?
> >    I saw that Patrick hit a somewhat similar problem in [5] when trying
> >    to use direct map manipulation functions in KVM and was pointed by
> >    David at Elliot's guestmem library [6] that might include a shim for that.
> >    Would the library be the right place to expose required interfaces like
> >    `vma_is_gmem`?
>
> Not sure what's the best to do, but IIUC the current way this series uses
> may not work as long as one tries to reference a kvm symbol from core mm..
>
> One trick I used so far is leveraging vm_ops and provide hook function to
> report specialties when it's gmem.  In general, I did not yet dare to
> overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
> be accepted.  E.g. something like this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e742738240c..b068bb79fdbc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -653,8 +653,26 @@ struct vm_operations_struct {
>          */
>         struct page *(*find_special_page)(struct vm_area_struct *vma,
>                                           unsigned long addr);
> +       /*
> +        * When set, return the allowed orders bitmask in faults of mmap()
> +        * ranges (e.g. for follow up huge_fault() processing).  Drivers
> +        * can use this to bypass THP setups for specific types of VMAs.
> +        */
> +       unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
>  };
>
> +static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
> +{
> +       return vma->vm_ops && vma->vm_ops->get_supported_orders;
> +}
> +
> +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
> +{
> +       if (!vma_has_supported_orders(vma))
> +               return 0;
> +       return vma->vm_ops->get_supported_orders(vma);
> +}
> +
>
> In my case I used that to allow gmem report huge page supports on faults.
>
> Said that, above only existed in my own tree so far, so I also don't know
> whether something like that could be accepted (even if it'll work for you).

I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?

And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.

I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...

Peter Xu March 5, 2025, 8:29 p.m. UTC | #3

On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> I think it might be useful to implement an fs-generic MINOR mode. The
> fault handler is already easy enough to do generically (though it
> would become more difficult to determine if the "MINOR" fault is
> actually a MISSING fault, but at least for my userspace, the
> distinction isn't important. :)) So the question becomes: what should
> UFFDIO_CONTINUE look like?
> 
> And I think it would be nice if UFFDIO_CONTINUE just called
> vm_ops->fault() to get the page we want to map and then mapped it,
> instead of having shmem-specific and hugetlb-specific versions (though
> maybe we need to keep the hugetlb specialization...). That would avoid
> putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> 
> I've actually wanted to do this for a while but haven't had a good
> reason to pursue it. I wonder if it can be done in a
> backwards-compatible fashion...

Yes I also thought about that. :)

When Axel added minor fault, it's not a major concern as it's the only fs
that will consume the feature anyway in the do_fault() path - hugetlbfs has
its own path to take care of.. even until now.

And there's some valid points too if someone would argue to put it there
especially on folio lock - do that in shmem.c can avoid taking folio lock
when generating minor fault message.  It might make some difference when
the faults are heavy and when folio lock is frequently taken elsewhere too.

It might boil down to how many more FSes would support minor fault, and
whether we would care about such difference at last to shmem users. If gmem
is the only one after existing ones, IIUC there's still option we implement
it in gmem code.  After all, I expect the change should be very under
control (<20 LOCs?)..

Nikita Kalyazin March 10, 2025, 6:12 p.m. UTC | #4

On 05/03/2025 20:29, Peter Xu wrote:
> On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
>> I think it might be useful to implement an fs-generic MINOR mode. The
>> fault handler is already easy enough to do generically (though it
>> would become more difficult to determine if the "MINOR" fault is
>> actually a MISSING fault, but at least for my userspace, the
>> distinction isn't important. :)) So the question becomes: what should
>> UFFDIO_CONTINUE look like?
>>
>> And I think it would be nice if UFFDIO_CONTINUE just called
>> vm_ops->fault() to get the page we want to map and then mapped it,
>> instead of having shmem-specific and hugetlb-specific versions (though
>> maybe we need to keep the hugetlb specialization...). That would avoid
>> putting kvm/gmem/etc. symbols in mm/userfaultfd code.
>>
>> I've actually wanted to do this for a while but haven't had a good
>> reason to pursue it. I wonder if it can be done in a
>> backwards-compatible fashion...
> 
> Yes I also thought about that. :)

Hi Peter, hi James.  Thanks for pointing at the race condition!

I did some experimentation and it indeed looks possible to call 
vm_ops->fault() from userfault_continue() to make it generic and 
decouple from KVM, at least for non-hugetlb cases.  One thing is we'd 
need to prevent a recursive handle_userfault() invocation, which I 
believe can be solved by adding a new VMF flag to ignore the userfault 
path when the fault handler is called from userfault_continue().  I'm 
open to a more elegant solution though.

Regarding usage of the MINOR notification, in what case do you recommend 
sending it?  If following the logic implemented in shmem and hugetlb, ie 
if the page is _present_ in the pagecache, I can't see how it is going 
to work with the write syscall, as we'd like to know when the page is 
_missing_ in order to respond with the population via the write.  If 
going against shmem/hugetlb logic, and sending the MINOR event when the 
page is missing from the pagecache, how would it solve the race 
condition problem?

Also, where would the check for the folio_test_uptodate() mentioned by 
James fit into here?  Would it only be used for fortifying the MINOR 
(present) against the race?

> When Axel added minor fault, it's not a major concern as it's the only fs
> that will consume the feature anyway in the do_fault() path - hugetlbfs has
> its own path to take care of.. even until now.
> 
> And there's some valid points too if someone would argue to put it there
> especially on folio lock - do that in shmem.c can avoid taking folio lock
> when generating minor fault message.  It might make some difference when
> the faults are heavy and when folio lock is frequently taken elsewhere too.

Peter, could you expand on this?  Are you referring to the following 
(shmem_get_folio_gfp)?

	if (folio) {
		folio_lock(folio);

		/* Has the folio been truncated or swapped out? */
		if (unlikely(folio->mapping != inode->i_mapping)) {
			folio_unlock(folio);
			folio_put(folio);
			goto repeat;
		}
		if (sgp == SGP_WRITE)
			folio_mark_accessed(folio);
		if (folio_test_uptodate(folio))
			goto out;
		/* fallocated folio */
		if (sgp != SGP_READ)
			goto clear;
		folio_unlock(folio);
		folio_put(folio);
	}

Could you explain in what case the lock can be avoided?  AFAIC, the 
function is called by both the shmem fault handler and userfault_continue().

> It might boil down to how many more FSes would support minor fault, and
> whether we would care about such difference at last to shmem users. If gmem
> is the only one after existing ones, IIUC there's still option we implement
> it in gmem code.  After all, I expect the change should be very under
> control (<20 LOCs?)..
> 
> --
> Peter Xu
>

Peter Xu March 10, 2025, 7:57 p.m. UTC | #5

On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 05/03/2025 20:29, Peter Xu wrote:
> > On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> > > I think it might be useful to implement an fs-generic MINOR mode. The
> > > fault handler is already easy enough to do generically (though it
> > > would become more difficult to determine if the "MINOR" fault is
> > > actually a MISSING fault, but at least for my userspace, the
> > > distinction isn't important. :)) So the question becomes: what should
> > > UFFDIO_CONTINUE look like?
> > > 
> > > And I think it would be nice if UFFDIO_CONTINUE just called
> > > vm_ops->fault() to get the page we want to map and then mapped it,
> > > instead of having shmem-specific and hugetlb-specific versions (though
> > > maybe we need to keep the hugetlb specialization...). That would avoid
> > > putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> > > 
> > > I've actually wanted to do this for a while but haven't had a good
> > > reason to pursue it. I wonder if it can be done in a
> > > backwards-compatible fashion...
> > 
> > Yes I also thought about that. :)
> 
> Hi Peter, hi James.  Thanks for pointing at the race condition!
> 
> I did some experimentation and it indeed looks possible to call
> vm_ops->fault() from userfault_continue() to make it generic and decouple
> from KVM, at least for non-hugetlb cases.  One thing is we'd need to prevent
> a recursive handle_userfault() invocation, which I believe can be solved by
> adding a new VMF flag to ignore the userfault path when the fault handler is
> called from userfault_continue().  I'm open to a more elegant solution
> though.

It sounds working to me.  Adding fault flag can also be seen as part of
extension of vm_operations_struct ops.  So we could consider reusing
fault() API indeed.

> 
> Regarding usage of the MINOR notification, in what case do you recommend
> sending it?  If following the logic implemented in shmem and hugetlb, ie if
> the page is _present_ in the pagecache, I can't see how it is going to work

It could be confusing when reading that chunk of code, because it looks
like it notifies minor fault when cache hit. But the critical part here is
that we rely on the pgtable missing causing the fault() to trigger first.
So it's more like "cache hit && pgtable missing" for minor fault.

> with the write syscall, as we'd like to know when the page is _missing_ in
> order to respond with the population via the write.  If going against
> shmem/hugetlb logic, and sending the MINOR event when the page is missing
> from the pagecache, how would it solve the race condition problem?

Should be easier we stick with mmap() rather than write().  E.g. for shmem
case of current code base:

	if (folio && vma && userfaultfd_minor(vma)) {
		if (!xa_is_value(folio))
			folio_put(folio);
		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
		return 0;
	}

vma is only availble if vmf!=NULL, aka in fault context.  With that, in
write() to shmem inodes, nothing will generate a message, because minor
fault so far is only about pgtable missing.  It needs to be mmap()ed first,
and has nothing yet to do with write() syscalls.

> 
> Also, where would the check for the folio_test_uptodate() mentioned by James
> fit into here?  Would it only be used for fortifying the MINOR (present)
> against the race?
> 
> > When Axel added minor fault, it's not a major concern as it's the only fs
> > that will consume the feature anyway in the do_fault() path - hugetlbfs has
> > its own path to take care of.. even until now.
> > 
> > And there's some valid points too if someone would argue to put it there
> > especially on folio lock - do that in shmem.c can avoid taking folio lock
> > when generating minor fault message.  It might make some difference when
> > the faults are heavy and when folio lock is frequently taken elsewhere too.
> 
> Peter, could you expand on this?  Are you referring to the following
> (shmem_get_folio_gfp)?
> 
> 	if (folio) {
> 		folio_lock(folio);
> 
> 		/* Has the folio been truncated or swapped out? */
> 		if (unlikely(folio->mapping != inode->i_mapping)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			goto repeat;
> 		}
> 		if (sgp == SGP_WRITE)
> 			folio_mark_accessed(folio);
> 		if (folio_test_uptodate(folio))
> 			goto out;
> 		/* fallocated folio */
> 		if (sgp != SGP_READ)
> 			goto clear;
> 		folio_unlock(folio);
> 		folio_put(folio);
> 	}
> 
> Could you explain in what case the lock can be avoided?  AFAIC, the function
> is called by both the shmem fault handler and userfault_continue().

I think you meant the UFFDIO_CONTINUE side of things.  I agree with you, we
always need the folio lock.

What I was saying is the trapping side, where the minor fault message can
be generated without the folio lock now in case of shmem.  It's about
whether we could generalize the trapping side, so handle_mm_fault() can
generate the minor fault message instead of by shmem.c.

If the only concern is "referring to a module symbol from core mm", then
indeed the trapping side should be less of a concern anyway, because the
trapping side (when in the module codes) should always be able to reference
mm functions.

Actually.. if we have a fault() flag introduced above, maybe we can
generalize the trap side altogether without the folio lock overhead.  When
the flag set, if we can always return the folio unlocked (as long as
refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.

> 
> > It might boil down to how many more FSes would support minor fault, and
> > whether we would care about such difference at last to shmem users. If gmem
> > is the only one after existing ones, IIUC there's still option we implement
> > it in gmem code.  After all, I expect the change should be very under
> > control (<20 LOCs?)..
> > 
> > --
> > Peter Xu
> > 
>

Nikita Kalyazin March 11, 2025, 4:56 p.m. UTC | #6

On 10/03/2025 19:57, Peter Xu wrote:
> On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
>>
>>
>> On 05/03/2025 20:29, Peter Xu wrote:
>>> On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
>>>> I think it might be useful to implement an fs-generic MINOR mode. The
>>>> fault handler is already easy enough to do generically (though it
>>>> would become more difficult to determine if the "MINOR" fault is
>>>> actually a MISSING fault, but at least for my userspace, the
>>>> distinction isn't important. :)) So the question becomes: what should
>>>> UFFDIO_CONTINUE look like?
>>>>
>>>> And I think it would be nice if UFFDIO_CONTINUE just called
>>>> vm_ops->fault() to get the page we want to map and then mapped it,
>>>> instead of having shmem-specific and hugetlb-specific versions (though
>>>> maybe we need to keep the hugetlb specialization...). That would avoid
>>>> putting kvm/gmem/etc. symbols in mm/userfaultfd code.
>>>>
>>>> I've actually wanted to do this for a while but haven't had a good
>>>> reason to pursue it. I wonder if it can be done in a
>>>> backwards-compatible fashion...
>>>
>>> Yes I also thought about that. :)
>>
>> Hi Peter, hi James.  Thanks for pointing at the race condition!
>>
>> I did some experimentation and it indeed looks possible to call
>> vm_ops->fault() from userfault_continue() to make it generic and decouple
>> from KVM, at least for non-hugetlb cases.  One thing is we'd need to prevent
>> a recursive handle_userfault() invocation, which I believe can be solved by
>> adding a new VMF flag to ignore the userfault path when the fault handler is
>> called from userfault_continue().  I'm open to a more elegant solution
>> though.
> 
> It sounds working to me.  Adding fault flag can also be seen as part of
> extension of vm_operations_struct ops.  So we could consider reusing
> fault() API indeed.

Great!

>>
>> Regarding usage of the MINOR notification, in what case do you recommend
>> sending it?  If following the logic implemented in shmem and hugetlb, ie if
>> the page is _present_ in the pagecache, I can't see how it is going to work
> 
> It could be confusing when reading that chunk of code, because it looks
> like it notifies minor fault when cache hit. But the critical part here is
> that we rely on the pgtable missing causing the fault() to trigger first.
> So it's more like "cache hit && pgtable missing" for minor fault.

Right, but the cache hit still looks like a precondition for the minor 
fault event?

>> with the write syscall, as we'd like to know when the page is _missing_ in
>> order to respond with the population via the write.  If going against
>> shmem/hugetlb logic, and sending the MINOR event when the page is missing
>> from the pagecache, how would it solve the race condition problem?
> 
> Should be easier we stick with mmap() rather than write().  E.g. for shmem
> case of current code base:
> 
>          if (folio && vma && userfaultfd_minor(vma)) {
>                  if (!xa_is_value(folio))
>                          folio_put(folio);
>                  *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
>                  return 0;
>          }
> 
> vma is only availble if vmf!=NULL, aka in fault context.  With that, in
> write() to shmem inodes, nothing will generate a message, because minor
> fault so far is only about pgtable missing.  It needs to be mmap()ed first,
> and has nothing yet to do with write() syscalls.

Yes, that's true that write() itself isn't going to generate a message. 
My idea was to _respond_ to a message generated by the fault handler 
(vmf != NULL) with a write().  I didn't mean to generate it from write().

What I wanted to achieve was send a message on fault + cache miss and 
respond to the message with a write() to fill the cache followed by a 
UFFDIO_CONTINUE to set up pagetables.  I understand that a MINOR trap 
(MINOR + UFFDIO_CONTINUE) is preferable, but how does it fit into this 
model?  What/how will guarantee a cache hit that would trigger the MINOR 
message?

To clarify, I would like to be able to populate pages _on-demand_, not 
only proactively (like in the original UFFDIO_CONTINUE cover letter 
[1]).  Do you think the MINOR trap could still be applicable or would it 
necessarily require the MISSING trap?

[1] 
https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/

>>
>> Also, where would the check for the folio_test_uptodate() mentioned by James
>> fit into here?  Would it only be used for fortifying the MINOR (present)
>> against the race?
>>
>>> When Axel added minor fault, it's not a major concern as it's the only fs
>>> that will consume the feature anyway in the do_fault() path - hugetlbfs has
>>> its own path to take care of.. even until now.
>>>
>>> And there's some valid points too if someone would argue to put it there
>>> especially on folio lock - do that in shmem.c can avoid taking folio lock
>>> when generating minor fault message.  It might make some difference when
>>> the faults are heavy and when folio lock is frequently taken elsewhere too.
>>
>> Peter, could you expand on this?  Are you referring to the following
>> (shmem_get_folio_gfp)?
>>
>>        if (folio) {
>>                folio_lock(folio);
>>
>>                /* Has the folio been truncated or swapped out? */
>>                if (unlikely(folio->mapping != inode->i_mapping)) {
>>                        folio_unlock(folio);
>>                        folio_put(folio);
>>                        goto repeat;
>>                }
>>                if (sgp == SGP_WRITE)
>>                        folio_mark_accessed(folio);
>>                if (folio_test_uptodate(folio))
>>                        goto out;
>>                /* fallocated folio */
>>                if (sgp != SGP_READ)
>>                        goto clear;
>>                folio_unlock(folio);
>>                folio_put(folio);
>>        }
>>
>> Could you explain in what case the lock can be avoided?  AFAIC, the function
>> is called by both the shmem fault handler and userfault_continue().
> 
> I think you meant the UFFDIO_CONTINUE side of things.  I agree with you, we
> always need the folio lock.
> 
> What I was saying is the trapping side, where the minor fault message can
> be generated without the folio lock now in case of shmem.  It's about
> whether we could generalize the trapping side, so handle_mm_fault() can
> generate the minor fault message instead of by shmem.c.
> 
> If the only concern is "referring to a module symbol from core mm", then
> indeed the trapping side should be less of a concern anyway, because the
> trapping side (when in the module codes) should always be able to reference
> mm functions.
> 
> Actually.. if we have a fault() flag introduced above, maybe we can
> generalize the trap side altogether without the folio lock overhead.  When
> the flag set, if we can always return the folio unlocked (as long as
> refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.

Where does this locking happen exactly during trapping?  I was thinking 
it was only done when the page was allocated.  The trapping part (quoted 
by you above) only looks up the page in the cache and calls 
handle_userfault().  Am I missing something?

>>
>>> It might boil down to how many more FSes would support minor fault, and
>>> whether we would care about such difference at last to shmem users. If gmem
>>> is the only one after existing ones, IIUC there's still option we implement
>>> it in gmem code.  After all, I expect the change should be very under
>>> control (<20 LOCs?)..
>>>
>>> --
>>> Peter Xu
>>>
>>
> 
> --
> Peter Xu
>

Peter Xu March 12, 2025, 3:45 p.m. UTC | #7

On Tue, Mar 11, 2025 at 04:56:47PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 10/03/2025 19:57, Peter Xu wrote:
> > On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 05/03/2025 20:29, Peter Xu wrote:
> > > > On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> > > > > I think it might be useful to implement an fs-generic MINOR mode. The
> > > > > fault handler is already easy enough to do generically (though it
> > > > > would become more difficult to determine if the "MINOR" fault is
> > > > > actually a MISSING fault, but at least for my userspace, the
> > > > > distinction isn't important. :)) So the question becomes: what should
> > > > > UFFDIO_CONTINUE look like?
> > > > > 
> > > > > And I think it would be nice if UFFDIO_CONTINUE just called
> > > > > vm_ops->fault() to get the page we want to map and then mapped it,
> > > > > instead of having shmem-specific and hugetlb-specific versions (though
> > > > > maybe we need to keep the hugetlb specialization...). That would avoid
> > > > > putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> > > > > 
> > > > > I've actually wanted to do this for a while but haven't had a good
> > > > > reason to pursue it. I wonder if it can be done in a
> > > > > backwards-compatible fashion...
> > > > 
> > > > Yes I also thought about that. :)
> > > 
> > > Hi Peter, hi James.  Thanks for pointing at the race condition!
> > > 
> > > I did some experimentation and it indeed looks possible to call
> > > vm_ops->fault() from userfault_continue() to make it generic and decouple
> > > from KVM, at least for non-hugetlb cases.  One thing is we'd need to prevent
> > > a recursive handle_userfault() invocation, which I believe can be solved by
> > > adding a new VMF flag to ignore the userfault path when the fault handler is
> > > called from userfault_continue().  I'm open to a more elegant solution
> > > though.
> > 
> > It sounds working to me.  Adding fault flag can also be seen as part of
> > extension of vm_operations_struct ops.  So we could consider reusing
> > fault() API indeed.
> 
> Great!
> 
> > > 
> > > Regarding usage of the MINOR notification, in what case do you recommend
> > > sending it?  If following the logic implemented in shmem and hugetlb, ie if
> > > the page is _present_ in the pagecache, I can't see how it is going to work
> > 
> > It could be confusing when reading that chunk of code, because it looks
> > like it notifies minor fault when cache hit. But the critical part here is
> > that we rely on the pgtable missing causing the fault() to trigger first.
> > So it's more like "cache hit && pgtable missing" for minor fault.
> 
> Right, but the cache hit still looks like a precondition for the minor fault
> event?

Yes.

> 
> > > with the write syscall, as we'd like to know when the page is _missing_ in
> > > order to respond with the population via the write.  If going against
> > > shmem/hugetlb logic, and sending the MINOR event when the page is missing
> > > from the pagecache, how would it solve the race condition problem?
> > 
> > Should be easier we stick with mmap() rather than write().  E.g. for shmem
> > case of current code base:
> > 
> >          if (folio && vma && userfaultfd_minor(vma)) {
> >                  if (!xa_is_value(folio))
> >                          folio_put(folio);
> >                  *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> >                  return 0;
> >          }
> > 
> > vma is only availble if vmf!=NULL, aka in fault context.  With that, in
> > write() to shmem inodes, nothing will generate a message, because minor
> > fault so far is only about pgtable missing.  It needs to be mmap()ed first,
> > and has nothing yet to do with write() syscalls.
> 
> Yes, that's true that write() itself isn't going to generate a message. My
> idea was to _respond_ to a message generated by the fault handler (vmf !=
> NULL) with a write().  I didn't mean to generate it from write().
> 
> What I wanted to achieve was send a message on fault + cache miss and
> respond to the message with a write() to fill the cache followed by a
> UFFDIO_CONTINUE to set up pagetables.  I understand that a MINOR trap (MINOR
> + UFFDIO_CONTINUE) is preferable, but how does it fit into this model?
> What/how will guarantee a cache hit that would trigger the MINOR message?
> 
> To clarify, I would like to be able to populate pages _on-demand_, not only
> proactively (like in the original UFFDIO_CONTINUE cover letter [1]).  Do you
> think the MINOR trap could still be applicable or would it necessarily
> require the MISSING trap?

I think MINOR can also achieve similar things.  MINOR traps the pgtable
missing event (let's imagine page cache is already populated, or at least
when MISSING mode not registered, it'll auto-populate on 1st access).  So
as long as the content can only be accessed from the pgtable (either via
mmap() or GUP on top of it), then afaiu it could work similarly like
MISSING faults, because anything trying to access it will be trapped.

Said that, we can also choose to implement MISSING first.  In that case
write() is definitely not enough, because MISSING is at least so far based
on top of whether the page cache present, and write() won't be atomic on
update a page.  We need to implement UFFDIO_COPY for gmemfd MISSING.

Either way looks ok to me.

> 
> [1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/
> 
> > > 
> > > Also, where would the check for the folio_test_uptodate() mentioned by James
> > > fit into here?  Would it only be used for fortifying the MINOR (present)
> > > against the race?
> > > 
> > > > When Axel added minor fault, it's not a major concern as it's the only fs
> > > > that will consume the feature anyway in the do_fault() path - hugetlbfs has
> > > > its own path to take care of.. even until now.
> > > > 
> > > > And there's some valid points too if someone would argue to put it there
> > > > especially on folio lock - do that in shmem.c can avoid taking folio lock
> > > > when generating minor fault message.  It might make some difference when
> > > > the faults are heavy and when folio lock is frequently taken elsewhere too.
> > > 
> > > Peter, could you expand on this?  Are you referring to the following
> > > (shmem_get_folio_gfp)?
> > > 
> > >        if (folio) {
> > >                folio_lock(folio);
> > > 
> > >                /* Has the folio been truncated or swapped out? */
> > >                if (unlikely(folio->mapping != inode->i_mapping)) {
> > >                        folio_unlock(folio);
> > >                        folio_put(folio);
> > >                        goto repeat;
> > >                }
> > >                if (sgp == SGP_WRITE)
> > >                        folio_mark_accessed(folio);
> > >                if (folio_test_uptodate(folio))
> > >                        goto out;
> > >                /* fallocated folio */
> > >                if (sgp != SGP_READ)
> > >                        goto clear;
> > >                folio_unlock(folio);
> > >                folio_put(folio);
> > >        }

[1]

> > > 
> > > Could you explain in what case the lock can be avoided?  AFAIC, the function
> > > is called by both the shmem fault handler and userfault_continue().
> > 
> > I think you meant the UFFDIO_CONTINUE side of things.  I agree with you, we
> > always need the folio lock.
> > 
> > What I was saying is the trapping side, where the minor fault message can
> > be generated without the folio lock now in case of shmem.  It's about
> > whether we could generalize the trapping side, so handle_mm_fault() can
> > generate the minor fault message instead of by shmem.c.
> > 
> > If the only concern is "referring to a module symbol from core mm", then
> > indeed the trapping side should be less of a concern anyway, because the
> > trapping side (when in the module codes) should always be able to reference
> > mm functions.
> > 
> > Actually.. if we have a fault() flag introduced above, maybe we can
> > generalize the trap side altogether without the folio lock overhead.  When
> > the flag set, if we can always return the folio unlocked (as long as
> > refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
> 
> Where does this locking happen exactly during trapping?  I was thinking it
> was only done when the page was allocated.  The trapping part (quoted by you
> above) only looks up the page in the cache and calls handle_userfault().  Am
> I missing something?

That's only what I worry if we want to reuse fault() to generalize the trap
code in core mm, because fault() by default takes the folio lock at least
for shmem.  I agree the folio doesn't need locking when trapping the fault
and sending the message.

Thanks,

> 
> > > 
> > > > It might boil down to how many more FSes would support minor fault, and
> > > > whether we would care about such difference at last to shmem users. If gmem
> > > > is the only one after existing ones, IIUC there's still option we implement
> > > > it in gmem code.  After all, I expect the change should be very under
> > > > control (<20 LOCs?)..
> > > > 
> > > > --
> > > > Peter Xu
> > > > 
> > > 
> > 
> > --
> > Peter Xu
> > 
>

[RFC,0/5] KVM: guest_memfd: support for uffd missing

Message

Comments