diff mbox series

[v2] mm/khugepaged: skip shmem with userfaultfd

Message ID 20230206112856.1802547-1-stevensd@google.com (mailing list archive)
State New
Headers show
Series [v2] mm/khugepaged: skip shmem with userfaultfd | expand

Commit Message

David Stevens Feb. 6, 2023, 11:28 a.m. UTC
From: David Stevens <stevensd@chromium.org>

Collapsing memory will result in any empty pages in the target range
being filled by the new THP. If userspace has a userfaultfd registered
with MODE_MISSING, for any page which it knows to be missing after
registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
Taking these two facts together, khugepaged needs to take care when
collapsing pages in shmem to make sure it doesn't break the userfaultfd
API.

This change first makes sure that the intermediate page cache state
during collapse is not visible by moving when gaps are filled to after
the page cache lock is acquired for the final time. This is necessary
because the synchronization provided by locking hpage is insufficient
for functions which operate on the page cache without actually locking
individual pages to examine their content (e.g. shmem_mfill_atomic_pte).

This refactoring allows us to iterate over i_mmap to check for any VMAs
with userfaultfds and then finalize the collapse if no such VMAs exist,
all while holding the page cache lock. Since no mm locks are held, it is
necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
vm_flags are visible to khugepaged. However, no further locking of
userfaultfd state is necessary. Although new userfaultfds can be
registered concurrently with finalizing the collapse, any missing pages
that are being replaced can no longer be observed by userspace, so there
is no data race.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. The fact that the intermediate page cache state before
the rollback of a failed collapse can no longer be observed is
technically a userspace-visible change (via at least SEEK_DATA and
SEEK_END), but it is exceedingly unlikely that anything relies on being
able to observe that transient state.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 fs/userfaultfd.c |  2 ++
 mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 10 deletions(-)

Comments

Matthew Wilcox Feb. 6, 2023, 2:25 p.m. UTC | #1
On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
>  	int nr_none = 0, result = SCAN_SUCCEED;
>  	bool is_shmem = shmem_file(file);
> +	bool i_mmap_locked = false;

you don't need this ...

> +	 * While iterating, we may drop the page cache lock multiple times. It
> +	 * is safe to replace pages in the page cache with hpage while doing so
> +	 * because nobody is able to map or otherwise access the content of
> +	 * hpage until we unlock it. However, we cannot insert hpage into empty
> +	 * indicies until we know we won't have to drop the page cache lock

"indices".

> @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		put_page(page);
>  		goto xa_unlocked;
>  	}
> +
> +	if (nr_none) {
> +		struct vm_area_struct *vma;
> +		int nr_none_check = 0;
> +
> +		xas_unlock_irq(&xas);
> +		i_mmap_lock_read(mapping);
> +		i_mmap_locked = true;
> +		xas_lock_irq(&xas);
> +
> +		xas_set(&xas, start);
> +		for (index = start; index < end; index++) {
> +			if (!xas_next(&xas))
> +				nr_none_check++;
> +		}
> +
> +		if (nr_none != nr_none_check) {
> +			result = SCAN_PAGE_FILLED;

... you can unlock the i_mmap_lock here before the goto.


I think you need to add a case in madvise_collapse_errno().  It should
probably return -EBUSY, I would think?
Matthew Wilcox Feb. 6, 2023, 7:01 p.m. UTC | #2
On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> This change first makes sure that the intermediate page cache state
> during collapse is not visible by moving when gaps are filled to after
> the page cache lock is acquired for the final time. This is necessary
> because the synchronization provided by locking hpage is insufficient
> for functions which operate on the page cache without actually locking
> individual pages to examine their content (e.g. shmem_mfill_atomic_pte).

I've been a little scared of touching khugepaged because, well, look at
that function.  But if we are going to touch it, how about this patch
first?  It does _part_ of what you need by not filling in the holes,
but obviously not the part that looks at uffd.  

It leaves the old pages in-place and frozen.  I think this should be
safe, but I haven't booted it (not entirely sure what test I'd run
to prove that it's not broken)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eb38bd1b1b2f..cfd33dff7253 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1845,15 +1845,14 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
+ *    + freeze the old pages
  *    + keep old pages around in case rollback is required;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
  *    + unlock huge page;
  *  - if replacing failed;
- *    + put all pages back and unfreeze them;
- *    + restore gaps in the page cache;
+ *    + unfreeze old pages;
  *    + unlock and free huge page;
  */
 static int collapse_file(struct mm_struct *mm, unsigned long addr,
@@ -1930,7 +1929,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -2081,8 +2079,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		 */
 		list_add_tail(&page->lru, &pagelist);
 
-		/* Finally, replace with the new page. */
-		xas_store(&xas, hpage);
 		continue;
 out_unlock:
 		unlock_page(page);
@@ -2195,32 +2191,17 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 			shmem_uncharge(mapping->host, nr_none);
 		}
 
-		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
+		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+			list_del(&page->lru);
 			page = list_first_entry_or_null(&pagelist,
 					struct page, lru);
-			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
-				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
-				continue;
-			}
-
-			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 
 			/* Unfreeze the page. */
 			list_del(&page->lru);
 			page_ref_unfreeze(page, 2);
-			xas_store(&xas, page);
-			xas_pause(&xas);
-			xas_unlock_irq(&xas);
 			unlock_page(page);
 			putback_lru_page(page);
-			xas_lock_irq(&xas);
 		}
-		VM_BUG_ON(nr_none);
 		/*
 		 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
 		 * This undo is not needed unless failure is due to SCAN_COPY_MC.
Peter Xu Feb. 6, 2023, 8:52 p.m. UTC | #3
On Mon, Feb 06, 2023 at 07:01:39PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > This change first makes sure that the intermediate page cache state
> > during collapse is not visible by moving when gaps are filled to after
> > the page cache lock is acquired for the final time. This is necessary
> > because the synchronization provided by locking hpage is insufficient
> > for functions which operate on the page cache without actually locking
> > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> 
> I've been a little scared of touching khugepaged because, well, look at
> that function.  But if we are going to touch it, how about this patch
> first?  It does _part_ of what you need by not filling in the holes,
> but obviously not the part that looks at uffd.  
> 
> It leaves the old pages in-place and frozen.  I think this should be
> safe, but I haven't booted it (not entirely sure what test I'd run
> to prove that it's not broken)

That logic existed since Kirill's original commit to add shmem thp support
on khugepaged, so Kirill should be the best to tell.. but so far it seems
reasonalbe to me to have that extra operation.

The problem is khugepaged will release pgtable lock during collapsing, so
AFAICT there can be a race where some other thread tries to insert pages
into page cache in parallel with khugepaged right after khugepaged released
the page cache lock.

For example, it seems to me new page cache can be inserted when khugepaged
is copying small page content to the new hpage.
Peter Xu Feb. 6, 2023, 9:02 p.m. UTC | #4
On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Collapsing memory will result in any empty pages in the target range
> being filled by the new THP. If userspace has a userfaultfd registered
> with MODE_MISSING, for any page which it knows to be missing after
> registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
> Taking these two facts together, khugepaged needs to take care when
> collapsing pages in shmem to make sure it doesn't break the userfaultfd
> API.
> 
> This change first makes sure that the intermediate page cache state
> during collapse is not visible by moving when gaps are filled to after
> the page cache lock is acquired for the final time. This is necessary
> because the synchronization provided by locking hpage is insufficient
> for functions which operate on the page cache without actually locking
> individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> 
> This refactoring allows us to iterate over i_mmap to check for any VMAs
> with userfaultfds and then finalize the collapse if no such VMAs exist,
> all while holding the page cache lock. Since no mm locks are held, it is
> necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
> vm_flags are visible to khugepaged. However, no further locking of
> userfaultfd state is necessary. Although new userfaultfds can be
> registered concurrently with finalizing the collapse, any missing pages
> that are being replaced can no longer be observed by userspace, so there
> is no data race.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. The fact that the intermediate page cache state before
> the rollback of a failed collapse can no longer be observed is
> technically a userspace-visible change (via at least SEEK_DATA and
> SEEK_END), but it is exceedingly unlikely that anything relies on being
> able to observe that transient state.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  fs/userfaultfd.c |  2 ++
>  mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 59 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index cc694846617a..6ddfcff11920 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
>  	const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
>  
>  	vma->vm_flags = flags;
> +	/* Pairs with smp_rmb() in khugepaged's collapse_file() */
> +	smp_wmb();

Could you help to explain why this is needed?  What's the operation to
serialize against updating vm_flags?

>  	/*
>  	 * For shared mappings, we want to enable writenotify while
>  	 * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 79be13133322..97435c226b18 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -55,6 +55,7 @@ enum scan_result {
>  	SCAN_CGROUP_CHARGE_FAIL,
>  	SCAN_TRUNCATED,
>  	SCAN_PAGE_HAS_PRIVATE,
> +	SCAN_PAGE_FILLED,
>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *  - allocate and lock a new huge page;
>   *  - scan page cache replacing old pages with the new one
>   *    + swap/gup in pages if necessary;
> - *    + fill in gaps;
>   *    + keep old pages around in case rollback is required;
> + *  - finalize updates to the page cache;
>   *  - if replacing succeeds:
>   *    + copy data over;
>   *    + free old pages;
> @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
>  	int nr_none = 0, result = SCAN_SUCCEED;
>  	bool is_shmem = shmem_file(file);
> +	bool i_mmap_locked = false;
>  	int nr = 0;
>  
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  
>  	/*
>  	 * At this point the hpage is locked and not up-to-date.
> -	 * It's safe to insert it into the page cache, because nobody would
> -	 * be able to map it or use it in another way until we unlock it.
> +	 *
> +	 * While iterating, we may drop the page cache lock multiple times. It
> +	 * is safe to replace pages in the page cache with hpage while doing so
> +	 * because nobody is able to map or otherwise access the content of
> +	 * hpage until we unlock it. However, we cannot insert hpage into empty
> +	 * indicies until we know we won't have to drop the page cache lock
> +	 * again, as doing so would let things which only check the presence
> +	 * of pages in the page cache see a state that may yet be rolled back.
>  	 */
>  
>  	xas_set(&xas, start);
> @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  						result = SCAN_TRUNCATED;
>  						goto xa_locked;
>  					}
> -					xas_set(&xas, index);
> +					xas_set(&xas, index + 1);

I failed to figure out why this index needs a shift here... It seems to me
it'll ignore the initial empty page cache, is that what the patch wanted?

>  				}
>  				if (!shmem_charge(mapping->host, 1)) {
>  					result = SCAN_FAIL;
>  					goto xa_locked;
>  				}
> -				xas_store(&xas, hpage);

[I raised a question in the other thread on whether it's legal to not
 populate page cache holes at all.  We can keep the discussion there]

>  				nr_none++;
>  				continue;
>  			}
> @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		put_page(page);
>  		goto xa_unlocked;
>  	}
> +
> +	if (nr_none) {
> +		struct vm_area_struct *vma;
> +		int nr_none_check = 0;
> +
> +		xas_unlock_irq(&xas);
> +		i_mmap_lock_read(mapping);
> +		i_mmap_locked = true;
> +		xas_lock_irq(&xas);
> +
> +		xas_set(&xas, start);
> +		for (index = start; index < end; index++) {
> +			if (!xas_next(&xas))
> +				nr_none_check++;
> +		}
> +
> +		if (nr_none != nr_none_check) {
> +			result = SCAN_PAGE_FILLED;
> +			goto xa_locked;
> +		}
> +
> +		/*
> +		 * If userspace observed a missing page in a VMA with an armed
> +		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> +		 * that page, so we need to roll back to avoid suppressing such
> +		 * an event. Any userfaultfds armed after this point will not be
> +		 * able to observe any missing pages, since the page cache is
> +		 * locked until after the collapse is completed.
> +		 *
> +		 * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> +		 */
> +		smp_rmb();
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> +			if (userfaultfd_missing(vma)) {
> +				result = SCAN_EXCEED_NONE_PTE;
> +				goto xa_locked;
> +			}
> +		}
> +	}

Thanks for writting the patch, but I am still confused why this can avoid
race conditions of uffd missing mode.

I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
succeeded on a specific shmem VMA, any page faults due to missing page
cache on this vma should generate a page fault.  If not, it's violating the
userfaultfd semantics.

Here, I don't see what stops a new vma from registering MISSING mode right
away in parallel of collapsing.

When registration completes, it means we should report uffd missing
messages on the holes that collapse_file() scanned previously.  However
they'll be filled very possibly later with the thp which means the messages
can be lost.  Then the issue can still happen, while this patch only makes
it very unlikely to happen, or am I wrong?

Thanks,

> +
>  	nr = thp_nr_pages(hpage);
>  
>  	if (is_shmem)
> @@ -2000,6 +2047,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	xas_store(&xas, hpage);
>  xa_locked:
>  	xas_unlock_irq(&xas);
> +	if (i_mmap_locked)
> +		i_mmap_unlock_read(mapping);
>  xa_unlocked:
>  
>  	/*
> @@ -2065,15 +2114,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		xas_set(&xas, start);
> -		xas_for_each(&xas, page, end - 1) {
> +		end = index;
> +		for (index = start; index < end; index++) {
> +			xas_next(&xas);
>  			page = list_first_entry_or_null(&pagelist,
>  					struct page, lru);
>  			if (!page || xas.xa_index < page->index) {
> -				if (!nr_none)
> -					break;
>  				nr_none--;
> -				/* Put holes back where they were */
> -				xas_store(&xas, NULL);
>  				continue;
>  			}
>  
> -- 
> 2.39.1.519.gcb327c4b5f-goog
>
Matthew Wilcox Feb. 6, 2023, 9:50 p.m. UTC | #5
On Mon, Feb 06, 2023 at 03:52:19PM -0500, Peter Xu wrote:
> On Mon, Feb 06, 2023 at 07:01:39PM +0000, Matthew Wilcox wrote:
> > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > This change first makes sure that the intermediate page cache state
> > > during collapse is not visible by moving when gaps are filled to after
> > > the page cache lock is acquired for the final time. This is necessary
> > > because the synchronization provided by locking hpage is insufficient
> > > for functions which operate on the page cache without actually locking
> > > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> > 
> > I've been a little scared of touching khugepaged because, well, look at
> > that function.  But if we are going to touch it, how about this patch
> > first?  It does _part_ of what you need by not filling in the holes,
> > but obviously not the part that looks at uffd.  
> > 
> > It leaves the old pages in-place and frozen.  I think this should be
> > safe, but I haven't booted it (not entirely sure what test I'd run
> > to prove that it's not broken)
> 
> That logic existed since Kirill's original commit to add shmem thp support
> on khugepaged, so Kirill should be the best to tell.. but so far it seems
> reasonalbe to me to have that extra operation.
> 
> The problem is khugepaged will release pgtable lock during collapsing, so
> AFAICT there can be a race where some other thread tries to insert pages
> into page cache in parallel with khugepaged right after khugepaged released
> the page cache lock.
> 
> For example, it seems to me new page cache can be inserted when khugepaged
> is copying small page content to the new hpage.

Mmm, yes, we need to have _something_ in the page cache to block new
pages from being added.  It can be either the new or the old pages,
but it can't be NULL.  It could even be a RETRY entry, since that'll
have the same effect as a frozen page.

But both David's patch and mine are wrong.  Not sure what to do for
David's problem -- maybe it's OK to have the holes temporarily filled
with frozen / RETRY entries until we get to the point where we check
for an uffd marker?
David Stevens Feb. 7, 2023, 1:37 a.m. UTC | #6
On Tue, Feb 7, 2023 at 6:50 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 06, 2023 at 03:52:19PM -0500, Peter Xu wrote:
> > On Mon, Feb 06, 2023 at 07:01:39PM +0000, Matthew Wilcox wrote:
> > > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > > This change first makes sure that the intermediate page cache state
> > > > during collapse is not visible by moving when gaps are filled to after
> > > > the page cache lock is acquired for the final time. This is necessary
> > > > because the synchronization provided by locking hpage is insufficient
> > > > for functions which operate on the page cache without actually locking
> > > > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> > >
> > > I've been a little scared of touching khugepaged because, well, look at
> > > that function.  But if we are going to touch it, how about this patch
> > > first?  It does _part_ of what you need by not filling in the holes,
> > > but obviously not the part that looks at uffd.
> > >
> > > It leaves the old pages in-place and frozen.  I think this should be
> > > safe, but I haven't booted it (not entirely sure what test I'd run
> > > to prove that it's not broken)
> >
> > That logic existed since Kirill's original commit to add shmem thp support
> > on khugepaged, so Kirill should be the best to tell.. but so far it seems
> > reasonalbe to me to have that extra operation.
> >
> > The problem is khugepaged will release pgtable lock during collapsing, so
> > AFAICT there can be a race where some other thread tries to insert pages
> > into page cache in parallel with khugepaged right after khugepaged released
> > the page cache lock.
> >
> > For example, it seems to me new page cache can be inserted when khugepaged
> > is copying small page content to the new hpage.

This particular race can't happen with either patch, since the missing
page cache entries are filled when we create the multi-index entry for
hpage.

> Mmm, yes, we need to have _something_ in the page cache to block new
> pages from being added.  It can be either the new or the old pages,
> but it can't be NULL.  It could even be a RETRY entry, since that'll
> have the same effect as a frozen page.
>
> But both David's patch and mine are wrong.  Not sure what to do for
> David's problem -- maybe it's OK to have the holes temporarily filled
> with frozen / RETRY entries until we get to the point where we check
> for an uffd marker?

My patch re-counts the holes after acquiring the page cache lock for
the final time, right before creating the final hpage multi-index
entry. Since we lock present pages while iterating over the target
range, they can't have been truncated before our re-validation of
nr_none. So if the number of missing pages is still equal to nr_none,
then we know that nothing has come along and filled in a missing page.
Compared to adding some sort of marker for missing pages, this does
add another failure path for collapse, but I don't think there is any
race.

-David
Matthew Wilcox Feb. 7, 2023, 2:29 a.m. UTC | #7
On Tue, Feb 07, 2023 at 10:37:06AM +0900, David Stevens wrote:
> On Tue, Feb 7, 2023 at 6:50 AM Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Feb 06, 2023 at 03:52:19PM -0500, Peter Xu wrote:
> > > The problem is khugepaged will release pgtable lock during collapsing, so
> > > AFAICT there can be a race where some other thread tries to insert pages
> > > into page cache in parallel with khugepaged right after khugepaged released
> > > the page cache lock.
> > >
> > > For example, it seems to me new page cache can be inserted when khugepaged
> > > is copying small page content to the new hpage.
> 
> This particular race can't happen with either patch, since the missing
> page cache entries are filled when we create the multi-index entry for
> hpage.

Can too.

        for (index = start; index < end; index++) {
...
                        if (xa_is_value(page) || !PageUptodate(page)) {
                                xas_unlock_irq(&xas);
                                /* swap in or instantiate fallocated page */
                                if (shmem_get_folio(mapping->host, index,
                                                &folio, SGP_NOALLOC)) {
                                        result = SCAN_FAIL;
                                        goto xa_unlocked;
                                }
...

So we start the iteration, and then a page fault happens in one of
the indices we've already examined, but we don't have the page on
the list.  It's a nice wide race too because we're bringing the
page in from swap.
David Stevens Feb. 7, 2023, 3:56 a.m. UTC | #8
On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Collapsing memory will result in any empty pages in the target range
> > being filled by the new THP. If userspace has a userfaultfd registered
> > with MODE_MISSING, for any page which it knows to be missing after
> > registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
> > Taking these two facts together, khugepaged needs to take care when
> > collapsing pages in shmem to make sure it doesn't break the userfaultfd
> > API.
> >
> > This change first makes sure that the intermediate page cache state
> > during collapse is not visible by moving when gaps are filled to after
> > the page cache lock is acquired for the final time. This is necessary
> > because the synchronization provided by locking hpage is insufficient
> > for functions which operate on the page cache without actually locking
> > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> >
> > This refactoring allows us to iterate over i_mmap to check for any VMAs
> > with userfaultfds and then finalize the collapse if no such VMAs exist,
> > all while holding the page cache lock. Since no mm locks are held, it is
> > necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
> > vm_flags are visible to khugepaged. However, no further locking of
> > userfaultfd state is necessary. Although new userfaultfds can be
> > registered concurrently with finalizing the collapse, any missing pages
> > that are being replaced can no longer be observed by userspace, so there
> > is no data race.
> >
> > This fix is targeted at khugepaged, but the change also applies to
> > MADV_COLLAPSE. The fact that the intermediate page cache state before
> > the rollback of a failed collapse can no longer be observed is
> > technically a userspace-visible change (via at least SEEK_DATA and
> > SEEK_END), but it is exceedingly unlikely that anything relies on being
> > able to observe that transient state.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  fs/userfaultfd.c |  2 ++
> >  mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
> >  2 files changed, 59 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index cc694846617a..6ddfcff11920 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
> >       const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
> >
> >       vma->vm_flags = flags;
> > +     /* Pairs with smp_rmb() in khugepaged's collapse_file() */
> > +     smp_wmb();
>
> Could you help to explain why this is needed?  What's the operation to
> serialize against updating vm_flags?

We need to ensure that any updates to VM_UFFD_MISSING from
UFFDIO_REGISTER are visible to khugepaged before the ioctl returns to
userspace. There aren't any locks that obviously provide that
synchronization, so I added the smp_wmb/smp_rmb pair. It's possible
that page cache lock indirectly provides sufficient synchronization,
though.

> >       /*
> >        * For shared mappings, we want to enable writenotify while
> >        * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 79be13133322..97435c226b18 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -55,6 +55,7 @@ enum scan_result {
> >       SCAN_CGROUP_CHARGE_FAIL,
> >       SCAN_TRUNCATED,
> >       SCAN_PAGE_HAS_PRIVATE,
> > +     SCAN_PAGE_FILLED,
> >  };
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >   *  - allocate and lock a new huge page;
> >   *  - scan page cache replacing old pages with the new one
> >   *    + swap/gup in pages if necessary;
> > - *    + fill in gaps;
> >   *    + keep old pages around in case rollback is required;
> > + *  - finalize updates to the page cache;
> >   *  - if replacing succeeds:
> >   *    + copy data over;
> >   *    + free old pages;
> > @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> >       int nr_none = 0, result = SCAN_SUCCEED;
> >       bool is_shmem = shmem_file(file);
> > +     bool i_mmap_locked = false;
> >       int nr = 0;
> >
> >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >
> >       /*
> >        * At this point the hpage is locked and not up-to-date.
> > -      * It's safe to insert it into the page cache, because nobody would
> > -      * be able to map it or use it in another way until we unlock it.
> > +      *
> > +      * While iterating, we may drop the page cache lock multiple times. It
> > +      * is safe to replace pages in the page cache with hpage while doing so
> > +      * because nobody is able to map or otherwise access the content of
> > +      * hpage until we unlock it. However, we cannot insert hpage into empty
> > +      * indicies until we know we won't have to drop the page cache lock
> > +      * again, as doing so would let things which only check the presence
> > +      * of pages in the page cache see a state that may yet be rolled back.
> >        */
> >
> >       xas_set(&xas, start);
> > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >                                               result = SCAN_TRUNCATED;
> >                                               goto xa_locked;
> >                                       }
> > -                                     xas_set(&xas, index);
> > +                                     xas_set(&xas, index + 1);
>
> I failed to figure out why this index needs a shift here... It seems to me
> it'll ignore the initial empty page cache, is that what the patch wanted?

This chunk is preceded by a call to xas_next_entry. Previously,
xas_store(hpage) would pair with xas_set(index) to walk the xarray and
prepare xas for the xas_next at the start of the next loop iteration.
Now that we're no longer calling xas_store, xas_set(index+1) is
necessary to prepare xas for the next iteration of the loop.

> >                               }
> >                               if (!shmem_charge(mapping->host, 1)) {
> >                                       result = SCAN_FAIL;
> >                                       goto xa_locked;
> >                               }
> > -                             xas_store(&xas, hpage);
>
> [I raised a question in the other thread on whether it's legal to not
>  populate page cache holes at all.  We can keep the discussion there]
>
> >                               nr_none++;
> >                               continue;
> >                       }
> > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >               put_page(page);
> >               goto xa_unlocked;
> >       }
> > +
> > +     if (nr_none) {
> > +             struct vm_area_struct *vma;
> > +             int nr_none_check = 0;
> > +
> > +             xas_unlock_irq(&xas);
> > +             i_mmap_lock_read(mapping);
> > +             i_mmap_locked = true;
> > +             xas_lock_irq(&xas);
> > +
> > +             xas_set(&xas, start);
> > +             for (index = start; index < end; index++) {
> > +                     if (!xas_next(&xas))
> > +                             nr_none_check++;
> > +             }
> > +
> > +             if (nr_none != nr_none_check) {
> > +                     result = SCAN_PAGE_FILLED;
> > +                     goto xa_locked;
> > +             }
> > +
> > +             /*
> > +              * If userspace observed a missing page in a VMA with an armed
> > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > +              * that page, so we need to roll back to avoid suppressing such
> > +              * an event. Any userfaultfds armed after this point will not be
> > +              * able to observe any missing pages, since the page cache is
> > +              * locked until after the collapse is completed.
> > +              *
> > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > +              */
> > +             smp_rmb();
> > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > +                     if (userfaultfd_missing(vma)) {
> > +                             result = SCAN_EXCEED_NONE_PTE;
> > +                             goto xa_locked;
> > +                     }
> > +             }
> > +     }
>
> Thanks for writting the patch, but I am still confused why this can avoid
> race conditions of uffd missing mode.
>
> I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> succeeded on a specific shmem VMA, any page faults due to missing page
> cache on this vma should generate a page fault.  If not, it's violating the
> userfaultfd semantics.
>
> Here, I don't see what stops a new vma from registering MISSING mode right
> away in parallel of collapsing.
>
> When registration completes, it means we should report uffd missing
> messages on the holes that collapse_file() scanned previously.  However
> they'll be filled very possibly later with the thp which means the messages
> can be lost.  Then the issue can still happen, while this patch only makes
> it very unlikely to happen, or am I wrong?

This race can still happen, but I don't think it's actually a problem.
In particular, I think the semantics you've given for UFFDIO_REGISTER
are overly restrictive on the kernel in a way that provides no
meaningful benefit to userspace.

To simplify things a little bit, we only need to consider going from
zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
is already registered isn't particularly interesting, since khugepaged
would have seen the existing userfaultfd and given up on the collapse.

When userspace registers the first userfaultfd, it cannot rely on any
knowledge about the presence or absence of pages from before when
UFFDIO_REGISTER completes. This is because, as far as userspace is
concerned, khugepaged could have come along and collapsed every single
page in the shmem right after UFFDIO_REGISTER entered the kernel. So
to properly use the userfaultfd API, userspace must re-examine the
state of memory after UFFDIO_REGISTER completes. Because of that,
having UFFDIO_REGISTER be a complete barrier with regards to
khugepaged is essentially meaningless from the perspective of
userspace. The guarantee might slightly reduce the window for some
races in userspace programs, but there are no userspace races that it
can completely resolve.

Given that, the kernel just needs to guarantee that no missing page
which userspace might have observed after registering a userfaultfd is
filled by khugepaged. This patch takes the page cache lock, checks
that there are no registered userfaultfds, fills the missing pages,
and then releases the page cache lock. That's sufficient to guarantee
that any UFFDIO_REGISTER which occurs after the userfaultfd check but
before we finish the collapse can't actually observe any missing
pages.

-David





> Thanks,
>
> > +
> >       nr = thp_nr_pages(hpage);
> >
> >       if (is_shmem)
> > @@ -2000,6 +2047,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       xas_store(&xas, hpage);
> >  xa_locked:
> >       xas_unlock_irq(&xas);
> > +     if (i_mmap_locked)
> > +             i_mmap_unlock_read(mapping);
> >  xa_unlocked:
> >
> >       /*
> > @@ -2065,15 +2114,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >               }
> >
> >               xas_set(&xas, start);
> > -             xas_for_each(&xas, page, end - 1) {
> > +             end = index;
> > +             for (index = start; index < end; index++) {
> > +                     xas_next(&xas);
> >                       page = list_first_entry_or_null(&pagelist,
> >                                       struct page, lru);
> >                       if (!page || xas.xa_index < page->index) {
> > -                             if (!nr_none)
> > -                                     break;
> >                               nr_none--;
> > -                             /* Put holes back where they were */
> > -                             xas_store(&xas, NULL);
> >                               continue;
> >                       }
> >
> > --
> > 2.39.1.519.gcb327c4b5f-goog
> >
>
> --
> Peter Xu
>
David Stevens Feb. 7, 2023, 4:14 a.m. UTC | #9
On Tue, Feb 7, 2023 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 07, 2023 at 10:37:06AM +0900, David Stevens wrote:
> > On Tue, Feb 7, 2023 at 6:50 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Mon, Feb 06, 2023 at 03:52:19PM -0500, Peter Xu wrote:
> > > > The problem is khugepaged will release pgtable lock during collapsing, so
> > > > AFAICT there can be a race where some other thread tries to insert pages
> > > > into page cache in parallel with khugepaged right after khugepaged released
> > > > the page cache lock.
> > > >
> > > > For example, it seems to me new page cache can be inserted when khugepaged
> > > > is copying small page content to the new hpage.
> >
> > This particular race can't happen with either patch, since the missing
> > page cache entries are filled when we create the multi-index entry for
> > hpage.
>
> Can too.
>
>         for (index = start; index < end; index++) {
> ...
>                         if (xa_is_value(page) || !PageUptodate(page)) {
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_get_folio(mapping->host, index,
>                                                 &folio, SGP_NOALLOC)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> ...
>
> So we start the iteration, and then a page fault happens in one of
> the indices we've already examined, but we don't have the page on
> the list.  It's a nice wide race too because we're bringing the
> page in from swap.
>

Ah, I misunderstood your objection to refer specifically to the
copy_highpage loop at the end of collapse_file, rather than the entire
process of the collapse.

Yes, there can definitely be faults that fill in pages during the
overall process of collapsing. However, my patch re-checks nr_none
after acquiring the page cache lock for the final time, right before
creating the hpage multi-index entry. Since present pages are locked,
they can't be truncated after we've looked at them. That means that if
nr_none still matches, then there weren't any page faults that
populated missing pages. If we do detect any filled in pages, we can
just roll back the collapse to avoid any problems.

-David
Peter Xu Feb. 7, 2023, 4:34 p.m. UTC | #10
On Tue, Feb 07, 2023 at 12:56:56PM +0900, David Stevens wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Collapsing memory will result in any empty pages in the target range
> > > being filled by the new THP. If userspace has a userfaultfd registered
> > > with MODE_MISSING, for any page which it knows to be missing after
> > > registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
> > > Taking these two facts together, khugepaged needs to take care when
> > > collapsing pages in shmem to make sure it doesn't break the userfaultfd
> > > API.
> > >
> > > This change first makes sure that the intermediate page cache state
> > > during collapse is not visible by moving when gaps are filled to after
> > > the page cache lock is acquired for the final time. This is necessary
> > > because the synchronization provided by locking hpage is insufficient
> > > for functions which operate on the page cache without actually locking
> > > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> > >
> > > This refactoring allows us to iterate over i_mmap to check for any VMAs
> > > with userfaultfds and then finalize the collapse if no such VMAs exist,
> > > all while holding the page cache lock. Since no mm locks are held, it is
> > > necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
> > > vm_flags are visible to khugepaged. However, no further locking of
> > > userfaultfd state is necessary. Although new userfaultfds can be
> > > registered concurrently with finalizing the collapse, any missing pages
> > > that are being replaced can no longer be observed by userspace, so there
> > > is no data race.
> > >
> > > This fix is targeted at khugepaged, but the change also applies to
> > > MADV_COLLAPSE. The fact that the intermediate page cache state before
> > > the rollback of a failed collapse can no longer be observed is
> > > technically a userspace-visible change (via at least SEEK_DATA and
> > > SEEK_END), but it is exceedingly unlikely that anything relies on being
> > > able to observe that transient state.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  fs/userfaultfd.c |  2 ++
> > >  mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
> > >  2 files changed, 59 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index cc694846617a..6ddfcff11920 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
> > >       const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
> > >
> > >       vma->vm_flags = flags;
> > > +     /* Pairs with smp_rmb() in khugepaged's collapse_file() */
> > > +     smp_wmb();
> >
> > Could you help to explain why this is needed?  What's the operation to
> > serialize against updating vm_flags?
> 
> We need to ensure that any updates to VM_UFFD_MISSING from
> UFFDIO_REGISTER are visible to khugepaged before the ioctl returns to
> userspace. There aren't any locks that obviously provide that
> synchronization, so I added the smp_wmb/smp_rmb pair. It's possible
> that page cache lock indirectly provides sufficient synchronization,
> though.

If so I don't think it's needed.  If vm_flags cannot be flushed to memory
and present to the rest before the syscall returns, then it's a fatal
problem already before this patch.  I bet there're quite a few barriers to
protect it, while the world switch should be the last guardline.

One example: vm_flags updates need mmap write lock, then: mmap_write_unlock
-> up_write -> __up_write -> preempt_disable() -> barrier().

> 
> > >       /*
> > >        * For shared mappings, we want to enable writenotify while
> > >        * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 79be13133322..97435c226b18 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -55,6 +55,7 @@ enum scan_result {
> > >       SCAN_CGROUP_CHARGE_FAIL,
> > >       SCAN_TRUNCATED,
> > >       SCAN_PAGE_HAS_PRIVATE,
> > > +     SCAN_PAGE_FILLED,
> > >  };
> > >
> > >  #define CREATE_TRACE_POINTS
> > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > >   *  - allocate and lock a new huge page;
> > >   *  - scan page cache replacing old pages with the new one
> > >   *    + swap/gup in pages if necessary;
> > > - *    + fill in gaps;
> > >   *    + keep old pages around in case rollback is required;
> > > + *  - finalize updates to the page cache;
> > >   *  - if replacing succeeds:
> > >   *    + copy data over;
> > >   *    + free old pages;
> > > @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >       XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> > >       int nr_none = 0, result = SCAN_SUCCEED;
> > >       bool is_shmem = shmem_file(file);
> > > +     bool i_mmap_locked = false;
> > >       int nr = 0;
> > >
> > >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > > @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >
> > >       /*
> > >        * At this point the hpage is locked and not up-to-date.
> > > -      * It's safe to insert it into the page cache, because nobody would
> > > -      * be able to map it or use it in another way until we unlock it.
> > > +      *
> > > +      * While iterating, we may drop the page cache lock multiple times. It
> > > +      * is safe to replace pages in the page cache with hpage while doing so
> > > +      * because nobody is able to map or otherwise access the content of
> > > +      * hpage until we unlock it. However, we cannot insert hpage into empty
> > > +      * indicies until we know we won't have to drop the page cache lock
> > > +      * again, as doing so would let things which only check the presence
> > > +      * of pages in the page cache see a state that may yet be rolled back.
> > >        */
> > >
> > >       xas_set(&xas, start);
> > > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >                                               result = SCAN_TRUNCATED;
> > >                                               goto xa_locked;
> > >                                       }
> > > -                                     xas_set(&xas, index);
> > > +                                     xas_set(&xas, index + 1);
> >
> > I failed to figure out why this index needs a shift here... It seems to me
> > it'll ignore the initial empty page cache, is that what the patch wanted?
> 
> This chunk is preceded by a call to xas_next_entry. Previously,
> xas_store(hpage) would pair with xas_set(index) to walk the xarray and
> prepare xas for the xas_next at the start of the next loop iteration.
> Now that we're no longer calling xas_store, xas_set(index+1) is
> necessary to prepare xas for the next iteration of the loop.

If I'm not wrong, xas_store() doesn't move the niddle at all.  It's the
xas_next() at the entry of the loop that does.

> 
> > >                               }
> > >                               if (!shmem_charge(mapping->host, 1)) {
> > >                                       result = SCAN_FAIL;
> > >                                       goto xa_locked;
> > >                               }
> > > -                             xas_store(&xas, hpage);
> >
> > [I raised a question in the other thread on whether it's legal to not
> >  populate page cache holes at all.  We can keep the discussion there]
> >
> > >                               nr_none++;
> > >                               continue;
> > >                       }
> > > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >               put_page(page);
> > >               goto xa_unlocked;
> > >       }
> > > +
> > > +     if (nr_none) {
> > > +             struct vm_area_struct *vma;
> > > +             int nr_none_check = 0;
> > > +
> > > +             xas_unlock_irq(&xas);
> > > +             i_mmap_lock_read(mapping);
> > > +             i_mmap_locked = true;
> > > +             xas_lock_irq(&xas);
> > > +
> > > +             xas_set(&xas, start);
> > > +             for (index = start; index < end; index++) {
> > > +                     if (!xas_next(&xas))
> > > +                             nr_none_check++;
> > > +             }
> > > +
> > > +             if (nr_none != nr_none_check) {
> > > +                     result = SCAN_PAGE_FILLED;
> > > +                     goto xa_locked;
> > > +             }
> > > +
> > > +             /*
> > > +              * If userspace observed a missing page in a VMA with an armed
> > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > +              * that page, so we need to roll back to avoid suppressing such
> > > +              * an event. Any userfaultfds armed after this point will not be
> > > +              * able to observe any missing pages, since the page cache is
> > > +              * locked until after the collapse is completed.
> > > +              *
> > > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > > +              */
> > > +             smp_rmb();
> > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > +                     if (userfaultfd_missing(vma)) {
> > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > +                             goto xa_locked;
> > > +                     }
> > > +             }
> > > +     }
> >
> > Thanks for writting the patch, but I am still confused why this can avoid
> > race conditions of uffd missing mode.
> >
> > I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> > succeeded on a specific shmem VMA, any page faults due to missing page
> > cache on this vma should generate a page fault.  If not, it's violating the
> > userfaultfd semantics.
> >
> > Here, I don't see what stops a new vma from registering MISSING mode right
> > away in parallel of collapsing.
> >
> > When registration completes, it means we should report uffd missing
> > messages on the holes that collapse_file() scanned previously.  However
> > they'll be filled very possibly later with the thp which means the messages
> > can be lost.  Then the issue can still happen, while this patch only makes
> > it very unlikely to happen, or am I wrong?
> 
> This race can still happen, but I don't think it's actually a problem.
> In particular, I think the semantics you've given for UFFDIO_REGISTER
> are overly restrictive on the kernel in a way that provides no
> meaningful benefit to userspace.
> 
> To simplify things a little bit, we only need to consider going from
> zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
> is already registered isn't particularly interesting, since khugepaged
> would have seen the existing userfaultfd and given up on the collapse.

Could you point me which part of the code you're referencing here?

Note that unlike anonymous thp collapsing scan (hpage_collapse_scan_pmd),
hpage_collapse_scan_file works on page caches only.

> 
> When userspace registers the first userfaultfd, it cannot rely on any
> knowledge about the presence or absence of pages from before when
> UFFDIO_REGISTER completes. This is because, as far as userspace is
> concerned, khugepaged could have come along and collapsed every single
> page in the shmem right after UFFDIO_REGISTER entered the kernel. So
> to properly use the userfaultfd API, userspace must re-examine the
> state of memory after UFFDIO_REGISTER completes. Because of that,
> having UFFDIO_REGISTER be a complete barrier with regards to
> khugepaged is essentially meaningless from the perspective of
> userspace. The guarantee might slightly reduce the window for some
> races in userspace programs, but there are no userspace races that it
> can completely resolve.
> 
> Given that, the kernel just needs to guarantee that no missing page
> which userspace might have observed after registering a userfaultfd is
> filled by khugepaged. This patch takes the page cache lock, checks
> that there are no registered userfaultfds, fills the missing pages,
> and then releases the page cache lock. That's sufficient to guarantee
> that any UFFDIO_REGISTER which occurs after the userfaultfd check but
> before we finish the collapse can't actually observe any missing
> pages.

I don't think your current patch guaranteed that?

Now I'm confused why you removed population of the small pages using hpage,
because that seems to support what you're doing here.  Note that your
current code is added _before_ the replacement of the huge page here:

	/* Join all the small entries into a single multi-index entry */
	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
	xas_store(&xas, hpage);

So either the patch needs to keep the original logic to fill up the holes
with hpage, or the vma checks needs to be done later, to achieve what you
proposed to happen, IIUC.

Besides, please imagine an uffd program that does this:

  (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
  (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
      noticed that it returns DATA, means this is not a hole.  So it's
      expected that there will be no further message generated to this page.
  (3) After a while, it received a missing fault on page P1

My question is whether above is a problem or not?  Whether it violates the
userfaultfd missing mode semantics?

AFAICT, above can happen if some program accidentally registers UFFD
missing mode during collapsing of the shmem thp P1 if with current logic
applied.

Thanks,
David Stevens Feb. 8, 2023, 2:42 a.m. UTC | #11
On Wed, Feb 8, 2023 at 1:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 07, 2023 at 12:56:56PM +0900, David Stevens wrote:
> > On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Collapsing memory will result in any empty pages in the target range
> > > > being filled by the new THP. If userspace has a userfaultfd registered
> > > > with MODE_MISSING, for any page which it knows to be missing after
> > > > registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
> > > > Taking these two facts together, khugepaged needs to take care when
> > > > collapsing pages in shmem to make sure it doesn't break the userfaultfd
> > > > API.
> > > >
> > > > This change first makes sure that the intermediate page cache state
> > > > during collapse is not visible by moving when gaps are filled to after
> > > > the page cache lock is acquired for the final time. This is necessary
> > > > because the synchronization provided by locking hpage is insufficient
> > > > for functions which operate on the page cache without actually locking
> > > > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> > > >
> > > > This refactoring allows us to iterate over i_mmap to check for any VMAs
> > > > with userfaultfds and then finalize the collapse if no such VMAs exist,
> > > > all while holding the page cache lock. Since no mm locks are held, it is
> > > > necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
> > > > vm_flags are visible to khugepaged. However, no further locking of
> > > > userfaultfd state is necessary. Although new userfaultfds can be
> > > > registered concurrently with finalizing the collapse, any missing pages
> > > > that are being replaced can no longer be observed by userspace, so there
> > > > is no data race.
> > > >
> > > > This fix is targeted at khugepaged, but the change also applies to
> > > > MADV_COLLAPSE. The fact that the intermediate page cache state before
> > > > the rollback of a failed collapse can no longer be observed is
> > > > technically a userspace-visible change (via at least SEEK_DATA and
> > > > SEEK_END), but it is exceedingly unlikely that anything relies on being
> > > > able to observe that transient state.
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  fs/userfaultfd.c |  2 ++
> > > >  mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
> > > >  2 files changed, 59 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index cc694846617a..6ddfcff11920 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
> > > >       const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
> > > >
> > > >       vma->vm_flags = flags;
> > > > +     /* Pairs with smp_rmb() in khugepaged's collapse_file() */
> > > > +     smp_wmb();
> > >
> > > Could you help to explain why this is needed?  What's the operation to
> > > serialize against updating vm_flags?
> >
> > We need to ensure that any updates to VM_UFFD_MISSING from
> > UFFDIO_REGISTER are visible to khugepaged before the ioctl returns to
> > userspace. There aren't any locks that obviously provide that
> > synchronization, so I added the smp_wmb/smp_rmb pair. It's possible
> > that page cache lock indirectly provides sufficient synchronization,
> > though.
>
> If so I don't think it's needed.  If vm_flags cannot be flushed to memory
> and present to the rest before the syscall returns, then it's a fatal
> problem already before this patch.  I bet there're quite a few barriers to
> protect it, while the world switch should be the last guardline.
>
> One example: vm_flags updates need mmap write lock, then: mmap_write_unlock
> -> up_write -> __up_write -> preempt_disable() -> barrier().
>
> >
> > > >       /*
> > > >        * For shared mappings, we want to enable writenotify while
> > > >        * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 79be13133322..97435c226b18 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -55,6 +55,7 @@ enum scan_result {
> > > >       SCAN_CGROUP_CHARGE_FAIL,
> > > >       SCAN_TRUNCATED,
> > > >       SCAN_PAGE_HAS_PRIVATE,
> > > > +     SCAN_PAGE_FILLED,
> > > >  };
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > >   *  - allocate and lock a new huge page;
> > > >   *  - scan page cache replacing old pages with the new one
> > > >   *    + swap/gup in pages if necessary;
> > > > - *    + fill in gaps;
> > > >   *    + keep old pages around in case rollback is required;
> > > > + *  - finalize updates to the page cache;
> > > >   *  - if replacing succeeds:
> > > >   *    + copy data over;
> > > >   *    + free old pages;
> > > > @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >       XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> > > >       int nr_none = 0, result = SCAN_SUCCEED;
> > > >       bool is_shmem = shmem_file(file);
> > > > +     bool i_mmap_locked = false;
> > > >       int nr = 0;
> > > >
> > > >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > > > @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >
> > > >       /*
> > > >        * At this point the hpage is locked and not up-to-date.
> > > > -      * It's safe to insert it into the page cache, because nobody would
> > > > -      * be able to map it or use it in another way until we unlock it.
> > > > +      *
> > > > +      * While iterating, we may drop the page cache lock multiple times. It
> > > > +      * is safe to replace pages in the page cache with hpage while doing so
> > > > +      * because nobody is able to map or otherwise access the content of
> > > > +      * hpage until we unlock it. However, we cannot insert hpage into empty
> > > > +      * indicies until we know we won't have to drop the page cache lock
> > > > +      * again, as doing so would let things which only check the presence
> > > > +      * of pages in the page cache see a state that may yet be rolled back.
> > > >        */
> > > >
> > > >       xas_set(&xas, start);
> > > > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >                                               result = SCAN_TRUNCATED;
> > > >                                               goto xa_locked;
> > > >                                       }
> > > > -                                     xas_set(&xas, index);
> > > > +                                     xas_set(&xas, index + 1);
> > >
> > > I failed to figure out why this index needs a shift here... It seems to me
> > > it'll ignore the initial empty page cache, is that what the patch wanted?
> >
> > This chunk is preceded by a call to xas_next_entry. Previously,
> > xas_store(hpage) would pair with xas_set(index) to walk the xarray and
> > prepare xas for the xas_next at the start of the next loop iteration.
> > Now that we're no longer calling xas_store, xas_set(index+1) is
> > necessary to prepare xas for the next iteration of the loop.
>
> If I'm not wrong, xas_store() doesn't move the niddle at all.  It's the
> xas_next() at the entry of the loop that does.

xas_set(n) resets the cursor to the top of the tree. If the cursor is
at the top of the tree, then xas_next() walks to n instead of
advancing n to n+1. xas_store() on the other hand behaves the same
regardless of whether or not the cursor is at the top of the tree (it
starts by calling xas_create or xas_load, which will walk the tree if
necessary).

Looking at it another way, if we're here, then start == index. So if
we were to call xas_set(&xas, index), xas at the start of the next
iteration would be in an identical state to when we entered the loop
the first time. So xas would be one value behind index, and we would
never actually examine the last entry in the range being collapsed.

> >
> > > >                               }
> > > >                               if (!shmem_charge(mapping->host, 1)) {
> > > >                                       result = SCAN_FAIL;
> > > >                                       goto xa_locked;
> > > >                               }
> > > > -                             xas_store(&xas, hpage);
> > >
> > > [I raised a question in the other thread on whether it's legal to not
> > >  populate page cache holes at all.  We can keep the discussion there]
> > >
> > > >                               nr_none++;
> > > >                               continue;
> > > >                       }
> > > > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >               put_page(page);
> > > >               goto xa_unlocked;
> > > >       }
> > > > +
> > > > +     if (nr_none) {
> > > > +             struct vm_area_struct *vma;
> > > > +             int nr_none_check = 0;
> > > > +
> > > > +             xas_unlock_irq(&xas);
> > > > +             i_mmap_lock_read(mapping);
> > > > +             i_mmap_locked = true;
> > > > +             xas_lock_irq(&xas);
> > > > +
> > > > +             xas_set(&xas, start);
> > > > +             for (index = start; index < end; index++) {
> > > > +                     if (!xas_next(&xas))
> > > > +                             nr_none_check++;
> > > > +             }
> > > > +
> > > > +             if (nr_none != nr_none_check) {
> > > > +                     result = SCAN_PAGE_FILLED;
> > > > +                     goto xa_locked;
> > > > +             }
> > > > +
> > > > +             /*
> > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > +              * able to observe any missing pages, since the page cache is
> > > > +              * locked until after the collapse is completed.
> > > > +              *
> > > > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > > > +              */
> > > > +             smp_rmb();
> > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > +                     if (userfaultfd_missing(vma)) {
> > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > +                             goto xa_locked;
> > > > +                     }
> > > > +             }
> > > > +     }
> > >
> > > Thanks for writting the patch, but I am still confused why this can avoid
> > > race conditions of uffd missing mode.
> > >
> > > I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> > > succeeded on a specific shmem VMA, any page faults due to missing page
> > > cache on this vma should generate a page fault.  If not, it's violating the
> > > userfaultfd semantics.
> > >
> > > Here, I don't see what stops a new vma from registering MISSING mode right
> > > away in parallel of collapsing.
> > >
> > > When registration completes, it means we should report uffd missing
> > > messages on the holes that collapse_file() scanned previously.  However
> > > they'll be filled very possibly later with the thp which means the messages
> > > can be lost.  Then the issue can still happen, while this patch only makes
> > > it very unlikely to happen, or am I wrong?
> >
> > This race can still happen, but I don't think it's actually a problem.
> > In particular, I think the semantics you've given for UFFDIO_REGISTER
> > are overly restrictive on the kernel in a way that provides no
> > meaningful benefit to userspace.
> >
> > To simplify things a little bit, we only need to consider going from
> > zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
> > is already registered isn't particularly interesting, since khugepaged
> > would have seen the existing userfaultfd and given up on the collapse.
>
> Could you point me which part of the code you're referencing here?

I'm referencing the behavior with this patch. With my patch, all
khugepaged cares about is whether there exists an armed userfaultfd on
a given VMA. If there already exists an armed userfaultfd on a given
VMA, then with my patch, khugepaged will give up collapsing. So the
case where a second/third/etc userfaultfd is being armed isn't
interesting, because khugepaged will already not be doing anything, so
there cannot be any race condition.

> Note that unlike anonymous thp collapsing scan (hpage_collapse_scan_pmd),
> hpage_collapse_scan_file works on page caches only.
>
> >
> > When userspace registers the first userfaultfd, it cannot rely on any
> > knowledge about the presence or absence of pages from before when
> > UFFDIO_REGISTER completes. This is because, as far as userspace is
> > concerned, khugepaged could have come along and collapsed every single
> > page in the shmem right after UFFDIO_REGISTER entered the kernel. So
> > to properly use the userfaultfd API, userspace must re-examine the
> > state of memory after UFFDIO_REGISTER completes. Because of that,
> > having UFFDIO_REGISTER be a complete barrier with regards to
> > khugepaged is essentially meaningless from the perspective of
> > userspace. The guarantee might slightly reduce the window for some
> > races in userspace programs, but there are no userspace races that it
> > can completely resolve.
> >
> > Given that, the kernel just needs to guarantee that no missing page
> > which userspace might have observed after registering a userfaultfd is
> > filled by khugepaged. This patch takes the page cache lock, checks
> > that there are no registered userfaultfds, fills the missing pages,
> > and then releases the page cache lock. That's sufficient to guarantee
> > that any UFFDIO_REGISTER which occurs after the userfaultfd check but
> > before we finish the collapse can't actually observe any missing
> > pages.
>
> I don't think your current patch guaranteed that?
>
> Now I'm confused why you removed population of the small pages using hpage,
> because that seems to support what you're doing here.  Note that your
> current code is added _before_ the replacement of the huge page here:
>
>         /* Join all the small entries into a single multi-index entry */
>         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
>         xas_store(&xas, hpage);
>
> So either the patch needs to keep the original logic to fill up the holes
> with hpage, or the vma checks needs to be done later, to achieve what you
> proposed to happen, IIUC.

Maybe I'm fundamentally misunderstanding how locking for shmem works?
With my patch, we effectively have:

xas_lock_irq(&xas);
<check that userfaultfd_missing(VMA) == false for all VMAs>
xas_set_order(&xas, start, HPAGE_PMD_ORDER);
xas_store(&xas, hpage);
xas_unlock_irq(&xas);

Here's my understanding. If there are no armed userfaultfds during the
check, then userspace could not have made any observations of the
shmem with an armed userfaultfd before collapse_file locked xas. The
xas_store() is where missing pages become present pages. Since this
happens while xas is still locked, we know that no additional
observations of shmem could have occured since locking xas. So even if
registration of a new userfaultfd raced with the check, we know that
new userfaultfd couldn't have made any observations of shmem.

Are you saying that it's possible for something to come in and observe
a missing page from the shmem between when we check the VMAs and when
we store hpage, even though xas is locked? Or that userspace can
observe shmem completely independently of the locking done by
xas_lock_irq(&xas)?

> Besides, please imagine an uffd program that does this:
>
>   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
>   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
>       noticed that it returns DATA, means this is not a hole.  So it's
>       expected that there will be no further message generated to this page.
>   (3) After a while, it received a missing fault on page P1
>
> My question is whether above is a problem or not?  Whether it violates the
> userfaultfd missing mode semantics?
>
> AFAICT, above can happen if some program accidentally registers UFFD
> missing mode during collapsing of the shmem thp P1 if with current logic
> applied.

That's a bug that can occur with khugepaged as it exists before my
patch, if collapse_file fills a missing page but then needs to roll
back the collapse. With my patch, khugepaged doesn't fill any missing
pages until after it knows for sure the collapse will succeed, so I
don't see any way for collapse_file to cause a formerly present page
to become absent.

-David
Peter Xu Feb. 8, 2023, 5:24 p.m. UTC | #12
On Wed, Feb 08, 2023 at 11:42:22AM +0900, David Stevens wrote:
> On Wed, Feb 8, 2023 at 1:34 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 07, 2023 at 12:56:56PM +0900, David Stevens wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > > > From: David Stevens <stevensd@chromium.org>
> > > > >
> > > > > Collapsing memory will result in any empty pages in the target range
> > > > > being filled by the new THP. If userspace has a userfaultfd registered
> > > > > with MODE_MISSING, for any page which it knows to be missing after
> > > > > registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
> > > > > Taking these two facts together, khugepaged needs to take care when
> > > > > collapsing pages in shmem to make sure it doesn't break the userfaultfd
> > > > > API.
> > > > >
> > > > > This change first makes sure that the intermediate page cache state
> > > > > during collapse is not visible by moving when gaps are filled to after
> > > > > the page cache lock is acquired for the final time. This is necessary
> > > > > because the synchronization provided by locking hpage is insufficient
> > > > > for functions which operate on the page cache without actually locking
> > > > > individual pages to examine their content (e.g. shmem_mfill_atomic_pte).
> > > > >
> > > > > This refactoring allows us to iterate over i_mmap to check for any VMAs
> > > > > with userfaultfds and then finalize the collapse if no such VMAs exist,
> > > > > all while holding the page cache lock. Since no mm locks are held, it is
> > > > > necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
> > > > > vm_flags are visible to khugepaged. However, no further locking of
> > > > > userfaultfd state is necessary. Although new userfaultfds can be
> > > > > registered concurrently with finalizing the collapse, any missing pages
> > > > > that are being replaced can no longer be observed by userspace, so there
> > > > > is no data race.
> > > > >
> > > > > This fix is targeted at khugepaged, but the change also applies to
> > > > > MADV_COLLAPSE. The fact that the intermediate page cache state before
> > > > > the rollback of a failed collapse can no longer be observed is
> > > > > technically a userspace-visible change (via at least SEEK_DATA and
> > > > > SEEK_END), but it is exceedingly unlikely that anything relies on being
> > > > > able to observe that transient state.
> > > > >
> > > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > > ---
> > > > >  fs/userfaultfd.c |  2 ++
> > > > >  mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
> > > > >  2 files changed, 59 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > index cc694846617a..6ddfcff11920 100644
> > > > > --- a/fs/userfaultfd.c
> > > > > +++ b/fs/userfaultfd.c
> > > > > @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
> > > > >       const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
> > > > >
> > > > >       vma->vm_flags = flags;
> > > > > +     /* Pairs with smp_rmb() in khugepaged's collapse_file() */
> > > > > +     smp_wmb();
> > > >
> > > > Could you help to explain why this is needed?  What's the operation to
> > > > serialize against updating vm_flags?
> > >
> > > We need to ensure that any updates to VM_UFFD_MISSING from
> > > UFFDIO_REGISTER are visible to khugepaged before the ioctl returns to
> > > userspace. There aren't any locks that obviously provide that
> > > synchronization, so I added the smp_wmb/smp_rmb pair. It's possible
> > > that page cache lock indirectly provides sufficient synchronization,
> > > though.
> >
> > If so I don't think it's needed.  If vm_flags cannot be flushed to memory
> > and present to the rest before the syscall returns, then it's a fatal
> > problem already before this patch.  I bet there're quite a few barriers to
> > protect it, while the world switch should be the last guardline.
> >
> > One example: vm_flags updates need mmap write lock, then: mmap_write_unlock
> > -> up_write -> __up_write -> preempt_disable() -> barrier().
> >
> > >
> > > > >       /*
> > > > >        * For shared mappings, we want to enable writenotify while
> > > > >        * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 79be13133322..97435c226b18 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -55,6 +55,7 @@ enum scan_result {
> > > > >       SCAN_CGROUP_CHARGE_FAIL,
> > > > >       SCAN_TRUNCATED,
> > > > >       SCAN_PAGE_HAS_PRIVATE,
> > > > > +     SCAN_PAGE_FILLED,
> > > > >  };
> > > > >
> > > > >  #define CREATE_TRACE_POINTS
> > > > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > > >   *  - allocate and lock a new huge page;
> > > > >   *  - scan page cache replacing old pages with the new one
> > > > >   *    + swap/gup in pages if necessary;
> > > > > - *    + fill in gaps;
> > > > >   *    + keep old pages around in case rollback is required;
> > > > > + *  - finalize updates to the page cache;
> > > > >   *  - if replacing succeeds:
> > > > >   *    + copy data over;
> > > > >   *    + free old pages;
> > > > > @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >       XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> > > > >       int nr_none = 0, result = SCAN_SUCCEED;
> > > > >       bool is_shmem = shmem_file(file);
> > > > > +     bool i_mmap_locked = false;
> > > > >       int nr = 0;
> > > > >
> > > > >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > > > > @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >
> > > > >       /*
> > > > >        * At this point the hpage is locked and not up-to-date.
> > > > > -      * It's safe to insert it into the page cache, because nobody would
> > > > > -      * be able to map it or use it in another way until we unlock it.
> > > > > +      *
> > > > > +      * While iterating, we may drop the page cache lock multiple times. It
> > > > > +      * is safe to replace pages in the page cache with hpage while doing so
> > > > > +      * because nobody is able to map or otherwise access the content of
> > > > > +      * hpage until we unlock it. However, we cannot insert hpage into empty
> > > > > +      * indicies until we know we won't have to drop the page cache lock
> > > > > +      * again, as doing so would let things which only check the presence
> > > > > +      * of pages in the page cache see a state that may yet be rolled back.
> > > > >        */
> > > > >
> > > > >       xas_set(&xas, start);
> > > > > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >                                               result = SCAN_TRUNCATED;
> > > > >                                               goto xa_locked;
> > > > >                                       }
> > > > > -                                     xas_set(&xas, index);
> > > > > +                                     xas_set(&xas, index + 1);
> > > >
> > > > I failed to figure out why this index needs a shift here... It seems to me
> > > > it'll ignore the initial empty page cache, is that what the patch wanted?
> > >
> > > This chunk is preceded by a call to xas_next_entry. Previously,
> > > xas_store(hpage) would pair with xas_set(index) to walk the xarray and
> > > prepare xas for the xas_next at the start of the next loop iteration.
> > > Now that we're no longer calling xas_store, xas_set(index+1) is
> > > necessary to prepare xas for the next iteration of the loop.
> >
> > If I'm not wrong, xas_store() doesn't move the niddle at all.  It's the
> > xas_next() at the entry of the loop that does.
> 
> xas_set(n) resets the cursor to the top of the tree. If the cursor is
> at the top of the tree, then xas_next() walks to n instead of
> advancing n to n+1. xas_store() on the other hand behaves the same
> regardless of whether or not the cursor is at the top of the tree (it
> starts by calling xas_create or xas_load, which will walk the tree if
> necessary).
> 
> Looking at it another way, if we're here, then start == index. So if
> we were to call xas_set(&xas, index), xas at the start of the next
> iteration would be in an identical state to when we entered the loop
> the first time. So xas would be one value behind index, and we would
> never actually examine the last entry in the range being collapsed.

Are you suggesting that this is a bug in the existing code?

My read on this whole chunk is: if this is the first page we're scanning
and it's missing, let's have a look to make sure the whole THP isn't
truncated.  If it is, we return SCAN_TRUNCATED.  Otherwise, we fill up the
holes _starting from this index 0_.

AFAIU your change here will make that last sentence start from index 1.

> 
> > >
> > > > >                               }
> > > > >                               if (!shmem_charge(mapping->host, 1)) {
> > > > >                                       result = SCAN_FAIL;
> > > > >                                       goto xa_locked;
> > > > >                               }
> > > > > -                             xas_store(&xas, hpage);
> > > >
> > > > [I raised a question in the other thread on whether it's legal to not
> > > >  populate page cache holes at all.  We can keep the discussion there]
> > > >
> > > > >                               nr_none++;
> > > > >                               continue;
> > > > >                       }
> > > > > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >               put_page(page);
> > > > >               goto xa_unlocked;
> > > > >       }
> > > > > +
> > > > > +     if (nr_none) {
> > > > > +             struct vm_area_struct *vma;
> > > > > +             int nr_none_check = 0;
> > > > > +
> > > > > +             xas_unlock_irq(&xas);
> > > > > +             i_mmap_lock_read(mapping);
> > > > > +             i_mmap_locked = true;
> > > > > +             xas_lock_irq(&xas);
> > > > > +
> > > > > +             xas_set(&xas, start);
> > > > > +             for (index = start; index < end; index++) {
> > > > > +                     if (!xas_next(&xas))
> > > > > +                             nr_none_check++;
> > > > > +             }
> > > > > +
> > > > > +             if (nr_none != nr_none_check) {
> > > > > +                     result = SCAN_PAGE_FILLED;
> > > > > +                     goto xa_locked;
> > > > > +             }
> > > > > +
> > > > > +             /*
> > > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > > +              * able to observe any missing pages, since the page cache is
> > > > > +              * locked until after the collapse is completed.
> > > > > +              *
> > > > > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > > > > +              */
> > > > > +             smp_rmb();
> > > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > > +                     if (userfaultfd_missing(vma)) {
> > > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > > +                             goto xa_locked;
> > > > > +                     }
> > > > > +             }
> > > > > +     }
> > > >
> > > > Thanks for writting the patch, but I am still confused why this can avoid
> > > > race conditions of uffd missing mode.
> > > >
> > > > I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> > > > succeeded on a specific shmem VMA, any page faults due to missing page
> > > > cache on this vma should generate a page fault.  If not, it's violating the
> > > > userfaultfd semantics.
> > > >
> > > > Here, I don't see what stops a new vma from registering MISSING mode right
> > > > away in parallel of collapsing.
> > > >
> > > > When registration completes, it means we should report uffd missing
> > > > messages on the holes that collapse_file() scanned previously.  However
> > > > they'll be filled very possibly later with the thp which means the messages
> > > > can be lost.  Then the issue can still happen, while this patch only makes
> > > > it very unlikely to happen, or am I wrong?
> > >
> > > This race can still happen, but I don't think it's actually a problem.
> > > In particular, I think the semantics you've given for UFFDIO_REGISTER
> > > are overly restrictive on the kernel in a way that provides no
> > > meaningful benefit to userspace.
> > >
> > > To simplify things a little bit, we only need to consider going from
> > > zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
> > > is already registered isn't particularly interesting, since khugepaged
> > > would have seen the existing userfaultfd and given up on the collapse.
> >
> > Could you point me which part of the code you're referencing here?
> 
> I'm referencing the behavior with this patch. With my patch, all
> khugepaged cares about is whether there exists an armed userfaultfd on
> a given VMA. If there already exists an armed userfaultfd on a given
> VMA, then with my patch, khugepaged will give up collapsing. So the
> case where a second/third/etc userfaultfd is being armed isn't
> interesting, because khugepaged will already not be doing anything, so
> there cannot be any race condition.
> 
> > Note that unlike anonymous thp collapsing scan (hpage_collapse_scan_pmd),
> > hpage_collapse_scan_file works on page caches only.
> >
> > >
> > > When userspace registers the first userfaultfd, it cannot rely on any
> > > knowledge about the presence or absence of pages from before when
> > > UFFDIO_REGISTER completes. This is because, as far as userspace is
> > > concerned, khugepaged could have come along and collapsed every single
> > > page in the shmem right after UFFDIO_REGISTER entered the kernel. So
> > > to properly use the userfaultfd API, userspace must re-examine the
> > > state of memory after UFFDIO_REGISTER completes. Because of that,
> > > having UFFDIO_REGISTER be a complete barrier with regards to
> > > khugepaged is essentially meaningless from the perspective of
> > > userspace. The guarantee might slightly reduce the window for some
> > > races in userspace programs, but there are no userspace races that it
> > > can completely resolve.
> > >
> > > Given that, the kernel just needs to guarantee that no missing page
> > > which userspace might have observed after registering a userfaultfd is
> > > filled by khugepaged. This patch takes the page cache lock, checks
> > > that there are no registered userfaultfds, fills the missing pages,
> > > and then releases the page cache lock. That's sufficient to guarantee
> > > that any UFFDIO_REGISTER which occurs after the userfaultfd check but
> > > before we finish the collapse can't actually observe any missing
> > > pages.
> >
> > I don't think your current patch guaranteed that?
> >
> > Now I'm confused why you removed population of the small pages using hpage,
> > because that seems to support what you're doing here.  Note that your
> > current code is added _before_ the replacement of the huge page here:
> >
> >         /* Join all the small entries into a single multi-index entry */
> >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> >         xas_store(&xas, hpage);
> >
> > So either the patch needs to keep the original logic to fill up the holes
> > with hpage, or the vma checks needs to be done later, to achieve what you
> > proposed to happen, IIUC.
> 
> Maybe I'm fundamentally misunderstanding how locking for shmem works?
> With my patch, we effectively have:
> 
> xas_lock_irq(&xas);
> <check that userfaultfd_missing(VMA) == false for all VMAs>
> xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> xas_store(&xas, hpage);
> xas_unlock_irq(&xas);
> 
> Here's my understanding. If there are no armed userfaultfds during the
> check, then userspace could not have made any observations of the
> shmem with an armed userfaultfd before collapse_file locked xas. The
> xas_store() is where missing pages become present pages. Since this
> happens while xas is still locked, we know that no additional
> observations of shmem could have occured since locking xas. So even if
> registration of a new userfaultfd raced with the check, we know that
> new userfaultfd couldn't have made any observations of shmem.
> 
> Are you saying that it's possible for something to come in and observe
> a missing page from the shmem between when we check the VMAs and when
> we store hpage, even though xas is locked? Or that userspace can
> observe shmem completely independently of the locking done by
> xas_lock_irq(&xas)?

AFAIU a pure lookup is done without xas lock but rcu read lock.  More
below.

> 
> > Besides, please imagine an uffd program that does this:
> >
> >   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
> >   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
> >       noticed that it returns DATA, means this is not a hole.  So it's
> >       expected that there will be no further message generated to this page.
> >   (3) After a while, it received a missing fault on page P1
> >
> > My question is whether above is a problem or not?  Whether it violates the
> > userfaultfd missing mode semantics?
> >
> > AFAICT, above can happen if some program accidentally registers UFFD
> > missing mode during collapsing of the shmem thp P1 if with current logic
> > applied.
> 
> That's a bug that can occur with khugepaged as it exists before my
> patch, if collapse_file fills a missing page but then needs to roll
> back the collapse. With my patch, khugepaged doesn't fill any missing
> pages until after it knows for sure the collapse will succeed, so I
> don't see any way for collapse_file to cause a formerly present page
> to become absent.

Right, I should have reversed the conditions, sorry:

  (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
  (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
      noticed that it returns HOLE.
  (3) The HOLE got collapsed into a THP

I think it can still happen after the patch applied.  lseek has:

  shmem_file_llseek -> mapping_seek_hole_data -> find_get_entry

So I think it can happen concurrently, that one thread does (1+2) before
your patch collapsed the thp in the page cache but after the vma checks.

What I'm so far sincerely not sure is whether we should care about the case
like lseek (or, mincore can do similar if without swap being considered).

What I used to propose before on the inode counter should be able to avoid
that by serializing the collapse thp operation against step (1) with a
per-inode lock.  So IIUC that should even protect things like lseek().  The
problem with that is the change is slightly more involved (the counter will
enlarge every shmem inode struct as long as CONFIG_USERFAULTFD), also the
impact will be upon the whole inode too even if only 1 page is registered
with uffd.

Thanks,
David Stevens Feb. 9, 2023, 5:10 a.m. UTC | #13
On Thu, Feb 9, 2023 at 2:24 AM Peter Xu <peterx@redhat.com> wrote:
> On Wed, Feb 08, 2023 at 11:42:22AM +0900, David Stevens wrote:
> > On Wed, Feb 8, 2023 at 1:34 AM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, Feb 07, 2023 at 12:56:56PM +0900, David Stevens wrote:
> > > > On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > > > > From: David Stevens <stevensd@chromium.org>
> > > > > > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > >                                               result = SCAN_TRUNCATED;
> > > > > >                                               goto xa_locked;
> > > > > >                                       }
> > > > > > -                                     xas_set(&xas, index);
> > > > > > +                                     xas_set(&xas, index + 1);
> > > > >
> > > > > I failed to figure out why this index needs a shift here... It seems to me
> > > > > it'll ignore the initial empty page cache, is that what the patch wanted?
> > > >
> > > > This chunk is preceded by a call to xas_next_entry. Previously,
> > > > xas_store(hpage) would pair with xas_set(index) to walk the xarray and
> > > > prepare xas for the xas_next at the start of the next loop iteration.
> > > > Now that we're no longer calling xas_store, xas_set(index+1) is
> > > > necessary to prepare xas for the next iteration of the loop.
> > >
> > > If I'm not wrong, xas_store() doesn't move the niddle at all.  It's the
> > > xas_next() at the entry of the loop that does.
> >
> > xas_set(n) resets the cursor to the top of the tree. If the cursor is
> > at the top of the tree, then xas_next() walks to n instead of
> > advancing n to n+1. xas_store() on the other hand behaves the same
> > regardless of whether or not the cursor is at the top of the tree (it
> > starts by calling xas_create or xas_load, which will walk the tree if
> > necessary).
> >
> > Looking at it another way, if we're here, then start == index. So if
> > we were to call xas_set(&xas, index), xas at the start of the next
> > iteration would be in an identical state to when we entered the loop
> > the first time. So xas would be one value behind index, and we would
> > never actually examine the last entry in the range being collapsed.
>
> Are you suggesting that this is a bug in the existing code?
>
> My read on this whole chunk is: if this is the first page we're scanning
> and it's missing, let's have a look to make sure the whole THP isn't
> truncated.  If it is, we return SCAN_TRUNCATED.  Otherwise, we fill up the
> holes _starting from this index 0_.

This overlooks the fact that we've already processed index 0 during
this iteration. The last sentence should be along the lines of:
Otherwise, we continue examining subsequent pages in the chunk.

> AFAIU your change here will make that last sentence start from index 1.

The original code is correct. The important thing to keep in mind is
that xas_next() behaves differently depending on whether or not
xas_not_node(xa_node) is true. Stepping through the relevant lines of
the code and showing the relevant state after each xas call executes:

xas_set(&xas, start);
  (index == start, xa_index == start, xa_node == XAS_RESTART)
xas_next(&xas); -> returns entry at start
  (index == start, xa_index == start, xa_node == start)
xas_next_entry(&xas, end - 1);
  (index == start , xa_index == x, xa_node == x, where x > start)

From here, the original code has:

xas_set(&xas, start);
  (index == start, xa_index == start, xa_node == XAS_RESTART)
xas_store(&xas, hpage);
  (index == start, xa_index == start, xa_node == start)
xas_next(&xas); -> returns entry at start + 1
  (index == start + 1, xa_index == start + 1, xa_node == start + 1)

With my patch, we have:

xas_set(&xas, start + 1);
  (index == start, xa_index == start + 1, xa_node == XAS_RESTART)
xas_next(&xas); -> returns entry at start + 1
  (index == start + 1, xa_index == start + 1, xa_node == start + 1)

If instead we were to do as you suggest, we would have:

xas_set(&xas, start);
  (index == start, xa_index == start, xa_node == XAS_RESTART)
xas_next(&xas); -> returns entry at start
  (index == start + 1, xa_index == start, xa_node == start)

In this version, there is a mismatch between index and xa_index.
Because of this, we will end up breaking out of the loop before
xa_index is able to reach xa_index +  HPAGE_PMD_NR - 1, so we will
miss the last entry in the chunk.

> > > > > > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > >               put_page(page);
> > > > > >               goto xa_unlocked;
> > > > > >       }
> > > > > > +
> > > > > > +     if (nr_none) {
> > > > > > +             struct vm_area_struct *vma;
> > > > > > +             int nr_none_check = 0;
> > > > > > +
> > > > > > +             xas_unlock_irq(&xas);
> > > > > > +             i_mmap_lock_read(mapping);
> > > > > > +             i_mmap_locked = true;
> > > > > > +             xas_lock_irq(&xas);
> > > > > > +
> > > > > > +             xas_set(&xas, start);
> > > > > > +             for (index = start; index < end; index++) {
> > > > > > +                     if (!xas_next(&xas))
> > > > > > +                             nr_none_check++;
> > > > > > +             }
> > > > > > +
> > > > > > +             if (nr_none != nr_none_check) {
> > > > > > +                     result = SCAN_PAGE_FILLED;
> > > > > > +                     goto xa_locked;
> > > > > > +             }
> > > > > > +
> > > > > > +             /*
> > > > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > > > +              * able to observe any missing pages, since the page cache is
> > > > > > +              * locked until after the collapse is completed.
> > > > > > +              *
> > > > > > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > > > > > +              */
> > > > > > +             smp_rmb();
> > > > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > > > +                     if (userfaultfd_missing(vma)) {
> > > > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > > > +                             goto xa_locked;
> > > > > > +                     }
> > > > > > +             }
> > > > > > +     }
> > > > >
> > > > > Thanks for writting the patch, but I am still confused why this can avoid
> > > > > race conditions of uffd missing mode.
> > > > >
> > > > > I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> > > > > succeeded on a specific shmem VMA, any page faults due to missing page
> > > > > cache on this vma should generate a page fault.  If not, it's violating the
> > > > > userfaultfd semantics.
> > > > >
> > > > > Here, I don't see what stops a new vma from registering MISSING mode right
> > > > > away in parallel of collapsing.
> > > > >
> > > > > When registration completes, it means we should report uffd missing
> > > > > messages on the holes that collapse_file() scanned previously.  However
> > > > > they'll be filled very possibly later with the thp which means the messages
> > > > > can be lost.  Then the issue can still happen, while this patch only makes
> > > > > it very unlikely to happen, or am I wrong?
> > > >
> > > > This race can still happen, but I don't think it's actually a problem.
> > > > In particular, I think the semantics you've given for UFFDIO_REGISTER
> > > > are overly restrictive on the kernel in a way that provides no
> > > > meaningful benefit to userspace.
> > > >
> > > > To simplify things a little bit, we only need to consider going from
> > > > zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
> > > > is already registered isn't particularly interesting, since khugepaged
> > > > would have seen the existing userfaultfd and given up on the collapse.
> > >
> > > Could you point me which part of the code you're referencing here?
> >
> > I'm referencing the behavior with this patch. With my patch, all
> > khugepaged cares about is whether there exists an armed userfaultfd on
> > a given VMA. If there already exists an armed userfaultfd on a given
> > VMA, then with my patch, khugepaged will give up collapsing. So the
> > case where a second/third/etc userfaultfd is being armed isn't
> > interesting, because khugepaged will already not be doing anything, so
> > there cannot be any race condition.
> >
> > > Note that unlike anonymous thp collapsing scan (hpage_collapse_scan_pmd),
> > > hpage_collapse_scan_file works on page caches only.
> > >
> > > >
> > > > When userspace registers the first userfaultfd, it cannot rely on any
> > > > knowledge about the presence or absence of pages from before when
> > > > UFFDIO_REGISTER completes. This is because, as far as userspace is
> > > > concerned, khugepaged could have come along and collapsed every single
> > > > page in the shmem right after UFFDIO_REGISTER entered the kernel. So
> > > > to properly use the userfaultfd API, userspace must re-examine the
> > > > state of memory after UFFDIO_REGISTER completes. Because of that,
> > > > having UFFDIO_REGISTER be a complete barrier with regards to
> > > > khugepaged is essentially meaningless from the perspective of
> > > > userspace. The guarantee might slightly reduce the window for some
> > > > races in userspace programs, but there are no userspace races that it
> > > > can completely resolve.
> > > >
> > > > Given that, the kernel just needs to guarantee that no missing page
> > > > which userspace might have observed after registering a userfaultfd is
> > > > filled by khugepaged. This patch takes the page cache lock, checks
> > > > that there are no registered userfaultfds, fills the missing pages,
> > > > and then releases the page cache lock. That's sufficient to guarantee
> > > > that any UFFDIO_REGISTER which occurs after the userfaultfd check but
> > > > before we finish the collapse can't actually observe any missing
> > > > pages.
> > >
> > > I don't think your current patch guaranteed that?
> > >
> > > Now I'm confused why you removed population of the small pages using hpage,
> > > because that seems to support what you're doing here.  Note that your
> > > current code is added _before_ the replacement of the huge page here:
> > >
> > >         /* Join all the small entries into a single multi-index entry */
> > >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > >         xas_store(&xas, hpage);
> > >
> > > So either the patch needs to keep the original logic to fill up the holes
> > > with hpage, or the vma checks needs to be done later, to achieve what you
> > > proposed to happen, IIUC.
> >
> > Maybe I'm fundamentally misunderstanding how locking for shmem works?
> > With my patch, we effectively have:
> >
> > xas_lock_irq(&xas);
> > <check that userfaultfd_missing(VMA) == false for all VMAs>
> > xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > xas_store(&xas, hpage);
> > xas_unlock_irq(&xas);
> >
> > Here's my understanding. If there are no armed userfaultfds during the
> > check, then userspace could not have made any observations of the
> > shmem with an armed userfaultfd before collapse_file locked xas. The
> > xas_store() is where missing pages become present pages. Since this
> > happens while xas is still locked, we know that no additional
> > observations of shmem could have occured since locking xas. So even if
> > registration of a new userfaultfd raced with the check, we know that
> > new userfaultfd couldn't have made any observations of shmem.
> >
> > Are you saying that it's possible for something to come in and observe
> > a missing page from the shmem between when we check the VMAs and when
> > we store hpage, even though xas is locked? Or that userspace can
> > observe shmem completely independently of the locking done by
> > xas_lock_irq(&xas)?
>
> AFAIU a pure lookup is done without xas lock but rcu read lock.  More
> below.
>
> >
> > > Besides, please imagine an uffd program that does this:
> > >
> > >   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
> > >   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
> > >       noticed that it returns DATA, means this is not a hole.  So it's
> > >       expected that there will be no further message generated to this page.
> > >   (3) After a while, it received a missing fault on page P1
> > >
> > > My question is whether above is a problem or not?  Whether it violates the
> > > userfaultfd missing mode semantics?
> > >
> > > AFAICT, above can happen if some program accidentally registers UFFD
> > > missing mode during collapsing of the shmem thp P1 if with current logic
> > > applied.
> >
> > That's a bug that can occur with khugepaged as it exists before my
> > patch, if collapse_file fills a missing page but then needs to roll
> > back the collapse. With my patch, khugepaged doesn't fill any missing
> > pages until after it knows for sure the collapse will succeed, so I
> > don't see any way for collapse_file to cause a formerly present page
> > to become absent.
>
> Right, I should have reversed the conditions, sorry:
>
>   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
>   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
>       noticed that it returns HOLE.
>   (3) The HOLE got collapsed into a THP
>
> I think it can still happen after the patch applied.  lseek has:
>
>   shmem_file_llseek -> mapping_seek_hole_data -> find_get_entry
>
> So I think it can happen concurrently, that one thread does (1+2) before
> your patch collapsed the thp in the page cache but after the vma checks.

Ah, I see what you're saying. I didn't fully understand the locking in
the page cache.

I think Matthew's suggestion earlier of using RETRY entries might
work. We would still need to consolidate the mark/check/collapse
sequence into a single xas_lock critical section, though, since things
will end up spinning on the RETRY within rcu critical sections.

That said, is using XA_RETRY_ENTRY a proper use of the xarray API? I
don't see anything in the kernel external to xarray using it. I would
think XA_ZERO_ENTRY would be more appropriate (i.e. what xa_reserve
uses), but the handling for that is inconsistent. XA_ZERO_ENTRY is
sufficient to cause mapping_get_entry to retry, but it looks like
everywhere else in filemap/shmem that directly iterates over the page
cache does if (xas_retry()) continue, which would end up not retrying
in the XA_ZERO_ENTRY case.

> What I'm so far sincerely not sure is whether we should care about the case
> like lseek (or, mincore can do similar if without swap being considered).

It would be nice to at least get lseek to work. We want to use lseek
to determine what pages of guest memory are actually allocated, to
reduce how much data needs to be processed. I guess the occasional
race with khugepaged causing pages to spuriously be present isn't
really a huge problem for us, though.

-David
Peter Xu Feb. 9, 2023, 6:50 p.m. UTC | #14
On Thu, Feb 09, 2023 at 02:10:08PM +0900, David Stevens wrote:
>  On Thu, Feb 9, 2023 at 2:24 AM Peter Xu <peterx@redhat.com> wrote:
> > On Wed, Feb 08, 2023 at 11:42:22AM +0900, David Stevens wrote:
> > > On Wed, Feb 8, 2023 at 1:34 AM Peter Xu <peterx@redhat.com> wrote:
> > > > On Tue, Feb 07, 2023 at 12:56:56PM +0900, David Stevens wrote:
> > > > > On Tue, Feb 7, 2023 at 6:02 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote:
> > > > > > > From: David Stevens <stevensd@chromium.org>
> > > > > > > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > > >                                               result = SCAN_TRUNCATED;
> > > > > > >                                               goto xa_locked;
> > > > > > >                                       }
> > > > > > > -                                     xas_set(&xas, index);
> > > > > > > +                                     xas_set(&xas, index + 1);
> > > > > >
> > > > > > I failed to figure out why this index needs a shift here... It seems to me
> > > > > > it'll ignore the initial empty page cache, is that what the patch wanted?
> > > > >
> > > > > This chunk is preceded by a call to xas_next_entry. Previously,
> > > > > xas_store(hpage) would pair with xas_set(index) to walk the xarray and
> > > > > prepare xas for the xas_next at the start of the next loop iteration.
> > > > > Now that we're no longer calling xas_store, xas_set(index+1) is
> > > > > necessary to prepare xas for the next iteration of the loop.
> > > >
> > > > If I'm not wrong, xas_store() doesn't move the niddle at all.  It's the
> > > > xas_next() at the entry of the loop that does.
> > >
> > > xas_set(n) resets the cursor to the top of the tree. If the cursor is
> > > at the top of the tree, then xas_next() walks to n instead of
> > > advancing n to n+1. xas_store() on the other hand behaves the same
> > > regardless of whether or not the cursor is at the top of the tree (it
> > > starts by calling xas_create or xas_load, which will walk the tree if
> > > necessary).
> > >
> > > Looking at it another way, if we're here, then start == index. So if
> > > we were to call xas_set(&xas, index), xas at the start of the next
> > > iteration would be in an identical state to when we entered the loop
> > > the first time. So xas would be one value behind index, and we would
> > > never actually examine the last entry in the range being collapsed.
> >
> > Are you suggesting that this is a bug in the existing code?
> >
> > My read on this whole chunk is: if this is the first page we're scanning
> > and it's missing, let's have a look to make sure the whole THP isn't
> > truncated.  If it is, we return SCAN_TRUNCATED.  Otherwise, we fill up the
> > holes _starting from this index 0_.
> 
> This overlooks the fact that we've already processed index 0 during
> this iteration. The last sentence should be along the lines of:
> Otherwise, we continue examining subsequent pages in the chunk.
> 
> > AFAIU your change here will make that last sentence start from index 1.
> 
> The original code is correct. The important thing to keep in mind is
> that xas_next() behaves differently depending on whether or not
> xas_not_node(xa_node) is true. Stepping through the relevant lines of
> the code and showing the relevant state after each xas call executes:
> 
> xas_set(&xas, start);
>   (index == start, xa_index == start, xa_node == XAS_RESTART)
> xas_next(&xas); -> returns entry at start
>   (index == start, xa_index == start, xa_node == start)
> xas_next_entry(&xas, end - 1);
>   (index == start , xa_index == x, xa_node == x, where x > start)
> 
> From here, the original code has:
> 
> xas_set(&xas, start);
>   (index == start, xa_index == start, xa_node == XAS_RESTART)
> xas_store(&xas, hpage);
>   (index == start, xa_index == start, xa_node == start)
> xas_next(&xas); -> returns entry at start + 1
>   (index == start + 1, xa_index == start + 1, xa_node == start + 1)
> 
> With my patch, we have:
> 
> xas_set(&xas, start + 1);
>   (index == start, xa_index == start + 1, xa_node == XAS_RESTART)
> xas_next(&xas); -> returns entry at start + 1
>   (index == start + 1, xa_index == start + 1, xa_node == start + 1)
> 
> If instead we were to do as you suggest, we would have:
> 
> xas_set(&xas, start);
>   (index == start, xa_index == start, xa_node == XAS_RESTART)
> xas_next(&xas); -> returns entry at start
>   (index == start + 1, xa_index == start, xa_node == start)
> 
> In this version, there is a mismatch between index and xa_index.
> Because of this, we will end up breaking out of the loop before
> xa_index is able to reach xa_index +  HPAGE_PMD_NR - 1, so we will
> miss the last entry in the chunk.

I see what I missed now, thanks.

> 
> > > > > > > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > > >               put_page(page);
> > > > > > >               goto xa_unlocked;
> > > > > > >       }
> > > > > > > +
> > > > > > > +     if (nr_none) {
> > > > > > > +             struct vm_area_struct *vma;
> > > > > > > +             int nr_none_check = 0;
> > > > > > > +
> > > > > > > +             xas_unlock_irq(&xas);
> > > > > > > +             i_mmap_lock_read(mapping);
> > > > > > > +             i_mmap_locked = true;
> > > > > > > +             xas_lock_irq(&xas);
> > > > > > > +
> > > > > > > +             xas_set(&xas, start);
> > > > > > > +             for (index = start; index < end; index++) {
> > > > > > > +                     if (!xas_next(&xas))
> > > > > > > +                             nr_none_check++;
> > > > > > > +             }
> > > > > > > +
> > > > > > > +             if (nr_none != nr_none_check) {
> > > > > > > +                     result = SCAN_PAGE_FILLED;
> > > > > > > +                     goto xa_locked;
> > > > > > > +             }
> > > > > > > +
> > > > > > > +             /*
> > > > > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > > > > +              * able to observe any missing pages, since the page cache is
> > > > > > > +              * locked until after the collapse is completed.
> > > > > > > +              *
> > > > > > > +              * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
> > > > > > > +              */
> > > > > > > +             smp_rmb();
> > > > > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > > > > +                     if (userfaultfd_missing(vma)) {
> > > > > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > > > > +                             goto xa_locked;
> > > > > > > +                     }
> > > > > > > +             }
> > > > > > > +     }
> > > > > >
> > > > > > Thanks for writting the patch, but I am still confused why this can avoid
> > > > > > race conditions of uffd missing mode.
> > > > > >
> > > > > > I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl
> > > > > > succeeded on a specific shmem VMA, any page faults due to missing page
> > > > > > cache on this vma should generate a page fault.  If not, it's violating the
> > > > > > userfaultfd semantics.
> > > > > >
> > > > > > Here, I don't see what stops a new vma from registering MISSING mode right
> > > > > > away in parallel of collapsing.
> > > > > >
> > > > > > When registration completes, it means we should report uffd missing
> > > > > > messages on the holes that collapse_file() scanned previously.  However
> > > > > > they'll be filled very possibly later with the thp which means the messages
> > > > > > can be lost.  Then the issue can still happen, while this patch only makes
> > > > > > it very unlikely to happen, or am I wrong?
> > > > >
> > > > > This race can still happen, but I don't think it's actually a problem.
> > > > > In particular, I think the semantics you've given for UFFDIO_REGISTER
> > > > > are overly restrictive on the kernel in a way that provides no
> > > > > meaningful benefit to userspace.
> > > > >
> > > > > To simplify things a little bit, we only need to consider going from
> > > > > zero to non-zero (i.e. one) userfaultfd. The case where a userfaultfd
> > > > > is already registered isn't particularly interesting, since khugepaged
> > > > > would have seen the existing userfaultfd and given up on the collapse.
> > > >
> > > > Could you point me which part of the code you're referencing here?
> > >
> > > I'm referencing the behavior with this patch. With my patch, all
> > > khugepaged cares about is whether there exists an armed userfaultfd on
> > > a given VMA. If there already exists an armed userfaultfd on a given
> > > VMA, then with my patch, khugepaged will give up collapsing. So the
> > > case where a second/third/etc userfaultfd is being armed isn't
> > > interesting, because khugepaged will already not be doing anything, so
> > > there cannot be any race condition.
> > >
> > > > Note that unlike anonymous thp collapsing scan (hpage_collapse_scan_pmd),
> > > > hpage_collapse_scan_file works on page caches only.
> > > >
> > > > >
> > > > > When userspace registers the first userfaultfd, it cannot rely on any
> > > > > knowledge about the presence or absence of pages from before when
> > > > > UFFDIO_REGISTER completes. This is because, as far as userspace is
> > > > > concerned, khugepaged could have come along and collapsed every single
> > > > > page in the shmem right after UFFDIO_REGISTER entered the kernel. So
> > > > > to properly use the userfaultfd API, userspace must re-examine the
> > > > > state of memory after UFFDIO_REGISTER completes. Because of that,
> > > > > having UFFDIO_REGISTER be a complete barrier with regards to
> > > > > khugepaged is essentially meaningless from the perspective of
> > > > > userspace. The guarantee might slightly reduce the window for some
> > > > > races in userspace programs, but there are no userspace races that it
> > > > > can completely resolve.
> > > > >
> > > > > Given that, the kernel just needs to guarantee that no missing page
> > > > > which userspace might have observed after registering a userfaultfd is
> > > > > filled by khugepaged. This patch takes the page cache lock, checks
> > > > > that there are no registered userfaultfds, fills the missing pages,
> > > > > and then releases the page cache lock. That's sufficient to guarantee
> > > > > that any UFFDIO_REGISTER which occurs after the userfaultfd check but
> > > > > before we finish the collapse can't actually observe any missing
> > > > > pages.
> > > >
> > > > I don't think your current patch guaranteed that?
> > > >
> > > > Now I'm confused why you removed population of the small pages using hpage,
> > > > because that seems to support what you're doing here.  Note that your
> > > > current code is added _before_ the replacement of the huge page here:
> > > >
> > > >         /* Join all the small entries into a single multi-index entry */
> > > >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > > >         xas_store(&xas, hpage);
> > > >
> > > > So either the patch needs to keep the original logic to fill up the holes
> > > > with hpage, or the vma checks needs to be done later, to achieve what you
> > > > proposed to happen, IIUC.
> > >
> > > Maybe I'm fundamentally misunderstanding how locking for shmem works?
> > > With my patch, we effectively have:
> > >
> > > xas_lock_irq(&xas);
> > > <check that userfaultfd_missing(VMA) == false for all VMAs>
> > > xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > > xas_store(&xas, hpage);
> > > xas_unlock_irq(&xas);
> > >
> > > Here's my understanding. If there are no armed userfaultfds during the
> > > check, then userspace could not have made any observations of the
> > > shmem with an armed userfaultfd before collapse_file locked xas. The
> > > xas_store() is where missing pages become present pages. Since this
> > > happens while xas is still locked, we know that no additional
> > > observations of shmem could have occured since locking xas. So even if
> > > registration of a new userfaultfd raced with the check, we know that
> > > new userfaultfd couldn't have made any observations of shmem.
> > >
> > > Are you saying that it's possible for something to come in and observe
> > > a missing page from the shmem between when we check the VMAs and when
> > > we store hpage, even though xas is locked? Or that userspace can
> > > observe shmem completely independently of the locking done by
> > > xas_lock_irq(&xas)?
> >
> > AFAIU a pure lookup is done without xas lock but rcu read lock.  More
> > below.
> >
> > >
> > > > Besides, please imagine an uffd program that does this:
> > > >
> > > >   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
> > > >   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
> > > >       noticed that it returns DATA, means this is not a hole.  So it's
> > > >       expected that there will be no further message generated to this page.
> > > >   (3) After a while, it received a missing fault on page P1
> > > >
> > > > My question is whether above is a problem or not?  Whether it violates the
> > > > userfaultfd missing mode semantics?
> > > >
> > > > AFAICT, above can happen if some program accidentally registers UFFD
> > > > missing mode during collapsing of the shmem thp P1 if with current logic
> > > > applied.
> > >
> > > That's a bug that can occur with khugepaged as it exists before my
> > > patch, if collapse_file fills a missing page but then needs to roll
> > > back the collapse. With my patch, khugepaged doesn't fill any missing
> > > pages until after it knows for sure the collapse will succeed, so I
> > > don't see any way for collapse_file to cause a formerly present page
> > > to become absent.
> >
> > Right, I should have reversed the conditions, sorry:
> >
> >   (1) ioctl(UFFDIO_REGISTER) upon the shmem vma
> >   (2) lseek(SEEK_DATA) on one page P1 (which is part of THP range T1), it
> >       noticed that it returns HOLE.
> >   (3) The HOLE got collapsed into a THP
> >
> > I think it can still happen after the patch applied.  lseek has:
> >
> >   shmem_file_llseek -> mapping_seek_hole_data -> find_get_entry
> >
> > So I think it can happen concurrently, that one thread does (1+2) before
> > your patch collapsed the thp in the page cache but after the vma checks.
> 
> Ah, I see what you're saying. I didn't fully understand the locking in
> the page cache.
> 
> I think Matthew's suggestion earlier of using RETRY entries might
> work. We would still need to consolidate the mark/check/collapse
> sequence into a single xas_lock critical section, though, since things
> will end up spinning on the RETRY within rcu critical sections.

That seems to be challenging, though.

I don't see what we can do in special paths like swapin as Matthew quoted
in the other email, where the xa lock needs to be released for now.

OTOH, if we want to have xas lock not released to cover the whole critical
section, I think it also means the i_mmap_rwsem needs to be taken before
hand, and not released after collapse happened.  That seems to be a major
change too.

> 
> That said, is using XA_RETRY_ENTRY a proper use of the xarray API? I
> don't see anything in the kernel external to xarray using it. I would
> think XA_ZERO_ENTRY would be more appropriate (i.e. what xa_reserve
> uses), but the handling for that is inconsistent. XA_ZERO_ENTRY is
> sufficient to cause mapping_get_entry to retry, but it looks like
> everywhere else in filemap/shmem that directly iterates over the page
> cache does if (xas_retry()) continue, which would end up not retrying
> in the XA_ZERO_ENTRY case.

I probably don't have much insight over xarray, just to mention after I
read some more on xarray today: I saw xas_retry() covers both ZERO and
RETRY entries; they all return true to me.  The only difference is an extra
xas_reset() for RETRY entries only.

It seems to me both entries will work with find_get_entry() which shmem
uses.  I'm not sure whether that means it's okay in this specific case.
Even if it works, I have no idea whether there's a graceful way we can
maintain the critical section as you proposed above.
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cc694846617a..6ddfcff11920 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -114,6 +114,8 @@  static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
 	const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
 
 	vma->vm_flags = flags;
+	/* Pairs with smp_rmb() in khugepaged's collapse_file() */
+	smp_wmb();
 	/*
 	 * For shared mappings, we want to enable writenotify while
 	 * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 79be13133322..97435c226b18 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -55,6 +55,7 @@  enum scan_result {
 	SCAN_CGROUP_CHARGE_FAIL,
 	SCAN_TRUNCATED,
 	SCAN_PAGE_HAS_PRIVATE,
+	SCAN_PAGE_FILLED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1725,8 +1726,8 @@  static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
+ *  - finalize updates to the page cache;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
@@ -1747,6 +1748,7 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
 	int nr_none = 0, result = SCAN_SUCCEED;
 	bool is_shmem = shmem_file(file);
+	bool i_mmap_locked = false;
 	int nr = 0;
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
@@ -1780,8 +1782,14 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * At this point the hpage is locked and not up-to-date.
-	 * It's safe to insert it into the page cache, because nobody would
-	 * be able to map it or use it in another way until we unlock it.
+	 *
+	 * While iterating, we may drop the page cache lock multiple times. It
+	 * is safe to replace pages in the page cache with hpage while doing so
+	 * because nobody is able to map or otherwise access the content of
+	 * hpage until we unlock it. However, we cannot insert hpage into empty
+	 * indicies until we know we won't have to drop the page cache lock
+	 * again, as doing so would let things which only check the presence
+	 * of pages in the page cache see a state that may yet be rolled back.
 	 */
 
 	xas_set(&xas, start);
@@ -1802,13 +1810,12 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 						result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
-					xas_set(&xas, index);
+					xas_set(&xas, index + 1);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -1967,6 +1974,46 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		put_page(page);
 		goto xa_unlocked;
 	}
+
+	if (nr_none) {
+		struct vm_area_struct *vma;
+		int nr_none_check = 0;
+
+		xas_unlock_irq(&xas);
+		i_mmap_lock_read(mapping);
+		i_mmap_locked = true;
+		xas_lock_irq(&xas);
+
+		xas_set(&xas, start);
+		for (index = start; index < end; index++) {
+			if (!xas_next(&xas))
+				nr_none_check++;
+		}
+
+		if (nr_none != nr_none_check) {
+			result = SCAN_PAGE_FILLED;
+			goto xa_locked;
+		}
+
+		/*
+		 * If userspace observed a missing page in a VMA with an armed
+		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
+		 * that page, so we need to roll back to avoid suppressing such
+		 * an event. Any userfaultfds armed after this point will not be
+		 * able to observe any missing pages, since the page cache is
+		 * locked until after the collapse is completed.
+		 *
+		 * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
+		 */
+		smp_rmb();
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
+			if (userfaultfd_missing(vma)) {
+				result = SCAN_EXCEED_NONE_PTE;
+				goto xa_locked;
+			}
+		}
+	}
+
 	nr = thp_nr_pages(hpage);
 
 	if (is_shmem)
@@ -2000,6 +2047,8 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	xas_store(&xas, hpage);
 xa_locked:
 	xas_unlock_irq(&xas);
+	if (i_mmap_locked)
+		i_mmap_unlock_read(mapping);
 xa_unlocked:
 
 	/*
@@ -2065,15 +2114,13 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
+		end = index;
+		for (index = start; index < end; index++) {
+			xas_next(&xas);
 			page = list_first_entry_or_null(&pagelist,
 					struct page, lru);
 			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
 				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
 				continue;
 			}