diff mbox series

[05/17] gup: Add try_get_folio()

Message ID 20220102215729.2943705-6-willy@infradead.org (mailing list archive)
State New
Headers show
Series Convert GUP to folios | expand

Commit Message

Matthew Wilcox Jan. 2, 2022, 9:57 p.m. UTC
This replaces try_get_compound_head().  It includes a small optimisation
for the race where a folio is split between being looked up from its
tail page and the reference count being obtained.  Before, it returned
NULL, which presumably triggered a retry under the mmap_lock, whereas
now it will retry without the lock.  Finding a frozen page will still
return NULL.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/gup.c | 69 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 36 insertions(+), 33 deletions(-)

Comments

Christoph Hellwig Jan. 4, 2022, 8:18 a.m. UTC | #1
On Sun, Jan 02, 2022 at 09:57:17PM +0000, Matthew Wilcox (Oracle) wrote:
> This replaces try_get_compound_head().  It includes a small optimisation
> for the race where a folio is split between being looked up from its
> tail page and the reference count being obtained.  Before, it returned
> NULL, which presumably triggered a retry under the mmap_lock, whereas
> now it will retry without the lock.  Finding a frozen page will still
> return NULL.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although I think splitting out the rety optimization into a separate
patch would be nicer to document the change and to ease bisection if
that arises.
John Hubbard Jan. 5, 2022, 1:25 a.m. UTC | #2
On 1/2/22 13:57, Matthew Wilcox (Oracle) wrote:
> This replaces try_get_compound_head().  It includes a small optimisation
> for the race where a folio is split between being looked up from its
> tail page and the reference count being obtained.  Before, it returned
> NULL, which presumably triggered a retry under the mmap_lock, whereas
> now it will retry without the lock.  Finding a frozen page will still
> return NULL.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   mm/gup.c | 69 +++++++++++++++++++++++++++++---------------------------
>   1 file changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 2c51e9748a6a..58e5cfaaa676 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,12 +29,11 @@ struct follow_page_context {
>   	unsigned int page_mask;
>   };
>   
> -static void hpage_pincount_add(struct page *page, int refs)
> +static void folio_pincount_add(struct folio *folio, int refs)
>   {
> -	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> -	VM_BUG_ON_PAGE(page != compound_head(page), page);
> +	VM_BUG_ON_FOLIO(!folio_pincount_available(folio), folio);
>   
> -	atomic_add(refs, compound_pincount_ptr(page));
> +	atomic_add(refs, folio_pincount_ptr(folio));
>   }
>   
>   static void hpage_pincount_sub(struct page *page, int refs)
> @@ -63,33 +62,35 @@ static void put_page_refs(struct page *page, int refs)
>   }
>   
>   /*
> - * Return the compound head page with ref appropriately incremented,
> + * Return the folio with ref appropriately incremented,
>    * or NULL if that failed.
>    */
> -static inline struct page *try_get_compound_head(struct page *page, int refs)
> +static inline struct folio *try_get_folio(struct page *page, int refs)
>   {
> -	struct page *head = compound_head(page);
> +	struct folio *folio;
>   
> -	if (WARN_ON_ONCE(page_ref_count(head) < 0))
> +retry:

Yes, this new retry looks like a solid improvement. Retrying at this low
level makes a lot of sense, given that it is racing with a very
transient sort of behavior.


> +	folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
>   		return NULL;
> -	if (unlikely(!page_cache_add_speculative(head, refs)))
> +	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))

I'm a little lost about the meaning and intended use of the _rcu aspects
of folio_ref_try_add_rcu() here. For example, try_get_folio() does not
require that callers are in an rcu read section, right? This is probably
just a documentation question, sorry if it's obvious--I wasn't able to
work it out on my own.

>   		return NULL;
>   
>   	/*
> -	 * At this point we have a stable reference to the head page; but it
> -	 * could be that between the compound_head() lookup and the refcount
> -	 * increment, the compound page was split, in which case we'd end up
> -	 * holding a reference on a page that has nothing to do with the page
> +	 * At this point we have a stable reference to the folio; but it
> +	 * could be that between calling page_folio() and the refcount
> +	 * increment, the folio was split, in which case we'd end up
> +	 * holding a reference on a folio that has nothing to do with the page
>   	 * we were given anymore.
> -	 * So now that the head page is stable, recheck that the pages still
> -	 * belong together.
> +	 * So now that the folio is stable, recheck that the page still
> +	 * belongs to this folio.
>   	 */
> -	if (unlikely(compound_head(page) != head)) {
> -		put_page_refs(head, refs);
> -		return NULL;
> +	if (unlikely(page_folio(page) != folio)) {
> +		folio_put_refs(folio, refs);
> +		goto retry;
>   	}
>   
> -	return head;
> +	return folio;
>   }
>   
>   /**
> @@ -128,8 +129,10 @@ struct page *try_grab_compound_head(struct page *page,
>   				    int refs, unsigned int flags)
>   {
>   	if (flags & FOLL_GET)
> -		return try_get_compound_head(page, refs);
> +		return &try_get_folio(page, refs)->page;


Did you want to use folio_page() here, instead?


>   	else if (flags & FOLL_PIN) {
> +		struct folio *folio;
> +
>   		/*
>   		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>   		 * right zone, so fail and let the caller fall back to the slow
> @@ -143,29 +146,29 @@ struct page *try_grab_compound_head(struct page *page,
>   		 * CAUTION: Don't use compound_head() on the page before this
>   		 * point, the result won't be stable.
>   		 */
> -		page = try_get_compound_head(page, refs);
> -		if (!page)
> +		folio = try_get_folio(page, refs);
> +		if (!folio)
>   			return NULL;
>   
>   		/*
> -		 * When pinning a compound page of order > 1 (which is what
> +		 * When pinning a folio of order > 1 (which is what
>   		 * hpage_pincount_available() checks for), use an exact count to
> -		 * track it, via hpage_pincount_add/_sub().
> +		 * track it, via folio_pincount_add/_sub().
>   		 *
> -		 * However, be sure to *also* increment the normal page refcount
> -		 * field at least once, so that the page really is pinned.
> +		 * However, be sure to *also* increment the normal folio refcount
> +		 * field at least once, so that the folio really is pinned.
>   		 * That's why the refcount from the earlier
> -		 * try_get_compound_head() is left intact.
> +		 * try_get_folio() is left intact.
>   		 */
> -		if (hpage_pincount_available(page))
> -			hpage_pincount_add(page, refs);
> +		if (folio_pincount_available(folio))
> +			folio_pincount_add(folio, refs);
>   		else
> -			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
> +			folio_ref_add(folio,
> +					refs * (GUP_PIN_COUNTING_BIAS - 1));
>   
> -		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
> -				    refs);
> +		node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
>   
> -		return page;
> +		return &folio->page;
>   	}
>   
>   	WARN_ON_ONCE(1);

Just minor questions above. This all looks solid though.

thanks,
John Hubbard Jan. 5, 2022, 7 a.m. UTC | #3
On 1/4/22 17:25, John Hubbard wrote:
>> @@ -128,8 +129,10 @@ struct page *try_grab_compound_head(struct page *page,
>>                       int refs, unsigned int flags)
>>   {
>>       if (flags & FOLL_GET)
>> -        return try_get_compound_head(page, refs);
>> +        return &try_get_folio(page, refs)->page;
> 
> 
> Did you want to use folio_page() here, instead?
> 
...never mind about that, because I see that patch 08 gets rid of the
"->page", anyway.


thanks,
Jason Gunthorpe Jan. 7, 2022, 6:23 p.m. UTC | #4
On Tue, Jan 04, 2022 at 05:25:07PM -0800, John Hubbard wrote:

> > +	folio = page_folio(page);
> > +	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
> >   		return NULL;
> > -	if (unlikely(!page_cache_add_speculative(head, refs)))
> > +	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> 
> I'm a little lost about the meaning and intended use of the _rcu aspects
> of folio_ref_try_add_rcu() here. For example, try_get_folio() does not
> require that callers are in an rcu read section, right? This is probably
> just a documentation question, sorry if it's obvious--I wasn't able to
> work it out on my own.

Normally this is called under gup_fast, so it is operating under the
quasi-rcu it uses. I have no idea about preempt kernels if this should
be marked with rcu or if the local_irq_save is good enough..

The exceptional cases are get_gate_page() which I suppose relies on
the mmap lock to ensure that the special gate page cannot be
concurrently freed.

Then there are the cases under slow GUP (eg follow_page_pte()) which
don't need "try" at all because they hold the PTLs so it is guarenteed
the refcount should be non-zero AFAIK..

It wouldn't be a bad idea to have a non-try version of this just for
clarity, could it peform a bit better?

		/*
		 * try_grab_page() should always succeed here, because: a) we
		 * hold the pmd (ptl) lock, and b) we've just checked that the
		 * huge pmd (head) page is present in the page tables. The ptl
		 * prevents the head page and tail pages from being rearranged
		 * in any way. So this page must be available at this point,
		 * unless the page refcount overflowed:
		 */
		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {

Jason
Matthew Wilcox Jan. 8, 2022, 1:37 a.m. UTC | #5
On Tue, Jan 04, 2022 at 05:25:07PM -0800, John Hubbard wrote:
> On 1/2/22 13:57, Matthew Wilcox (Oracle) wrote:
> > +	folio = page_folio(page);
> > +	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
> >   		return NULL;
> > -	if (unlikely(!page_cache_add_speculative(head, refs)))
> > +	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> 
> I'm a little lost about the meaning and intended use of the _rcu aspects
> of folio_ref_try_add_rcu() here. For example, try_get_folio() does not
> require that callers are in an rcu read section, right? This is probably
> just a documentation question, sorry if it's obvious--I wasn't able to
> work it out on my own.

page_cache_add_speculative() always assumed that you were at least as
protected as RCU.  Quoting v5.4's pagemap.h:

 * speculatively take a reference to a page.
 * If the page is free (_refcount == 0), then _refcount is untouched, and 0
 * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
 *
 * This function must be called inside the same rcu_read_lock() section as has
 * been used to lookup the page in the pagecache radix-tree (or page table):
 * this allows allocators to use a synchronize_rcu() to stabilize _refcount.

... so is it the addition of "rcu" in the name that's scared you?
If so, that seems like a good thing, because previously you were using
page_cache_add_speculative() without realising that RCU protection
was required.

I think there is sufficient protection; either we have interrupts disabled
in gup-fast (which prevents RCU from finishing a grace period), or we
have the mmap_sem held in gup-slow (which prevents pages from being
removed from the page tables).

But maybe I've missed something?
John Hubbard Jan. 8, 2022, 2:36 a.m. UTC | #6
On 1/7/22 17:37, Matthew Wilcox wrote:
> On Tue, Jan 04, 2022 at 05:25:07PM -0800, John Hubbard wrote:
>> On 1/2/22 13:57, Matthew Wilcox (Oracle) wrote:
>>> +	folio = page_folio(page);
>>> +	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
>>>    		return NULL;
>>> -	if (unlikely(!page_cache_add_speculative(head, refs)))
>>> +	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
>>
>> I'm a little lost about the meaning and intended use of the _rcu aspects
>> of folio_ref_try_add_rcu() here. For example, try_get_folio() does not
>> require that callers are in an rcu read section, right? This is probably
>> just a documentation question, sorry if it's obvious--I wasn't able to
>> work it out on my own.
> 
> page_cache_add_speculative() always assumed that you were at least as
> protected as RCU.  Quoting v5.4's pagemap.h:
> 
>   * speculatively take a reference to a page.
>   * If the page is free (_refcount == 0), then _refcount is untouched, and 0
>   * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
>   *
>   * This function must be called inside the same rcu_read_lock() section as has
>   * been used to lookup the page in the pagecache radix-tree (or page table):
>   * this allows allocators to use a synchronize_rcu() to stabilize _refcount.

Thanks for the v5.4 documentation reference. I did read and understand that a
while back, but it didn't show up at all when I was looking through the code
during this review, so I missed a chance to re-read it. I'd forgotten about the
RCU qualifier there.

> 
> ... so is it the addition of "rcu" in the name that's scared you?
> If so, that seems like a good thing, because previously you were using
> page_cache_add_speculative() without realising that RCU protection
> was required.
> 
> I think there is sufficient protection; either we have interrupts disabled
> in gup-fast (which prevents RCU from finishing a grace period), or we
> have the mmap_sem held in gup-slow (which prevents pages from being
> removed from the page tables).
> 
> But maybe I've missed something?

Clearly, I'm not the right person to answer that question, at this point. :)

FWIW, I went back and looked again, and it seems like everything is covered.

On the documentation front, I do miss the v5.4 helpful comment block above
__page_cache_add_speculative(), it would be nice to have an updated version
of that revived, but of course that's a separate point from this patchset.

thanks,
Jason Gunthorpe Jan. 10, 2022, 3:01 p.m. UTC | #7
On Sat, Jan 08, 2022 at 01:37:09AM +0000, Matthew Wilcox wrote:

> I think there is sufficient protection; either we have interrupts
> disabled in gup-fast (which prevents RCU from finishing a grace
> period), or we have the mmap_sem held in gup-slow (which prevents
> pages from being removed from the page tables).

mmap sem doesn't prevent pages from being removed, for instance
unmap_mapping_pages() doesn't hold it.

It is safe because gup slow holds the PTLs when it calls this.

No idea about how the get_gate_page() works

Jason
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..58e5cfaaa676 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,12 +29,11 @@  struct follow_page_context {
 	unsigned int page_mask;
 };
 
-static void hpage_pincount_add(struct page *page, int refs)
+static void folio_pincount_add(struct folio *folio, int refs)
 {
-	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
-	VM_BUG_ON_PAGE(page != compound_head(page), page);
+	VM_BUG_ON_FOLIO(!folio_pincount_available(folio), folio);
 
-	atomic_add(refs, compound_pincount_ptr(page));
+	atomic_add(refs, folio_pincount_ptr(folio));
 }
 
 static void hpage_pincount_sub(struct page *page, int refs)
@@ -63,33 +62,35 @@  static void put_page_refs(struct page *page, int refs)
 }
 
 /*
- * Return the compound head page with ref appropriately incremented,
+ * Return the folio with ref appropriately incremented,
  * or NULL if that failed.
  */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
+static inline struct folio *try_get_folio(struct page *page, int refs)
 {
-	struct page *head = compound_head(page);
+	struct folio *folio;
 
-	if (WARN_ON_ONCE(page_ref_count(head) < 0))
+retry:
+	folio = page_folio(page);
+	if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
 		return NULL;
-	if (unlikely(!page_cache_add_speculative(head, refs)))
+	if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
 		return NULL;
 
 	/*
-	 * At this point we have a stable reference to the head page; but it
-	 * could be that between the compound_head() lookup and the refcount
-	 * increment, the compound page was split, in which case we'd end up
-	 * holding a reference on a page that has nothing to do with the page
+	 * At this point we have a stable reference to the folio; but it
+	 * could be that between calling page_folio() and the refcount
+	 * increment, the folio was split, in which case we'd end up
+	 * holding a reference on a folio that has nothing to do with the page
 	 * we were given anymore.
-	 * So now that the head page is stable, recheck that the pages still
-	 * belong together.
+	 * So now that the folio is stable, recheck that the page still
+	 * belongs to this folio.
 	 */
-	if (unlikely(compound_head(page) != head)) {
-		put_page_refs(head, refs);
-		return NULL;
+	if (unlikely(page_folio(page) != folio)) {
+		folio_put_refs(folio, refs);
+		goto retry;
 	}
 
-	return head;
+	return folio;
 }
 
 /**
@@ -128,8 +129,10 @@  struct page *try_grab_compound_head(struct page *page,
 				    int refs, unsigned int flags)
 {
 	if (flags & FOLL_GET)
-		return try_get_compound_head(page, refs);
+		return &try_get_folio(page, refs)->page;
 	else if (flags & FOLL_PIN) {
+		struct folio *folio;
+
 		/*
 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 		 * right zone, so fail and let the caller fall back to the slow
@@ -143,29 +146,29 @@  struct page *try_grab_compound_head(struct page *page,
 		 * CAUTION: Don't use compound_head() on the page before this
 		 * point, the result won't be stable.
 		 */
-		page = try_get_compound_head(page, refs);
-		if (!page)
+		folio = try_get_folio(page, refs);
+		if (!folio)
 			return NULL;
 
 		/*
-		 * When pinning a compound page of order > 1 (which is what
+		 * When pinning a folio of order > 1 (which is what
 		 * hpage_pincount_available() checks for), use an exact count to
-		 * track it, via hpage_pincount_add/_sub().
+		 * track it, via folio_pincount_add/_sub().
 		 *
-		 * However, be sure to *also* increment the normal page refcount
-		 * field at least once, so that the page really is pinned.
+		 * However, be sure to *also* increment the normal folio refcount
+		 * field at least once, so that the folio really is pinned.
 		 * That's why the refcount from the earlier
-		 * try_get_compound_head() is left intact.
+		 * try_get_folio() is left intact.
 		 */
-		if (hpage_pincount_available(page))
-			hpage_pincount_add(page, refs);
+		if (folio_pincount_available(folio))
+			folio_pincount_add(folio, refs);
 		else
-			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
+			folio_ref_add(folio,
+					refs * (GUP_PIN_COUNTING_BIAS - 1));
 
-		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
-				    refs);
+		node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
 
-		return page;
+		return &folio->page;
 	}
 
 	WARN_ON_ONCE(1);