[v10,08/33] mm: Add folio_try_get_rcu

Message ID	20210511214735.1836149-9-willy@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: akpm@linux-foundation.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v10 08/33] mm: Add folio_try_get_rcu Date: Tue, 11 May 2021 22:47:10 +0100 Message-Id: <20210511214735.1836149-9-willy@infradead.org> In-Reply-To: <20210511214735.1836149-1-willy@infradead.org> References: <20210511214735.1836149-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Memory folios \| expand [v10,00/33] Memory folios [v10,01/33] mm: Introduce struct folio [v10,02/33] mm: Add folio_pgdat and folio_zone [v10,03/33] mm/vmstat: Add functions to account folio statistics [v10,04/33] mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO [v10,05/33] mm: Add folio reference count functions [v10,06/33] mm: Add folio_put [v10,07/33] mm: Add folio_get [v10,08/33] mm: Add folio_try_get_rcu [v10,09/33] mm: Add folio flag manipulation functions [v10,10/33] mm: Add folio_young and folio_idle [v10,11/33] mm: Handle per-folio private data [v10,12/33] mm/filemap: Add folio_index, folio_file_page and folio_contains [v10,13/33] mm/filemap: Add folio_next_index [v10,14/33] mm/filemap: Add folio_offset and folio_file_offset [v10,15/33] mm/util: Add folio_mapping and folio_file_mapping [v10,16/33] mm: Add folio_mapcount [v10,17/33] mm/memcg: Add folio wrappers for various functions [v10,18/33] mm/filemap: Add folio_unlock [v10,19/33] mm/filemap: Add folio_lock [v10,20/33] mm/filemap: Add folio_lock_killable [v10,21/33] mm/filemap: Add __folio_lock_async [v10,22/33] mm/filemap: Add __folio_lock_or_retry [v10,23/33] mm/filemap: Add folio_wait_locked [v10,24/33] mm/swap: Add folio_rotate_reclaimable [v10,25/33] mm/filemap: Add folio_end_writeback [v10,26/33] mm/writeback: Add folio_wait_writeback [v10,27/33] mm/writeback: Add folio_wait_stable [v10,28/33] mm/filemap: Add folio_wait_bit [v10,29/33] mm/filemap: Add folio_wake_bit [v10,30/33] mm/filemap: Convert page wait queues to be folios [v10,31/33] mm/filemap: Add folio private_2 functions [v10,32/33] fs/netfs: Add folio fscache functions [v10,33/33] mm: Add folio_mapped

Message ID

20210511214735.1836149-9-willy@infradead.org (mailing list archive)

State

New, archived

Headers

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
To: akpm@linux-foundation.org
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
        linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: [PATCH v10 08/33] mm: Add folio_try_get_rcu
Date: Tue, 11 May 2021 22:47:10 +0100
Message-Id: <20210511214735.1836149-9-willy@infradead.org>
In-Reply-To: <20210511214735.1836149-1-willy@infradead.org>
References: <20210511214735.1836149-1-willy@infradead.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Memory folios | expand

Commit Message

Matthew Wilcox May 11, 2021, 9:47 p.m. UTC

This is the equivalent of page_cache_get_speculative().  Also add
folio_ref_try_add_rcu (the equivalent of page_cache_add_speculative)
and folio_get_unless_zero() (the equivalent of get_page_unless_zero()).

The new kernel-doc attempts to explain from the user's point of view
when to use folio_try_get_rcu() and when to use folio_get_unless_zero(),
because there seems to be some confusion currently between the users of
page_cache_get_speculative() and get_page_unless_zero().

Reimplement page_cache_add_speculative() and page_cache_get_speculative()
as wrappers around the folio equivalents, but leave get_page_unless_zero()
alone for now.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page_ref.h | 72 ++++++++++++++++++++++++++++++++--
 include/linux/pagemap.h  | 84 ++--------------------------------------
 mm/filemap.c             | 20 ++++++++++
 3 files changed, 93 insertions(+), 83 deletions(-)

Comments

Vlastimil Babka May 14, 2021, 12:11 p.m. UTC | #1

On 5/11/21 11:47 PM, Matthew Wilcox (Oracle) wrote:
> This is the equivalent of page_cache_get_speculative().  Also add
> folio_ref_try_add_rcu (the equivalent of page_cache_add_speculative)
> and folio_get_unless_zero() (the equivalent of get_page_unless_zero()).
> 
> The new kernel-doc attempts to explain from the user's point of view
> when to use folio_try_get_rcu() and when to use folio_get_unless_zero(),
> because there seems to be some confusion currently between the users of
> page_cache_get_speculative() and get_page_unless_zero().
> 
> Reimplement page_cache_add_speculative() and page_cache_get_speculative()
> as wrappers around the folio equivalents, but leave get_page_unless_zero()
> alone for now.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Christoph Hellwig May 27, 2021, 8:16 a.m. UTC | #2

On Tue, May 11, 2021 at 10:47:10PM +0100, Matthew Wilcox (Oracle) wrote:
> -static inline int page_ref_add_unless(struct page *page, int nr, int u)
> +static inline bool page_ref_add_unless(struct page *page, int nr, int u)
>  {
> -	int ret = atomic_add_unless(&page->_refcount, nr, u);
> +	bool ret = atomic_add_unless(&page->_refcount, nr, u);
>  
>  	if (page_ref_tracepoint_active(page_ref_mod_unless))
>  		__page_ref_mod_unless(page, nr, ret);
>  	return ret;
>  }

Unrelated but neat cleanup.

>  
> -static inline int folio_ref_add_unless(struct folio *folio, int nr, int u)
> +static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
>  {
>  	return page_ref_add_unless(&folio->page, nr, u);
>  }

This should probably go into the patch adding folio_ref_add_unless.

> +static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)

Should this have a __ prefix and/or a don't use direct comment?

> +{
> +#ifdef CONFIG_TINY_RCU
> +	/*
> +	 * The caller guarantees the folio will not be freed from interrupt
> +	 * context, so (on !SMP) we only need preemption to be disabled
> +	 * and TINY_RCU does that for us.
> +	 */
> +# ifdef CONFIG_PREEMPT_COUNT
> +	VM_BUG_ON(!in_atomic() && !irqs_disabled());
> +# endif

	VM_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
		  !in_atomic() && !irqs_disabled());

?

> +	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> +	folio_ref_add(folio, count);
> +#else
> +	if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> +		/* Either the folio has been freed, or will be freed. */
> +		return false;
> +	}
> +#endif
> +	return true;

but is this tiny rcu optimization really worth it?  I guess we're just
preserving it from the existing code and don't rock the boat..

> @@ -1746,6 +1746,26 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(page_cache_prev_miss);
>  
> +/*
> + * Lockless page cache protocol:
> + * On the lookup side:
> + * 1. Load the folio from i_pages
> + * 2. Increment the refcount if it's not zero
> + * 3. If the folio is not found by xas_reload(), put the refcount and retry
> + *
> + * On the removal side:
> + * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
> + * B. Remove the page from i_pages
> + * C. Return the page to the page allocator
> + *
> + * This means that any page may have its reference count temporarily
> + * increased by a speculative page cache (or fast GUP) lookup as it can
> + * be allocated by another user before the RCU grace period expires.
> + * Because the refcount temporarily acquired here may end up being the
> + * last refcount on the page, any page allocation must be freeable by
> + * put_folio().
> + */
> +
>  /*
>   * mapping_get_entry - Get a page cache entry.
>   * @mapping: the address_space to search

Is this really a good place for the comment?  I'd expect it either near
a relevant function or at the top of a file.

Matthew Wilcox June 5, 2021, 4:26 a.m. UTC | #3

On Thu, May 27, 2021 at 09:16:42AM +0100, Christoph Hellwig wrote:
> On Tue, May 11, 2021 at 10:47:10PM +0100, Matthew Wilcox (Oracle) wrote:
> > +static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> 
> Should this have a __ prefix and/or a don't use direct comment?

I think it will get used directly ... its page counterpart is:

mm/gup.c:       if (unlikely(!page_cache_add_speculative(head, refs)))

I deliberately left kernel-doc off this function so it's not described,
but described folio_try_get_rcu() in excruciating detail.  I hope that's
enough.  There's no comment on page_cache_add_speculative() today, so
again, we're status quo.

> > +{
> > +#ifdef CONFIG_TINY_RCU
> > +	/*
> > +	 * The caller guarantees the folio will not be freed from interrupt
> > +	 * context, so (on !SMP) we only need preemption to be disabled
> > +	 * and TINY_RCU does that for us.
> > +	 */
> > +# ifdef CONFIG_PREEMPT_COUNT
> > +	VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > +# endif
> 
> 	VM_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
> 		  !in_atomic() && !irqs_disabled());
> 
> ?

I'm just moving it over, and honestly, I think it's slightly clearer
this way.  We can't check it if PREEMPT_COUNT isn't enabled, and I
think that's expressed better by the ifdef than the IS_ENABLED().

> > +	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > +	folio_ref_add(folio, count);
> > +#else
> > +	if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> > +		/* Either the folio has been freed, or will be freed. */
> > +		return false;
> > +	}
> > +#endif
> > +	return true;
> 
> but is this tiny rcu optimization really worth it?  I guess we're just
> preserving it from the existing code and don't rock the boat..

I wondered that myself.  It's been there since Nick introduced it in
2008 with commit e286781d5f2e.  We certainly cared about small systems
more then, but apparently we still care about UP enough to maintain
CONFIG_TINY_RCU, so maybe this optimisation is still relevant.

> > @@ -1746,6 +1746,26 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> >  }
> >  EXPORT_SYMBOL(page_cache_prev_miss);
> >  
> > +/*
> > + * Lockless page cache protocol:
> > + * On the lookup side:
> > + * 1. Load the folio from i_pages
> > + * 2. Increment the refcount if it's not zero
> > + * 3. If the folio is not found by xas_reload(), put the refcount and retry
> > + *
> > + * On the removal side:
> > + * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
> > + * B. Remove the page from i_pages
> > + * C. Return the page to the page allocator
> > + *
> > + * This means that any page may have its reference count temporarily
> > + * increased by a speculative page cache (or fast GUP) lookup as it can
> > + * be allocated by another user before the RCU grace period expires.
> > + * Because the refcount temporarily acquired here may end up being the
> > + * last refcount on the page, any page allocation must be freeable by
> > + * put_folio().
> > + */
> > +
> >  /*
> >   * mapping_get_entry - Get a page cache entry.
> >   * @mapping: the address_space to search
> 
> Is this really a good place for the comment?  I'd expect it either near
> a relevant function or at the top of a file.

It's right before mapping_get_entry() which is the main lookup function
for the page cache, so I think it meets your first criteria?

Christoph Hellwig June 6, 2021, 2:13 p.m. UTC | #4

On Sat, Jun 05, 2021 at 05:26:59AM +0100, Matthew Wilcox wrote:
> On Thu, May 27, 2021 at 09:16:42AM +0100, Christoph Hellwig wrote:
> > On Tue, May 11, 2021 at 10:47:10PM +0100, Matthew Wilcox (Oracle) wrote:
> > > +static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> > 
> > Should this have a __ prefix and/or a don't use direct comment?
> 
> I think it will get used directly ... its page counterpart is:
> 
> mm/gup.c:       if (unlikely(!page_cache_add_speculative(head, refs)))
> 
> I deliberately left kernel-doc off this function so it's not described,
> but described folio_try_get_rcu() in excruciating detail.  I hope that's
> enough.  There's no comment on page_cache_add_speculative() today, so
> again, we're status quo.

Ok.  Seems a little weird, but so does much in this area.

> > Is this really a good place for the comment?  I'd expect it either near
> > a relevant function or at the top of a file.
> 
> It's right before mapping_get_entry() which is the main lookup function
> for the page cache, so I think it meets your first criteria?

I guess it does, yes.

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 85816b2c0496..2e677e6ad09f 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -233,20 +233,86 @@  static inline int folio_ref_dec_return(struct folio *folio)
 	return page_ref_dec_return(&folio->page);
 }
 
-static inline int page_ref_add_unless(struct page *page, int nr, int u)
+static inline bool page_ref_add_unless(struct page *page, int nr, int u)
 {
-	int ret = atomic_add_unless(&page->_refcount, nr, u);
+	bool ret = atomic_add_unless(&page->_refcount, nr, u);
 
 	if (page_ref_tracepoint_active(page_ref_mod_unless))
 		__page_ref_mod_unless(page, nr, ret);
 	return ret;
 }
 
-static inline int folio_ref_add_unless(struct folio *folio, int nr, int u)
+static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
 {
 	return page_ref_add_unless(&folio->page, nr, u);
 }
 
+/**
+ * folio_try_get - Attempt to increase the refcount on a folio.
+ * @folio: The folio.
+ *
+ * If you do not already have a reference to a folio, you can attempt to
+ * get one using this function.  It may fail if, for example, the folio
+ * has been freed since you found a pointer to it, or it is frozen for
+ * the purposes of splitting or migration.
+ *
+ * Return: True if the reference count was successfully incremented.
+ */
+static inline bool folio_try_get(struct folio *folio)
+{
+	return folio_ref_add_unless(folio, 1, 0);
+}
+
+static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
+{
+#ifdef CONFIG_TINY_RCU
+	/*
+	 * The caller guarantees the folio will not be freed from interrupt
+	 * context, so (on !SMP) we only need preemption to be disabled
+	 * and TINY_RCU does that for us.
+	 */
+# ifdef CONFIG_PREEMPT_COUNT
+	VM_BUG_ON(!in_atomic() && !irqs_disabled());
+# endif
+	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+	folio_ref_add(folio, count);
+#else
+	if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
+		/* Either the folio has been freed, or will be freed. */
+		return false;
+	}
+#endif
+	return true;
+}
+
+/**
+ * folio_try_get_rcu - Attempt to increase the refcount on a folio.
+ * @folio: The folio.
+ *
+ * This is a version of folio_try_get() optimised for non-SMP kernels.
+ * If you are still holding the rcu_read_lock() after looking up the
+ * page and know that the page cannot have its refcount decreased to
+ * zero in interrupt context, you can use this instead of folio_try_get().
+ *
+ * Example users include get_user_pages_fast() (as pages are not unmapped
+ * from interrupt context) and the page cache lookups (as pages are not
+ * truncated from interrupt context).  We also know that pages are not
+ * frozen in interrupt context for the purposes of splitting or migration.
+ *
+ * You can also use this function if you're holding a lock that prevents
+ * pages being frozen & removed; eg the i_pages lock for the page cache
+ * or the mmap_sem or page table lock for page tables.  In this case,
+ * it will always succeed, and you could have used a plain folio_get(),
+ * but it's sometimes more convenient to have a common function called
+ * from both locked and RCU-protected contexts.
+ *
+ * Return: True if the reference count was successfully incremented.
+ */
+static inline bool folio_try_get_rcu(struct folio *folio)
+{
+	return folio_ref_try_add_rcu(folio, 1);
+}
+
 static inline int page_ref_freeze(struct page *page, int count)
 {
 	int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a4bd41128bf3..4900e64c880d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -172,91 +172,15 @@  static inline struct address_space *page_mapping_file(struct page *page)
 	return page_mapping(page);
 }
 
-/*
- * speculatively take a reference to a page.
- * If the page is free (_refcount == 0), then _refcount is untouched, and 0
- * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
- *
- * This function must be called inside the same rcu_read_lock() section as has
- * been used to lookup the page in the pagecache radix-tree (or page table):
- * this allows allocators to use a synchronize_rcu() to stabilize _refcount.
- *
- * Unless an RCU grace period has passed, the count of all pages coming out
- * of the allocator must be considered unstable. page_count may return higher
- * than expected, and put_page must be able to do the right thing when the
- * page has been finished with, no matter what it is subsequently allocated
- * for (because put_page is what is used here to drop an invalid speculative
- * reference).
- *
- * This is the interesting part of the lockless pagecache (and lockless
- * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
- * has the following pattern:
- * 1. find page in radix tree
- * 2. conditionally increment refcount
- * 3. check the page is still in pagecache (if no, goto 1)
- *
- * Remove-side that cares about stability of _refcount (eg. reclaim) has the
- * following (with the i_pages lock held):
- * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
- * B. remove page from pagecache
- * C. free the page
- *
- * There are 2 critical interleavings that matter:
- * - 2 runs before A: in this case, A sees elevated refcount and bails out
- * - A runs before 2: in this case, 2 sees zero refcount and retries;
- *   subsequently, B will complete and 1 will find no page, causing the
- *   lookup to return NULL.
- *
- * It is possible that between 1 and 2, the page is removed then the exact same
- * page is inserted into the same position in pagecache. That's OK: the
- * old find_get_page using a lock could equally have run before or after
- * such a re-insertion, depending on order that locks are granted.
- *
- * Lookups racing against pagecache insertion isn't a big problem: either 1
- * will find the page or it will not. Likewise, the old find_get_page could run
- * either before the insertion or afterwards, depending on timing.
- */
-static inline int __page_cache_add_speculative(struct page *page, int count)
+static inline bool page_cache_add_speculative(struct page *page, int count)
 {
-#ifdef CONFIG_TINY_RCU
-# ifdef CONFIG_PREEMPT_COUNT
-	VM_BUG_ON(!in_atomic() && !irqs_disabled());
-# endif
-	/*
-	 * Preempt must be disabled here - we rely on rcu_read_lock doing
-	 * this for us.
-	 *
-	 * Pagecache won't be truncated from interrupt context, so if we have
-	 * found a page in the radix tree here, we have pinned its refcount by
-	 * disabling preempt, and hence no need for the "speculative get" that
-	 * SMP requires.
-	 */
-	VM_BUG_ON_PAGE(page_count(page) == 0, page);
-	page_ref_add(page, count);
-
-#else
-	if (unlikely(!page_ref_add_unless(page, count, 0))) {
-		/*
-		 * Either the page has been freed, or will be freed.
-		 * In either case, retry here and the caller should
-		 * do the right thing (see comments above).
-		 */
-		return 0;
-	}
-#endif
 	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	return 1;
-}
-
-static inline int page_cache_get_speculative(struct page *page)
-{
-	return __page_cache_add_speculative(page, 1);
+	return folio_ref_try_add_rcu((struct folio *)page, count);
 }
 
-static inline int page_cache_add_speculative(struct page *page, int count)
+static inline bool page_cache_get_speculative(struct page *page)
 {
-	return __page_cache_add_speculative(page, count);
+	return page_cache_add_speculative(page, 1);
 }
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 66f7e9fdfbc4..817a47059bd0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1746,6 +1746,26 @@  pgoff_t page_cache_prev_miss(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_miss);
 
+/*
+ * Lockless page cache protocol:
+ * On the lookup side:
+ * 1. Load the folio from i_pages
+ * 2. Increment the refcount if it's not zero
+ * 3. If the folio is not found by xas_reload(), put the refcount and retry
+ *
+ * On the removal side:
+ * A. Freeze the page (by zeroing the refcount if nobody else has a reference)
+ * B. Remove the page from i_pages
+ * C. Return the page to the page allocator
+ *
+ * This means that any page may have its reference count temporarily
+ * increased by a speculative page cache (or fast GUP) lookup as it can
+ * be allocated by another user before the RCU grace period expires.
+ * Because the refcount temporarily acquired here may end up being the
+ * last refcount on the page, any page allocation must be freeable by
+ * put_folio().
+ */
+
 /*
  * mapping_get_entry - Get a page cache entry.
  * @mapping: the address_space to search

[v10,08/33] mm: Add folio_try_get_rcu

Commit Message

Comments

Patch