mm: fix missing wake-up event for FSDAX pages

Message ID	20220704074054.32310-1-songmuchun@bytedance.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Muchun Song <songmuchun@bytedance.com> To: akpm@linux-foundation.org, willy@infradead.org, jgg@ziepe.ca, jhubbard@nvidia.com, william.kucharski@oracle.com, dan.j.williams@intel.com, jack@suse.cz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH] mm: fix missing wake-up event for FSDAX pages Date: Mon, 4 Jul 2022 15:40:54 +0800 Message-Id: <20220704074054.32310-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: fix missing wake-up event for FSDAX pages \| expand mm: fix missing wake-up event for FSDAX pages

Muchun Song July 4, 2022, 7:40 a.m. UTC

FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
1, then the page is freed.  The FSDAX pages can be pinned through GUP,
then they will be unpinned via unpin_user_page() using a folio variant
to put the page, however, folio variants did not consider this special
case, the result will be to miss a wakeup event (like the user of
__fuse_dax_break_layouts()).

Fixes: d8ddc099c6b3 ("mm/gup: Add gup_put_folio()")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

Matthew Wilcox July 4, 2022, 10:38 a.m. UTC | #1

On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> then they will be unpinned via unpin_user_page() using a folio variant
> to put the page, however, folio variants did not consider this special
> case, the result will be to miss a wakeup event (like the user of
> __fuse_dax_break_layouts()).

Argh, no.  The 1-based refcounts are a blight on the entire kernel.
They need to go away, not be pushed into folios as well.  I think
we're close to having that fixed, but until then, this should do
the trick?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc98ab012a9b..4cef5e0f78b6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1129,18 +1129,18 @@ static inline bool is_zone_movable_page(const struct page *page)
 #if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-bool __put_devmap_managed_page(struct page *page);
-static inline bool put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page, int refs);
+static inline bool put_devmap_managed_page(struct page *page, int refs)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
 	if (!is_zone_device_page(page))
 		return false;
-	return __put_devmap_managed_page(page);
+	return __put_devmap_managed_page(page, refs);
 }
 
 #else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page, int refs)
 {
 	return false;
 }
@@ -1246,7 +1246,7 @@ static inline void put_page(struct page *page)
 	 * For some devmap managed pages we need to catch refcount transition
 	 * from 2 to 1:
 	 */
-	if (put_devmap_managed_page(&folio->page))
+	if (put_devmap_managed_page(&folio->page, 1))
 		return;
 	folio_put(folio);
 }
diff --git a/mm/gup.c b/mm/gup.c
index d1132b39aa8f..28df02121c78 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -88,7 +88,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 	 * belongs to this folio.
 	 */
 	if (unlikely(page_folio(page) != folio)) {
-		folio_put_refs(folio, refs);
+		if (!put_devmap_managed_page(&folio->page, refs))
+			folio_put_refs(folio, refs);
 		goto retry;
 	}
 
@@ -177,6 +178,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
+	if (put_devmap_managed_page(&folio->page, refs))
+		return;
 	folio_put_refs(folio, refs);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index b870a659eee6..b25e40e3a11e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -499,7 +499,7 @@ void free_zone_device_page(struct page *page)
 }
 
 #ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page, int refs)
 {
 	if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
 		return false;
@@ -509,7 +509,7 @@ bool __put_devmap_managed_page(struct page *page)
 	 * refcount is 1, then the page is free and the refcount is
 	 * stable because nobody holds a reference on the page.
 	 */
-	if (page_ref_dec_return(page) == 1)
+	if (page_ref_sub_return(page, refs) == 1)
 		wake_up_var(&page->_refcount);
 	return true;
 }
diff --git a/mm/swap.c b/mm/swap.c
index c6194cfa2af6..94e42a9bab92 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -960,7 +960,7 @@ void release_pages(struct page **pages, int nr)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (put_devmap_managed_page(&folio->page))
+			if (put_devmap_managed_page(&folio->page, 1))
 				continue;
 			if (folio_put_testzero(folio))
 				free_zone_device_page(&folio->page);

Muchun Song July 4, 2022, 10:56 a.m. UTC | #2

On Mon, Jul 04, 2022 at 11:38:16AM +0100, Matthew Wilcox wrote:
> On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > then they will be unpinned via unpin_user_page() using a folio variant
> > to put the page, however, folio variants did not consider this special
> > case, the result will be to miss a wakeup event (like the user of
> > __fuse_dax_break_layouts()).
> 
> Argh, no.  The 1-based refcounts are a blight on the entire kernel.
> They need to go away, not be pushed into folios as well.  I think

I would be happy if this could go away.

> we're close to having that fixed, but until then, this should do
> the trick?
> 

The following fix looks good to me since it lowers the overhead as
much as possible

Thanks.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc98ab012a9b..4cef5e0f78b6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1129,18 +1129,18 @@ static inline bool is_zone_movable_page(const struct page *page)
>  #if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
>  
> -bool __put_devmap_managed_page(struct page *page);
> -static inline bool put_devmap_managed_page(struct page *page)
> +bool __put_devmap_managed_page(struct page *page, int refs);
> +static inline bool put_devmap_managed_page(struct page *page, int refs)
>  {
>  	if (!static_branch_unlikely(&devmap_managed_key))
>  		return false;
>  	if (!is_zone_device_page(page))
>  		return false;
> -	return __put_devmap_managed_page(page);
> +	return __put_devmap_managed_page(page, refs);
>  }
>  
>  #else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> -static inline bool put_devmap_managed_page(struct page *page)
> +static inline bool put_devmap_managed_page(struct page *page, int refs)
>  {
>  	return false;
>  }
> @@ -1246,7 +1246,7 @@ static inline void put_page(struct page *page)
>  	 * For some devmap managed pages we need to catch refcount transition
>  	 * from 2 to 1:
>  	 */
> -	if (put_devmap_managed_page(&folio->page))
> +	if (put_devmap_managed_page(&folio->page, 1))
>  		return;
>  	folio_put(folio);
>  }
> diff --git a/mm/gup.c b/mm/gup.c
> index d1132b39aa8f..28df02121c78 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -88,7 +88,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>  	 * belongs to this folio.
>  	 */
>  	if (unlikely(page_folio(page) != folio)) {
> -		folio_put_refs(folio, refs);
> +		if (!put_devmap_managed_page(&folio->page, refs))
> +			folio_put_refs(folio, refs);
>  		goto retry;
>  	}
>  
> @@ -177,6 +178,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>  			refs *= GUP_PIN_COUNTING_BIAS;
>  	}
>  
> +	if (put_devmap_managed_page(&folio->page, refs))
> +		return;
>  	folio_put_refs(folio, refs);
>  }
>  
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b870a659eee6..b25e40e3a11e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -499,7 +499,7 @@ void free_zone_device_page(struct page *page)
>  }
>  
>  #ifdef CONFIG_FS_DAX
> -bool __put_devmap_managed_page(struct page *page)
> +bool __put_devmap_managed_page(struct page *page, int refs)
>  {
>  	if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
>  		return false;
> @@ -509,7 +509,7 @@ bool __put_devmap_managed_page(struct page *page)
>  	 * refcount is 1, then the page is free and the refcount is
>  	 * stable because nobody holds a reference on the page.
>  	 */
> -	if (page_ref_dec_return(page) == 1)
> +	if (page_ref_sub_return(page, refs) == 1)
>  		wake_up_var(&page->_refcount);
>  	return true;
>  }
> diff --git a/mm/swap.c b/mm/swap.c
> index c6194cfa2af6..94e42a9bab92 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -960,7 +960,7 @@ void release_pages(struct page **pages, int nr)
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			if (put_devmap_managed_page(&folio->page))
> +			if (put_devmap_managed_page(&folio->page, 1))
>  				continue;
>  			if (folio_put_testzero(folio))
>  				free_zone_device_page(&folio->page);
>

Dan Williams July 12, 2022, 2:39 a.m. UTC | #3

Muchun Song wrote:
> On Mon, Jul 04, 2022 at 11:38:16AM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> > > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > > then they will be unpinned via unpin_user_page() using a folio variant
> > > to put the page, however, folio variants did not consider this special
> > > case, the result will be to miss a wakeup event (like the user of
> > > __fuse_dax_break_layouts()).
> > 
> > Argh, no.  The 1-based refcounts are a blight on the entire kernel.
> > They need to go away, not be pushed into folios as well.  I think
> 
> I would be happy if this could go away.

Continue to agree that this blight needs to end.

One of the pre-requisites to getting back to normal accounting of FSDAX
page pin counts was to first drop the usage of get_dev_pagemap() in the
GUP path:

https://lore.kernel.org/linux-mm/161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com/

That work stalled on notifying mappers of surprise removal events of FSDAX pfns.
However, Ruan has been spending some time on that problem recently:

https://lore.kernel.org/all/20220410171623.3788004-1-ruansy.fnst@fujitsu.com/

So, once I dig out from a bit of CXL backlog and review that effort the
next step that I see will be convert the FSDAX path to take typical
references vmf_insert() time. Unless I am missing a shorter path to get
this fixed up?

> > we're close to having that fixed, but until then, this should do
> > the trick?
> > 
> 
> The following fix looks good to me since it lowers the overhead as
> much as possible

This change looks good to me as well.

> 
> Thanks.
> 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index cc98ab012a9b..4cef5e0f78b6 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1129,18 +1129,18 @@ static inline bool is_zone_movable_page(const struct page *page)
> >  #if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
> >  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> >  
> > -bool __put_devmap_managed_page(struct page *page);
> > -static inline bool put_devmap_managed_page(struct page *page)
> > +bool __put_devmap_managed_page(struct page *page, int refs);
> > +static inline bool put_devmap_managed_page(struct page *page, int refs)
> >  {
> >  	if (!static_branch_unlikely(&devmap_managed_key))
> >  		return false;
> >  	if (!is_zone_device_page(page))
> >  		return false;
> > -	return __put_devmap_managed_page(page);
> > +	return __put_devmap_managed_page(page, refs);
> >  }
> >  
> >  #else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> > -static inline bool put_devmap_managed_page(struct page *page)
> > +static inline bool put_devmap_managed_page(struct page *page, int refs)
> >  {
> >  	return false;
> >  }
> > @@ -1246,7 +1246,7 @@ static inline void put_page(struct page *page)
> >  	 * For some devmap managed pages we need to catch refcount transition
> >  	 * from 2 to 1:
> >  	 */
> > -	if (put_devmap_managed_page(&folio->page))
> > +	if (put_devmap_managed_page(&folio->page, 1))
> >  		return;
> >  	folio_put(folio);
> >  }
> > diff --git a/mm/gup.c b/mm/gup.c
> > index d1132b39aa8f..28df02121c78 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -88,7 +88,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >  	 * belongs to this folio.
> >  	 */
> >  	if (unlikely(page_folio(page) != folio)) {
> > -		folio_put_refs(folio, refs);
> > +		if (!put_devmap_managed_page(&folio->page, refs))
> > +			folio_put_refs(folio, refs);
> >  		goto retry;
> >  	}
> >  
> > @@ -177,6 +178,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> >  			refs *= GUP_PIN_COUNTING_BIAS;
> >  	}
> >  
> > +	if (put_devmap_managed_page(&folio->page, refs))
> > +		return;
> >  	folio_put_refs(folio, refs);
> >  }
> >  
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index b870a659eee6..b25e40e3a11e 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -499,7 +499,7 @@ void free_zone_device_page(struct page *page)
> >  }
> >  
> >  #ifdef CONFIG_FS_DAX
> > -bool __put_devmap_managed_page(struct page *page)
> > +bool __put_devmap_managed_page(struct page *page, int refs)
> >  {
> >  	if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >  		return false;
> > @@ -509,7 +509,7 @@ bool __put_devmap_managed_page(struct page *page)
> >  	 * refcount is 1, then the page is free and the refcount is
> >  	 * stable because nobody holds a reference on the page.
> >  	 */
> > -	if (page_ref_dec_return(page) == 1)
> > +	if (page_ref_sub_return(page, refs) == 1)
> >  		wake_up_var(&page->_refcount);
> >  	return true;
> >  }
> > diff --git a/mm/swap.c b/mm/swap.c
> > index c6194cfa2af6..94e42a9bab92 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -960,7 +960,7 @@ void release_pages(struct page **pages, int nr)
> >  				unlock_page_lruvec_irqrestore(lruvec, flags);
> >  				lruvec = NULL;
> >  			}
> > -			if (put_devmap_managed_page(&folio->page))
> > +			if (put_devmap_managed_page(&folio->page, 1))
> >  				continue;
> >  			if (folio_put_testzero(folio))
> >  				free_zone_device_page(&folio->page);
> >

Jason Gunthorpe July 21, 2022, 6:51 p.m. UTC | #4

On Mon, Jul 11, 2022 at 07:39:17PM -0700, Dan Williams wrote:
> Muchun Song wrote:
> > On Mon, Jul 04, 2022 at 11:38:16AM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> > > > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > > > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > > > then they will be unpinned via unpin_user_page() using a folio variant
> > > > to put the page, however, folio variants did not consider this special
> > > > case, the result will be to miss a wakeup event (like the user of
> > > > __fuse_dax_break_layouts()).
> > > 
> > > Argh, no.  The 1-based refcounts are a blight on the entire kernel.
> > > They need to go away, not be pushed into folios as well.  I think
> > 
> > I would be happy if this could go away.
> 
> Continue to agree that this blight needs to end.
> 
> One of the pre-requisites to getting back to normal accounting of FSDAX
> page pin counts was to first drop the usage of get_dev_pagemap() in the
> GUP path:
> 
> https://lore.kernel.org/linux-mm/161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> That work stalled on notifying mappers of surprise removal events of FSDAX pfns.

We already talked about this - once we have proper refcounting the
above is protected naturally by the proper refcounting. The reason it
is there is only because the refcount goes to 1 and the page is
re-used so the natural protection in GUP doesn't work.

We don't need surprise removal events to fix this, we need the FS side
to hold a reference when it puts the pages into the PTEs..

> So, once I dig out from a bit of CXL backlog and review that effort the
> next step that I see will be convert the FSDAX path to take typical
> references vmf_insert() time. Unless I am missing a shorter path to get
> this fixed up?

Yeah, just do this. IIRC Christoph already did all the infrastructure now,
just take the correct references and remove the special cases that
turn off the new infrastructure for fsdax.

When I looked at it a long while ago it seemd to require some
understanding of the fsdax code and exactly what the lifecycle model
was supposed to be there.

Jason

Dan Williams July 22, 2022, 12:27 a.m. UTC | #5

Jason Gunthorpe wrote:
> On Mon, Jul 11, 2022 at 07:39:17PM -0700, Dan Williams wrote:
> > Muchun Song wrote:
> > > On Mon, Jul 04, 2022 at 11:38:16AM +0100, Matthew Wilcox wrote:
> > > > On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> > > > > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > > > > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > > > > then they will be unpinned via unpin_user_page() using a folio variant
> > > > > to put the page, however, folio variants did not consider this special
> > > > > case, the result will be to miss a wakeup event (like the user of
> > > > > __fuse_dax_break_layouts()).
> > > > 
> > > > Argh, no.  The 1-based refcounts are a blight on the entire kernel.
> > > > They need to go away, not be pushed into folios as well.  I think
> > > 
> > > I would be happy if this could go away.
> > 
> > Continue to agree that this blight needs to end.
> > 
> > One of the pre-requisites to getting back to normal accounting of FSDAX
> > page pin counts was to first drop the usage of get_dev_pagemap() in the
> > GUP path:
> > 
> > https://lore.kernel.org/linux-mm/161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com/
> > 
> > That work stalled on notifying mappers of surprise removal events of FSDAX pfns.
> 
> We already talked about this - once we have proper refcounting the
> above is protected naturally by the proper refcounting. The reason it
> is there is only because the refcount goes to 1 and the page is
> re-used so the natural protection in GUP doesn't work.
> 
> We don't need surprise removal events to fix this, we need the FS side
> to hold a reference when it puts the pages into the PTEs..

Ah, true. Once the FS reference can make device removal hang on the open
references then that is good enough for fixing up the dax-page reference
count problem.

The notification to force the FS to drop its references is just a
behaviour improvment at that point.

> 
> > So, once I dig out from a bit of CXL backlog and review that effort the
> > next step that I see will be convert the FSDAX path to take typical
> > references vmf_insert() time. Unless I am missing a shorter path to get
> > this fixed up?
> 
> Yeah, just do this. IIRC Christoph already did all the infrastructure now,
> just take the correct references and remove the special cases that
> turn off the new infrastructure for fsdax.
> 
> When I looked at it a long while ago it seemd to require some
> understanding of the fsdax code and exactly what the lifecycle model
> was supposed to be there.

CXL development has reached a good break point for me to hop over and
take a look at this now.

Speaking of CXL, if you have any heartburn on that rework of
devm_request_free_mem_region(), let me know:

https://lore.kernel.org/all/62d97a89d66a1_17f3e82949e@dwillia2-xfh.jf.intel.com.notmuch/

mm: fix missing wake-up event for FSDAX pages

Commit Message

Comments

Patch