diff mbox series

[v3,10/25] fsdax: Introduce pgmap_request_folios()

Message ID 166579187573.2236710.10151157417629496558.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix the DAX-gup mistake | expand

Commit Message

Dan Williams Oct. 14, 2022, 11:57 p.m. UTC
The next step in sanitizing DAX page and pgmap lifetime is to take page
references when a pgmap user maps a page or otherwise puts it into use.
Unlike the page allocator where the it picks the page/folio, ZONE_DEVICE
users know in advance which folio they want to access.  Additionally,
ZONE_DEVICE implementations know when the pgmap is alive. Introduce
pgmap_request_folios() that pins @nr_folios folios at a time provided
they are contiguous and of the same folio_order().

Some WARN assertions are added to document expectations and catch bugs
in future kernel work, like a potential conversion of fsdax to use
multi-page folios, but they otherwise are not expected to fire.

Note that the paired pgmap_release_folios() implementation temporarily,
in this path, takes an @pgmap argument to drop pgmap references. A
follow-on patch arranges for free_zone_device_page() to drop pgmap
references in all cases. In other words, the intent is that only
put_folio() (on each folio requested pgmap_request_folio()) is needed to
to undo pgmap_request_folios().

The intent is that this also replaces zone_device_page_init(), but that
too requires some more preparatory reworks to unify the various
MEMORY_DEVICE_* types.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   32 ++++++++++++++++-----
 include/linux/memremap.h |   17 +++++++++++
 mm/memremap.c            |   70 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 8 deletions(-)

Comments

Alistair Popple Oct. 17, 2022, 6:31 a.m. UTC | #1
Dan Williams <dan.j.williams@intel.com> writes:

[...]

> +/**
> + * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> + * @pgmap: host page map for the folio array
> + * @folio: start of the folio list, all subsequent folios have same folio_size()
> + *
> + * Caller is responsible for @pgmap remaining live for the duration of
> + * this call. Caller is also responsible for not racing requests for the
> + * same folios.
> + */
> +bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> +			  int nr_folios)

All callers of pgmap_request_folios() I could see call this with
nr_folios == 1 and pgmap == folio_pgmap(folio). Could we remove the
redundant arguments and simplify it to
pgmap_request_folios(struct folio *folio)?

> +{
> +	struct folio *iter;
> +	int i;
> +
> +	/*
> +	 * All of the WARNs below are for catching bugs in future
> +	 * development that changes the assumptions of:
> +	 * 1/ uniform folios in @pgmap
> +	 * 2/ @pgmap death does not race this routine.
> +	 */
> +	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> +
> +	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> +		return false;
> +
> +	for (iter = folio_next(folio), i = 1; i < nr_folios;
> +	     iter = folio_next(folio), i++)
> +		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> +			return false;
> +
> +	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
> +		folio_ref_inc(iter);
> +		if (folio_ref_count(iter) == 1)
> +			percpu_ref_tryget(&pgmap->ref);
> +	}
> +
> +	return true;
> +}
> +
> +void pgmap_release_folios(struct dev_pagemap *pgmap, struct folio *folio, int nr_folios)
> +{
> +	struct folio *iter;
> +	int i;
> +
> +	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
> +		if (!put_devmap_managed_page(&iter->page))
> +			folio_put(iter);
> +		if (!folio_ref_count(iter))
> +			put_dev_pagemap(pgmap);
> +	}
> +}
> +
>  #ifdef CONFIG_FS_DAX
>  bool __put_devmap_managed_page_refs(struct page *page, int refs)
>  {
Jason Gunthorpe Oct. 17, 2022, 7:41 p.m. UTC | #2
On Fri, Oct 14, 2022 at 04:57:55PM -0700, Dan Williams wrote:

> +/**
> + * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> + * @pgmap: host page map for the folio array
> + * @folio: start of the folio list, all subsequent folios have same folio_size()
> + *
> + * Caller is responsible for @pgmap remaining live for the duration of
> + * this call. Caller is also responsible for not racing requests for the
> + * same folios.
> + */
> +bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> +			  int nr_folios)
> +{
> +	struct folio *iter;
> +	int i;
> +
> +	/*
> +	 * All of the WARNs below are for catching bugs in future
> +	 * development that changes the assumptions of:
> +	 * 1/ uniform folios in @pgmap
> +	 * 2/ @pgmap death does not race this routine.
> +	 */
> +	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> +
> +	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> +		return false;
> +
> +	for (iter = folio_next(folio), i = 1; i < nr_folios;
> +	     iter = folio_next(folio), i++)
> +		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> +			return false;
> +
> +	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
> +		folio_ref_inc(iter);
> +		if (folio_ref_count(iter) == 1)

This seems strange, I would say if the folio doesn't have 0 ref at
this point it is a caller bug. The caller should only be able to
"request" folios once.

Ie the caller should not attempt to request a folio until it has been
released back via free page.

From a FS viewpoint this makes sense as if we establish a new inode
then those pages under the inode had better be currently exclusive to
the inode, or we have some rouge entity touching them.

As is, this looks racy because if you assume transient refs are now
allowed then their could be a 2->1 transition immediately after
ref_inc.

Having a release operation seems unworkable for a refcounted structure:

> +void pgmap_release_folios(struct dev_pagemap *pgmap, struct folio *folio, int nr_folios)
> +{
> +	struct folio *iter;
> +	int i;
> +
> +	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
> +		if (!put_devmap_managed_page(&iter->page))
> +			folio_put(iter);
> +		if (!folio_ref_count(iter))
> +			put_dev_pagemap(pgmap);

As if we don't have a 0 refcount here (eg due to transient external
page ref) we have now lost the put_dev_pagemap()

Jason
Dan Williams Oct. 17, 2022, 8:06 p.m. UTC | #3
Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> [...]
> 
> > +/**
> > + * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > + * @pgmap: host page map for the folio array
> > + * @folio: start of the folio list, all subsequent folios have same folio_size()
> > + *
> > + * Caller is responsible for @pgmap remaining live for the duration of
> > + * this call. Caller is also responsible for not racing requests for the
> > + * same folios.
> > + */
> > +bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > +			  int nr_folios)
> 
> All callers of pgmap_request_folios() I could see call this with
> nr_folios == 1 and pgmap == folio_pgmap(folio). Could we remove the
> redundant arguments and simplify it to
> pgmap_request_folios(struct folio *folio)?

The rationale for the @pgmap argument being separate is to make clear
that the caller must assert that pgmap is alive before passing it to
pgmap_request_folios(), and that @pgmap is constant for the span of
folios requested.

However, I see your point that these arguments are redundant so I would
not be opposed to a helper like:

pgmap_request_folio(struct folio *folio)

...that documents that it pulls the pgmap from the folio and because of
that does not support requesting more than one folio.
Jason Gunthorpe Oct. 17, 2022, 8:11 p.m. UTC | #4
On Mon, Oct 17, 2022 at 01:06:25PM -0700, Dan Williams wrote:
> Alistair Popple wrote:
> > 
> > Dan Williams <dan.j.williams@intel.com> writes:
> > 
> > [...]
> > 
> > > +/**
> > > + * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > > + * @pgmap: host page map for the folio array
> > > + * @folio: start of the folio list, all subsequent folios have same folio_size()
> > > + *
> > > + * Caller is responsible for @pgmap remaining live for the duration of
> > > + * this call. Caller is also responsible for not racing requests for the
> > > + * same folios.
> > > + */
> > > +bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > > +			  int nr_folios)
> > 
> > All callers of pgmap_request_folios() I could see call this with
> > nr_folios == 1 and pgmap == folio_pgmap(folio). Could we remove the
> > redundant arguments and simplify it to
> > pgmap_request_folios(struct folio *folio)?
> 
> The rationale for the @pgmap argument being separate is to make clear
> that the caller must assert that pgmap is alive before passing it to
> pgmap_request_folios(), and that @pgmap is constant for the span of
> folios requested.

The api is kind of weird to take in a folio - it should take in the
offset in bytes from the start of the pgmap and "create" usable
non-free folios.

A good design is that nothing has a struct folio * unless it can also
prove it has a non-zero refcount on it, and that is simply impossible
for any caller at this point.

Jason
Dan Williams Oct. 17, 2022, 8:51 p.m. UTC | #5
Jason Gunthorpe wrote:
> On Mon, Oct 17, 2022 at 01:06:25PM -0700, Dan Williams wrote:
> > Alistair Popple wrote:
> > > 
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > > 
> > > [...]
> > > 
> > > > +/**
> > > > + * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > > > + * @pgmap: host page map for the folio array
> > > > + * @folio: start of the folio list, all subsequent folios have same folio_size()
> > > > + *
> > > > + * Caller is responsible for @pgmap remaining live for the duration of
> > > > + * this call. Caller is also responsible for not racing requests for the
> > > > + * same folios.
> > > > + */
> > > > +bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > > > +			  int nr_folios)
> > > 
> > > All callers of pgmap_request_folios() I could see call this with
> > > nr_folios == 1 and pgmap == folio_pgmap(folio). Could we remove the
> > > redundant arguments and simplify it to
> > > pgmap_request_folios(struct folio *folio)?
> > 
> > The rationale for the @pgmap argument being separate is to make clear
> > that the caller must assert that pgmap is alive before passing it to
> > pgmap_request_folios(), and that @pgmap is constant for the span of
> > folios requested.
> 
> The api is kind of weird to take in a folio - it should take in the
> offset in bytes from the start of the pgmap and "create" usable
> non-free folios.

Plumbing that is non-trivial. Recall that in the DAX case the filesystem
can be sitting on top of a stack of device-mapper-dax devices routing to
more than one pgmap, and the pgmap can be made up of discontiguous
ranges. The filesystem knows the offset into the dax-device and then
asks that device to translate that to a pfn. From pfn the optional (see
CONFIG_FS_DAX_LIMITED) pgmap can be retrieved. Once DAX knows what pfn
it wants, while holding a lock that prevents the pgmap from entering its
dying state, it can request to pin that pfn.

I arrived at the interface being folio based because the order of the
pin(s) also needs to be conveyed and the order should be consistent.

> A good design is that nothing has a struct folio * unless it can also
> prove it has a non-zero refcount on it, and that is simply impossible
> for any caller at this point.

I agree that's a good design, and I think it is a bug to make this
request on anything but a zero-refcount folio, but plumbing pgmap
offsets to replace pfn_to_page() would require dax_direct_access() to
shift from a pfn lookup to a pgmap + offset lookup.

CONFIG_FS_DAX_LIMITED is something that needs to be deleted, but for now
it supports DAX without a pgmap, where "limited" is just meant for XIP
(eXecute-In-Place), no GUP support. Once that goes then
dax_direct_access() could evolve into an interface that assumes a pgmap
is present. Until then I think pgmap_request_folios() at least puts out
the immediate fire of making people contend with oddly refcounted DAX
pages.
Jason Gunthorpe Oct. 17, 2022, 11:57 p.m. UTC | #6
On Mon, Oct 17, 2022 at 01:51:18PM -0700, Dan Williams wrote:

> > A good design is that nothing has a struct folio * unless it can also
> > prove it has a non-zero refcount on it, and that is simply impossible
> > for any caller at this point.
> 
> I agree that's a good design, and I think it is a bug to make this
> request on anything but a zero-refcount folio, but plumbing pgmap
> offsets to replace pfn_to_page() would require dax_direct_access() to
> shift from a pfn lookup to a pgmap + offset lookup.

That is fair for this series, but maybe the other pgmap users shouldn't be
converted to use this style of API?

Jason
Dan Williams Oct. 18, 2022, 12:19 a.m. UTC | #7
Jason Gunthorpe wrote:
> On Mon, Oct 17, 2022 at 01:51:18PM -0700, Dan Williams wrote:
> 
> > > A good design is that nothing has a struct folio * unless it can also
> > > prove it has a non-zero refcount on it, and that is simply impossible
> > > for any caller at this point.
> > 
> > I agree that's a good design, and I think it is a bug to make this
> > request on anything but a zero-refcount folio, but plumbing pgmap
> > offsets to replace pfn_to_page() would require dax_direct_access() to
> > shift from a pfn lookup to a pgmap + offset lookup.
> 
> That is fair for this series, but maybe the other pgmap users shouldn't be
> converted to use this style of API?

That is also fair...

I think my main stumbling block is getting back to the pgmap from the
DAX code. However, I could move that to a dirty-hack that looks it up
via dax_direct_access()+pfn_to_page(), but keep the core request API
based on pgmap-offset.

I.e. it needs some access to a 0-referenced page, but it moves that wart
out-of-line of the main flow. Some of the non-DAX users might need this
hack as well unless I can figure out how to route back to the pgmap
without interrogating the pfn.

I'll try it incrementally to the current series to make things easier
for Andrew.
diff mbox series

Patch

diff --git a/fs/dax.c b/fs/dax.c
index d03c7a952d02..095c9d7b4a1d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -385,20 +385,27 @@  static inline void dax_mapping_set_cow(struct folio *folio)
 	folio->index++;
 }
 
+static struct dev_pagemap *folio_pgmap(struct folio *folio)
+{
+	return folio_page(folio, 0)->pgmap;
+}
+
 /*
  * When it is called in dax_insert_entry(), the cow flag will indicate that
  * whether this entry is shared by multiple files.  If so, set the page->mapping
  * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address, bool cow)
+static vm_fault_t dax_associate_entry(void *entry,
+				      struct address_space *mapping,
+				      struct vm_area_struct *vma,
+				      unsigned long address, bool cow)
 {
 	unsigned long size = dax_entry_size(entry), index;
 	struct folio *folio;
 	int i;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
+		return 0;
 
 	index = linear_page_index(vma, address & ~(size - 1));
 	dax_for_each_folio(entry, folio, i)
@@ -406,9 +413,13 @@  static void dax_associate_entry(void *entry, struct address_space *mapping,
 			dax_mapping_set_cow(folio);
 		} else {
 			WARN_ON_ONCE(folio->mapping);
+			if (!pgmap_request_folios(folio_pgmap(folio), folio, 1))
+				return VM_FAULT_SIGBUS;
 			folio->mapping = mapping;
 			folio->index = index + i;
 		}
+
+	return 0;
 }
 
 static void dax_disassociate_entry(void *entry, struct address_space *mapping,
@@ -702,9 +713,12 @@  static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
 
 	zap = !dax_is_zapped(entry);
 
-	dax_for_each_folio(entry, folio, i)
+	dax_for_each_folio(entry, folio, i) {
+		if (zap)
+			pgmap_release_folios(folio_pgmap(folio), folio, 1);
 		if (!ret && !dax_folio_idle(folio))
 			ret = folio_page(folio, 0);
+	}
 
 	if (zap)
 		dax_zap_entry(xas, entry);
@@ -934,6 +948,7 @@  static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
 	bool cow = dax_fault_is_cow(iter);
 	void *entry = *pentry;
+	vm_fault_t ret = 0;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -954,8 +969,10 @@  static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
+		ret = dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
 				cow);
+		if (ret)
+			goto out;
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -978,10 +995,11 @@  static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	if (cow)
 		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
 
+	*pentry = entry;
+out:
 	xas_unlock_irq(xas);
 
-	*pentry = entry;
-	return 0;
+	return ret;
 }
 
 static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7fcaf3180a5b..b87c16577af1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -193,7 +193,11 @@  void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap);
+				    struct dev_pagemap *pgmap);
+bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
+			  int nr_folios);
+void pgmap_release_folios(struct dev_pagemap *pgmap, struct folio *folio,
+			  int nr_folios);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -223,6 +227,17 @@  static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 	return NULL;
 }
 
+static inline bool pgmap_request_folios(struct dev_pagemap *pgmap,
+					struct folio *folio, int nr_folios)
+{
+	return false;
+}
+
+static inline void pgmap_release_folios(struct dev_pagemap *pgmap,
+					struct folio *folio, int nr_folios)
+{
+}
+
 static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
 {
 	return false;
diff --git a/mm/memremap.c b/mm/memremap.c
index f9287babb3ce..87a649ecdc54 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -530,6 +530,76 @@  void zone_device_page_init(struct page *page)
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);
 
+static bool folio_span_valid(struct dev_pagemap *pgmap, struct folio *folio,
+			     int nr_folios)
+{
+	unsigned long pfn_start, pfn_end;
+
+	pfn_start = page_to_pfn(folio_page(folio, 0));
+	pfn_end = pfn_start + (1 << folio_order(folio)) * nr_folios - 1;
+
+	if (pgmap != xa_load(&pgmap_array, pfn_start))
+		return false;
+
+	if (pfn_end > pfn_start && pgmap != xa_load(&pgmap_array, pfn_end))
+		return false;
+
+	return true;
+}
+
+/**
+ * pgmap_request_folios - activate an contiguous span of folios in @pgmap
+ * @pgmap: host page map for the folio array
+ * @folio: start of the folio list, all subsequent folios have same folio_size()
+ *
+ * Caller is responsible for @pgmap remaining live for the duration of
+ * this call. Caller is also responsible for not racing requests for the
+ * same folios.
+ */
+bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
+			  int nr_folios)
+{
+	struct folio *iter;
+	int i;
+
+	/*
+	 * All of the WARNs below are for catching bugs in future
+	 * development that changes the assumptions of:
+	 * 1/ uniform folios in @pgmap
+	 * 2/ @pgmap death does not race this routine.
+	 */
+	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
+
+	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
+		return false;
+
+	for (iter = folio_next(folio), i = 1; i < nr_folios;
+	     iter = folio_next(folio), i++)
+		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
+			return false;
+
+	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
+		folio_ref_inc(iter);
+		if (folio_ref_count(iter) == 1)
+			percpu_ref_tryget(&pgmap->ref);
+	}
+
+	return true;
+}
+
+void pgmap_release_folios(struct dev_pagemap *pgmap, struct folio *folio, int nr_folios)
+{
+	struct folio *iter;
+	int i;
+
+	for (iter = folio, i = 0; i < nr_folios; iter = folio_next(iter), i++) {
+		if (!put_devmap_managed_page(&iter->page))
+			folio_put(iter);
+		if (!folio_ref_count(iter))
+			put_dev_pagemap(pgmap);
+	}
+}
+
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_page_refs(struct page *page, int refs)
 {