diff mbox series

[v2,15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()

Message ID 166329939733.2786261.13946962468817639563.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix the DAX-gup mistake | expand

Commit Message

Dan Williams Sept. 16, 2022, 3:36 a.m. UTC
Track entries and take pgmap references at mapping insertion time.
Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
references at device destruction or inode eviction time. With this in
place, and the fsdax equivalent already in place, the gup code no longer
needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
before taking a page reference.

In other words, GUP takes additional references on mapped pages. Until
now, DAX in all its forms was failing to take references at mapping
time. With that fixed there is no longer a requirement for gup to manage
@pgmap references. However, that cleanup is saved for a follow-on patch.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/bus.c     |   15 +++++++++-
 drivers/dax/device.c  |   73 +++++++++++++++++++++++++++++--------------------
 drivers/dax/mapping.c |    3 ++
 3 files changed, 60 insertions(+), 31 deletions(-)

Comments

Jason Gunthorpe Sept. 21, 2022, 2:10 p.m. UTC | #1
On Thu, Sep 15, 2022 at 08:36:37PM -0700, Dan Williams wrote:
> Track entries and take pgmap references at mapping insertion time.
> Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
> references at device destruction or inode eviction time. With this in
> place, and the fsdax equivalent already in place, the gup code no longer
> needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
> before taking a page reference.
> 
> In other words, GUP takes additional references on mapped pages. Until
> now, DAX in all its forms was failing to take references at mapping
> time. With that fixed there is no longer a requirement for gup to manage
> @pgmap references. However, that cleanup is saved for a follow-on patch.

A page->pgmap must be valid and stable so long as a page has a
positive refcount. Once we fixed the refcount GUP is automatically
fine. So this explanation seems confusing.

If dax code needs other changes to maintain that invarient it should
be spelled out what those are and why, but the instant we fix the
refcount we can delete the stuff in gup.c and everywhere else.

Jason
Dan Williams Sept. 21, 2022, 3:48 p.m. UTC | #2
Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:37PM -0700, Dan Williams wrote:
> > Track entries and take pgmap references at mapping insertion time.
> > Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
> > references at device destruction or inode eviction time. With this in
> > place, and the fsdax equivalent already in place, the gup code no longer
> > needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
> > before taking a page reference.
> > 
> > In other words, GUP takes additional references on mapped pages. Until
> > now, DAX in all its forms was failing to take references at mapping
> > time. With that fixed there is no longer a requirement for gup to manage
> > @pgmap references. However, that cleanup is saved for a follow-on patch.
> 
> A page->pgmap must be valid and stable so long as a page has a
> positive refcount. Once we fixed the refcount GUP is automatically
> fine. So this explanation seems confusing.

I think while trying to describe the history I made this patch
description confusing.

> If dax code needs other changes to maintain that invarient it should
> be spelled out what those are and why, but the instant we fix the
> refcount we can delete the stuff in gup.c and everywhere else.

How about the following, note that this incorporates new changes I have
in flight after Dave pointed out the problem DAX has with page pins
versus inode lifetime:

---

The fsdax core now manages pgmap references when servicing faults that
install new mappings, and elevates the page reference until it is
zapped. It coordinates with the VFS to make sure that all page
references are dropped before the hosting inode goes out of scope
(iput_final()).

In order to delete the unnecessary pgmap reference taking in mm/gup.c
devdax needs to move to the same model.
Jason Gunthorpe Sept. 21, 2022, 10:23 p.m. UTC | #3
On Wed, Sep 21, 2022 at 08:48:22AM -0700, Dan Williams wrote:

> The fsdax core now manages pgmap references when servicing faults that
> install new mappings, and elevates the page reference until it is
> zapped. It coordinates with the VFS to make sure that all page
> references are dropped before the hosting inode goes out of scope
> (iput_final()).
>
> In order to delete the unnecessary pgmap reference taking in mm/gup.c
> devdax needs to move to the same model.

I think this patch is more about making devdax and fsdax use the same
set of functions and logic so that when it gets to patch 16/17 devdax
doesn't break. That understanding matches the first paragraph, at
least.

I would delete the remark about gup since it is really patch 17 that
allows gup to be fixed by making it so that refcount == 0 means not to
look at the pgmap (instead of refcount == 1 as is now) ?

Jason
Dan Williams Sept. 22, 2022, 12:15 a.m. UTC | #4
Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 08:48:22AM -0700, Dan Williams wrote:
> 
> > The fsdax core now manages pgmap references when servicing faults that
> > install new mappings, and elevates the page reference until it is
> > zapped. It coordinates with the VFS to make sure that all page
> > references are dropped before the hosting inode goes out of scope
> > (iput_final()).
> >
> > In order to delete the unnecessary pgmap reference taking in mm/gup.c
> > devdax needs to move to the same model.
> 
> I think this patch is more about making devdax and fsdax use the same
> set of functions and logic so that when it gets to patch 16/17 devdax
> doesn't break. That understanding matches the first paragraph, at
> least.
> 
> I would delete the remark about gup since it is really patch 17 that
> allows gup to be fixed by making it so that refcount == 0 means not to
> look at the pgmap (instead of refcount == 1 as is now) ?

Yeah, makes sense.
diff mbox series

Patch

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..35a319a76c82 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -382,9 +382,22 @@  void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	struct page *page;
 
 	kill_dax(dax_dev);
-	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+
+	/*
+	 * New mappings are blocked. Wait for all GUP users to release
+	 * their pins.
+	 */
+	do {
+		page = dax_zap_mappings(inode->i_mapping);
+		if (!page)
+			break;
+		__wait_var_event(page, dax_page_idle(page));
+	} while (true);
+
+	truncate_inode_pages(inode->i_mapping, 0);
 
 	/*
 	 * Dynamic dax region have the pgmap allocated via dev_kzalloc()
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5494d745ced5..7f306939807e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -73,38 +73,15 @@  __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
-			      unsigned long fault_size)
-{
-	unsigned long i, nr_pages = fault_size / PAGE_SIZE;
-	struct file *filp = vmf->vma->vm_file;
-	struct dev_dax *dev_dax = filp->private_data;
-	pgoff_t pgoff;
-
-	/* mapping is only set on the head */
-	if (dev_dax->pgmap->vmemmap_shift)
-		nr_pages = 1;
-
-	pgoff = linear_page_index(vmf->vma,
-			ALIGN(vmf->address, fault_size));
-
-	for (i = 0; i < nr_pages; i++) {
-		struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-
-		page = compound_head(page);
-		if (page->mapping)
-			continue;
-
-		page->mapping = filp->f_mapping;
-		page->index = pgoff + i;
-	}
-}
-
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
@@ -128,7 +105,16 @@  static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, 0);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, 0);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
@@ -136,10 +122,14 @@  static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
 	pgoff_t pgoff;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
@@ -171,7 +161,16 @@  static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PMD);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
@@ -180,10 +179,14 @@  static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
 	pgoff_t pgoff;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
@@ -216,7 +219,16 @@  static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, PUD_ORDER);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PUD);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
@@ -494,3 +506,4 @@  MODULE_LICENSE("GPL v2");
 module_init(dax_init);
 module_exit(dax_exit);
 MODULE_ALIAS_DAX_DEVICE(0);
+MODULE_IMPORT_NS(DAX);
diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
index b5a5196f8831..9981eebb2dc5 100644
--- a/drivers/dax/mapping.c
+++ b/drivers/dax/mapping.c
@@ -266,6 +266,7 @@  void dax_unlock_entry(struct xa_state *xas, void *entry)
 	WARN_ON(!dax_is_locked(old));
 	dax_wake_entry(xas, entry, WAKE_NEXT);
 }
+EXPORT_SYMBOL_NS_GPL(dax_unlock_entry, DAX);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -666,6 +667,7 @@  void *dax_grab_mapping_entry(struct xa_state *xas,
 	xas_unlock_irq(xas);
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL_NS_GPL(dax_grab_mapping_entry, DAX);
 
 static void *dax_zap_entry(struct xa_state *xas, void *entry)
 {
@@ -910,6 +912,7 @@  vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	*pentry = entry;
 	return 0;
 }
+EXPORT_SYMBOL_NS_GPL(dax_insert_entry, DAX);
 
 int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 		      struct address_space *mapping, void *entry)