diff mbox series

[v8,20/20] device/dax: Properly refcount device dax pages when mapping

Message ID 9d9d33b418dd1aab9323203488305085389f62c1.1739850794.git-series.apopple@nvidia.com
State New
Headers show
Series fs/dax: Fix ZONE_DEVICE page reference counts | expand

Commit Message

Alistair Popple Feb. 18, 2025, 3:55 a.m. UTC
Device DAX pages are currently not reference counted when mapped,
instead relying on the devmap PTE bit to ensure mapping code will not
get/put references. This requires special handling in various page
table walkers, particularly GUP, to manage references on the
underlying pgmap to ensure the pages remain valid.

However there is no reason these pages can't be refcounted properly at
map time. Doning so eliminates the need for the devmap PTE bit,
freeing up a precious PTE bit. It also simplifies GUP as it no longer
needs to manage the special pgmap references and can instead just
treat the pages normally as defined by vm_normal_page().

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/dax/device.c | 15 +++++++++------
 mm/memremap.c        | 13 ++++++-------
 2 files changed, 15 insertions(+), 13 deletions(-)

Comments

Gerald Schaefer Feb. 20, 2025, 6:33 p.m. UTC | #1
On Tue, 18 Feb 2025 14:55:36 +1100
Alistair Popple <apopple@nvidia.com> wrote:

[...]
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 9a8879b..532a52a 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
>  
> -	if (WARN_ON_ONCE(!pgmap->ops))
> -		return;
> -
> -	if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
> -			 !pgmap->ops->page_free))
> +	if (WARN_ON_ONCE((!pgmap->ops &&
> +			  pgmap->type != MEMORY_DEVICE_GENERIC) ||
> +			 (pgmap->ops && !pgmap->ops->page_free &&
> +			  pgmap->type != MEMORY_DEVICE_FS_DAX)))

Playing around with dcssblk, adding devm_memremap_pages() and
pgmap.type = MEMORY_DEVICE_FS_DAX, similar to the other two existing
FS_DAX drivers drivers/nvdimm/pmem.c and fs/fuse/virtio_fs.c, I hit
this warning when executing binaries from DAX-mounted fs.

I do not set up pgmap->ops, similar to fs/fuse/virtio_fs.c, and I don't see
why they would be needed here anyway, at least for MEMORY_DEVICE_FS_DAX.
drivers/nvdimm/pmem.c does set up pgmap->ops, but only ->memory_failure,
which is still good enough to not trigger the warning here, probably just
by chance.

Now I wonder:
1) What is this check / warning good for, when this function only ever
   calls pgmap->ops->page_free(), but not for MEMORY_DEVICE_FS_DAX and
   not for MEMORY_DEVICE_GENERIC (the latter only after this patch)?
2) Is the warning also seen for virtio DAX mappings (added Vivek and
   Stefan on CC)? No pgmap->ops set up there, so I would guess "yes",
   and already before this series, with the old check / warning.
3) Could this be changed to only check / warn if pgmap->ops (or maybe
   rather pgmap->ops->page_free) is not set up, but not for
   MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX where this is not
   being called?
4) Or is there any reason why pgmap->ops would be required for
   MEMORY_DEVICE_FS_DAX?

Apart from the warning, we would also miss out on the
wake_up_var(&folio->page) in the MEMORY_DEVICE_FS_DAX case, when no
pgmap->ops was set up. IIUC, even before this change / series (i.e.
for virtio DAX only, since dcssblk was not using ZONE_DEVICE before,
and pmem seems to work by chance because they have ops->memory_failure).

>  		return;
>  
>  	mem_cgroup_uncharge(folio);
> @@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio)
>  	 * zero which indicating the page has been removed from the file
>  	 * system mapping.
>  	 */
> -	if (pgmap->type != MEMORY_DEVICE_FS_DAX)
> +	if (pgmap->type != MEMORY_DEVICE_FS_DAX &&
> +	    pgmap->type != MEMORY_DEVICE_GENERIC)
>  		folio->mapping = NULL;
>  
>  	switch (pgmap->type) {
> @@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio)
>  		 * Reset the refcount to 1 to prepare for handing out the page
>  		 * again.
>  		 */
> -		pgmap->ops->page_free(folio_page(folio, 0));

Ok, this is probably the reason why you adjusted the check above, since
no more pgmap->ops needed for MEMORY_DEVICE_GENERIC.
Still, the MEMORY_DEVICE_FS_DAX case also does not seem to need
pgmap->ops, and at least the existing virtio DAX should already be
affected, and of course future dcssblk DAX.

But maybe that should be addressed in a separate patch, since your changes
here seem consistent, and not change or worsen anything wrt !pgmap->ops
and MEMORY_DEVICE_FS_DAX.

>  		folio_set_count(folio, 1);
>  		break;
>  

For reference, this is call trace I see when I hit the warning:

[  283.567945] ------------[ cut here ]------------
[  283.567947] WARNING: CPU: 12 PID: 878 at mm/memremap.c:436 free_zone_device_folio+0x6e/0x140
[  283.567959] Modules linked in:
[  283.567963] CPU: 12 UID: 0 PID: 878 Comm: ls Not tainted 6.14.0-rc3-next-20250220-00012-gd072dabf62e8-dirty #44
[  283.567968] Hardware name: IBM 3931 A01 704 (z/VM 7.4.0)
[  283.567971] Krnl PSW : 0704d00180000000 000001ec0548b44a (free_zone_device_folio+0x72/0x140)
[  283.567978]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[  283.567982] Krnl GPRS: 0000000000000038 0000000000000000 0000000000000003 000001ec06cc42e8
[  283.567986]            00000004cc38e400 0000000000000000 0000000000000003 0000000093eacd00
[  283.567990]            000000009a68413f 0000016614010940 000000009553a640 0000016614010940
[  283.567994]            0000000000000000 0000000000000000 000001ec0548b416 0000016c05da3bf8
[  283.568004] Krnl Code: 000001ec0548b43e: a70e0003		chi	%r0,3
                          000001ec0548b442: a7840006		brc	8,000001ec0548b44e
                         #000001ec0548b446: af000000		mc	0,0
                         >000001ec0548b44a: a7f4005f		brc	15,000001ec0548b508
                          000001ec0548b44e: c00400000008	brcl	0,000001ec0548b45e
                          000001ec0548b454: b904002b		lgr	%r2,%r11
                          000001ec0548b458: c0e5001dcd84	brasl	%r14,000001ec05844f60
                          000001ec0548b45e: 9101b01f		tm	31(%r11),1
[  283.568035] Call Trace:
[  283.568038]  [<000001ec0548b44a>] free_zone_device_folio+0x72/0x140 
[  283.568042] ([<000001ec0548b416>] free_zone_device_folio+0x3e/0x140)
[  283.568045]  [<000001ec057a4c1c>] wp_page_copy+0x34c/0x6e0 
[  283.568050]  [<000001ec057ac640>] __handle_mm_fault+0x220/0x4d0 
[  283.568054]  [<000001ec057ac97e>] handle_mm_fault+0x8e/0x160 
[  283.568057]  [<000001ec054ca006>] do_exception+0x1a6/0x450 
[  283.568061]  [<000001ec06264992>] __do_pgm_check+0x132/0x1e0 
[  283.568065]  [<000001ec0627057e>] pgm_check_handler+0x11e/0x170 
[  283.568069] Last Breaking-Event-Address:
[  283.568070]  [<000001ec0548b428>] free_zone_device_folio+0x50/0x140
[  283.568074] ---[ end trace 0000000000000000 ]---
diff mbox series

Patch

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index bc871a3..328231c 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -125,11 +125,12 @@  static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn),
+					vmf->flags & FAULT_FLAG_WRITE);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -168,11 +169,12 @@  static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)),
+				vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -213,11 +215,12 @@  static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return vmf_insert_folio_pud(vmf, page_folio(pfn_t_to_page(pfn)),
+				vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/mm/memremap.c b/mm/memremap.c
index 9a8879b..532a52a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -460,11 +460,10 @@  void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
 
-	if (WARN_ON_ONCE(!pgmap->ops))
-		return;
-
-	if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
-			 !pgmap->ops->page_free))
+	if (WARN_ON_ONCE((!pgmap->ops &&
+			  pgmap->type != MEMORY_DEVICE_GENERIC) ||
+			 (pgmap->ops && !pgmap->ops->page_free &&
+			  pgmap->type != MEMORY_DEVICE_FS_DAX)))
 		return;
 
 	mem_cgroup_uncharge(folio);
@@ -494,7 +493,8 @@  void free_zone_device_folio(struct folio *folio)
 	 * zero which indicating the page has been removed from the file
 	 * system mapping.
 	 */
-	if (pgmap->type != MEMORY_DEVICE_FS_DAX)
+	if (pgmap->type != MEMORY_DEVICE_FS_DAX &&
+	    pgmap->type != MEMORY_DEVICE_GENERIC)
 		folio->mapping = NULL;
 
 	switch (pgmap->type) {
@@ -509,7 +509,6 @@  void free_zone_device_folio(struct folio *folio)
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
-		pgmap->ops->page_free(folio_page(folio, 0));
 		folio_set_count(folio, 1);
 		break;