Message ID | 9d9d33b418dd1aab9323203488305085389f62c1.1739850794.git-series.apopple@nvidia.com |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On Tue, 18 Feb 2025 14:55:36 +1100 Alistair Popple <apopple@nvidia.com> wrote: [...] > diff --git a/mm/memremap.c b/mm/memremap.c > index 9a8879b..532a52a 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio) > { > struct dev_pagemap *pgmap = folio->pgmap; > > - if (WARN_ON_ONCE(!pgmap->ops)) > - return; > - > - if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX && > - !pgmap->ops->page_free)) > + if (WARN_ON_ONCE((!pgmap->ops && > + pgmap->type != MEMORY_DEVICE_GENERIC) || > + (pgmap->ops && !pgmap->ops->page_free && > + pgmap->type != MEMORY_DEVICE_FS_DAX))) Playing around with dcssblk, adding devm_memremap_pages() and pgmap.type = MEMORY_DEVICE_FS_DAX, similar to the other two existing FS_DAX drivers drivers/nvdimm/pmem.c and fs/fuse/virtio_fs.c, I hit this warning when executing binaries from DAX-mounted fs. I do not set up pgmap->ops, similar to fs/fuse/virtio_fs.c, and I don't see why they would be needed here anyway, at least for MEMORY_DEVICE_FS_DAX. drivers/nvdimm/pmem.c does set up pgmap->ops, but only ->memory_failure, which is still good enough to not trigger the warning here, probably just by chance. Now I wonder: 1) What is this check / warning good for, when this function only ever calls pgmap->ops->page_free(), but not for MEMORY_DEVICE_FS_DAX and not for MEMORY_DEVICE_GENERIC (the latter only after this patch)? 2) Is the warning also seen for virtio DAX mappings (added Vivek and Stefan on CC)? No pgmap->ops set up there, so I would guess "yes", and already before this series, with the old check / warning. 3) Could this be changed to only check / warn if pgmap->ops (or maybe rather pgmap->ops->page_free) is not set up, but not for MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX where this is not being called? 4) Or is there any reason why pgmap->ops would be required for MEMORY_DEVICE_FS_DAX? Apart from the warning, we would also miss out on the wake_up_var(&folio->page) in the MEMORY_DEVICE_FS_DAX case, when no pgmap->ops was set up. IIUC, even before this change / series (i.e. for virtio DAX only, since dcssblk was not using ZONE_DEVICE before, and pmem seems to work by chance because they have ops->memory_failure). > return; > > mem_cgroup_uncharge(folio); > @@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio) > * zero which indicating the page has been removed from the file > * system mapping. > */ > - if (pgmap->type != MEMORY_DEVICE_FS_DAX) > + if (pgmap->type != MEMORY_DEVICE_FS_DAX && > + pgmap->type != MEMORY_DEVICE_GENERIC) > folio->mapping = NULL; > > switch (pgmap->type) { > @@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio) > * Reset the refcount to 1 to prepare for handing out the page > * again. > */ > - pgmap->ops->page_free(folio_page(folio, 0)); Ok, this is probably the reason why you adjusted the check above, since no more pgmap->ops needed for MEMORY_DEVICE_GENERIC. Still, the MEMORY_DEVICE_FS_DAX case also does not seem to need pgmap->ops, and at least the existing virtio DAX should already be affected, and of course future dcssblk DAX. But maybe that should be addressed in a separate patch, since your changes here seem consistent, and not change or worsen anything wrt !pgmap->ops and MEMORY_DEVICE_FS_DAX. > folio_set_count(folio, 1); > break; > For reference, this is call trace I see when I hit the warning: [ 283.567945] ------------[ cut here ]------------ [ 283.567947] WARNING: CPU: 12 PID: 878 at mm/memremap.c:436 free_zone_device_folio+0x6e/0x140 [ 283.567959] Modules linked in: [ 283.567963] CPU: 12 UID: 0 PID: 878 Comm: ls Not tainted 6.14.0-rc3-next-20250220-00012-gd072dabf62e8-dirty #44 [ 283.567968] Hardware name: IBM 3931 A01 704 (z/VM 7.4.0) [ 283.567971] Krnl PSW : 0704d00180000000 000001ec0548b44a (free_zone_device_folio+0x72/0x140) [ 283.567978] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [ 283.567982] Krnl GPRS: 0000000000000038 0000000000000000 0000000000000003 000001ec06cc42e8 [ 283.567986] 00000004cc38e400 0000000000000000 0000000000000003 0000000093eacd00 [ 283.567990] 000000009a68413f 0000016614010940 000000009553a640 0000016614010940 [ 283.567994] 0000000000000000 0000000000000000 000001ec0548b416 0000016c05da3bf8 [ 283.568004] Krnl Code: 000001ec0548b43e: a70e0003 chi %r0,3 000001ec0548b442: a7840006 brc 8,000001ec0548b44e #000001ec0548b446: af000000 mc 0,0 >000001ec0548b44a: a7f4005f brc 15,000001ec0548b508 000001ec0548b44e: c00400000008 brcl 0,000001ec0548b45e 000001ec0548b454: b904002b lgr %r2,%r11 000001ec0548b458: c0e5001dcd84 brasl %r14,000001ec05844f60 000001ec0548b45e: 9101b01f tm 31(%r11),1 [ 283.568035] Call Trace: [ 283.568038] [<000001ec0548b44a>] free_zone_device_folio+0x72/0x140 [ 283.568042] ([<000001ec0548b416>] free_zone_device_folio+0x3e/0x140) [ 283.568045] [<000001ec057a4c1c>] wp_page_copy+0x34c/0x6e0 [ 283.568050] [<000001ec057ac640>] __handle_mm_fault+0x220/0x4d0 [ 283.568054] [<000001ec057ac97e>] handle_mm_fault+0x8e/0x160 [ 283.568057] [<000001ec054ca006>] do_exception+0x1a6/0x450 [ 283.568061] [<000001ec06264992>] __do_pgm_check+0x132/0x1e0 [ 283.568065] [<000001ec0627057e>] pgm_check_handler+0x11e/0x170 [ 283.568069] Last Breaking-Event-Address: [ 283.568070] [<000001ec0548b428>] free_zone_device_folio+0x50/0x140 [ 283.568074] ---[ end trace 0000000000000000 ]---
diff --git a/drivers/dax/device.c b/drivers/dax/device.c index bc871a3..328231c 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -125,11 +125,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); + pfn = phys_to_pfn_t(phys, 0); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_mixed(vmf->vma, vmf->address, pfn); + return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), + vmf->flags & FAULT_FLAG_WRITE); } static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, @@ -168,11 +169,12 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); + pfn = phys_to_pfn_t(phys, 0); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); + return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)), + vmf->flags & FAULT_FLAG_WRITE); } #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD @@ -213,11 +215,12 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); + pfn = phys_to_pfn_t(phys, 0); dax_set_mapping(vmf, pfn, fault_size); - return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); + return vmf_insert_folio_pud(vmf, page_folio(pfn_t_to_page(pfn)), + vmf->flags & FAULT_FLAG_WRITE); } #else static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, diff --git a/mm/memremap.c b/mm/memremap.c index 9a8879b..532a52a 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio) { struct dev_pagemap *pgmap = folio->pgmap; - if (WARN_ON_ONCE(!pgmap->ops)) - return; - - if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX && - !pgmap->ops->page_free)) + if (WARN_ON_ONCE((!pgmap->ops && + pgmap->type != MEMORY_DEVICE_GENERIC) || + (pgmap->ops && !pgmap->ops->page_free && + pgmap->type != MEMORY_DEVICE_FS_DAX))) return; mem_cgroup_uncharge(folio); @@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio) * zero which indicating the page has been removed from the file * system mapping. */ - if (pgmap->type != MEMORY_DEVICE_FS_DAX) + if (pgmap->type != MEMORY_DEVICE_FS_DAX && + pgmap->type != MEMORY_DEVICE_GENERIC) folio->mapping = NULL; switch (pgmap->type) { @@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio) * Reset the refcount to 1 to prepare for handing out the page * again. */ - pgmap->ops->page_free(folio_page(folio, 0)); folio_set_count(folio, 1); break;
Device DAX pages are currently not reference counted when mapped, instead relying on the devmap PTE bit to ensure mapping code will not get/put references. This requires special handling in various page table walkers, particularly GUP, to manage references on the underlying pgmap to ensure the pages remain valid. However there is no reason these pages can't be refcounted properly at map time. Doning so eliminates the need for the devmap PTE bit, freeing up a precious PTE bit. It also simplifies GUP as it no longer needs to manage the special pgmap references and can instead just treat the pages normally as defined by vm_normal_page(). Signed-off-by: Alistair Popple <apopple@nvidia.com> --- drivers/dax/device.c | 15 +++++++++------ mm/memremap.c | 13 ++++++------- 2 files changed, 15 insertions(+), 13 deletions(-)