Message ID | cover.425da7c4e76c2749d0ad1734f972b06114e02d52.1736221254.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On Tue, 7 Jan 2025 14:42:16 +1100 Alistair Popple <apopple@nvidia.com> wrote: > Device and FS DAX pages have always maintained their own page > reference counts without following the normal rules for page reference > counting. In particular pages are considered free when the refcount > hits one rather than zero and refcounts are not added when mapping the > page. > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary > mechanism for allowing GUP to hold references on the page (see > get_dev_pagemap). However there doesn't seem to be any reason why FS > DAX pages need their own reference counting scheme. > > By treating the refcounts on these pages the same way as normal pages > we can remove a lot of special checks. In particular pXd_trans_huge() > becomes the same as pXd_leaf(), although I haven't made that change > here. It also frees up a valuable SW define PTE bit on architectures > that have devmap PTE bits defined. > > It also almost certainly allows further clean-up of the devmap managed > functions, but I have left that as a future improvment. It also > enables support for compound ZONE_DEVICE pages which is one of my > primary motivators for doing this work. > https://lkml.kernel.org/r/wysuus23bqmjtwkfu3zutqtmkse3ki3erf45x32yezlrl24qto@xlqt7qducyld made me expect merge/build/runtime issues, however this series merges and builds OK on mm-unstable. Did something change? What's the story here? Oh well, it built so I'll ship it!
Andrew Morton wrote: > On Tue, 7 Jan 2025 14:42:16 +1100 Alistair Popple <apopple@nvidia.com> wrote: > > > Device and FS DAX pages have always maintained their own page > > reference counts without following the normal rules for page reference > > counting. In particular pages are considered free when the refcount > > hits one rather than zero and refcounts are not added when mapping the > > page. > > > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary > > mechanism for allowing GUP to hold references on the page (see > > get_dev_pagemap). However there doesn't seem to be any reason why FS > > DAX pages need their own reference counting scheme. > > > > By treating the refcounts on these pages the same way as normal pages > > we can remove a lot of special checks. In particular pXd_trans_huge() > > becomes the same as pXd_leaf(), although I haven't made that change > > here. It also frees up a valuable SW define PTE bit on architectures > > that have devmap PTE bits defined. > > > > It also almost certainly allows further clean-up of the devmap managed > > functions, but I have left that as a future improvment. It also > > enables support for compound ZONE_DEVICE pages which is one of my > > primary motivators for doing this work. > > > > https://lkml.kernel.org/r/wysuus23bqmjtwkfu3zutqtmkse3ki3erf45x32yezlrl24qto@xlqt7qducyld > made me expect merge/build/runtime issues, however this series merges > and builds OK on mm-unstable. Did something change? What's the story > here? > > Oh well, it built so I'll ship it! So my plan is to review this latest set on top of -next as is and then rebase (or ask Alistair to rebase) on a mainline tag so I can identify the merge conflicts with -mm and communicate those to Linus. I will double check that you have pulled these back out of mm-unstable before doing that to avoid a double-commit conflicts in -next, but for now exposure in mm-unstable is good to flush out issues.
On Tue, Jan 07, 2025 at 02:42:16PM +1100, Alistair Popple wrote: > Main updates since v4: > > - Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This > means smaps/pagemap may contain DAX pages. > > - Fixed rmap accounting of PUD mapped pages. > > - Minor code clean-ups. > > Main updates since v3: > > - Rebased onto next-20241216. Hi Alistair- This set passes the ndctl/dax unit tests when applied to next-20241216 Tested-by: Alison Schofield <alison.schofield@intel.com> -- snip
Main updates since v4: - Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This means smaps/pagemap may contain DAX pages. - Fixed rmap accounting of PUD mapped pages. - Minor code clean-ups. Main updates since v3: - Rebased onto next-20241216. The rebase wasn't too difficult, but in the interests of getting this out sooner for Andrew to look at as requested by him I have yet to extensively build/run test this version of the series. - Fixed a bunch of build breakages reported by John Hubbard and the kernel test robot due to various combinations of CONFIG options. - Split the rmap changes into a separate patch as suggested by David H. - Reworded the description for the P2PDMA change. Main updates since v2: - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX and have them pass the vmf struct. - Separate out the device DAX changes. - Restore the page share mapping counting and associated warnings. - Rework truncate to require file-systems to have previously called dax_break_layout() to remove the address space mapping for a page. This found several bugs which are fixed by the first half of the series. The motivation for this was initially to allow the FS DAX page-cache mappings to hold a reference on the page. However that turned out to be a dead-end (see the comments on patch 21), but it found several bugs and I think overall it is an improvement so I have left it here. Device and FS DAX pages have always maintained their own page reference counts without following the normal rules for page reference counting. In particular pages are considered free when the refcount hits one rather than zero and refcounts are not added when mapping the page. Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary mechanism for allowing GUP to hold references on the page (see get_dev_pagemap). However there doesn't seem to be any reason why FS DAX pages need their own reference counting scheme. By treating the refcounts on these pages the same way as normal pages we can remove a lot of special checks. In particular pXd_trans_huge() becomes the same as pXd_leaf(), although I haven't made that change here. It also frees up a valuable SW define PTE bit on architectures that have devmap PTE bits defined. It also almost certainly allows further clean-up of the devmap managed functions, but I have left that as a future improvment. It also enables support for compound ZONE_DEVICE pages which is one of my primary motivators for doing this work. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- Cc: lina@asahilina.net Cc: zhang.lyra@gmail.com Cc: gerald.schaefer@linux.ibm.com Cc: dan.j.williams@intel.com Cc: vishal.l.verma@intel.com Cc: dave.jiang@intel.com Cc: logang@deltatee.com Cc: bhelgaas@google.com Cc: jack@suse.cz Cc: jgg@ziepe.ca Cc: catalin.marinas@arm.com Cc: will@kernel.org Cc: mpe@ellerman.id.au Cc: npiggin@gmail.com Cc: dave.hansen@linux.intel.com Cc: ira.weiny@intel.com Cc: willy@infradead.org Cc: djwong@kernel.org Cc: tytso@mit.edu Cc: linmiaohe@huawei.com Cc: david@redhat.com Cc: peterx@redhat.com Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: nvdimm@lists.linux.dev Cc: linux-cxl@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-ext4@vger.kernel.org Cc: linux-xfs@vger.kernel.org Cc: jhubbard@nvidia.com Cc: hch@lst.de Cc: david@fromorbit.com Alistair Popple (25): fuse: Fix dax truncate/punch_hole fault path fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() fs/dax: Don't skip locked entries when scanning entries fs/dax: Refactor wait for dax idle page fs/dax: Create a common implementation to break DAX layouts fs/dax: Always remove DAX page-cache entries when breaking layouts fs/dax: Ensure all pages are idle prior to filesystem unmount fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag mm/gup: Remove redundant check for PCI P2PDMA page mm/mm_init: Move p2pdma page refcount initialisation to p2pdma mm: Allow compound zone device pages mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings mm/memory: Add vmf_insert_page_mkwrite() rmap: Add support for PUD sized mappings to rmap huge_memory: Add vmf_insert_folio_pud() huge_memory: Add vmf_insert_folio_pmd() memremap: Add is_devdax_page() and is_fsdax_page() helpers mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages proc/task_mmu: Mark devdax and fsdax pages as always unpinned mm/mlock: Skip ZONE_DEVICE PMDs during mlock fs/dax: Properly refcount fs dax pages device/dax: Properly refcount device dax pages when mapping mm: Remove pXX_devmap callers mm: Remove devmap related functions and page table bits Revert "riscv: mm: Add support for ZONE_DEVICE" Documentation/mm/arch_pgtable_helpers.rst | 6 +- arch/arm64/Kconfig | 1 +- arch/arm64/include/asm/pgtable-prot.h | 1 +- arch/arm64/include/asm/pgtable.h | 24 +- arch/powerpc/Kconfig | 1 +- arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 +- arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 52 +--- arch/powerpc/include/asm/book3s/64/radix.h | 14 +- arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +- arch/powerpc/mm/book3s64/pgtable.c | 8 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +- arch/powerpc/mm/pgtable.c | 2 +- arch/riscv/Kconfig | 1 +- arch/riscv/include/asm/pgtable-64.h | 20 +- arch/riscv/include/asm/pgtable-bits.h | 1 +- arch/riscv/include/asm/pgtable.h | 17 +- arch/x86/Kconfig | 1 +- arch/x86/include/asm/pgtable.h | 51 +--- arch/x86/include/asm/pgtable_types.h | 5 +- drivers/dax/device.c | 15 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 +- drivers/nvdimm/pmem.c | 4 +- drivers/pci/p2pdma.c | 19 +- fs/dax.c | 363 ++++++++++++++----- fs/ext4/inode.c | 43 +-- fs/fuse/dax.c | 35 +-- fs/fuse/virtio_fs.c | 3 +- fs/proc/task_mmu.c | 2 +- fs/userfaultfd.c | 2 +- fs/xfs/xfs_inode.c | 40 +- fs/xfs/xfs_inode.h | 3 +- fs/xfs/xfs_super.c | 18 +- include/linux/dax.h | 37 ++- include/linux/huge_mm.h | 12 +- include/linux/memremap.h | 28 +- include/linux/migrate.h | 4 +- include/linux/mm.h | 40 +-- include/linux/mm_types.h | 14 +- include/linux/mmzone.h | 12 +- include/linux/page-flags.h | 6 +- include/linux/pfn_t.h | 20 +- include/linux/pgtable.h | 21 +- include/linux/rmap.h | 15 +- lib/test_hmm.c | 3 +- mm/Kconfig | 4 +- mm/debug_vm_pgtable.c | 59 +--- mm/gup.c | 176 +--------- mm/hmm.c | 12 +- mm/huge_memory.c | 220 +++++++----- mm/internal.h | 2 +- mm/khugepaged.c | 2 +- mm/madvise.c | 8 +- mm/mapping_dirty_helpers.c | 4 +- mm/memory-failure.c | 6 +- mm/memory.c | 118 ++++-- mm/memremap.c | 59 +-- mm/migrate_device.c | 9 +- mm/mlock.c | 2 +- mm/mm_init.c | 23 +- mm/mprotect.c | 2 +- mm/mremap.c | 5 +- mm/page_vma_mapped.c | 5 +- mm/pagewalk.c | 14 +- mm/pgtable-generic.c | 7 +- mm/rmap.c | 65 ++- mm/swap.c | 2 +- mm/truncate.c | 16 +- mm/userfaultfd.c | 5 +- mm/vmscan.c | 5 +- 70 files changed, 889 insertions(+), 929 deletions(-) base-commit: e25c8d66f6786300b680866c0e0139981273feba