mbox series

[v6,00/26] fs/dax: Fix ZONE_DEVICE page reference counts

Message ID cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com
Headers show
Series fs/dax: Fix ZONE_DEVICE page reference counts | expand

Message

Alistair Popple Jan. 10, 2025, 6 a.m. UTC
Main updates since v5:

 - Reworked patch 1 based on Dan's feedback.

 - Fixed build issues on PPC and when CONFIG_PGTABLE_HAS_HUGE_LEAVES
   is no defined.

 - Minor comment formatting and documentation fixes.

 - Remove PTE_DEVMAP definitions from Loongarch which were added since
   this series was initially written.

Main updates since v4:

 - Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This
   means smaps/pagemap may contain DAX pages.

 - Fixed rmap accounting of PUD mapped pages.

 - Minor code clean-ups.

Main updates since v3:

 - Rebased onto next-20241216. The rebase wasn't too difficult, but in
   the interests of getting this out sooner for Andrew to look at as
   requested by him I have yet to extensively build/run test this
   version of the series.

 - Fixed a bunch of build breakages reported by John Hubbard and the
   kernel test robot due to various combinations of CONFIG options.

 - Split the rmap changes into a separate patch as suggested by David H.

 - Reworded the description for the P2PDMA change.

Main updates since v2:

 - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
   and have them pass the vmf struct.

 - Separate out the device DAX changes.

 - Restore the page share mapping counting and associated warnings.

 - Rework truncate to require file-systems to have previously called
   dax_break_layout() to remove the address space mapping for a
   page. This found several bugs which are fixed by the first half of
   the series. The motivation for this was initially to allow the FS
   DAX page-cache mappings to hold a reference on the page.

   However that turned out to be a dead-end (see the comments on patch
   21), but it found several bugs and I think overall it is an
   improvement so I have left it here.

Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>

---

Cc: lina@asahilina.net
Cc: zhang.lyra@gmail.com
Cc: gerald.schaefer@linux.ibm.com
Cc: dan.j.williams@intel.com
Cc: vishal.l.verma@intel.com
Cc: dave.jiang@intel.com
Cc: logang@deltatee.com
Cc: bhelgaas@google.com
Cc: jack@suse.cz
Cc: jgg@ziepe.ca
Cc: catalin.marinas@arm.com
Cc: will@kernel.org
Cc: mpe@ellerman.id.au
Cc: npiggin@gmail.com
Cc: dave.hansen@linux.intel.com
Cc: ira.weiny@intel.com
Cc: willy@infradead.org
Cc: djwong@kernel.org
Cc: tytso@mit.edu
Cc: linmiaohe@huawei.com
Cc: david@redhat.com
Cc: peterx@redhat.com
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nvdimm@lists.linux.dev
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: jhubbard@nvidia.com
Cc: hch@lst.de
Cc: david@fromorbit.com
Cc: chenhuacai@kernel.org
Cc: kernel@xen0n.name
Cc: loongarch@lists.linux.dev

Alistair Popple (26):
  fuse: Fix dax truncate/punch_hole fault path
  fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
  fs/dax: Don't skip locked entries when scanning entries
  fs/dax: Refactor wait for dax idle page
  fs/dax: Create a common implementation to break DAX layouts
  fs/dax: Always remove DAX page-cache entries when breaking layouts
  fs/dax: Ensure all pages are idle prior to filesystem unmount
  fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
  mm/gup: Remove redundant check for PCI P2PDMA page
  mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
  mm: Allow compound zone device pages
  mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
  mm/memory: Add vmf_insert_page_mkwrite()
  rmap: Add support for PUD sized mappings to rmap
  huge_memory: Add vmf_insert_folio_pud()
  huge_memory: Add vmf_insert_folio_pmd()
  memremap: Add is_devdax_page() and is_fsdax_page() helpers
  mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  proc/task_mmu: Mark devdax and fsdax pages as always unpinned
  mm/mlock: Skip ZONE_DEVICE PMDs during mlock
  fs/dax: Properly refcount fs dax pages
  device/dax: Properly refcount device dax pages when mapping
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits
  Revert "riscv: mm: Add support for ZONE_DEVICE"
  Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support"

 Documentation/mm/arch_pgtable_helpers.rst     |   6 +-
 arch/arm64/Kconfig                            |   1 +-
 arch/arm64/include/asm/pgtable-prot.h         |   1 +-
 arch/arm64/include/asm/pgtable.h              |  24 +-
 arch/loongarch/Kconfig                        |   1 +-
 arch/loongarch/include/asm/pgtable-bits.h     |   6 +-
 arch/loongarch/include/asm/pgtable.h          |  19 +-
 arch/powerpc/Kconfig                          |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  53 +---
 arch/powerpc/include/asm/book3s/64/radix.h    |  14 +-
 arch/powerpc/mm/book3s64/hash_hugepage.c      |   2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c       |   3 +-
 arch/powerpc/mm/book3s64/hugetlbpage.c        |   2 +-
 arch/powerpc/mm/book3s64/pgtable.c            |  10 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   5 +-
 arch/powerpc/mm/pgtable.c                     |   2 +-
 arch/riscv/Kconfig                            |   1 +-
 arch/riscv/include/asm/pgtable-64.h           |  20 +-
 arch/riscv/include/asm/pgtable-bits.h         |   1 +-
 arch/riscv/include/asm/pgtable.h              |  17 +-
 arch/x86/Kconfig                              |   1 +-
 arch/x86/include/asm/pgtable.h                |  51 +---
 arch/x86/include/asm/pgtable_types.h          |   5 +-
 drivers/dax/device.c                          |  15 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |   3 +-
 drivers/nvdimm/pmem.c                         |   4 +-
 drivers/pci/p2pdma.c                          |  19 +-
 fs/dax.c                                      | 363 ++++++++++++++-----
 fs/ext4/inode.c                               |  43 +--
 fs/fuse/dax.c                                 |  30 +--
 fs/fuse/dir.c                                 |   2 +-
 fs/fuse/file.c                                |   4 +-
 fs/fuse/virtio_fs.c                           |   3 +-
 fs/proc/task_mmu.c                            |   2 +-
 fs/userfaultfd.c                              |   2 +-
 fs/xfs/xfs_inode.c                            |  40 +-
 fs/xfs/xfs_inode.h                            |   3 +-
 fs/xfs/xfs_super.c                            |  18 +-
 include/linux/dax.h                           |  37 ++-
 include/linux/huge_mm.h                       |  12 +-
 include/linux/memremap.h                      |  28 +-
 include/linux/migrate.h                       |   4 +-
 include/linux/mm.h                            |  40 +--
 include/linux/mm_types.h                      |  16 +-
 include/linux/mmzone.h                        |  12 +-
 include/linux/page-flags.h                    |   6 +-
 include/linux/pfn_t.h                         |  20 +-
 include/linux/pgtable.h                       |  21 +-
 include/linux/rmap.h                          |  15 +-
 lib/test_hmm.c                                |   3 +-
 mm/Kconfig                                    |   4 +-
 mm/debug_vm_pgtable.c                         |  59 +---
 mm/gup.c                                      | 176 +---------
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              | 220 +++++++-----
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   2 +-
 mm/madvise.c                                  |   8 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 118 ++++--
 mm/memremap.c                                 |  59 +--
 mm/migrate_device.c                           |   9 +-
 mm/mlock.c                                    |   2 +-
 mm/mm_init.c                                  |  23 +-
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   5 +-
 mm/page_vma_mapped.c                          |   5 +-
 mm/pagewalk.c                                 |  14 +-
 mm/pgtable-generic.c                          |   7 +-
 mm/rmap.c                                     |  67 +++-
 mm/swap.c                                     |   2 +-
 mm/truncate.c                                 |  16 +-
 mm/userfaultfd.c                              |   5 +-
 mm/vmscan.c                                   |   5 +-
 77 files changed, 895 insertions(+), 961 deletions(-)

base-commit: e25c8d66f6786300b680866c0e0139981273feba

Comments

Dan Williams Jan. 10, 2025, 7:05 a.m. UTC | #1
Alistair Popple wrote:
> Main updates since v5:
> 
>  - Reworked patch 1 based on Dan's feedback.
> 
>  - Fixed build issues on PPC and when CONFIG_PGTABLE_HAS_HUGE_LEAVES
>    is no defined.
> 
>  - Minor comment formatting and documentation fixes.
> 
>  - Remove PTE_DEVMAP definitions from Loongarch which were added since
>    this series was initially written.
[..]
> 
> base-commit: e25c8d66f6786300b680866c0e0139981273feba

If this is going to go through nvdimm.git I will need it against a
mainline tag baseline. Linus will want to see the merge conflicts.

Otherwise if that merge commit is too messy, or you would rather not
rebase, then it either needs to go one of two options:

- Andrew's tree which is the only tree I know of that can carry
  patches relative to linux-next.

- Wait for v6.14-rc1 and get this into nvdimm.git early in the cycle
  when the conflict storm will be low.

Last I attempted the merge conflict resolution with v4, they were not
*that* bad. However, that rebase may need to keep some definitions
around to avoid compile breakage and the need to expand the merge commit
to carrying things like the Loongarch PTE_DEVMAP removal. I.e. move some
of the after-the-fact cleanups to a post merge branch.
Andrew Morton Jan. 11, 2025, 1:30 a.m. UTC | #2
On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> >  - Remove PTE_DEVMAP definitions from Loongarch which were added since
> >    this series was initially written.
> [..]
> > 
> > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> 
> If this is going to go through nvdimm.git I will need it against a
> mainline tag baseline. Linus will want to see the merge conflicts.
> 
> Otherwise if that merge commit is too messy, or you would rather not
> rebase, then it either needs to go one of two options:
> 
> - Andrew's tree which is the only tree I know of that can carry
>   patches relative to linux-next.

I used to be able to do that but haven't got around to setting up such
a thing with mm.git.  This is the first time the need has arisen,
really.

> - Wait for v6.14-rc1 

I'm thinking so.  Darrick's review comments indicate that we'll be seeing a v7.

> and get this into nvdimm.git early in the cycle
>   when the conflict storm will be low.

erk.  This patchset hits mm/ a lot, and nvdimm hardly at all.  Is it
not practical to carry this in mm.git?
Dan Williams Jan. 11, 2025, 3:35 a.m. UTC | #3
Andrew Morton wrote:
> On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > >  - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > >    this series was initially written.
> > [..]
> > > 
> > > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> > 
> > If this is going to go through nvdimm.git I will need it against a
> > mainline tag baseline. Linus will want to see the merge conflicts.
> > 
> > Otherwise if that merge commit is too messy, or you would rather not
> > rebase, then it either needs to go one of two options:
> > 
> > - Andrew's tree which is the only tree I know of that can carry
> >   patches relative to linux-next.
> 
> I used to be able to do that but haven't got around to setting up such
> a thing with mm.git.  This is the first time the need has arisen,
> really.

Oh, good to know.

> 
> > - Wait for v6.14-rc1 
> 
> I'm thinking so.  Darrick's review comments indicate that we'll be seeing a v7.
> 
> > and get this into nvdimm.git early in the cycle
> >   when the conflict storm will be low.
> 
> erk.  This patchset hits mm/ a lot, and nvdimm hardly at all.  Is it
> not practical to carry this in mm.git?

I'm totally fine with it going through mm.git. nvdimm.git is just the
historical path for touches to fs/dax.c, and git blame points mostly to
me for the issues Alistair is fixing. I am happy to review and ack and
watch this go through mm.git.
Alistair Popple Jan. 13, 2025, 1:05 a.m. UTC | #4
On Fri, Jan 10, 2025 at 07:35:57PM -0800, Dan Williams wrote:
> Andrew Morton wrote:
> > On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > >  - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > > >    this series was initially written.
> > > [..]
> > > > 
> > > > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> > > 
> > > If this is going to go through nvdimm.git I will need it against a
> > > mainline tag baseline. Linus will want to see the merge conflicts.
> > > 
> > > Otherwise if that merge commit is too messy, or you would rather not
> > > rebase, then it either needs to go one of two options:
> > > 
> > > - Andrew's tree which is the only tree I know of that can carry
> > >   patches relative to linux-next.
> > 
> > I used to be able to do that but haven't got around to setting up such
> > a thing with mm.git.  This is the first time the need has arisen,
> > really.
> 
> Oh, good to know.
> 
> > 
> > > - Wait for v6.14-rc1 
> > 
> > I'm thinking so.  Darrick's review comments indicate that we'll be seeing a v7.

I'm ok with that. It could do with a decent soak in linux-next anyway given it
touches a lot of mm and fs.

Once v6.14-rc1 is released I will do a rebase on top of that.

> > > and get this into nvdimm.git early in the cycle
> > >   when the conflict storm will be low.
> > 
> > erk.  This patchset hits mm/ a lot, and nvdimm hardly at all.  Is it
> > not practical to carry this in mm.git?
> 
> I'm totally fine with it going through mm.git. nvdimm.git is just the
> historical path for touches to fs/dax.c, and git blame points mostly to
> me for the issues Alistair is fixing. I am happy to review and ack and
> watch this go through mm.git.