mbox series

[00/13] fs/dax: Fix FS DAX page reference counts

Message ID cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apopple@nvidia.com
Headers show
Series fs/dax: Fix FS DAX page reference counts | expand

Message

Alistair Popple June 27, 2024, 12:54 a.m. UTC
FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment.

This is an update to the original RFC rebased onto v6.10-rc5. Unlike
the original RFC it passes the same number of ndctl test suite
(https://github.com/pmem/ndctl) tests as my current development
environment does without these patches.

I am not intimately familiar with the FS DAX code so would appreciate
some careful review there. In particular I have not given any thought
at all to CONFIG_FS_DAX_LIMITED.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

Alistair Popple (13):
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  pci/p2pdma: Don't initialise page refcount to one
  fs/dax: Refactor wait for dax idle page
  fs/dax: Add dax_page_free callback
  mm: Allow compound zone device pages
  mm/memory: Add dax_insert_pfn
  huge_memory: Allow mappings of PUD sized pages
  huge_memory: Allow mappings of PMD sized pages
  gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  fs/dax: Properly refcount fs dax pages
  huge_memory: Remove dead vmf_insert_pXd code
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits

 Documentation/mm/arch_pgtable_helpers.rst     |   6 +-
 arch/arm64/Kconfig                            |   1 +-
 arch/arm64/include/asm/pgtable-prot.h         |   1 +-
 arch/arm64/include/asm/pgtable.h              |  24 +--
 arch/powerpc/Kconfig                          |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  52 +----
 arch/powerpc/include/asm/book3s/64/radix.h    |  14 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c       |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c            |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   5 +-
 arch/powerpc/mm/pgtable.c                     |   2 +-
 arch/x86/Kconfig                              |   1 +-
 arch/x86/include/asm/pgtable.h                |  50 +----
 arch/x86/include/asm/pgtable_types.h          |   5 +-
 drivers/dax/device.c                          |  12 +-
 drivers/dax/super.c                           |   2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |   2 +-
 drivers/nvdimm/pmem.c                         |   9 +-
 drivers/pci/p2pdma.c                          |   4 +-
 fs/dax.c                                      | 204 +++++++---------
 fs/ext4/inode.c                               |   5 +-
 fs/fuse/dax.c                                 |   4 +-
 fs/fuse/virtio_fs.c                           |   8 +-
 fs/userfaultfd.c                              |   2 +-
 fs/xfs/xfs_inode.c                            |   4 +-
 include/linux/dax.h                           |  11 +-
 include/linux/huge_mm.h                       |  17 +-
 include/linux/memremap.h                      |  23 +-
 include/linux/migrate.h                       |   2 +-
 include/linux/mm.h                            |  40 +---
 include/linux/page-flags.h                    |   6 +-
 include/linux/pfn_t.h                         |  20 +--
 include/linux/pgtable.h                       |  21 +--
 include/linux/rmap.h                          |  14 +-
 lib/test_hmm.c                                |   2 +-
 mm/Kconfig                                    |   4 +-
 mm/debug_vm_pgtable.c                         |  59 +-----
 mm/gup.c                                      | 178 +--------------
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              | 248 +++++++------------
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   2 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 114 ++++++---
 mm/memremap.c                                 |  38 +---
 mm/migrate_device.c                           |   6 +-
 mm/mlock.c                                    |   2 +-
 mm/mm_init.c                                  |   5 +-
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   5 +-
 mm/page_vma_mapped.c                          |   5 +-
 mm/pgtable-generic.c                          |   7 +-
 mm/rmap.c                                     |  48 ++++-
 mm/swap.c                                     |   2 +-
 mm/userfaultfd.c                              |   2 +-
 mm/vmscan.c                                   |   5 +-
 59 files changed, 485 insertions(+), 869 deletions(-)

base-commit: f2661062f16b2de5d7b6a5c42a9a5c96326b8454

Comments

Dan Williams June 27, 2024, 6:58 a.m. UTC | #1
Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment.
> 
> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> the original RFC it passes the same number of ndctl test suite
> (https://github.com/pmem/ndctl) tests as my current development
> environment does without these patches.

Are you seeing the 'mmap.sh' test fail even without these patches?

I see this with the patches, will try without in the morning.

 EXT4-fs (pmem0): unmounting filesystem 26ea1463-343a-464f-9f16-91cb176dbdc7.
 XFS (pmem0): Mounting V5 Filesystem 554953fd-c9f4-460f-bc37-f43979986b68
 XFS (pmem0): Ending clean mount
 Oops: general protection fault, probably for non-canonical address 0xdead000000000518: 00
T SMP PTI
 CPU: 15 PID: 1295 Comm: mmap Tainted: G           OE    N 6.10.0-rc5+ #261
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240524-3.fc40 05/24/2024
 RIP: 0010:folio_mark_dirty+0x25/0x60
 Code: 90 90 90 90 90 0f 1f 44 00 00 53 48 89 fb e8 22 18 02 00 48 85 c0 74 26 48 89 c7 48
0 02 00 74 05 f0 80 63 02 fd <48> 8b 87 18 01 00 00 48 89 de 5b 48 8b 40 18 e9 77 90 c0 00
 RSP: 0018:ffffb073022f7b08 EFLAGS: 00010246
 RAX: 004ffff800002000 RBX: ffffd0d005000300 RCX: 0400000000000040
 RDX: 0000000000000000 RSI: 00007f4006200000 RDI: dead000000000400
 RBP: 0000000000000000 R08: ffff9a4b04504a30 R09: 000fffffffffffff
 R10: ffffd0d005000300 R11: 0000000000000000 R12: 00007f4006200000
 R13: ffff9a4b7c96c000 R14: ffff9a4b7daba440 R15: ffffb073022f7cb0
 FS:  00007f4046351740(0000) GS:ffff9a4d77780000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f40461ff000 CR3: 000000027aea6000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x26
  ? die_addr+0x38/0x60
  ? exc_general_protection+0x143/0x420
  ? asm_exc_general_protection+0x22/0x30
  ? folio_mark_dirty+0x25/0x60
  ? folio_mark_dirty+0xe/0x60
  unmap_page_range+0xea5/0x1550
  unmap_vmas+0xf8/0x1e0
  unmap_region.constprop.0+0xd7/0x150
  ? lock_is_held_type+0xd5/0x130
  do_vmi_align_munmap.isra.0+0x3f4/0x580
  ? mas_walk+0x101/0x1b0
  __vm_munmap+0xa6/0x170
  __x64_sys_munmap+0x17/0x20
  do_syscall_64+0x75/0x190
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

$ faddr2line vmlinux folio_mark_dirty+0x25
folio_mark_dirty+0x25/0x58:
folio_mark_dirty at mm/page-writeback.c:2860
Alistair Popple June 27, 2024, 7:15 a.m. UTC | #2
Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>> 
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>> 
>> By treating the refcounts on these pages the same way as normal pages
>> we can remove a lot of special checks. In particular pXd_trans_huge()
>> becomes the same as pXd_leaf(), although I haven't made that change
>> here. It also frees up a valuable SW define PTE bit on architectures
>> that have devmap PTE bits defined.
>> 
>> It also almost certainly allows further clean-up of the devmap managed
>> functions, but I have left that as a future improvment.
>> 
>> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> the original RFC it passes the same number of ndctl test suite
>> (https://github.com/pmem/ndctl) tests as my current development
>> environment does without these patches.
>
> Are you seeing the 'mmap.sh' test fail even without these patches?

No. But I also don't see it failing with these patches :)

For reference this is what I see on my test machine with or without:

[1/70] Generating version.h with a custom command
 1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
 2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
 3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77
 4/13 ndctl:dax / dax-dev                   OK               0.02s
 5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
 6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
 7/13 ndctl:dax / device-dax                OK              13.40s
 8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
>>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem

 9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
>>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh

13/13 ndctl:dax / mmap.sh                   OK             107.57s

Ok:                 6   
Expected Fail:      0   
Fail:               2   
Unexpected Pass:    0   
Skipped:            5   
Timeout:            0   

I have been using QEMU for my testing. Maybe I missed some condition in
the unmap path though so will take another look.

> I see this with the patches, will try without in the morning.
>
>  EXT4-fs (pmem0): unmounting filesystem 26ea1463-343a-464f-9f16-91cb176dbdc7.
>  XFS (pmem0): Mounting V5 Filesystem 554953fd-c9f4-460f-bc37-f43979986b68
>  XFS (pmem0): Ending clean mount
>  Oops: general protection fault, probably for non-canonical address 0xdead000000000518: 00
> T SMP PTI
>  CPU: 15 PID: 1295 Comm: mmap Tainted: G           OE    N 6.10.0-rc5+ #261
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240524-3.fc40 05/24/2024
>  RIP: 0010:folio_mark_dirty+0x25/0x60
>  Code: 90 90 90 90 90 0f 1f 44 00 00 53 48 89 fb e8 22 18 02 00 48 85 c0 74 26 48 89 c7 48
> 0 02 00 74 05 f0 80 63 02 fd <48> 8b 87 18 01 00 00 48 89 de 5b 48 8b 40 18 e9 77 90 c0 00
>  RSP: 0018:ffffb073022f7b08 EFLAGS: 00010246
>  RAX: 004ffff800002000 RBX: ffffd0d005000300 RCX: 0400000000000040
>  RDX: 0000000000000000 RSI: 00007f4006200000 RDI: dead000000000400
>  RBP: 0000000000000000 R08: ffff9a4b04504a30 R09: 000fffffffffffff
>  R10: ffffd0d005000300 R11: 0000000000000000 R12: 00007f4006200000
>  R13: ffff9a4b7c96c000 R14: ffff9a4b7daba440 R15: ffffb073022f7cb0
>  FS:  00007f4046351740(0000) GS:ffff9a4d77780000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f40461ff000 CR3: 000000027aea6000 CR4: 00000000000006f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? __die_body.cold+0x19/0x26
>   ? die_addr+0x38/0x60
>   ? exc_general_protection+0x143/0x420
>   ? asm_exc_general_protection+0x22/0x30
>   ? folio_mark_dirty+0x25/0x60
>   ? folio_mark_dirty+0xe/0x60
>   unmap_page_range+0xea5/0x1550
>   unmap_vmas+0xf8/0x1e0
>   unmap_region.constprop.0+0xd7/0x150
>   ? lock_is_held_type+0xd5/0x130
>   do_vmi_align_munmap.isra.0+0x3f4/0x580
>   ? mas_walk+0x101/0x1b0
>   __vm_munmap+0xa6/0x170
>   __x64_sys_munmap+0x17/0x20
>   do_syscall_64+0x75/0x190
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> $ faddr2line vmlinux folio_mark_dirty+0x25
> folio_mark_dirty+0x25/0x58:
> folio_mark_dirty at mm/page-writeback.c:2860
Dan Williams June 27, 2024, 8:24 p.m. UTC | #3
Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > Alistair Popple wrote:
> >> FS DAX pages have always maintained their own page reference counts
> >> without following the normal rules for page reference counting. In
> >> particular pages are considered free when the refcount hits one rather
> >> than zero and refcounts are not added when mapping the page.
> >> 
> >> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> >> mechanism for allowing GUP to hold references on the page (see
> >> get_dev_pagemap). However there doesn't seem to be any reason why FS
> >> DAX pages need their own reference counting scheme.
> >> 
> >> By treating the refcounts on these pages the same way as normal pages
> >> we can remove a lot of special checks. In particular pXd_trans_huge()
> >> becomes the same as pXd_leaf(), although I haven't made that change
> >> here. It also frees up a valuable SW define PTE bit on architectures
> >> that have devmap PTE bits defined.
> >> 
> >> It also almost certainly allows further clean-up of the devmap managed
> >> functions, but I have left that as a future improvment.
> >> 
> >> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> >> the original RFC it passes the same number of ndctl test suite
> >> (https://github.com/pmem/ndctl) tests as my current development
> >> environment does without these patches.
> >
> > Are you seeing the 'mmap.sh' test fail even without these patches?
> 
> No. But I also don't see it failing with these patches :)
> 
> For reference this is what I see on my test machine with or without:
> 
> [1/70] Generating version.h with a custom command
>  1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
>  2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
>  3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77

I really need to get this test built as a service as this shows a
pre-req is missing, and it's not quite fair to expect submitters to put
it all together.

>  4/13 ndctl:dax / dax-dev                   OK               0.02s
>  5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
>  6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
>  7/13 ndctl:dax / device-dax                OK              13.40s
>  8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
> >>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem
> 
>  9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
> 11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
> 12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
> >>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh
> 
> 13/13 ndctl:dax / mmap.sh                   OK             107.57s

I need to think through why this one might false succeed, but that can
wait until we get this series reviewed. For now my failure is stable
which allows it to be bisected.

> 
> Ok:                 6   
> Expected Fail:      0   
> Fail:               2   
> Unexpected Pass:    0   
> Skipped:            5   
> Timeout:            0   
> 
> I have been using QEMU for my testing. Maybe I missed some condition in
> the unmap path though so will take another look.

I was able to bisect to:

[PATCH 10/13] fs/dax: Properly refcount fs dax pages

...I will prioritize that one in my review queue.
Alistair Popple June 28, 2024, 12:06 a.m. UTC | #4
Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> 
>> Dan Williams <dan.j.williams@intel.com> writes:
>> 
>> > Alistair Popple wrote:
>> >> FS DAX pages have always maintained their own page reference counts
>> >> without following the normal rules for page reference counting. In
>> >> particular pages are considered free when the refcount hits one rather
>> >> than zero and refcounts are not added when mapping the page.
>> >> 
>> >> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> >> mechanism for allowing GUP to hold references on the page (see
>> >> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> >> DAX pages need their own reference counting scheme.
>> >> 
>> >> By treating the refcounts on these pages the same way as normal pages
>> >> we can remove a lot of special checks. In particular pXd_trans_huge()
>> >> becomes the same as pXd_leaf(), although I haven't made that change
>> >> here. It also frees up a valuable SW define PTE bit on architectures
>> >> that have devmap PTE bits defined.
>> >> 
>> >> It also almost certainly allows further clean-up of the devmap managed
>> >> functions, but I have left that as a future improvment.
>> >> 
>> >> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> >> the original RFC it passes the same number of ndctl test suite
>> >> (https://github.com/pmem/ndctl) tests as my current development
>> >> environment does without these patches.
>> >
>> > Are you seeing the 'mmap.sh' test fail even without these patches?
>> 
>> No. But I also don't see it failing with these patches :)
>> 
>> For reference this is what I see on my test machine with or without:
>> 
>> [1/70] Generating version.h with a custom command
>>  1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
>>  2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
>>  3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77
>
> I really need to get this test built as a service as this shows a
> pre-req is missing, and it's not quite fair to expect submitters to put
> it all together.

Ok. I didn't dig into why this was being skipped but I might if I find
some time. The rest of the tests seemed more relevant anyway and turned
up enough bugs with my initial implementation to keep me busy which gave
me some confidence.

If I'm being honest though I found the whole test setup a bit of a
pain. In particular remembering you have to manually (re)build the
special test versions of the modules tripped me up a few times until I
updated my build scripts. But I got there in the end.

>>  4/13 ndctl:dax / dax-dev                   OK               0.02s
>>  5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
>>  6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
>>  7/13 ndctl:dax / device-dax                OK              13.40s
>>  8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
>> >>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem
>> 
>>  9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
>> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
>> 11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
>> 12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
>> >>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh
>> 
>> 13/13 ndctl:dax / mmap.sh                   OK             107.57s
>
> I need to think through why this one might false succeed, but that can
> wait until we get this series reviewed. For now my failure is stable
> which allows it to be bisected.
>
>> 
>> Ok:                 6   
>> Expected Fail:      0   
>> Fail:               2   
>> Unexpected Pass:    0   
>> Skipped:            5   
>> Timeout:            0   
>> 
>> I have been using QEMU for my testing. Maybe I missed some condition in
>> the unmap path though so will take another look.
>
> I was able to bisect to:

I could have guessed that one, as it's pretty much the crux of this
series given it's the one that switches everything away from
pXX_devmap. That means pXX_leaf/_trans_huge will start returning true
for DAX pages.

Based on your dump I'm guessing I missed some case in the
zap_pXX_range() path. It could be helpful to narrow down which of the
pXX paths is crashing but I will take another look there.

> [PATCH 10/13] fs/dax: Properly refcount fs dax pages
>
> ...I will prioritize that one in my review queue.

Thanks!
Dave Chinner July 1, 2024, 4:24 a.m. UTC | #5
On Thu, Jun 27, 2024 at 10:54:15AM +1000, Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment.
> 
> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> the original RFC it passes the same number of ndctl test suite
> (https://github.com/pmem/ndctl) tests as my current development
> environment does without these patches.

I strongly suggest running fstests on pmem devices with '-o
dax=always' mount options to get much more comprehensive fsdax test
coverage. That exercises a lot of the weird mmap corner cases that
cause problems so it would be good to actually test that nothing new
got broken in FSDAX by this patchset.

-Dave.
Alistair Popple July 1, 2024, 8:33 a.m. UTC | #6
Dave Chinner <david@fromorbit.com> writes:

> On Thu, Jun 27, 2024 at 10:54:15AM +1000, Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>> 
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>> 
>> By treating the refcounts on these pages the same way as normal pages
>> we can remove a lot of special checks. In particular pXd_trans_huge()
>> becomes the same as pXd_leaf(), although I haven't made that change
>> here. It also frees up a valuable SW define PTE bit on architectures
>> that have devmap PTE bits defined.
>> 
>> It also almost certainly allows further clean-up of the devmap managed
>> functions, but I have left that as a future improvment.
>> 
>> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> the original RFC it passes the same number of ndctl test suite
>> (https://github.com/pmem/ndctl) tests as my current development
>> environment does without these patches.
>
> I strongly suggest running fstests on pmem devices with '-o
> dax=always' mount options to get much more comprehensive fsdax test
> coverage. That exercises a lot of the weird mmap corner cases that
> cause problems so it would be good to actually test that nothing new
> got broken in FSDAX by this patchset.

Thanks Dave, I will do that and report back. I suspect it will turn up
something, given Dan was seeing a crash with these patches.

 - Alistair

> -Dave.