Message ID | 20240822135018.1931258-5-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | enable bs > ps in XFS | expand |
Hi, "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes: > From: Luis Chamberlain <mcgrof@kernel.org> > > split_folio() and split_folio_to_list() assume order 0, to support > minorder for non-anonymous folios, we must expand these to check the > folio mapping order and use that. > > Set new_order to be at least minimum folio order if it is set in > split_huge_page_to_list() so that we can maintain minimum folio order > requirement in the page cache. > > Update the debugfs write files used for testing to ensure the order > is respected as well. We simply enforce the min order when a file > mapping is used. > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Reviewed-by: Zi Yan <ziy@nvidia.com> > Tested-by: David Howells <dhowells@redhat.com> This causes the following warning on s390 with linux-next starting from next-20240827: [ 112.690518] BUG: Bad page map in process ksm01 pte:a5801317 pmd:99054000 [ 112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801 [ 112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff) [ 112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000 [ 112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000 [ 112.690548] page dumped because: bad pte [ 112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102 [ 112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0 [ 112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441 [ 112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 112.690568] Call Trace: [ 112.690571] [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0 [ 112.690579] [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0 [ 112.690584] [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870 [ 112.690598] [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0 [ 112.690601] [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0 [ 112.690603] [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0 [ 112.690605] [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0 [ 112.690607] [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0 [ 112.690609] [<000003ffe0405724>] exit_mmap+0xc4/0x3e0 [ 112.690613] [<000003ffe0154aa2>] mmput+0x72/0x170 [ 112.690616] [<000003ffe015e2c6>] exit_mm+0xd6/0x150 [ 112.690618] [<000003ffe015e52c>] do_exit+0x1ec/0x490 [ 112.690620] [<000003ffe015e9a4>] do_group_exit+0x44/0xc0 [ 112.690621] [<000003ffe016f000>] get_signal+0x7f0/0x800 [ 112.690624] [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320 [ 112.690628] [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170 [ 112.690632] [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0 [ 112.690634] [<000003ffe0ec303c>] system_call+0x74/0x98 [ 112.690638] Disabling lock debugging due to kernel taint To reproduce, running the ksm01 testsuite from ltp seems to be enough. The splat is always triggered immediately. The output from ksm01 is: tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s mem.c:422: TINFO: wait for all children to stop. mem.c:388: TINFO: child 0 stops. mem.c:388: TINFO: child 1 stops. mem.c:388: TINFO: child 2 stops. mem.c:495: TINFO: KSM merging... mem.c:434: TINFO: resume all children. mem.c:422: TINFO: wait for all children to stop. mem.c:344: TINFO: child 0 continues... mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c' mem.c:344: TINFO: child 1 continues... mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a' mem.c:344: TINFO: child 2 continues... mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a' mem.c:400: TINFO: child 1 stops. mem.c:400: TINFO: child 2 stops. mem.c:400: TINFO: child 0 stops. Test timeouted, sending SIGKILL! tst_test.c:1700: TINFO: Killed the leftover descendant processes tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 tst_test.c:1708: TBROK: Test killed! (timeout?) Thanks Sven
On Thu, Aug 29, 2024 at 12:51:25PM +0200, Sven Schnelle wrote: > Hi, > > "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes: > > > From: Luis Chamberlain <mcgrof@kernel.org> > > > > split_folio() and split_folio_to_list() assume order 0, to support > > minorder for non-anonymous folios, we must expand these to check the > > folio mapping order and use that. > > > > Set new_order to be at least minimum folio order if it is set in > > split_huge_page_to_list() so that we can maintain minimum folio order > > requirement in the page cache. > > > > Update the debugfs write files used for testing to ensure the order > > is respected as well. We simply enforce the min order when a file > > mapping is used. > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > > Reviewed-by: Hannes Reinecke <hare@suse.de> > > Reviewed-by: Zi Yan <ziy@nvidia.com> > > Tested-by: David Howells <dhowells@redhat.com> > > This causes the following warning on s390 with linux-next starting from > next-20240827: > > [ 112.690518] BUG: Bad page map in process ksm01 pte:a5801317 pmd:99054000 > [ 112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801 > [ 112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff) > [ 112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000 > [ 112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000 > [ 112.690548] page dumped because: bad pte > [ 112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102 > [ 112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0 > [ 112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441 > [ 112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) > [ 112.690568] Call Trace: > [ 112.690571] [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0 > [ 112.690579] [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0 > [ 112.690584] [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870 > [ 112.690598] [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0 > [ 112.690601] [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0 > [ 112.690603] [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0 > [ 112.690605] [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0 > [ 112.690607] [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0 > [ 112.690609] [<000003ffe0405724>] exit_mmap+0xc4/0x3e0 > [ 112.690613] [<000003ffe0154aa2>] mmput+0x72/0x170 > [ 112.690616] [<000003ffe015e2c6>] exit_mm+0xd6/0x150 > [ 112.690618] [<000003ffe015e52c>] do_exit+0x1ec/0x490 > [ 112.690620] [<000003ffe015e9a4>] do_group_exit+0x44/0xc0 > [ 112.690621] [<000003ffe016f000>] get_signal+0x7f0/0x800 > [ 112.690624] [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320 > [ 112.690628] [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170 > [ 112.690632] [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0 > [ 112.690634] [<000003ffe0ec303c>] system_call+0x74/0x98 > [ 112.690638] Disabling lock debugging due to kernel taint > > To reproduce, running the ksm01 testsuite from ltp seems to be > enough. The splat is always triggered immediately. The output from ksm01 > is: > > tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' > tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f > tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x > tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s > mem.c:422: TINFO: wait for all children to stop. > mem.c:388: TINFO: child 0 stops. > mem.c:388: TINFO: child 1 stops. > mem.c:388: TINFO: child 2 stops. > mem.c:495: TINFO: KSM merging... > mem.c:434: TINFO: resume all children. > mem.c:422: TINFO: wait for all children to stop. > mem.c:344: TINFO: child 0 continues... > mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c' > mem.c:344: TINFO: child 1 continues... > mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a' > mem.c:344: TINFO: child 2 continues... > mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a' > mem.c:400: TINFO: child 1 stops. > mem.c:400: TINFO: child 2 stops. > mem.c:400: TINFO: child 0 stops. > Test timeouted, sending SIGKILL! > tst_test.c:1700: TINFO: Killed the leftover descendant processes > tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > tst_test.c:1708: TBROK: Test killed! (timeout?) Thanks for the report and test reproducer, I was able to reproduce on x86_64 easily with ltp/testcases/kernel/mem/ksm/ksm01 as well: ltp/testcases/kernel/mem/ksm/ksm01 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6fc20 tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #56 SMP PREEMPT_DYNAMIC Tue Aug 27 08:10:26 UTC 2024 x86_64 tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s mem.c:422: TINFO: wait for all children to stop. mem.c:388: TINFO: child 0 stops. mem.c:388: TINFO: child 1 stops. mem.c:388: TINFO: child 2 stops. mem.c:495: TINFO: KSM merging... mem.c:434: TINFO: resume all children. mem.c:422: TINFO: wait for all children to stop. mem.c:344: TINFO: child 0 continues... mem.c:344: TINFO: child 2 continues... mem.c:344: TINFO: child 1 continues... mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a' mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c' mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a' mem.c:400: TINFO: child 1 stops. mem.c:400: TINFO: child 0 stops. mem.c:400: TINFO: child 2 stops. Test timeouted, sending SIGKILL! tst_test.c:1700: TINFO: Killed the leftover descendant processes tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 tst_test.c:1708: TBROK: Test killed! (timeout?) Summary: passed 0 failed 0 broken 1 skipped 0 warnings 0 With vm debugging however I get more information about the issue: Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff) Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------ Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509! Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace: Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: <TASK> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? die+0x32/0x80 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_trap+0xd9/0x100 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_error_trap+0x6a/0x90 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? exc_invalid_op+0x4c/0x60 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? asm_exc_invalid_op+0x16/0x20 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ksm_scan_thread+0x175b/0x1d30 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_ksm_scan_thread+0x10/0x10 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kthread+0xda/0x110 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork+0x2d/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork_asm+0x1a/0x30 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: </TASK> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Modules linked in: xfs nvme_fabrics 9p kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd pcspkr 9pnet_virtio joydev virtio_balloon virtio_console evdev button serio_raw drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover nvme psmouse crc32_pclmul crc32c_intel nvme_core virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ---[ end trace 0000000000000000 ]--- Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554 Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine cmp_and_merge_page() on the split case: } else if (split) { /* * We are here if we tried to merge two pages and * failed because they both belonged to the same * compound page. We will split the page now, but no * merging will take place. * We do not want to add the cost of a full lock; if * the page is locked, it is better to skip it and * perhaps try again later. */ if (!trylock_page(page)) return; split_huge_page(page); unlock_page(page); } The trylock_page() is faulty. I'm digging in further. Luis
On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote: > With vm debugging however I get more information about the issue: > > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff) > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------ > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509! This is in folio_unlock(). We're trying to unlock a folio which isn't locked! > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace: > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: <TASK> > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? die+0x32/0x80 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_trap+0xd9/0x100 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_error_trap+0x6a/0x90 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? exc_invalid_op+0x4c/0x60 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? asm_exc_invalid_op+0x16/0x20 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ksm_scan_thread+0x175b/0x1d30 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_ksm_scan_thread+0x10/0x10 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kthread+0xda/0x110 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork+0x2d/0x50 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork_asm+0x1a/0x30 > Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: </TASK> [...] > Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine > cmp_and_merge_page() on the split case: > > } else if (split) { > /* > * We are here if we tried to merge two pages and > * failed because they both belonged to the same > * compound page. We will split the page now, but no > * merging will take place. > * We do not want to add the cost of a full lock; if > * the page is locked, it is better to skip it and > * perhaps try again later. > */ > if (!trylock_page(page)) > return; > split_huge_page(page); > unlock_page(page); Obviously the page is locked when we call split_huge_page(). There's an assert inside it. And the lock bit is _supposed_ to be transferred to the head page of the page which is being split. My guess is that this is messed up somehow; we're perhaps transferring the lock bit to the wrong page?
On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote: > @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add > bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > +int split_folio_to_list(struct folio *folio, struct list_head *list); > static inline int split_huge_page(struct page *page) > { > - return split_huge_page_to_list_to_order(page, NULL, 0); > + return split_folio(page_folio(page)); Oh! You can't do this! split_huge_page() takes a precise page, NOT a folio. That page is locked. When we return from split_huge_page(), the new folio which contains the precise page is locked. You've made it so that the caller's page's folio won't necessarily be locked. More testing was needed ;-P > } > void deferred_split_folio(struct folio *folio); > > @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page) > { > return 0; > } > + > +static inline int split_folio_to_list(struct folio *folio, struct list_head *list) > +{ > + return 0; > +} > + > static inline void deferred_split_folio(struct folio *folio) {} > #define split_huge_pmd(__vma, __pmd, __address) \ > do { } while (0) > @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) > return split_folio_to_list_to_order(folio, NULL, new_order); > } > > -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) > -#define split_folio(f) split_folio_to_order(f, 0) > - > #endif /* _LINUX_HUGE_MM_H */ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index cf8e34f62976f..06384b85a3a20 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins) > * released, or if some unexpected race happened (e.g., anon VMA disappeared, > * truncation). > * > + * Callers should ensure that the order respects the address space mapping > + * min-order if one is set for non-anonymous folios. > + * > * Returns -EINVAL when trying to split to an order that is incompatible > * with the folio. Splitting to order 0 is compatible with all folios. > */ > @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > mapping = NULL; > anon_vma_lock_write(anon_vma); > } else { > + unsigned int min_order; > gfp_t gfp; > > mapping = folio->mapping; > @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > goto out; > } > > + min_order = mapping_min_folio_order(folio->mapping); > + if (new_order < min_order) { > + VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u", > + min_order); > + ret = -EINVAL; > + goto out; > + } > + > gfp = current_gfp_context(mapping_gfp_mask(mapping) & > GFP_RECLAIM_MASK); > > @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > +int split_folio_to_list(struct folio *folio, struct list_head *list) > +{ > + unsigned int min_order = 0; > + > + if (folio_test_anon(folio)) > + goto out; > + > + if (!folio->mapping) { > + if (folio_test_pmd_mappable(folio)) > + count_vm_event(THP_SPLIT_PAGE_FAILED); > + return -EBUSY; > + } > + > + min_order = mapping_min_folio_order(folio->mapping); > +out: > + return split_huge_page_to_list_to_order(&folio->page, list, > + min_order); > +} > + > void __folio_undo_large_rmappable(struct folio *folio) > { > struct deferred_split *ds_queue; > @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > struct vm_area_struct *vma = vma_lookup(mm, addr); > struct folio_walk fw; > struct folio *folio; > + struct address_space *mapping; > + unsigned int target_order = new_order; > > if (!vma) > break; > @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > if (!is_transparent_hugepage(folio)) > goto next; > > - if (new_order >= folio_order(folio)) > + if (!folio_test_anon(folio)) { > + mapping = folio->mapping; > + target_order = max(new_order, > + mapping_min_folio_order(mapping)); > + } > + > + if (target_order >= folio_order(folio)) > goto next; > > total++; > @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > folio_get(folio); > folio_walk_end(&fw, vma); > > - if (!split_folio_to_order(folio, new_order)) > + if (!folio_test_anon(folio) && folio->mapping != mapping) > + goto unlock; > + > + if (!split_folio_to_order(folio, target_order)) > split++; > > +unlock: > + > folio_unlock(folio); > folio_put(folio); > > @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > pgoff_t index; > int nr_pages = 1; > unsigned long total = 0, split = 0; > + unsigned int min_order; > + unsigned int target_order; > > file = getname_kernel(file_path); > if (IS_ERR(file)) > @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > file_path, off_start, off_end); > > mapping = candidate->f_mapping; > + min_order = mapping_min_folio_order(mapping); > + target_order = max(new_order, min_order); > > for (index = off_start; index < off_end; index += nr_pages) { > struct folio *folio = filemap_get_folio(mapping, index); > @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > total++; > nr_pages = folio_nr_pages(folio); > > - if (new_order >= folio_order(folio)) > + if (target_order >= folio_order(folio)) > goto next; > > if (!folio_trylock(folio)) > goto next; > > - if (!split_folio_to_order(folio, new_order)) > + if (folio->mapping != mapping) > + goto unlock; > + > + if (!split_folio_to_order(folio, target_order)) > split++; > > +unlock: > folio_unlock(folio); > next: > folio_put(folio); > -- > 2.44.1 >
On 29 Aug 2024, at 15:55, Matthew Wilcox wrote: > On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote: >> With vm debugging however I get more information about the issue: >> >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff) >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------ >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509! > > This is in folio_unlock(). We're trying to unlock a folio which isn't > locked! > >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace: >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: <TASK> >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? die+0x32/0x80 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_trap+0xd9/0x100 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? do_error_trap+0x6a/0x90 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? exc_invalid_op+0x4c/0x60 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? asm_exc_invalid_op+0x16/0x20 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? folio_unlock+0x43/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ksm_scan_thread+0x175b/0x1d30 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_ksm_scan_thread+0x10/0x10 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kthread+0xda/0x110 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork+0x2d/0x50 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ? __pfx_kthread+0x10/0x10 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ret_from_fork_asm+0x1a/0x30 >> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: </TASK> > [...] >> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine >> cmp_and_merge_page() on the split case: >> >> } else if (split) { >> /* >> * We are here if we tried to merge two pages and >> * failed because they both belonged to the same >> * compound page. We will split the page now, but no >> * merging will take place. >> * We do not want to add the cost of a full lock; if >> * the page is locked, it is better to skip it and >> * perhaps try again later. >> */ >> if (!trylock_page(page)) >> return; >> split_huge_page(page); >> unlock_page(page); > > Obviously the page is locked when we call split_huge_page(). There's > an assert inside it. And the lock bit is _supposed_ to be transferred > to the head page of the page which is being split. My guess is that > this is messed up somehow; we're perhaps transferring the lock bit to > the wrong page? The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer to split_huge_page_to_list_to_order(), which keeps that “page” still locked. But this patch changes the “page” passed into split_huge_page_to_list_to_order() always to the head page. This fixes the crash on my x86 VM, but it can be improved: diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7c50aeed0522..eff5d2fb5d4e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); int split_folio_to_list(struct folio *folio, struct list_head *list); -static inline int split_huge_page(struct page *page) -{ - return split_folio(page_folio(page)); -} +int split_huge_page(struct page *page); void deferred_split_folio(struct folio *folio); void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c29af9451d92..4d723dab4336 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; } +int split_huge_page(struct page *page) +{ + unsigned int min_order = 0; + struct folio *folio = page_folio(page); + + if (folio_test_anon(folio)) + goto out; + + if (!folio->mapping) { + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_SPLIT_PAGE_FAILED); + return -EBUSY; + } + + min_order = mapping_min_folio_order(folio->mapping); +out: + return split_huge_page_to_list_to_order(page, NULL, min_order); +} + int split_folio_to_list(struct folio *folio, struct list_head *list) { unsigned int min_order = 0;
On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote: > The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() > unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer > to split_huge_page_to_list_to_order(), which keeps that “page” still locked. > But this patch changes the “page” passed into split_huge_page_to_list_to_order() > always to the head page. > > This fixes the crash on my x86 VM, but it can be improved: > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 7c50aeed0522..eff5d2fb5d4e 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > int split_folio_to_list(struct folio *folio, struct list_head *list); > -static inline int split_huge_page(struct page *page) > -{ > - return split_folio(page_folio(page)); > -} > +int split_huge_page(struct page *page); > void deferred_split_folio(struct folio *folio); > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c29af9451d92..4d723dab4336 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > +int split_huge_page(struct page *page) > +{ > + unsigned int min_order = 0; > + struct folio *folio = page_folio(page); > + > + if (folio_test_anon(folio)) > + goto out; > + > + if (!folio->mapping) { > + if (folio_test_pmd_mappable(folio)) > + count_vm_event(THP_SPLIT_PAGE_FAILED); > + return -EBUSY; > + } > + > + min_order = mapping_min_folio_order(folio->mapping); > +out: > + return split_huge_page_to_list_to_order(page, NULL, min_order); > +} > + > int split_folio_to_list(struct folio *folio, struct list_head *list) > { > unsigned int min_order = 0; Confirmed, and also although you suggest it can be improved, I thought that we could do that by sharing more code and putting things in the headers, the below also fixes this but tries to share more code, but I think it is perhaps less easier to understand than your patch. So I think your patch is cleaner and easier as a fix. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index c275aa9cc105..99cd9c7bf55b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) #define split_folio(f) split_folio_to_list(f, NULL) +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list) #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES #define HPAGE_PMD_SHIFT PMD_SHIFT @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); -int split_folio_to_list(struct folio *folio, struct list_head *list); +int split_page_folio_to_list(struct page *page, struct folio *folio, + struct list_head *list); static inline int split_huge_page(struct page *page) { - return split_folio(page_folio(page)); + return split_page_folio_to_list(page, page_folio(page), NULL); } void deferred_split_folio(struct folio *folio); @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page) return 0; } -static inline int split_folio_to_list(struct folio *folio, struct list_head *list) +static inline int split_page_folio_to_list(struct page *page, + struct folio *folio, + struct list_head *list) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 169f1a71c95d..b115bfe63b52 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; } -int split_folio_to_list(struct folio *folio, struct list_head *list) +int split_page_folio_to_list(struct page *page, struct folio *folio, + struct list_head *list) { unsigned int min_order = 0; @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) min_order = mapping_min_folio_order(folio->mapping); out: - return split_huge_page_to_list_to_order(&folio->page, list, - min_order); + return split_huge_page_to_list_to_order(page, list, min_order); } void __folio_undo_large_rmappable(struct folio *folio)
Luis Chamberlain <mcgrof@kernel.org> writes: > On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote: >> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() >> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer >> to split_huge_page_to_list_to_order(), which keeps that “page” still locked. >> But this patch changes the “page” passed into split_huge_page_to_list_to_order() >> always to the head page. >> >> This fixes the crash on my x86 VM, but it can be improved: It also fixes the issue on s390x. Thanks!
On Thu, Aug 29, 2024 at 04:41:48PM -0700, Luis Chamberlain wrote: > On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote: > > The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() > > unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer > > to split_huge_page_to_list_to_order(), which keeps that “page” still locked. > > But this patch changes the “page” passed into split_huge_page_to_list_to_order() > > always to the head page. > > > > This fixes the crash on my x86 VM, but it can be improved: > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 7c50aeed0522..eff5d2fb5d4e 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins); > > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > > unsigned int new_order); > > int split_folio_to_list(struct folio *folio, struct list_head *list); > > -static inline int split_huge_page(struct page *page) > > -{ > > - return split_folio(page_folio(page)); > > -} > > +int split_huge_page(struct page *page); > > void deferred_split_folio(struct folio *folio); > > > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index c29af9451d92..4d723dab4336 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > > return ret; > > } > > > > +int split_huge_page(struct page *page) > > +{ > > + unsigned int min_order = 0; > > + struct folio *folio = page_folio(page); > > + > > + if (folio_test_anon(folio)) > > + goto out; > > + > > + if (!folio->mapping) { > > + if (folio_test_pmd_mappable(folio)) > > + count_vm_event(THP_SPLIT_PAGE_FAILED); > > + return -EBUSY; > > + } > > + > > + min_order = mapping_min_folio_order(folio->mapping); > > +out: > > + return split_huge_page_to_list_to_order(page, NULL, min_order); > > +} > > + > > int split_folio_to_list(struct folio *folio, struct list_head *list) > > { > > unsigned int min_order = 0; > > > Confirmed, and also although you suggest it can be improved, I thought > that we could do that by sharing more code and putting things in the > headers, the below also fixes this but tries to share more code, but > I think it is perhaps less easier to understand than your patch. > > So I think your patch is cleaner and easier as a fix. I reproduced it in arm64 as well. And both fixes provided solved the issue. > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index c275aa9cc105..99cd9c7bf55b 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) > > #define split_folio(f) split_folio_to_list(f, NULL) > +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list) > > #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES > #define HPAGE_PMD_SHIFT PMD_SHIFT > @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add > bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > -int split_folio_to_list(struct folio *folio, struct list_head *list); > +int split_page_folio_to_list(struct page *page, struct folio *folio, > + struct list_head *list); > static inline int split_huge_page(struct page *page) > { > - return split_folio(page_folio(page)); > + return split_page_folio_to_list(page, page_folio(page), NULL); > } > void deferred_split_folio(struct folio *folio); > > @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page) > return 0; > } > > -static inline int split_folio_to_list(struct folio *folio, struct list_head *list) > +static inline int split_page_folio_to_list(struct page *page, > + struct folio *folio, > + struct list_head *list) > { > return 0; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 169f1a71c95d..b115bfe63b52 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > -int split_folio_to_list(struct folio *folio, struct list_head *list) > +int split_page_folio_to_list(struct page *page, struct folio *folio, > + struct list_head *list) > { > unsigned int min_order = 0; > > @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) > > min_order = mapping_min_folio_order(folio->mapping); > out: > - return split_huge_page_to_list_to_order(&folio->page, list, > - min_order); > + return split_huge_page_to_list_to_order(page, list, min_order); > } > > void __folio_undo_large_rmappable(struct folio *folio)
On 30/08/2024 01:41, Luis Chamberlain wrote: > On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote: >> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() >> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer >> to split_huge_page_to_list_to_order(), which keeps that “page” still locked. >> But this patch changes the “page” passed into split_huge_page_to_list_to_order() >> always to the head page. >> >> This fixes the crash on my x86 VM, but it can be improved: >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 7c50aeed0522..eff5d2fb5d4e 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins); >> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> unsigned int new_order); >> int split_folio_to_list(struct folio *folio, struct list_head *list); >> -static inline int split_huge_page(struct page *page) >> -{ >> - return split_folio(page_folio(page)); >> -} >> +int split_huge_page(struct page *page); >> void deferred_split_folio(struct folio *folio); >> >> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index c29af9451d92..4d723dab4336 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> return ret; >> } >> >> +int split_huge_page(struct page *page) >> +{ >> + unsigned int min_order = 0; >> + struct folio *folio = page_folio(page); >> + >> + if (folio_test_anon(folio)) >> + goto out; >> + >> + if (!folio->mapping) { >> + if (folio_test_pmd_mappable(folio)) >> + count_vm_event(THP_SPLIT_PAGE_FAILED); >> + return -EBUSY; >> + } >> + >> + min_order = mapping_min_folio_order(folio->mapping); >> +out: >> + return split_huge_page_to_list_to_order(page, NULL, min_order); >> +} >> + >> int split_folio_to_list(struct folio *folio, struct list_head *list) >> { >> unsigned int min_order = 0; > > > Confirmed, and also although you suggest it can be improved, I thought > that we could do that by sharing more code and putting things in the > headers, the below also fixes this but tries to share more code, but > I think it is perhaps less easier to understand than your patch. > It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`. How about we extract the code that returns the min order so that we don't repeat. Something like this: diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index c275aa9cc105..d27febd5c639 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); +int min_order_for_split(struct folio *folio); int split_folio_to_list(struct folio *folio, struct list_head *list); static inline int split_huge_page(struct page *page) { - return split_folio(page_folio(page)); + struct folio *folio = page_folio(page); + int ret = min_order_for_split(folio); + + if (ret) + return ret; + + /* + * split_huge_page() locks the page before splitting and + * expects the same page that has been split to be locked when + * returned. split_folio_to_list() cannot be used here because + * it converts the page to folio and passes the head page to be + * split. + */ + return split_huge_page_to_list_to_order(page, NULL, ret); } void deferred_split_folio(struct folio *folio); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 169f1a71c95d..b167e036d01b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; } -int split_folio_to_list(struct folio *folio, struct list_head *list) +int min_order_for_split(struct folio *folio) { - unsigned int min_order = 0; - if (folio_test_anon(folio)) - goto out; + return 0; if (!folio->mapping) { if (folio_test_pmd_mappable(folio)) @@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) return -EBUSY; } - min_order = mapping_min_folio_order(folio->mapping); -out: - return split_huge_page_to_list_to_order(&folio->page, list, - min_order); + return mapping_min_folio_order(folio->mapping); +} + +int split_folio_to_list(struct folio *folio, struct list_head *list) +{ + int ret = min_order_for_split(folio); + + if (ret) + return ret; + + return split_huge_page_to_list_to_order(&folio->page, list, ret); } void __folio_undo_large_rmappable(struct folio *folio) > So I think your patch is cleaner and easier as a fix. > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index c275aa9cc105..99cd9c7bf55b 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) > > #define split_folio(f) split_folio_to_list(f, NULL) > +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list) > > #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES > #define HPAGE_PMD_SHIFT PMD_SHIFT > @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add > bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > -int split_folio_to_list(struct folio *folio, struct list_head *list); > +int split_page_folio_to_list(struct page *page, struct folio *folio, > + struct list_head *list); > static inline int split_huge_page(struct page *page) > { > - return split_folio(page_folio(page)); > + return split_page_folio_to_list(page, page_folio(page), NULL); > } > void deferred_split_folio(struct folio *folio); > > @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page) > return 0; > } > > -static inline int split_folio_to_list(struct folio *folio, struct list_head *list) > +static inline int split_page_folio_to_list(struct page *page, > + struct folio *folio, > + struct list_head *list) > { > return 0; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 169f1a71c95d..b115bfe63b52 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > -int split_folio_to_list(struct folio *folio, struct list_head *list) > +int split_page_folio_to_list(struct page *page, struct folio *folio, > + struct list_head *list) > { > unsigned int min_order = 0; > > @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) > > min_order = mapping_min_folio_order(folio->mapping); > out: > - return split_huge_page_to_list_to_order(&folio->page, list, > - min_order); > + return split_huge_page_to_list_to_order(page, list, min_order); > } > > void __folio_undo_large_rmappable(struct folio *folio)
On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote: > It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`. Agreed.. > How about we extract the code that returns the min order so that we don't repeat. > Something like this: Of all solutions I like this the most, Zi, do you have any preference? Luis
On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
We do that in the rmap code.
But this is not a performance path. We should keep this as simple as
possible.
On 30 Aug 2024, at 10:59, Pankaj Raghav wrote: > On 30/08/2024 01:41, Luis Chamberlain wrote: >> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote: >>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order() >>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer >>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked. >>> But this patch changes the “page” passed into split_huge_page_to_list_to_order() >>> always to the head page. >>> >>> This fixes the crash on my x86 VM, but it can be improved: >>> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>> index 7c50aeed0522..eff5d2fb5d4e 100644 >>> --- a/include/linux/huge_mm.h >>> +++ b/include/linux/huge_mm.h >>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins); >>> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >>> unsigned int new_order); >>> int split_folio_to_list(struct folio *folio, struct list_head *list); >>> -static inline int split_huge_page(struct page *page) >>> -{ >>> - return split_folio(page_folio(page)); >>> -} >>> +int split_huge_page(struct page *page); >>> void deferred_split_folio(struct folio *folio); >>> >>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index c29af9451d92..4d723dab4336 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >>> return ret; >>> } >>> >>> +int split_huge_page(struct page *page) >>> +{ >>> + unsigned int min_order = 0; >>> + struct folio *folio = page_folio(page); >>> + >>> + if (folio_test_anon(folio)) >>> + goto out; >>> + >>> + if (!folio->mapping) { >>> + if (folio_test_pmd_mappable(folio)) >>> + count_vm_event(THP_SPLIT_PAGE_FAILED); >>> + return -EBUSY; >>> + } >>> + >>> + min_order = mapping_min_folio_order(folio->mapping); >>> +out: >>> + return split_huge_page_to_list_to_order(page, NULL, min_order); >>> +} >>> + >>> int split_folio_to_list(struct folio *folio, struct list_head *list) >>> { >>> unsigned int min_order = 0; >> >> >> Confirmed, and also although you suggest it can be improved, I thought >> that we could do that by sharing more code and putting things in the >> headers, the below also fixes this but tries to share more code, but >> I think it is perhaps less easier to understand than your patch. >> > It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`. > > How about we extract the code that returns the min order so that we don't repeat. > > Something like this: > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index c275aa9cc105..d27febd5c639 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add > bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > +int min_order_for_split(struct folio *folio); > int split_folio_to_list(struct folio *folio, struct list_head *list); Since split_folio() is no longer used below, this line can be removed. > static inline int split_huge_page(struct page *page) > { > - return split_folio(page_folio(page)); > + struct folio *folio = page_folio(page); > + int ret = min_order_for_split(folio); > + > + if (ret) > + return ret; min_order_for_split() returns -EBUSY, 0, and a positive min_order. This if statement should be "if (ret < 0)"? > + > + /* > + * split_huge_page() locks the page before splitting and > + * expects the same page that has been split to be locked when > + * returned. split_folio_to_list() cannot be used here because > + * it converts the page to folio and passes the head page to be > + * split. > + */ > + return split_huge_page_to_list_to_order(page, NULL, ret); > } > void deferred_split_folio(struct folio *folio); > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 169f1a71c95d..b167e036d01b 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > -int split_folio_to_list(struct folio *folio, struct list_head *list) > +int min_order_for_split(struct folio *folio) > { > - unsigned int min_order = 0; > - > if (folio_test_anon(folio)) > - goto out; > + return 0; > > if (!folio->mapping) { > if (folio_test_pmd_mappable(folio)) > @@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list) > return -EBUSY; > } > > - min_order = mapping_min_folio_order(folio->mapping); > -out: > - return split_huge_page_to_list_to_order(&folio->page, list, > - min_order); > + return mapping_min_folio_order(folio->mapping); > +} > + > +int split_folio_to_list(struct folio *folio, struct list_head *list) > +{ > + int ret = min_order_for_split(folio); > + > + if (ret) > + return ret; Ditto. > + > + return split_huge_page_to_list_to_order(&folio->page, list, ret); > } > > void __folio_undo_large_rmappable(struct folio *folio) > -- Best Regards, Yan, Zi
On 30 Aug 2024, at 13:12, Luis Chamberlain wrote: > On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote: >> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`. > > Agreed.. > >> How about we extract the code that returns the min order so that we don't repeat. >> Something like this: > > Of all solutions I like this the most, Zi, do you have any preference? No preference, as long as Pankaj fixed the "if (ret)" issue I mentioned in the other email. -- Best Regards, Yan, Zi
Hi Luis, Greetings! I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902. After bisection and the first bad commit is: " fd031210c9ce mm: split a folio in minimum folio order chunks " All detailed into can be found at: https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page Syzkaller repro code: https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c Syzkaller repro syscall steps: https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog Syzkaller report: https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report Kconfig(make olddefconfig): https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin Bisect info: https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log bzImage: https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0 Issue dmesg: https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log " [ 447.976688] ? __pfx_soft_offline_page.part.0+0x10/0x10 [ 447.977255] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 447.977858] soft_offline_page+0x97/0xc0 [ 447.978281] do_madvise.part.0+0x1a45/0x2a30 [ 447.978742] ? __pfx___lock_acquire+0x10/0x10 [ 447.979227] ? __pfx_do_madvise.part.0+0x10/0x10 [ 447.979716] ? __this_cpu_preempt_check+0x21/0x30 [ 447.980225] ? __this_cpu_preempt_check+0x21/0x30 [ 447.980729] ? lock_release+0x441/0x870 [ 447.981160] ? __this_cpu_preempt_check+0x21/0x30 [ 447.981656] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0 [ 447.982321] ? lockdep_hardirqs_on+0x89/0x110 [ 447.982771] ? trace_hardirqs_on+0x51/0x60 [ 447.983191] ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0 [ 447.983819] ? __sanitizer_cov_trace_cmp4+0x1a/0x20 [ 447.984282] ? ktime_get_coarse_real_ts64+0xbf/0xf0 [ 447.984673] __x64_sys_madvise+0x139/0x180 [ 447.984997] x64_sys_call+0x19a5/0x2140 [ 447.985307] do_syscall_64+0x6d/0x140 [ 447.985600] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 447.986011] RIP: 0033:0x7f782623ee5d [ 447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c [ 447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d [ 447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000 [ 447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0 [ 447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118 [ 447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000 [ 447.988799] </TASK> [ 447.988921] [ 447.988921] Showing all locks held in the system: [ 447.989237] 1 lock held by khungtaskd/33: [ 447.989447] #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0 [ 447.989947] 1 lock held by repro/628: [ 447.990144] #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40 [ 447.990611] [ 447.990701] ============================================= " I hope you find it useful. Regards, Yi Lai --- If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information. How to reproduce: git clone https://gitlab.com/xupengfe/repro_vm_env.git cd repro_vm_env tar -xvf repro_vm_env.tar.gz cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel // You could change the bzImage_xxx as you want // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version You could use below command to log in, there is no password for root. ssh -p 10023 root@localhost After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm: gcc -pthread -o repro repro.c scp -P 10023 repro root@localhost:/root/ Get the bzImage for target kernel: Please use target kconfig and copy it to kernel_src/.config make olddefconfig make -jx bzImage //x should equal or less than cpu num your pc has Fill the bzImage file into above start3.sh to load the target kernel in vm. Tips: If you already have qemu-system-x86_64, please ignore below info. If you want to install qemu v7.1.0 version: git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote: > From: Luis Chamberlain <mcgrof@kernel.org> > > split_folio() and split_folio_to_list() assume order 0, to support > minorder for non-anonymous folios, we must expand these to check the > folio mapping order and use that. > > Set new_order to be at least minimum folio order if it is set in > split_huge_page_to_list() so that we can maintain minimum folio order > requirement in the page cache. > > Update the debugfs write files used for testing to ensure the order > is respected as well. We simply enforce the min order when a file > mapping is used. > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Reviewed-by: Zi Yan <ziy@nvidia.com> > Tested-by: David Howells <dhowells@redhat.com> > --- > include/linux/huge_mm.h | 14 +++++++--- > mm/huge_memory.c | 60 ++++++++++++++++++++++++++++++++++++++--- > 2 files changed, 66 insertions(+), 8 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 4c32058cacfec..70424d55da088 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) > > +#define split_folio(f) split_folio_to_list(f, NULL) > + > #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES > #define HPAGE_PMD_SHIFT PMD_SHIFT > #define HPAGE_PUD_SHIFT PUD_SHIFT > @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add > bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order); > +int split_folio_to_list(struct folio *folio, struct list_head *list); > static inline int split_huge_page(struct page *page) > { > - return split_huge_page_to_list_to_order(page, NULL, 0); > + return split_folio(page_folio(page)); > } > void deferred_split_folio(struct folio *folio); > > @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page) > { > return 0; > } > + > +static inline int split_folio_to_list(struct folio *folio, struct list_head *list) > +{ > + return 0; > +} > + > static inline void deferred_split_folio(struct folio *folio) {} > #define split_huge_pmd(__vma, __pmd, __address) \ > do { } while (0) > @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) > return split_folio_to_list_to_order(folio, NULL, new_order); > } > > -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) > -#define split_folio(f) split_folio_to_order(f, 0) > - > #endif /* _LINUX_HUGE_MM_H */ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index cf8e34f62976f..06384b85a3a20 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins) > * released, or if some unexpected race happened (e.g., anon VMA disappeared, > * truncation). > * > + * Callers should ensure that the order respects the address space mapping > + * min-order if one is set for non-anonymous folios. > + * > * Returns -EINVAL when trying to split to an order that is incompatible > * with the folio. Splitting to order 0 is compatible with all folios. > */ > @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > mapping = NULL; > anon_vma_lock_write(anon_vma); > } else { > + unsigned int min_order; > gfp_t gfp; > > mapping = folio->mapping; > @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > goto out; > } > > + min_order = mapping_min_folio_order(folio->mapping); > + if (new_order < min_order) { > + VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u", > + min_order); > + ret = -EINVAL; > + goto out; > + } > + > gfp = current_gfp_context(mapping_gfp_mask(mapping) & > GFP_RECLAIM_MASK); > > @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > return ret; > } > > +int split_folio_to_list(struct folio *folio, struct list_head *list) > +{ > + unsigned int min_order = 0; > + > + if (folio_test_anon(folio)) > + goto out; > + > + if (!folio->mapping) { > + if (folio_test_pmd_mappable(folio)) > + count_vm_event(THP_SPLIT_PAGE_FAILED); > + return -EBUSY; > + } > + > + min_order = mapping_min_folio_order(folio->mapping); > +out: > + return split_huge_page_to_list_to_order(&folio->page, list, > + min_order); > +} > + > void __folio_undo_large_rmappable(struct folio *folio) > { > struct deferred_split *ds_queue; > @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > struct vm_area_struct *vma = vma_lookup(mm, addr); > struct folio_walk fw; > struct folio *folio; > + struct address_space *mapping; > + unsigned int target_order = new_order; > > if (!vma) > break; > @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > if (!is_transparent_hugepage(folio)) > goto next; > > - if (new_order >= folio_order(folio)) > + if (!folio_test_anon(folio)) { > + mapping = folio->mapping; > + target_order = max(new_order, > + mapping_min_folio_order(mapping)); > + } > + > + if (target_order >= folio_order(folio)) > goto next; > > total++; > @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, > folio_get(folio); > folio_walk_end(&fw, vma); > > - if (!split_folio_to_order(folio, new_order)) > + if (!folio_test_anon(folio) && folio->mapping != mapping) > + goto unlock; > + > + if (!split_folio_to_order(folio, target_order)) > split++; > > +unlock: > + > folio_unlock(folio); > folio_put(folio); > > @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > pgoff_t index; > int nr_pages = 1; > unsigned long total = 0, split = 0; > + unsigned int min_order; > + unsigned int target_order; > > file = getname_kernel(file_path); > if (IS_ERR(file)) > @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > file_path, off_start, off_end); > > mapping = candidate->f_mapping; > + min_order = mapping_min_folio_order(mapping); > + target_order = max(new_order, min_order); > > for (index = off_start; index < off_end; index += nr_pages) { > struct folio *folio = filemap_get_folio(mapping, index); > @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, > total++; > nr_pages = folio_nr_pages(folio); > > - if (new_order >= folio_order(folio)) > + if (target_order >= folio_order(folio)) > goto next; > > if (!folio_trylock(folio)) > goto next; > > - if (!split_folio_to_order(folio, new_order)) > + if (folio->mapping != mapping) > + goto unlock; > + > + if (!split_folio_to_order(folio, target_order)) > split++; > > +unlock: > folio_unlock(folio); > next: > folio_put(folio); > -- > 2.44.1 >
On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote: Hi Yi, > > I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902. I don't know if it is related, but we had a fix for this commit for a ltp failure due to locking issues that is there in next-20240905 but not in next-20240902. Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/ Is this reproducible also on next-20240905? > > After bisection and the first bad commit is: > " > fd031210c9ce mm: split a folio in minimum folio order chunks > " > > All detailed into can be found at: > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page > Syzkaller repro code: > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c > Syzkaller repro syscall steps: > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog > Syzkaller report: > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report > Kconfig(make olddefconfig): > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin > Bisect info: > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log > bzImage: > https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0 > Issue dmesg: > https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log > > " > [ 447.976688] ? __pfx_soft_offline_page.part.0+0x10/0x10 > [ 447.977255] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 447.977858] soft_offline_page+0x97/0xc0 > [ 447.978281] do_madvise.part.0+0x1a45/0x2a30 > [ 447.978742] ? __pfx___lock_acquire+0x10/0x10 > [ 447.979227] ? __pfx_do_madvise.part.0+0x10/0x10 > [ 447.979716] ? __this_cpu_preempt_check+0x21/0x30 > [ 447.980225] ? __this_cpu_preempt_check+0x21/0x30 > [ 447.980729] ? lock_release+0x441/0x870 > [ 447.981160] ? __this_cpu_preempt_check+0x21/0x30 > [ 447.981656] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0 > [ 447.982321] ? lockdep_hardirqs_on+0x89/0x110 > [ 447.982771] ? trace_hardirqs_on+0x51/0x60 > [ 447.983191] ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0 > [ 447.983819] ? __sanitizer_cov_trace_cmp4+0x1a/0x20 > [ 447.984282] ? ktime_get_coarse_real_ts64+0xbf/0xf0 > [ 447.984673] __x64_sys_madvise+0x139/0x180 > [ 447.984997] x64_sys_call+0x19a5/0x2140 > [ 447.985307] do_syscall_64+0x6d/0x140 > [ 447.985600] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 447.986011] RIP: 0033:0x7f782623ee5d > [ 447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c > [ 447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d > [ 447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000 > [ 447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0 > [ 447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118 > [ 447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000 > [ 447.988799] </TASK> > [ 447.988921] > [ 447.988921] Showing all locks held in the system: > [ 447.989237] 1 lock held by khungtaskd/33: > [ 447.989447] #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0 > [ 447.989947] 1 lock held by repro/628: > [ 447.990144] #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40 > [ 447.990611] > [ 447.990701] ============================================= > > " > > I hope you find it useful. > > Regards, > Yi Lai >
Hi, I have tried running the repro.c for 1 hour using next-20240905 kernel. Issue cannot be reproduced. Thank you. Regards, Yi Lai On Fri, Sep 06, 2024 at 08:01:20AM +0000, Pankaj Raghav (Samsung) wrote: > On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote: > Hi Yi, > > > > > I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902. > > I don't know if it is related, but we had a fix for this commit for a > ltp failure due to locking issues that is there in next-20240905 but not > in next-20240902. > > Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/ > > Is this reproducible also on next-20240905? > > > > > After bisection and the first bad commit is: > > " > > fd031210c9ce mm: split a folio in minimum folio order chunks > > " > > > > All detailed into can be found at: > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page > > Syzkaller repro code: > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c > > Syzkaller repro syscall steps: > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog > > Syzkaller report: > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report > > Kconfig(make olddefconfig): > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin > > Bisect info: > > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log > > bzImage: > > https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0 > > Issue dmesg: > > https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log > > > > " > > [ 447.976688] ? __pfx_soft_offline_page.part.0+0x10/0x10 > > [ 447.977255] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > > [ 447.977858] soft_offline_page+0x97/0xc0 > > [ 447.978281] do_madvise.part.0+0x1a45/0x2a30 > > [ 447.978742] ? __pfx___lock_acquire+0x10/0x10 > > [ 447.979227] ? __pfx_do_madvise.part.0+0x10/0x10 > > [ 447.979716] ? __this_cpu_preempt_check+0x21/0x30 > > [ 447.980225] ? __this_cpu_preempt_check+0x21/0x30 > > [ 447.980729] ? lock_release+0x441/0x870 > > [ 447.981160] ? __this_cpu_preempt_check+0x21/0x30 > > [ 447.981656] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0 > > [ 447.982321] ? lockdep_hardirqs_on+0x89/0x110 > > [ 447.982771] ? trace_hardirqs_on+0x51/0x60 > > [ 447.983191] ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0 > > [ 447.983819] ? __sanitizer_cov_trace_cmp4+0x1a/0x20 > > [ 447.984282] ? ktime_get_coarse_real_ts64+0xbf/0xf0 > > [ 447.984673] __x64_sys_madvise+0x139/0x180 > > [ 447.984997] x64_sys_call+0x19a5/0x2140 > > [ 447.985307] do_syscall_64+0x6d/0x140 > > [ 447.985600] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > [ 447.986011] RIP: 0033:0x7f782623ee5d > > [ 447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c > > [ 447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d > > [ 447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000 > > [ 447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0 > > [ 447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118 > > [ 447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000 > > [ 447.988799] </TASK> > > [ 447.988921] > > [ 447.988921] Showing all locks held in the system: > > [ 447.989237] 1 lock held by khungtaskd/33: > > [ 447.989447] #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0 > > [ 447.989947] 1 lock held by repro/628: > > [ 447.990144] #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40 > > [ 447.990611] > > [ 447.990701] ============================================= > > > > " > > > > I hope you find it useful. > > > > Regards, > > Yi Lai > >
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4c32058cacfec..70424d55da088 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) +#define split_folio(f) split_folio_to_list(f, NULL) + #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES #define HPAGE_PMD_SHIFT PMD_SHIFT #define HPAGE_PUD_SHIFT PUD_SHIFT @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); +int split_folio_to_list(struct folio *folio, struct list_head *list); static inline int split_huge_page(struct page *page) { - return split_huge_page_to_list_to_order(page, NULL, 0); + return split_folio(page_folio(page)); } void deferred_split_folio(struct folio *folio); @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page) { return 0; } + +static inline int split_folio_to_list(struct folio *folio, struct list_head *list) +{ + return 0; +} + static inline void deferred_split_folio(struct folio *folio) {} #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) return split_folio_to_list_to_order(folio, NULL, new_order); } -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) -#define split_folio(f) split_folio_to_order(f, 0) - #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cf8e34f62976f..06384b85a3a20 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins) * released, or if some unexpected race happened (e.g., anon VMA disappeared, * truncation). * + * Callers should ensure that the order respects the address space mapping + * min-order if one is set for non-anonymous folios. + * * Returns -EINVAL when trying to split to an order that is incompatible * with the folio. Splitting to order 0 is compatible with all folios. */ @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, mapping = NULL; anon_vma_lock_write(anon_vma); } else { + unsigned int min_order; gfp_t gfp; mapping = folio->mapping; @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, goto out; } + min_order = mapping_min_folio_order(folio->mapping); + if (new_order < min_order) { + VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u", + min_order); + ret = -EINVAL; + goto out; + } + gfp = current_gfp_context(mapping_gfp_mask(mapping) & GFP_RECLAIM_MASK); @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; } +int split_folio_to_list(struct folio *folio, struct list_head *list) +{ + unsigned int min_order = 0; + + if (folio_test_anon(folio)) + goto out; + + if (!folio->mapping) { + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_SPLIT_PAGE_FAILED); + return -EBUSY; + } + + min_order = mapping_min_folio_order(folio->mapping); +out: + return split_huge_page_to_list_to_order(&folio->page, list, + min_order); +} + void __folio_undo_large_rmappable(struct folio *folio) { struct deferred_split *ds_queue; @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, struct vm_area_struct *vma = vma_lookup(mm, addr); struct folio_walk fw; struct folio *folio; + struct address_space *mapping; + unsigned int target_order = new_order; if (!vma) break; @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, if (!is_transparent_hugepage(folio)) goto next; - if (new_order >= folio_order(folio)) + if (!folio_test_anon(folio)) { + mapping = folio->mapping; + target_order = max(new_order, + mapping_min_folio_order(mapping)); + } + + if (target_order >= folio_order(folio)) goto next; total++; @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, folio_get(folio); folio_walk_end(&fw, vma); - if (!split_folio_to_order(folio, new_order)) + if (!folio_test_anon(folio) && folio->mapping != mapping) + goto unlock; + + if (!split_folio_to_order(folio, target_order)) split++; +unlock: + folio_unlock(folio); folio_put(folio); @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, pgoff_t index; int nr_pages = 1; unsigned long total = 0, split = 0; + unsigned int min_order; + unsigned int target_order; file = getname_kernel(file_path); if (IS_ERR(file)) @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, file_path, off_start, off_end); mapping = candidate->f_mapping; + min_order = mapping_min_folio_order(mapping); + target_order = max(new_order, min_order); for (index = off_start; index < off_end; index += nr_pages) { struct folio *folio = filemap_get_folio(mapping, index); @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, total++; nr_pages = folio_nr_pages(folio); - if (new_order >= folio_order(folio)) + if (target_order >= folio_order(folio)) goto next; if (!folio_trylock(folio)) goto next; - if (!split_folio_to_order(folio, new_order)) + if (folio->mapping != mapping) + goto unlock; + + if (!split_folio_to_order(folio, target_order)) split++; +unlock: folio_unlock(folio); next: folio_put(folio);