[v13,04/10] mm: split a folio in minimum folio order chunks

Message ID	20240822135018.1931258-5-kernel@pankajraghav.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> To: brauner@kernel.org, akpm@linux-foundation.org Cc: chandan.babu@oracle.com, linux-fsdevel@vger.kernel.org, djwong@kernel.org, hare@suse.de, gost.dev@samsung.com, linux-xfs@vger.kernel.org, kernel@pankajraghav.com, hch@lst.de, david@fromorbit.com, Zi Yan <ziy@nvidia.com>, yang@os.amperecomputing.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, john.g.garry@oracle.com, cl@os.amperecomputing.com, p.raghav@samsung.com, mcgrof@kernel.org, ryan.roberts@arm.com, David Howells <dhowells@redhat.com> Subject: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Date: Thu, 22 Aug 2024 15:50:12 +0200 Message-ID: <20240822135018.1931258-5-kernel@pankajraghav.com> In-Reply-To: <20240822135018.1931258-1-kernel@pankajraghav.com> References: <20240822135018.1931258-1-kernel@pankajraghav.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	enable bs > ps in XFS \| expand [v13,00/10] enable bs > ps in XFS [v13,01/10] fs: Allow fine-grained control of folio sizes [v13,02/10] filemap: allocate mapping_min_order folios in the page cache [v13,03/10] readahead: allocate folios with mapping_min_order in readahead [v13,04/10] mm: split a folio in minimum folio order chunks [v13,05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() [v13,06/10] iomap: fix iomap_dio_zero() for fs bs > system page size [v13,07/10] xfs: use kvmalloc for xattr buffers [v13,08/10] xfs: expose block size in stat [v13,09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() [v13,10/10] xfs: enable block size larger than page size support

Pankaj Raghav (Samsung) Aug. 22, 2024, 1:50 p.m. UTC

From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: David Howells <dhowells@redhat.com>
---
 include/linux/huge_mm.h | 14 +++++++---
 mm/huge_memory.c        | 60 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 66 insertions(+), 8 deletions(-)

Sven Schnelle Aug. 29, 2024, 10:51 a.m. UTC | #1

Hi,

"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:

> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: David Howells <dhowells@redhat.com>

This causes the following warning on s390 with linux-next starting from
next-20240827:

[  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
[  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
[  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
[  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
[  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
[  112.690548] page dumped because: bad pte
[  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
[  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
[  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
[  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[  112.690568] Call Trace:
[  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
[  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
[  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
[  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
[  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
[  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
[  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
[  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
[  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
[  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
[  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
[  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
[  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
[  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
[  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
[  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
[  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
[  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
[  112.690638] Disabling lock debugging due to kernel taint

To reproduce, running the ksm01 testsuite from ltp seems to be
enough. The splat is always triggered immediately. The output from ksm01
is:

tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:344: TINFO: child 2 continues...
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 2 stops.
mem.c:400: TINFO: child 0 stops.
Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks
Sven

Luis Chamberlain Aug. 29, 2024, 6:46 p.m. UTC | #2

On Thu, Aug 29, 2024 at 12:51:25PM +0200, Sven Schnelle wrote:
> Hi,
> 
> "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:
> 
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Tested-by: David Howells <dhowells@redhat.com>
> 
> This causes the following warning on s390 with linux-next starting from
> next-20240827:
> 
> [  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
> [  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
> [  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
> [  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
> [  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
> [  112.690548] page dumped because: bad pte
> [  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
> [  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> [  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
> [  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
> [  112.690568] Call Trace:
> [  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
> [  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
> [  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
> [  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
> [  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
> [  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
> [  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
> [  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
> [  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
> [  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
> [  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
> [  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
> [  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
> [  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
> [  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
> [  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
> [  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
> [  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
> [  112.690638] Disabling lock debugging due to kernel taint
> 
> To reproduce, running the ksm01 testsuite from ltp seems to be
> enough. The splat is always triggered immediately. The output from ksm01
> is:
> 
> tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
> tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
> tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
> tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:388: TINFO: child 0 stops.
> mem.c:388: TINFO: child 1 stops.
> mem.c:388: TINFO: child 2 stops.
> mem.c:495: TINFO: KSM merging...
> mem.c:434: TINFO: resume all children.
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:344: TINFO: child 0 continues...
> mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
> mem.c:344: TINFO: child 1 continues...
> mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
> mem.c:344: TINFO: child 2 continues...
> mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
> mem.c:400: TINFO: child 1 stops.
> mem.c:400: TINFO: child 2 stops.
> mem.c:400: TINFO: child 0 stops.
> Test timeouted, sending SIGKILL!
> tst_test.c:1700: TINFO: Killed the leftover descendant processes
> tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks for the report and test reproducer, I was able to reproduce on
x86_64 easily with ltp/testcases/kernel/mem/ksm/ksm01 as well:

ltp/testcases/kernel/mem/ksm/ksm01
tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6fc20
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #56 SMP
PREEMPT_DYNAMIC Tue Aug 27 08:10:26 UTC 2024 x86_64
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:344: TINFO: child 2 continues...
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 0 stops.
mem.c:400: TINFO: child 2 stops.


Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try
exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Summary:
passed   0
failed   0
broken   1
skipped  0
warnings 0


With vm debugging however I get more information about the issue:

Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Modules linked in: xfs nvme_fabrics 9p kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd pcspkr 9pnet_virtio joydev virtio_balloon virtio_console evdev button serio_raw drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover nvme psmouse crc32_pclmul crc32c_intel nvme_core virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ---[ end trace 0000000000000000 ]---
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554

Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
cmp_and_merge_page() on the split case:

                } else if (split) {                                             
                        /*                                                      
                         * We are here if we tried to merge two pages and       
                         * failed because they both belonged to the same        
                         * compound page. We will split the page now, but no    
                         * merging will take place.                             
                         * We do not want to add the cost of a full lock; if    
                         * the page is locked, it is better to skip it and      
                         * perhaps try again later.                             
                         */                                                     
                        if (!trylock_page(page))                                
                                return;                                         
                        split_huge_page(page);                                  
                        unlock_page(page);                                      
                }

The trylock_page() is faulty. I'm digging in further.

  Luis

Matthew Wilcox Aug. 29, 2024, 7:55 p.m. UTC | #3

On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
> With vm debugging however I get more information about the issue:
> 
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!

This is in folio_unlock().  We're trying to unlock a folio which isn't
locked!

> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
[...]
> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
> cmp_and_merge_page() on the split case:
> 
>                 } else if (split) {                                             
>                         /*                                                      
>                          * We are here if we tried to merge two pages and       
>                          * failed because they both belonged to the same        
>                          * compound page. We will split the page now, but no    
>                          * merging will take place.                             
>                          * We do not want to add the cost of a full lock; if    
>                          * the page is locked, it is better to skip it and      
>                          * perhaps try again later.                             
>                          */                                                     
>                         if (!trylock_page(page))                                
>                                 return;                                         
>                         split_huge_page(page);                                  
>                         unlock_page(page);                                      

Obviously the page is locked when we call split_huge_page().  There's
an assert inside it.  And the lock bit is _supposed_ to be transferred
to the head page of the page which is being split.  My guess is that
this is messed up somehow; we're perhaps transferring the lock bit to
the wrong page?

Matthew Wilcox Aug. 29, 2024, 10:11 p.m. UTC | #4

On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote:
> @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_folio_to_list(struct folio *folio, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_huge_page_to_list_to_order(page, NULL, 0);
> +	return split_folio(page_folio(page));

Oh!  You can't do this!

split_huge_page() takes a precise page, NOT a folio.  That page is
locked.  When we return from split_huge_page(), the new folio which
contains the precise page is locked.

You've made it so that the caller's page's folio won't necessarily
be locked.  More testing was needed ;-P

>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> +
> +static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	return 0;
> +}
> +
>  static inline void deferred_split_folio(struct folio *folio) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
> @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>  	return split_folio_to_list_to_order(folio, NULL, new_order);
>  }
>  
> -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> -#define split_folio(f) split_folio_to_order(f, 0)
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cf8e34f62976f..06384b85a3a20 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   * released, or if some unexpected race happened (e.g., anon VMA disappeared,
>   * truncation).
>   *
> + * Callers should ensure that the order respects the address space mapping
> + * min-order if one is set for non-anonymous folios.
> + *
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		mapping = NULL;
>  		anon_vma_lock_write(anon_vma);
>  	} else {
> +		unsigned int min_order;
>  		gfp_t gfp;
>  
>  		mapping = folio->mapping;
> @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  			goto out;
>  		}
>  
> +		min_order = mapping_min_folio_order(folio->mapping);
> +		if (new_order < min_order) {
> +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> +				     min_order);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>  		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>  							GFP_RECLAIM_MASK);
>  
> @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (folio_test_anon(folio))
> +		goto out;
> +
> +	if (!folio->mapping) {
> +		if (folio_test_pmd_mappable(folio))
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> +		return -EBUSY;
> +	}
> +
> +	min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +	return split_huge_page_to_list_to_order(&folio->page, list,
> +							min_order);
> +}
> +
>  void __folio_undo_large_rmappable(struct folio *folio)
>  {
>  	struct deferred_split *ds_queue;
> @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		struct vm_area_struct *vma = vma_lookup(mm, addr);
>  		struct folio_walk fw;
>  		struct folio *folio;
> +		struct address_space *mapping;
> +		unsigned int target_order = new_order;
>  
>  		if (!vma)
>  			break;
> @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> -		if (new_order >= folio_order(folio))
> +		if (!folio_test_anon(folio)) {
> +			mapping = folio->mapping;
> +			target_order = max(new_order,
> +					   mapping_min_folio_order(mapping));
> +		}
> +
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		total++;
> @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		folio_get(folio);
>  		folio_walk_end(&fw, vma);
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (!folio_test_anon(folio) && folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
> +
>  		folio_unlock(folio);
>  		folio_put(folio);
>  
> @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	pgoff_t index;
>  	int nr_pages = 1;
>  	unsigned long total = 0, split = 0;
> +	unsigned int min_order;
> +	unsigned int target_order;
>  
>  	file = getname_kernel(file_path);
>  	if (IS_ERR(file))
> @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		 file_path, off_start, off_end);
>  
>  	mapping = candidate->f_mapping;
> +	min_order = mapping_min_folio_order(mapping);
> +	target_order = max(new_order, min_order);
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
>  		struct folio *folio = filemap_get_folio(mapping, index);
> @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		total++;
>  		nr_pages = folio_nr_pages(folio);
>  
> -		if (new_order >= folio_order(folio))
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		if (!folio_trylock(folio))
>  			goto next;
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
>  		folio_unlock(folio);
>  next:
>  		folio_put(folio);
> -- 
> 2.44.1
>

Zi Yan Aug. 29, 2024, 10:12 p.m. UTC | #5

On 29 Aug 2024, at 15:55, Matthew Wilcox wrote:

> On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
>> With vm debugging however I get more information about the issue:
>>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
>
> This is in folio_unlock().  We're trying to unlock a folio which isn't
> locked!
>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
> [...]
>> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
>> cmp_and_merge_page() on the split case:
>>
>>                 } else if (split) {
>>                         /*
>>                          * We are here if we tried to merge two pages and
>>                          * failed because they both belonged to the same
>>                          * compound page. We will split the page now, but no
>>                          * merging will take place.
>>                          * We do not want to add the cost of a full lock; if
>>                          * the page is locked, it is better to skip it and
>>                          * perhaps try again later.
>>                          */
>>                         if (!trylock_page(page))
>>                                 return;
>>                         split_huge_page(page);
>>                         unlock_page(page);
>
> Obviously the page is locked when we call split_huge_page().  There's
> an assert inside it.  And the lock bit is _supposed_ to be transferred
> to the head page of the page which is being split.  My guess is that
> this is messed up somehow; we're perhaps transferring the lock bit to
> the wrong page?

The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
But this patch changes the “page” passed into split_huge_page_to_list_to_order()
always to the head page.

This fixes the crash on my x86 VM, but it can be improved:

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7c50aeed0522..eff5d2fb5d4e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-       return split_folio(page_folio(page));
-}
+int split_huge_page(struct page *page);
 void deferred_split_folio(struct folio *folio);

 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c29af9451d92..4d723dab4336 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

+int split_huge_page(struct page *page)
+{
+       unsigned int min_order = 0;
+       struct folio *folio = page_folio(page);
+
+       if (folio_test_anon(folio))
+               goto out;
+
+       if (!folio->mapping) {
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_SPLIT_PAGE_FAILED);
+               return -EBUSY;
+       }
+
+       min_order = mapping_min_folio_order(folio->mapping);
+out:
+       return split_huge_page_to_list_to_order(page, NULL, min_order);
+}
+
 int split_folio_to_list(struct folio *folio, struct list_head *list)
 {
        unsigned int min_order = 0;

Luis Chamberlain Aug. 29, 2024, 11:41 p.m. UTC | #6

On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> always to the head page.
> 
> This fixes the crash on my x86 VM, but it can be improved:
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7c50aeed0522..eff5d2fb5d4e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
> -static inline int split_huge_page(struct page *page)
> -{
> -       return split_folio(page_folio(page));
> -}
> +int split_huge_page(struct page *page);
>  void deferred_split_folio(struct folio *folio);
> 
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29af9451d92..4d723dab4336 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
> 
> +int split_huge_page(struct page *page)
> +{
> +       unsigned int min_order = 0;
> +       struct folio *folio = page_folio(page);
> +
> +       if (folio_test_anon(folio))
> +               goto out;
> +
> +       if (!folio->mapping) {
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> +               return -EBUSY;
> +       }
> +
> +       min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> +}
> +
>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>  {
>         unsigned int min_order = 0;


Confirmed, and also although you suggest it can be improved, I thought
that we could do that by sharing more code and putting things in the
headers, the below also fixes this but tries to share more code, but
I think it is perhaps less easier to understand than your patch.

So I think your patch is cleaner and easier as a fix.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..99cd9c7bf55b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
+#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
@@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
-int split_folio_to_list(struct folio *folio, struct list_head *list);
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_folio(page_folio(page));
+	return split_page_folio_to_list(page, page_folio(page), NULL);
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
 	return 0;
 }
 
-static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+static inline int split_page_folio_to_list(struct page *page,
+					   struct folio *folio,
+					   struct list_head *list)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b115bfe63b52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list)
 {
 	unsigned int min_order = 0;
 
@@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 out:
-	return split_huge_page_to_list_to_order(&folio->page, list,
-							min_order);
+	return split_huge_page_to_list_to_order(page, list, min_order);
 }
 
 void __folio_undo_large_rmappable(struct folio *folio)

Sven Schnelle Aug. 30, 2024, 5:57 a.m. UTC | #7

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>> 
>> This fixes the crash on my x86 VM, but it can be improved:

It also fixes the issue on s390x. Thanks!

Daniel Gomez Aug. 30, 2024, 11:58 a.m. UTC | #8

On Thu, Aug 29, 2024 at 04:41:48PM -0700, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> > The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> > unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> > to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> > But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> > always to the head page.
> > 
> > This fixes the crash on my x86 VM, but it can be improved:
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 7c50aeed0522..eff5d2fb5d4e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
> >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >                 unsigned int new_order);
> >  int split_folio_to_list(struct folio *folio, struct list_head *list);
> > -static inline int split_huge_page(struct page *page)
> > -{
> > -       return split_folio(page_folio(page));
> > -}
> > +int split_huge_page(struct page *page);
> >  void deferred_split_folio(struct folio *folio);
> > 
> >  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c29af9451d92..4d723dab4336 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >         return ret;
> >  }
> > 
> > +int split_huge_page(struct page *page)
> > +{
> > +       unsigned int min_order = 0;
> > +       struct folio *folio = page_folio(page);
> > +
> > +       if (folio_test_anon(folio))
> > +               goto out;
> > +
> > +       if (!folio->mapping) {
> > +               if (folio_test_pmd_mappable(folio))
> > +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> > +               return -EBUSY;
> > +       }
> > +
> > +       min_order = mapping_min_folio_order(folio->mapping);
> > +out:
> > +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> > +}
> > +
> >  int split_folio_to_list(struct folio *folio, struct list_head *list)
> >  {
> >         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
> So I think your patch is cleaner and easier as a fix.

I reproduced it in arm64 as well. And both fixes provided solved the issue.

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

Pankaj Raghav (Samsung) Aug. 30, 2024, 2:59 p.m. UTC | #9

On 30/08/2024 01:41, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>>
>> This fixes the crash on my x86 VM, but it can be improved:
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7c50aeed0522..eff5d2fb5d4e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                 unsigned int new_order);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>> -static inline int split_huge_page(struct page *page)
>> -{
>> -       return split_folio(page_folio(page));
>> -}
>> +int split_huge_page(struct page *page);
>>  void deferred_split_folio(struct folio *folio);
>>
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29af9451d92..4d723dab4336 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>         return ret;
>>  }
>>
>> +int split_huge_page(struct page *page)
>> +{
>> +       unsigned int min_order = 0;
>> +       struct folio *folio = page_folio(page);
>> +
>> +       if (folio_test_anon(folio))
>> +               goto out;
>> +
>> +       if (!folio->mapping) {
>> +               if (folio_test_pmd_mappable(folio))
>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>> +               return -EBUSY;
>> +       }
>> +
>> +       min_order = mapping_min_folio_order(folio->mapping);
>> +out:
>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>> +}
>> +
>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>  {
>>         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

How about we extract the code that returns the min order so that we don't repeat.

Something like this:


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..d27febd5c639 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
+int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-       return split_folio(page_folio(page));
+       struct folio *folio = page_folio(page);
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       /*
+        * split_huge_page() locks the page before splitting and
+        * expects the same page that has been split to be locked when
+        * returned. split_folio_to_list() cannot be used here because
+        * it converts the page to folio and passes the head page to be
+        * split.
+        */
+       return split_huge_page_to_list_to_order(page, NULL, ret);
 }
 void deferred_split_folio(struct folio *folio);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b167e036d01b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int min_order_for_split(struct folio *folio)
 {
-       unsigned int min_order = 0;
-
        if (folio_test_anon(folio))
-               goto out;
+               return 0;

        if (!folio->mapping) {
                if (folio_test_pmd_mappable(folio))
@@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
                return -EBUSY;
        }

-       min_order = mapping_min_folio_order(folio->mapping);
-out:
-       return split_huge_page_to_list_to_order(&folio->page, list,
-                                                       min_order);
+       return mapping_min_folio_order(folio->mapping);
+}
+
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }

 void __folio_undo_large_rmappable(struct folio *folio)

> So I think your patch is cleaner and easier as a fix.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

Luis Chamberlain Aug. 30, 2024, 5:12 p.m. UTC | #10

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

Agreed..

> How about we extract the code that returns the min order so that we don't repeat.
> Something like this:

Of all solutions I like this the most, Zi, do you have any preference?

  Luis

Matthew Wilcox Aug. 30, 2024, 10:42 p.m. UTC | #11

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

We do that in the rmap code.

But this is not a performance path.  We should keep this as simple as
possible.

Zi Yan Aug. 31, 2024, 10:35 p.m. UTC | #12

On 30 Aug 2024, at 10:59, Pankaj Raghav wrote:

> On 30/08/2024 01:41, Luis Chamberlain wrote:
>> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>>> always to the head page.
>>>
>>> This fixes the crash on my x86 VM, but it can be improved:
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 7c50aeed0522..eff5d2fb5d4e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>                 unsigned int new_order);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>> -static inline int split_huge_page(struct page *page)
>>> -{
>>> -       return split_folio(page_folio(page));
>>> -}
>>> +int split_huge_page(struct page *page);
>>>  void deferred_split_folio(struct folio *folio);
>>>
>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c29af9451d92..4d723dab4336 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>         return ret;
>>>  }
>>>
>>> +int split_huge_page(struct page *page)
>>> +{
>>> +       unsigned int min_order = 0;
>>> +       struct folio *folio = page_folio(page);
>>> +
>>> +       if (folio_test_anon(folio))
>>> +               goto out;
>>> +
>>> +       if (!folio->mapping) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>>> +               return -EBUSY;
>>> +       }
>>> +
>>> +       min_order = mapping_min_folio_order(folio->mapping);
>>> +out:
>>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>>> +}
>>> +
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>>  {
>>>         unsigned int min_order = 0;
>>
>>
>> Confirmed, and also although you suggest it can be improved, I thought
>> that we could do that by sharing more code and putting things in the
>> headers, the below also fixes this but tries to share more code, but
>> I think it is perhaps less easier to understand than your patch.
>>
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> How about we extract the code that returns the min order so that we don't repeat.
>
> Something like this:
>
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..d27febd5c639 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
> +int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);

Since split_folio() is no longer used below, this line can be removed.

>  static inline int split_huge_page(struct page *page)
>  {
> -       return split_folio(page_folio(page));
> +       struct folio *folio = page_folio(page);
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

min_order_for_split() returns -EBUSY, 0, and a positive min_order. This if
statement should be "if (ret < 0)"?

> +
> +       /*
> +        * split_huge_page() locks the page before splitting and
> +        * expects the same page that has been split to be locked when
> +        * returned. split_folio_to_list() cannot be used here because
> +        * it converts the page to folio and passes the head page to be
> +        * split.
> +        */
> +       return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b167e036d01b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
>
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int min_order_for_split(struct folio *folio)
>  {
> -       unsigned int min_order = 0;
> -
>         if (folio_test_anon(folio))
> -               goto out;
> +               return 0;
>
>         if (!folio->mapping) {
>                 if (folio_test_pmd_mappable(folio))
> @@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>                 return -EBUSY;
>         }
>
> -       min_order = mapping_min_folio_order(folio->mapping);
> -out:
> -       return split_huge_page_to_list_to_order(&folio->page, list,
> -                                                       min_order);
> +       return mapping_min_folio_order(folio->mapping);
> +}
> +
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

Ditto.

> +
> +       return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
>  void __folio_undo_large_rmappable(struct folio *folio)
>


--
Best Regards,
Yan, Zi

Zi Yan Aug. 31, 2024, 10:38 p.m. UTC | #13

On 30 Aug 2024, at 13:12, Luis Chamberlain wrote:

> On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
>> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> Agreed..
>
>> How about we extract the code that returns the min order so that we don't repeat.
>> Something like this:
>
> Of all solutions I like this the most, Zi, do you have any preference?

No preference, as long as Pankaj fixed the "if (ret)" issue I mentioned in the
other email.

--
Best Regards,
Yan, Zi

Lai, Yi Sept. 6, 2024, 6:52 a.m. UTC | #14

Hi Luis,

Greetings!

I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.

After bisection and the first bad commit is:
"
fd031210c9ce mm: split a folio in minimum folio order chunks
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log

"
[  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
[  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  447.977858]  soft_offline_page+0x97/0xc0
[  447.978281]  do_madvise.part.0+0x1a45/0x2a30
[  447.978742]  ? __pfx___lock_acquire+0x10/0x10
[  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
[  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
[  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
[  447.980729]  ? lock_release+0x441/0x870
[  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
[  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
[  447.982771]  ? trace_hardirqs_on+0x51/0x60
[  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
[  447.984673]  __x64_sys_madvise+0x139/0x180
[  447.984997]  x64_sys_call+0x19a5/0x2140
[  447.985307]  do_syscall_64+0x6d/0x140
[  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  447.986011] RIP: 0033:0x7f782623ee5d
[  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
[  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
[  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
[  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
[  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
[  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
[  447.988799]  </TASK>
[  447.988921]
[  447.988921] Showing all locks held in the system:
[  447.989237] 1 lock held by khungtaskd/33:
[  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
[  447.989947] 1 lock held by repro/628:
[  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
[  447.990611]
[  447.990701] =============================================

"

I hope you find it useful.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
> 
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
> 
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: David Howells <dhowells@redhat.com>
> ---
>  include/linux/huge_mm.h | 14 +++++++---
>  mm/huge_memory.c        | 60 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 66 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfec..70424d55da088 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
> +#define split_folio(f) split_folio_to_list(f, NULL)
> +
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
>  #define HPAGE_PUD_SHIFT PUD_SHIFT
> @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_folio_to_list(struct folio *folio, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_huge_page_to_list_to_order(page, NULL, 0);
> +	return split_folio(page_folio(page));
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> +
> +static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	return 0;
> +}
> +
>  static inline void deferred_split_folio(struct folio *folio) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
> @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>  	return split_folio_to_list_to_order(folio, NULL, new_order);
>  }
>  
> -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> -#define split_folio(f) split_folio_to_order(f, 0)
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cf8e34f62976f..06384b85a3a20 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   * released, or if some unexpected race happened (e.g., anon VMA disappeared,
>   * truncation).
>   *
> + * Callers should ensure that the order respects the address space mapping
> + * min-order if one is set for non-anonymous folios.
> + *
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		mapping = NULL;
>  		anon_vma_lock_write(anon_vma);
>  	} else {
> +		unsigned int min_order;
>  		gfp_t gfp;
>  
>  		mapping = folio->mapping;
> @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  			goto out;
>  		}
>  
> +		min_order = mapping_min_folio_order(folio->mapping);
> +		if (new_order < min_order) {
> +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> +				     min_order);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>  		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>  							GFP_RECLAIM_MASK);
>  
> @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (folio_test_anon(folio))
> +		goto out;
> +
> +	if (!folio->mapping) {
> +		if (folio_test_pmd_mappable(folio))
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> +		return -EBUSY;
> +	}
> +
> +	min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +	return split_huge_page_to_list_to_order(&folio->page, list,
> +							min_order);
> +}
> +
>  void __folio_undo_large_rmappable(struct folio *folio)
>  {
>  	struct deferred_split *ds_queue;
> @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		struct vm_area_struct *vma = vma_lookup(mm, addr);
>  		struct folio_walk fw;
>  		struct folio *folio;
> +		struct address_space *mapping;
> +		unsigned int target_order = new_order;
>  
>  		if (!vma)
>  			break;
> @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> -		if (new_order >= folio_order(folio))
> +		if (!folio_test_anon(folio)) {
> +			mapping = folio->mapping;
> +			target_order = max(new_order,
> +					   mapping_min_folio_order(mapping));
> +		}
> +
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		total++;
> @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		folio_get(folio);
>  		folio_walk_end(&fw, vma);
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (!folio_test_anon(folio) && folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
> +
>  		folio_unlock(folio);
>  		folio_put(folio);
>  
> @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	pgoff_t index;
>  	int nr_pages = 1;
>  	unsigned long total = 0, split = 0;
> +	unsigned int min_order;
> +	unsigned int target_order;
>  
>  	file = getname_kernel(file_path);
>  	if (IS_ERR(file))
> @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		 file_path, off_start, off_end);
>  
>  	mapping = candidate->f_mapping;
> +	min_order = mapping_min_folio_order(mapping);
> +	target_order = max(new_order, min_order);
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
>  		struct folio *folio = filemap_get_folio(mapping, index);
> @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		total++;
>  		nr_pages = folio_nr_pages(folio);
>  
> -		if (new_order >= folio_order(folio))
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		if (!folio_trylock(folio))
>  			goto next;
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
>  		folio_unlock(folio);
>  next:
>  		folio_put(folio);
> -- 
> 2.44.1
>

Pankaj Raghav (Samsung) Sept. 6, 2024, 8:01 a.m. UTC | #15

On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote:
Hi Yi,

> 
> I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.

I don't know if it is related, but we had a fix for this commit for a
ltp failure due to locking issues that is there in next-20240905 but not
in next-20240902.

Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/

Is this reproducible also on next-20240905?

> 
> After bisection and the first bad commit is:
> "
> fd031210c9ce mm: split a folio in minimum folio order chunks
> "
> 
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> 
> "
> [  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
> [  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [  447.977858]  soft_offline_page+0x97/0xc0
> [  447.978281]  do_madvise.part.0+0x1a45/0x2a30
> [  447.978742]  ? __pfx___lock_acquire+0x10/0x10
> [  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
> [  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.980729]  ? lock_release+0x441/0x870
> [  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> [  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
> [  447.982771]  ? trace_hardirqs_on+0x51/0x60
> [  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> [  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> [  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> [  447.984673]  __x64_sys_madvise+0x139/0x180
> [  447.984997]  x64_sys_call+0x19a5/0x2140
> [  447.985307]  do_syscall_64+0x6d/0x140
> [  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  447.986011] RIP: 0033:0x7f782623ee5d
> [  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> [  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
> [  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
> [  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
> [  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
> [  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
> [  447.988799]  </TASK>
> [  447.988921]
> [  447.988921] Showing all locks held in the system:
> [  447.989237] 1 lock held by khungtaskd/33:
> [  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
> [  447.989947] 1 lock held by repro/628:
> [  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
> [  447.990611]
> [  447.990701] =============================================
> 
> "
> 
> I hope you find it useful.
> 
> Regards,
> Yi Lai
>

Lai, Yi Sept. 9, 2024, 9:06 a.m. UTC | #16

Hi,

I have tried running the repro.c for 1 hour using next-20240905 kernel. Issue
cannot be reproduced.

Thank you.

Regards,
Yi Lai

On Fri, Sep 06, 2024 at 08:01:20AM +0000, Pankaj Raghav (Samsung) wrote:
> On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote:
> Hi Yi,
> 
> > 
> > I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.
> 
> I don't know if it is related, but we had a fix for this commit for a
> ltp failure due to locking issues that is there in next-20240905 but not
> in next-20240902.
> 
> Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/
> 
> Is this reproducible also on next-20240905?
> 
> > 
> > After bisection and the first bad commit is:
> > "
> > fd031210c9ce mm: split a folio in minimum folio order chunks
> > "
> > 
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> > 
> > "
> > [  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
> > [  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [  447.977858]  soft_offline_page+0x97/0xc0
> > [  447.978281]  do_madvise.part.0+0x1a45/0x2a30
> > [  447.978742]  ? __pfx___lock_acquire+0x10/0x10
> > [  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
> > [  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.980729]  ? lock_release+0x441/0x870
> > [  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> > [  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
> > [  447.982771]  ? trace_hardirqs_on+0x51/0x60
> > [  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> > [  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> > [  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> > [  447.984673]  __x64_sys_madvise+0x139/0x180
> > [  447.984997]  x64_sys_call+0x19a5/0x2140
> > [  447.985307]  do_syscall_64+0x6d/0x140
> > [  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  447.986011] RIP: 0033:0x7f782623ee5d
> > [  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> > [  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
> > [  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
> > [  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
> > [  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
> > [  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
> > [  447.988799]  </TASK>
> > [  447.988921]
> > [  447.988921] Showing all locks held in the system:
> > [  447.989237] 1 lock held by khungtaskd/33:
> > [  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
> > [  447.989947] 1 lock held by repro/628:
> > [  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
> > [  447.990611]
> > [  447.990701] =============================================
> > 
> > "
> > 
> > I hope you find it useful.
> > 
> > Regards,
> > Yi Lai
> >

[v13,04/10] mm: split a folio in minimum folio order chunks

Commit Message

Comments

Patch