Message ID | 20240425113746.335530-6-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | enable bs > ps in XFS | expand |
On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > Splitting a larger folio with a base order is supported using > split_huge_page_to_list_to_order() API. However, using that API for LBS > is resulting in an NULL ptr dereference error in the writeback path [1]. > > Refuse to split a folio if it has minimum folio order requirement until > we can start using split_huge_page_to_list_to_order() API. Splitting the > folio can be added as a later optimization. > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df Obviously this has to be tracked down and fixed before this patchset can be merged ... I think I have some ideas. Let me look a bit. How would I go about reproducing this?
On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > Splitting a larger folio with a base order is supported using > > split_huge_page_to_list_to_order() API. However, using that API for LBS > > is resulting in an NULL ptr dereference error in the writeback path [1]. > > > > Refuse to split a folio if it has minimum folio order requirement until > > we can start using split_huge_page_to_list_to_order() API. Splitting the > > folio can be added as a later optimization. > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > > Obviously this has to be tracked down and fixed before this patchset can > be merged ... I think I have some ideas. Let me look a bit. How > would I go about reproducing this? Using kdevops this is easy: make defconfig-lbs-xfs-small -j $(nproc) make -j $(nproc) make fstests make linux make fstests-baseline TESTS=generic/447 COUNT=10 tail -f guestfs/*-xfs-reflink-16k-4ks/console.log or sudo virsh list sudo virsh console ${foo}-xfs-reflink-16k-4ks Where $foo is the value of CONFIG_KDEVOPS_HOSTS_PREFIX in .config for your kdevops run. Otherwise if you wanna run things manually the above uses an lbs branch called large-block-minorder on kdevops [0] based on v6.9-rc5 with: a) Fixes we know we need b) this patch series minus this patch c) A truncation enablement patch Note that the above also uses an fstests git tree with the fstests changes we also have posted as fixes and some new tests which have been posted [1]. You will then want to run: ./check -s xfs_reflink_16k_4ks -I 10 generic/447 The configuration for xfs_reflink_16k_4ks follows: cat /var/lib/xfstests/configs/min-xfs-reflink-16k-4ks.config [default] FSTYP=xfs TEST_DIR=/media/test SCRATCH_MNT=/media/scratch RESULT_BASE=$PWD/results/$HOST/$(uname -r) DUMP_CORRUPT_FS=1 CANON_DEVS=yes RECREATE_TEST_DEV=true SOAK_DURATION=9900 [xfs_reflink_16k_4ks] TEST_DEV=/dev/loop16 SCRATCH_DEV_POOL="/dev/loop5 /dev/loop6 /dev/loop7 /dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11 /dev/loop12" MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -b size=16384, -s size=4k' USE_EXTERNAL=no LOGWRITES_DEV=/dev/loop15 I didn't have time to verify if the above commands for kdevops worked but... in theory its possible it may, because you know, May is right around the corner, and May... the force be with us. [0] https://github.com/linux-kdevops/linux/tree/large-block-minorder [1] https://github.com/linux-kdevops/fstests Luis
On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > Splitting a larger folio with a base order is supported using > > split_huge_page_to_list_to_order() API. However, using that API for LBS > > is resulting in an NULL ptr dereference error in the writeback path [1]. > > > > Refuse to split a folio if it has minimum folio order requirement until > > we can start using split_huge_page_to_list_to_order() API. Splitting the > > folio can be added as a later optimization. > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > > Obviously this has to be tracked down and fixed before this patchset can > be merged ... I think I have some ideas. Let me look a bit. How > would I go about reproducing this? I am able to reproduce it in a VM with 4G RAM and running generic/447 (sometimes you have to run it twice) on a 16K BS on a 4K PS system. I have a suspicion on this series: https://lore.kernel.org/linux-fsdevel/20240215063649.2164017-1-hch@lst.de/ but I am still unsure why this is happening when we split with LBS configurations. If you have kdevops installed, then go with Luis's suggestion, or else this is my local config. This is the diff I applied instead of this patch: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9859aa4f7553..63ee7b6ed03d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, { struct folio *folio = page_folio(page); struct deferred_split *ds_queue = get_deferred_split_queue(folio); + unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping); + + if (!folio_test_anon(folio)) + new_order = max_t(unsigned int, mapping_min_order, new_order); /* reset xarray order to new order after split */ XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; @@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, goto out; } + // XXX: Remove it later + VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio); gfp = current_gfp_context(mapping_gfp_mask(mapping) & GFP_RECLAIM_MASK); (END) xfstests is based on https://github.com/kdave/xfstests/tree/v2024.04.14 xfstests config: [default] FSTYP=xfs RESULT_BASE=/root/results/ DUMP_CORRUPT_FS=1 CANON_DEVS=yes RECREATE_TEST_DEV=true TEST_DEV=/dev/nvme0n1 TEST_DIR=/media/test SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/media/scratch LOGWRITES_DEV=/dev/vdc [16k_4ks] MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -b size=16k, -s size=4k' [nix-shell:~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS vdb 254:16 0 32G 0 disk /media/scratch vdc 254:32 0 32G 0 disk nvme0n1 259:0 0 32G 0 disk /media/test $ ./check -s 16k_4ks generic/447 BT: [ 74.170698] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x14b/0x510 [ 74.170938] Write of size 4 at addr 0000000000000036 by task kworker/u16:6/284 [ 74.170938] [ 74.170938] CPU: 0 PID: 284 Comm: kworker/u16:6 Not tainted 6.9.0-rc4-00011-g4676d00b6f6f #7 [ 74.170938] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 74.170938] Workqueue: writeback wb_workfn (flush-254:16) [ 74.170938] Call Trace: [ 74.170938] <TASK> [ 74.170938] dump_stack_lvl+0x51/0x70 [ 74.170938] kasan_report+0xab/0xe0 [ 74.170938] ? filemap_get_folios_tag+0x14b/0x510 [ 74.170938] kasan_check_range+0x35/0x1b0 [ 74.170938] filemap_get_folios_tag+0x14b/0x510 [ 74.170938] ? __pfx_filemap_get_folios_tag+0x10/0x10 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] writeback_iter+0x508/0xcc0 [ 74.170938] ? __pfx_iomap_do_writepage+0x10/0x10 [ 74.170938] write_cache_pages+0x80/0x100 [ 74.170938] ? __pfx_write_cache_pages+0x10/0x10 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? _raw_spin_lock+0x87/0xe0 [ 74.170938] iomap_writepages+0x85/0xe0 [ 74.170938] xfs_vm_writepages+0xe3/0x140 [xfs] [ 74.170938] ? __pfx_xfs_vm_writepages+0x10/0x10 [xfs] [ 74.170938] ? kasan_save_track+0x10/0x30 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? __kasan_kmalloc+0x7b/0x90 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? virtqueue_add_split+0x605/0x1b00 [ 74.170938] do_writepages+0x176/0x740 [ 74.170938] ? __pfx_do_writepages+0x10/0x10 [ 74.170938] ? __pfx_virtqueue_add_split+0x10/0x10 [ 74.170938] ? __pfx_update_sd_lb_stats.constprop.0+0x10/0x10 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? virtqueue_add_sgs+0xfe/0x130 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? virtblk_add_req+0x15c/0x280 [ 74.170938] __writeback_single_inode+0x9f/0x840 [ 74.170938] ? wbc_attach_and_unlock_inode+0x345/0x5d0 [ 74.170938] writeback_sb_inodes+0x491/0xce0 [ 74.170938] ? __pfx_wb_calc_thresh+0x10/0x10 [ 74.170938] ? __pfx_writeback_sb_inodes+0x10/0x10 [ 74.170938] ? __wb_calc_thresh+0x1a0/0x3c0 [ 74.170938] ? __pfx_down_read_trylock+0x10/0x10 [ 74.170938] ? wb_over_bg_thresh+0x16b/0x5e0 [ 74.170938] ? __pfx_move_expired_inodes+0x10/0x10 [ 74.170938] __writeback_inodes_wb+0xb7/0x200 [ 74.170938] wb_writeback+0x2c4/0x660 [ 74.170938] ? __pfx_wb_writeback+0x10/0x10 [ 74.170938] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 74.170938] wb_workfn+0x54e/0xaf0 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? __pfx_wb_workfn+0x10/0x10 [ 74.170938] ? __pfx___schedule+0x10/0x10 [ 74.170938] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 74.170938] process_one_work+0x622/0x1020 [ 74.170938] worker_thread+0x844/0x10e0 [ 74.170938] ? srso_return_thunk+0x5/0x5f [ 74.170938] ? __kthread_parkme+0x82/0x150 [ 74.170938] ? __pfx_worker_thread+0x10/0x10 [ 74.170938] kthread+0x2b4/0x380 [ 74.170938] ? __pfx_kthread+0x10/0x10 [ 74.170938] ret_from_fork+0x30/0x70 [ 74.170938] ? __pfx_kthread+0x10/0x10 [ 74.170938] ret_from_fork_asm+0x1a/0x30 [ 74.170938] </TASK>
On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > using that API for LBS is resulting in an NULL ptr dereference > > > error in the writeback path [1]. > > > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > > > > How would I go about reproducing this? > > Using kdevops this is easy: > > make defconfig-lbs-xfs-small -j $(nproc) > make -j $(nproc) I forgot after the above: make bringup > make fstests > make linux > make fstests-baseline TESTS=generic/447 COUNT=10 > tail -f guestfs/*-xfs-reflink-16k-4ks/console.log The above tail command needs sudo prefix for now too. > or > sudo virsh list > sudo virsh console ${foo}-xfs-reflink-16k-4ks > > Where $foo is the value of CONFIG_KDEVOPS_HOSTS_PREFIX in .config for > your kdevops run. > > I didn't have time to verify if the above commands for kdevops worked I did now, I forgot to git push commits to kdevops yesterday but I confirm the above steps can be used to repdroduce this issue now. Luis
On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: > On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: > > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > > > using that API for LBS is resulting in an NULL ptr dereference > > > > error in the writeback path [1]. > > > > > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > > > > > > How would I go about reproducing this? Well so the below fixes this but I am not sure if this is correct. folio_mark_dirty() at least says that a folio should not be truncated while its running. I am not sure if we should try to split folios then even though we check for writeback once. truncate_inode_partial_folio() will folio_wait_writeback() but it will split_folio() before checking for claiming to fail to truncate with folio_test_dirty(). But since the folio is locked its not clear why this should be possible. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 83955362d41c..90195506211a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (new_order >= folio_order(folio)) return -EINVAL; - if (folio_test_writeback(folio)) + if (folio_test_dirty(folio) || folio_test_writeback(folio)) return -EBUSY; if (!folio_test_anon(folio)) {
On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote: > On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: > > On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: > > > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > > > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > > > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > > > > > using that API for LBS is resulting in an NULL ptr dereference > > > > > error in the writeback path [1]. > > > > > > > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > > > > > > > > How would I go about reproducing this? > > Well so the below fixes this but I am not sure if this is correct. > folio_mark_dirty() at least says that a folio should not be truncated > while its running. I am not sure if we should try to split folios then > even though we check for writeback once. truncate_inode_partial_folio() > will folio_wait_writeback() but it will split_folio() before checking > for claiming to fail to truncate with folio_test_dirty(). But since the > folio is locked its not clear why this should be possible. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 83955362d41c..90195506211a 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > if (new_order >= folio_order(folio)) > return -EINVAL; > > - if (folio_test_writeback(folio)) > + if (folio_test_dirty(folio) || folio_test_writeback(folio)) > return -EBUSY; > > if (!folio_test_anon(folio)) { I wondered what code path is causing this and triggering this null pointer, so I just sprinkled a check here: VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio); The answer was: kcompactd() --> migrate_pages_batch() --> try_split_folio --> split_folio_to_list() --> split_huge_page_to_list_to_order() Since it took running fstests generic/447 twice to reproduce the crash with a minorder and 16k block size, modified generic/447 as followed and it helps to speed up the reproducer with just running the test once: diff --git a/tests/generic/447 b/tests/generic/447 index 16b814ee7347..43050b58e8ba 100755 --- a/tests/generic/447 +++ b/tests/generic/447 @@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1 testdir="$SCRATCH_MNT/test-$seq" mkdir "$testdir" +runfile="$tmp.compaction" +touch $runfile +while [ -e $runfile ]; do + echo 1 > /proc/sys/vm/compact_memory + sleep 10 +done & +compaction_pid=$! + + # Setup for one million blocks, but we'll accept stress testing down to # 2^17 blocks... that should be plenty for anyone. fnr=20 @@ -69,6 +81,8 @@ _scratch_cycle_mount echo "Delete file1" rm -rf $testdir/file1 +rm -f $runfile +wait > /dev/null 2>&1 # success, all done status=0 exit And I verified that moving the check only to the migrate_pages_batch() path also fixes the crash: diff --git a/mm/migrate.c b/mm/migrate.c index 73a052a382f1..83b528eb7100 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f int rc; folio_lock(folio); + if (folio_test_dirty(folio)) { + rc = -EBUSY; + goto out; + } rc = split_folio_to_list(folio, split_folios); +out: folio_unlock(folio); if (!rc) list_move_tail(&folio->lru, split_folios); However I'd like compaction folks to review this. I see some indications in the code that migration can race with truncation but we feel fine by it by taking the folio lock. However here we have a case where we see the folio clearly locked and the folio is dirty. Other migraiton code seems to write back the code and can wait, here we just move on. Further reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping as completely unmovable") seems to hint that migration is safe if the mapping either does not exist or the mapping does exist but has mapping->a_ops->migrate_folio so I'd like further feedback on this. Another thing which requires review is if we we split a folio but not down to order 0 but to the new min order, does the accounting on migrate_pages_batch() require changing? And most puzzling, why do we not see this with regular large folios, but we do see it with minorder ? Luis
On 28 Apr 2024, at 23:56, Luis Chamberlain wrote: > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote: >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: >>>>>> From: Pankaj Raghav <p.raghav@samsung.com> >>>>>> >>>>>> using that API for LBS is resulting in an NULL ptr dereference >>>>>> error in the writeback path [1]. >>>>>> >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df >>>>> >>>>> How would I go about reproducing this? >> >> Well so the below fixes this but I am not sure if this is correct. >> folio_mark_dirty() at least says that a folio should not be truncated >> while its running. I am not sure if we should try to split folios then >> even though we check for writeback once. truncate_inode_partial_folio() >> will folio_wait_writeback() but it will split_folio() before checking >> for claiming to fail to truncate with folio_test_dirty(). But since the >> folio is locked its not clear why this should be possible. >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 83955362d41c..90195506211a 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> if (new_order >= folio_order(folio)) >> return -EINVAL; >> >> - if (folio_test_writeback(folio)) >> + if (folio_test_dirty(folio) || folio_test_writeback(folio)) >> return -EBUSY; >> >> if (!folio_test_anon(folio)) { > > I wondered what code path is causing this and triggering this null > pointer, so I just sprinkled a check here: > > VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio); > > The answer was: > > kcompactd() --> migrate_pages_batch() > --> try_split_folio --> split_folio_to_list() --> > split_huge_page_to_list_to_order() > There are 3 try_split_folio() in migrate_pages_batch(). First one is to split anonymous large folios that are on deferred split list, so not related; second one is to split THPs when thp migration is not supported, but this is compaction, so not related; third one is to split large folios when there is no same size free page in the system, and this should be the one. > Since it took running fstests generic/447 twice to reproduce the crash > with a minorder and 16k block size, modified generic/447 as followed and > it helps to speed up the reproducer with just running the test once: > > diff --git a/tests/generic/447 b/tests/generic/447 > index 16b814ee7347..43050b58e8ba 100755 > --- a/tests/generic/447 > +++ b/tests/generic/447 > @@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1 > testdir="$SCRATCH_MNT/test-$seq" > mkdir "$testdir" > > +runfile="$tmp.compaction" > +touch $runfile > +while [ -e $runfile ]; do > + echo 1 > /proc/sys/vm/compact_memory > + sleep 10 > +done & > +compaction_pid=$! > + > + > # Setup for one million blocks, but we'll accept stress testing down to > # 2^17 blocks... that should be plenty for anyone. > fnr=20 > @@ -69,6 +81,8 @@ _scratch_cycle_mount > echo "Delete file1" > rm -rf $testdir/file1 > > +rm -f $runfile > +wait > /dev/null 2>&1 > # success, all done > status=0 > exit > > And I verified that moving the check only to the migrate_pages_batch() > path also fixes the crash: > > diff --git a/mm/migrate.c b/mm/migrate.c > index 73a052a382f1..83b528eb7100 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > int rc; > > folio_lock(folio); > + if (folio_test_dirty(folio)) { > + rc = -EBUSY; > + goto out; > + } > rc = split_folio_to_list(folio, split_folios); > +out: > folio_unlock(folio); > if (!rc) > list_move_tail(&folio->lru, split_folios); > > However I'd like compaction folks to review this. I see some indications > in the code that migration can race with truncation but we feel fine by > it by taking the folio lock. However here we have a case where we see > the folio clearly locked and the folio is dirty. Other migraiton code > seems to write back the code and can wait, here we just move on. Further > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping > as completely unmovable") seems to hint that migration is safe if the > mapping either does not exist or the mapping does exist but has > mapping->a_ops->migrate_folio so I'd like further feedback on this. During migration, all page table entries pointing to this dirty folio are invalid, and accesses to this folio will cause page fault and wait on the migration entry. I am not sure we need to skip dirty folios. > Another thing which requires review is if we we split a folio but not > down to order 0 but to the new min order, does the accounting on > migrate_pages_batch() require changing? And most puzzling, why do we What accounting are you referring to? split code should take care of it. > not see this with regular large folios, but we do see it with minorder ? I wonder if the split code handles folio->mapping->i_pages properly. Does the i_pages store just folio pointers or also need all tail page pointers? I am no expert in fs, thus need help. -- Best Regards, Yan, Zi
On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote: > On 28 Apr 2024, at 23:56, Luis Chamberlain wrote: > > > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote: > >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: > >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: > >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > >>>>>> From: Pankaj Raghav <p.raghav@samsung.com> > >>>>>> > >>>>>> using that API for LBS is resulting in an NULL ptr dereference > >>>>>> error in the writeback path [1]. > >>>>>> > >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > >>>>> > >>>>> How would I go about reproducing this? > >> > >> Well so the below fixes this but I am not sure if this is correct. > >> folio_mark_dirty() at least says that a folio should not be truncated > >> while its running. I am not sure if we should try to split folios then > >> even though we check for writeback once. truncate_inode_partial_folio() > >> will folio_wait_writeback() but it will split_folio() before checking > >> for claiming to fail to truncate with folio_test_dirty(). But since the > >> folio is locked its not clear why this should be possible. > >> > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> index 83955362d41c..90195506211a 100644 > >> --- a/mm/huge_memory.c > >> +++ b/mm/huge_memory.c > >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > >> if (new_order >= folio_order(folio)) > >> return -EINVAL; > >> > >> - if (folio_test_writeback(folio)) > >> + if (folio_test_dirty(folio) || folio_test_writeback(folio)) > >> return -EBUSY; > >> > >> if (!folio_test_anon(folio)) { > > > > I wondered what code path is causing this and triggering this null > > pointer, so I just sprinkled a check here: > > > > VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio); > > > > The answer was: > > > > kcompactd() --> migrate_pages_batch() > > --> try_split_folio --> split_folio_to_list() --> > > split_huge_page_to_list_to_order() > > > > There are 3 try_split_folio() in migrate_pages_batch(). This is only true for linux-next, for v6.9-rc5 off of which this testing is based on there are only two. > First one is to split anonymous large folios that are on deferred > split list, so not related; This is in linux-next and not v6.9-rc5. > second one is to split THPs when thp migration is not supported, but > this is compaction, so not related; third one is to split large folios > when there is no same size free page in the system, and this should be > the one. Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This also helps us enhance the reproducer further, which I'll do next. > > And I verified that moving the check only to the migrate_pages_batch() > > path also fixes the crash: > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 73a052a382f1..83b528eb7100 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > > int rc; > > > > folio_lock(folio); > > + if (folio_test_dirty(folio)) { > > + rc = -EBUSY; > > + goto out; > > + } > > rc = split_folio_to_list(folio, split_folios); > > +out: > > folio_unlock(folio); > > if (!rc) > > list_move_tail(&folio->lru, split_folios); > > > > However I'd like compaction folks to review this. I see some indications > > in the code that migration can race with truncation but we feel fine by > > it by taking the folio lock. However here we have a case where we see > > the folio clearly locked and the folio is dirty. Other migraiton code > > seems to write back the code and can wait, here we just move on. Further > > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping > > as completely unmovable") seems to hint that migration is safe if the > > mapping either does not exist or the mapping does exist but has > > mapping->a_ops->migrate_folio so I'd like further feedback on this. > > During migration, all page table entries pointing to this dirty folio > are invalid, and accesses to this folio will cause page fault and > wait on the migration entry. I am not sure we need to skip dirty folios. I see.. thanks! > > Another thing which requires review is if we we split a folio but not > > down to order 0 but to the new min order, does the accounting on > > migrate_pages_batch() require changing? And most puzzling, why do we > > What accounting are you referring to? split code should take care of it. The folio order can change after split, and so I was concerned about the nr_pages used in migrate_pages_batch(). But I see now that when migrate_folio_unmap() first failed we try to split the folio, and if successful I see now we the caller will again call migrate_pages_batch() with a retry attempt of 1 only to the split folios. I also see the nr_pages is just local to each list for each loop, first on the from list to unmap and afte on the unmap list so we move the folios. > > not see this with regular large folios, but we do see it with minorder ? > > I wonder if the split code handles folio->mapping->i_pages properly. > Does the i_pages store just folio pointers or also need all tail page > pointers? I am no expert in fs, thus need help. mapping->i_pages stores folio pointers in the page cache or swap/dax/shadow entries (xa_is_value(folio)). The folios however can be special and we special-case them with shmem_mapping(mapping) checks. split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow entries, and we also bail out on shmem_mapping(mapping) already. Luis
On Mon, Apr 29, 2024 at 05:31:04PM -0700, Luis Chamberlain wrote: > On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote: > > On 28 Apr 2024, at 23:56, Luis Chamberlain wrote: > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > index 73a052a382f1..83b528eb7100 100644 > > > --- a/mm/migrate.c > > > +++ b/mm/migrate.c > > > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > > > int rc; > > > > > > folio_lock(folio); > > > + if (folio_test_dirty(folio)) { > > > + rc = -EBUSY; > > > + goto out; > > > + } > > > rc = split_folio_to_list(folio, split_folios); > > > +out: > > > folio_unlock(folio); > > > if (!rc) > > > list_move_tail(&folio->lru, split_folios); > > > > > > However I'd like compaction folks to review this. I see some indications > > > in the code that migration can race with truncation but we feel fine by > > > it by taking the folio lock. However here we have a case where we see > > > the folio clearly locked and the folio is dirty. Other migraiton code > > > seems to write back the code and can wait, here we just move on. Further > > > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping > > > as completely unmovable") seems to hint that migration is safe if the > > > mapping either does not exist or the mapping does exist but has > > > mapping->a_ops->migrate_folio so I'd like further feedback on this. > > > > During migration, all page table entries pointing to this dirty folio > > are invalid, and accesses to this folio will cause page fault and > > wait on the migration entry. I am not sure we need to skip dirty folios. That would seem to explain why we get this, if we allow these to continue, ie, without the hunk above, so it begs to ask, are we sure we are waiting for migration to complete if we're doing writeback? Apr 29 17:28:54 min-xfs-reflink-16k-4ks unknown: run fstests generic/447 at 2024-04-29 17:28:54 Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled. Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65 Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount Apr 29 17:28:56 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Unmounting Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65 Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled. Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem e29c3eee-18d1-4fec-9a17-076b3eccbd74 Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: supervisor read access in kernel mode Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: error_code(0x0000) - not-present page Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PGD 0 P4D 0 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CPU: 3 PID: 2245 Comm: kworker/u36:5 Not tainted 6.9.0-rc5+ #26 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Workqueue: writeback wb_workfn (flush-7:5) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Code: ad 05 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53 All code ======== 0: ad lods %ds:(%rsi),%eax 1: 05 86 00 48 89 add $0x89480086,%eax 6: c3 ret 7: 48 3d 06 04 00 00 cmp $0x406,%rax d: 74 e8 je 0xfffffffffffffff7 f: 48 81 fb 02 04 00 00 cmp $0x402,%rbx 16: 0f 84 d0 00 00 00 je 0xec 1c: 48 85 db test %rbx,%rbx 1f: 0f 84 04 01 00 00 je 0x129 25: f6 c3 01 test $0x1,%bl 28: 75 c4 jne 0xffffffffffffffee 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction 2d: 85 c0 test %eax,%eax 2f: 0f 84 b7 00 00 00 je 0xec 35: 8d 50 01 lea 0x1(%rax),%edx 38: 48 8d 73 34 lea 0x34(%rbx),%rsi 3c: f0 lock 3d: 0f .byte 0xf 3e: b1 53 mov $0x53,%cl Code starting with the faulting instruction =========================================== 0: 8b 43 34 mov 0x34(%rbx),%eax 3: 85 c0 test %eax,%eax 5: 0f 84 b7 00 00 00 je 0xc2 b: 8d 50 01 lea 0x1(%rax),%edx e: 48 8d 73 34 lea 0x34(%rbx),%rsi 12: f0 lock 13: 0f .byte 0xf 14: b1 53 mov $0x53,%cl Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RSP: 0018:ffffad0a0396b8f8 EFLAGS: 00010246 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 00000000000ba200 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9f408d20d488 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R13: ffffad0a0396bbb8 R14: ffffad0a0396bcb8 R15: ffff9f40633bfc00 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: FS: 0000000000000000(0000) GS:ffff9f40bbcc0000(0000) knlGS:0000000000000000 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036 CR3: 000000010bd44006 CR4: 0000000000770ef0 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PKRU: 55555554 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Call Trace: Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: <TASK> Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? page_fault_oops (arch/x86/mm/fault.c:713) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? write_cache_pages (mm/page-writeback.c:2569) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1)) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (mm/filemap.c:1972 mm/filemap.c:2222) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: write_cache_pages (mm/page-writeback.c:2568) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: iomap_writepages (fs/iomap/buffered-io.c:1984) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: do_writepages (mm/page-writeback.c:2612) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_single_inode (fs/fs-writeback.c:1659) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? _raw_spin_lock (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:134 (discriminator 4) kernel/locking/spinlock.c:154 (discriminator 4)) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_sb_inodes (fs/fs-writeback.c:1943) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_writeback (fs/fs-writeback.c:2119) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1)) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: process_one_work (kernel/workqueue.c:3254) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: kthread (kernel/kthread.c:388) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork (arch/x86/kernel/process.c:147) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257) Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: </TASK> Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Modules linked in: xfs nvme_fabrics nvme_core t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sunrpc 9p netfs kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_balloon pcspkr 9pnet_virtio virtio_console button evdev joydev serio_raw loop drm dm_mod autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover failover virtio_blk crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036 Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ---[ end trace 0000000000000000 ]---
On 29 Apr 2024, at 20:31, Luis Chamberlain wrote: > On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote: >> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote: >> >>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote: >>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: >>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: >>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: >>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: >>>>>>>> From: Pankaj Raghav <p.raghav@samsung.com> >>>>>>>> >>>>>>>> using that API for LBS is resulting in an NULL ptr dereference >>>>>>>> error in the writeback path [1]. >>>>>>>> >>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df >>>>>>> >>>>>>> How would I go about reproducing this? >>>> >>>> Well so the below fixes this but I am not sure if this is correct. >>>> folio_mark_dirty() at least says that a folio should not be truncated >>>> while its running. I am not sure if we should try to split folios then >>>> even though we check for writeback once. truncate_inode_partial_folio() >>>> will folio_wait_writeback() but it will split_folio() before checking >>>> for claiming to fail to truncate with folio_test_dirty(). But since the >>>> folio is locked its not clear why this should be possible. >>>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 83955362d41c..90195506211a 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >>>> if (new_order >= folio_order(folio)) >>>> return -EINVAL; >>>> >>>> - if (folio_test_writeback(folio)) >>>> + if (folio_test_dirty(folio) || folio_test_writeback(folio)) >>>> return -EBUSY; >>>> >>>> if (!folio_test_anon(folio)) { >>> >>> I wondered what code path is causing this and triggering this null >>> pointer, so I just sprinkled a check here: >>> >>> VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio); >>> >>> The answer was: >>> >>> kcompactd() --> migrate_pages_batch() >>> --> try_split_folio --> split_folio_to_list() --> >>> split_huge_page_to_list_to_order() >>> >> >> There are 3 try_split_folio() in migrate_pages_batch(). > > This is only true for linux-next, for v6.9-rc5 off of which this testing > is based on there are only two. > >> First one is to split anonymous large folios that are on deferred >> split list, so not related; > > This is in linux-next and not v6.9-rc5. > >> second one is to split THPs when thp migration is not supported, but >> this is compaction, so not related; third one is to split large folios >> when there is no same size free page in the system, and this should be >> the one. > > Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This > also helps us enhance the reproducer further, which I'll do next. > >>> And I verified that moving the check only to the migrate_pages_batch() >>> path also fixes the crash: >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index 73a052a382f1..83b528eb7100 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f >>> int rc; >>> >>> folio_lock(folio); >>> + if (folio_test_dirty(folio)) { >>> + rc = -EBUSY; >>> + goto out; >>> + } >>> rc = split_folio_to_list(folio, split_folios); >>> +out: >>> folio_unlock(folio); >>> if (!rc) >>> list_move_tail(&folio->lru, split_folios); >>> >>> However I'd like compaction folks to review this. I see some indications >>> in the code that migration can race with truncation but we feel fine by >>> it by taking the folio lock. However here we have a case where we see >>> the folio clearly locked and the folio is dirty. Other migraiton code >>> seems to write back the code and can wait, here we just move on. Further >>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping >>> as completely unmovable") seems to hint that migration is safe if the >>> mapping either does not exist or the mapping does exist but has >>> mapping->a_ops->migrate_folio so I'd like further feedback on this. >> >> During migration, all page table entries pointing to this dirty folio >> are invalid, and accesses to this folio will cause page fault and >> wait on the migration entry. I am not sure we need to skip dirty folios. > > I see.. thanks! > >>> Another thing which requires review is if we we split a folio but not >>> down to order 0 but to the new min order, does the accounting on >>> migrate_pages_batch() require changing? And most puzzling, why do we >> >> What accounting are you referring to? split code should take care of it. > > The folio order can change after split, and so I was concerned about the > nr_pages used in migrate_pages_batch(). But I see now that when > migrate_folio_unmap() first failed we try to split the folio, and if > successful I see now we the caller will again call migrate_pages_batch() > with a retry attempt of 1 only to the split folios. I also see the > nr_pages is just local to each list for each loop, first on the from > list to unmap and afte on the unmap list so we move the folios. > >>> not see this with regular large folios, but we do see it with minorder ? >> >> I wonder if the split code handles folio->mapping->i_pages properly. >> Does the i_pages store just folio pointers or also need all tail page >> pointers? I am no expert in fs, thus need help. > > mapping->i_pages stores folio pointers in the page cache or > swap/dax/shadow entries (xa_is_value(folio)). The folios however can be > special and we special-case them with shmem_mapping(mapping) checks. > split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow > entries, and we also bail out on shmem_mapping(mapping) already. Hmm, I misunderstood the issue above. To clarify it, the error comes out when a page cache folio with minorder is split to order-0, an NULL ptr defer shows up in the writeback path. I thought the folio was split to non-0 order. split_huge_page_to_list_to_order() should be fine, since splitting to order-0 is not changed after my patches. I wonder if you can isolate the issue by just splitting a dirty minorder page cache folio instead of having folio split and migration going on together. You probably can use the debugfs to do that. Depending on the result, we can narrow down the cause of the issue. -- Best Regards, Yan, Zi
On Mon, Apr 29, 2024 at 10:43:16PM -0400, Zi Yan wrote: > On 29 Apr 2024, at 20:31, Luis Chamberlain wrote: > > > On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote: > >> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote: > >> > >>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote: > >>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote: > >>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote: > >>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote: > >>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote: > >>>>>>>> From: Pankaj Raghav <p.raghav@samsung.com> > >>>>>>>> > >>>>>>>> using that API for LBS is resulting in an NULL ptr dereference > >>>>>>>> error in the writeback path [1]. > >>>>>>>> > >>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df > >>>>>>> > >>>>>>> How would I go about reproducing this? > >>>> > >>>> Well so the below fixes this but I am not sure if this is correct. > >>>> folio_mark_dirty() at least says that a folio should not be truncated > >>>> while its running. I am not sure if we should try to split folios then > >>>> even though we check for writeback once. truncate_inode_partial_folio() > >>>> will folio_wait_writeback() but it will split_folio() before checking > >>>> for claiming to fail to truncate with folio_test_dirty(). But since the > >>>> folio is locked its not clear why this should be possible. > >>>> > >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >>>> index 83955362d41c..90195506211a 100644 > >>>> --- a/mm/huge_memory.c > >>>> +++ b/mm/huge_memory.c > >>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > >>>> if (new_order >= folio_order(folio)) > >>>> return -EINVAL; > >>>> > >>>> - if (folio_test_writeback(folio)) > >>>> + if (folio_test_dirty(folio) || folio_test_writeback(folio)) > >>>> return -EBUSY; > >>>> > >>>> if (!folio_test_anon(folio)) { > >>> > >>> I wondered what code path is causing this and triggering this null > >>> pointer, so I just sprinkled a check here: > >>> > >>> VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio); > >>> > >>> The answer was: > >>> > >>> kcompactd() --> migrate_pages_batch() > >>> --> try_split_folio --> split_folio_to_list() --> > >>> split_huge_page_to_list_to_order() > >>> > >> > >> There are 3 try_split_folio() in migrate_pages_batch(). > > > > This is only true for linux-next, for v6.9-rc5 off of which this testing > > is based on there are only two. > > > >> First one is to split anonymous large folios that are on deferred > >> split list, so not related; > > > > This is in linux-next and not v6.9-rc5. > > > >> second one is to split THPs when thp migration is not supported, but > >> this is compaction, so not related; third one is to split large folios > >> when there is no same size free page in the system, and this should be > >> the one. > > > > Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This > > also helps us enhance the reproducer further, which I'll do next. > > > >>> And I verified that moving the check only to the migrate_pages_batch() > >>> path also fixes the crash: > >>> > >>> diff --git a/mm/migrate.c b/mm/migrate.c > >>> index 73a052a382f1..83b528eb7100 100644 > >>> --- a/mm/migrate.c > >>> +++ b/mm/migrate.c > >>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > >>> int rc; > >>> > >>> folio_lock(folio); > >>> + if (folio_test_dirty(folio)) { > >>> + rc = -EBUSY; > >>> + goto out; > >>> + } > >>> rc = split_folio_to_list(folio, split_folios); > >>> +out: > >>> folio_unlock(folio); > >>> if (!rc) > >>> list_move_tail(&folio->lru, split_folios); > >>> > >>> However I'd like compaction folks to review this. I see some indications > >>> in the code that migration can race with truncation but we feel fine by > >>> it by taking the folio lock. However here we have a case where we see > >>> the folio clearly locked and the folio is dirty. Other migraiton code > >>> seems to write back the code and can wait, here we just move on. Further > >>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping > >>> as completely unmovable") seems to hint that migration is safe if the > >>> mapping either does not exist or the mapping does exist but has > >>> mapping->a_ops->migrate_folio so I'd like further feedback on this. > >> > >> During migration, all page table entries pointing to this dirty folio > >> are invalid, and accesses to this folio will cause page fault and > >> wait on the migration entry. I am not sure we need to skip dirty folios. > > > > I see.. thanks! > > > >>> Another thing which requires review is if we we split a folio but not > >>> down to order 0 but to the new min order, does the accounting on > >>> migrate_pages_batch() require changing? And most puzzling, why do we > >> > >> What accounting are you referring to? split code should take care of it. > > > > The folio order can change after split, and so I was concerned about the > > nr_pages used in migrate_pages_batch(). But I see now that when > > migrate_folio_unmap() first failed we try to split the folio, and if > > successful I see now we the caller will again call migrate_pages_batch() > > with a retry attempt of 1 only to the split folios. I also see the > > nr_pages is just local to each list for each loop, first on the from > > list to unmap and afte on the unmap list so we move the folios. > > > >>> not see this with regular large folios, but we do see it with minorder ? > >> > >> I wonder if the split code handles folio->mapping->i_pages properly. > >> Does the i_pages store just folio pointers or also need all tail page > >> pointers? I am no expert in fs, thus need help. > > > > mapping->i_pages stores folio pointers in the page cache or > > swap/dax/shadow entries (xa_is_value(folio)). The folios however can be > > special and we special-case them with shmem_mapping(mapping) checks. > > split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow > > entries, and we also bail out on shmem_mapping(mapping) already. > > Hmm, I misunderstood the issue above. To clarify it, the error comes out > when a page cache folio with minorder is split to order-0, No, min order is used. In order to support splits with min order we require an out of tree patch not yet posted: https://github.com/linux-kdevops/linux/commit/e77a2a4fd6d9aa7e2641d5ea456ad0522c1e8a04 The important part is if no order is specified we use the min order: int split_folio_to_list(struct folio *folio, struct list_head *list) { unsigned int min_order = 0; if (!folio_test_anon(folio)) min_order = mapping_min_folio_order(folio->mapping); return split_huge_page_to_list_to_order(&folio->page, list, min_order); } and so compaction's try_split_folio() --> split_folio_to_list(folio, split_folios) will use the min order implicitly due to the above. So yes, we see a null ptr deref on the writeback path when min order is set. > I wonder if you can isolate the issue by just splitting a dirty minorder > page cache folio instead of having folio split and migration going on together. > You probably can use the debugfs to do that. Depending on the result, > we can narrow down the cause of the issue. That's what I had tried with my new fstsest test but now I see where it also failed -- on 4k filesystems it was trying to split to order 0 and that had no issues as you pointed out. We can now fine tune the test very well. I can now reproduce the crash on plain on boring vanilla linux v6.9-rc6 on a plain xfs filesystem with 4k block size on x86_64 doing this: You may want the following appended to your kernel command line: dyndbg='file mm/huge_memory.c +p' mkfs.xfs /dev/vdd mkdir -p /media/scratch/ mount /dev/vdd /media/scratch/ while true; do dd if=/dev/zero of=$FILE bs=4M count=200 2> /dev/null; done & while true; do sleep 2; echo $FILE,0x0,0x4000,2 > /sys/kernel/debug/split_huge_pages 0x400000 2> /dev/null; done The crash: Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: SGI XFS with ACLs, security attributes, realtime, scrub, repair, quota, fatal assert, debug enabled Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Mounting V5 Filesystem d1f9e444-f61c-4439-a2bf-61a13f6d8e81 Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Ending clean mount Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: huge_memory: split file-backed THPs in file: /media/scratch/foo, page offset: [0x0 - 0x200000] Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: supervisor read access in kernel mode Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: error_code(0x0000) - not-present page Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PGD 0 P4D 0 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CPU: 4 PID: 89 Comm: kworker/u37:2 Not tainted 6.9.0-rc6 #10 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Workqueue: writeback wb_workfn (flush-254:48) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Code: bd 06 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53 All code ======== 0: bd 06 86 00 48 mov $0x48008606,%ebp 5: 89 c3 mov %eax,%ebx 7: 48 3d 06 04 00 00 cmp $0x406,%rax d: 74 e8 je 0xfffffffffffffff7 f: 48 81 fb 02 04 00 00 cmp $0x402,%rbx 16: 0f 84 d0 00 00 00 je 0xec 1c: 48 85 db test %rbx,%rbx 1f: 0f 84 04 01 00 00 je 0x129 25: f6 c3 01 test $0x1,%bl 28: 75 c4 jne 0xffffffffffffffee 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction 2d: 85 c0 test %eax,%eax 2f: 0f 84 b7 00 00 00 je 0xec 35: 8d 50 01 lea 0x1(%rax),%edx 38: 48 8d 73 34 lea 0x34(%rbx),%rsi 3c: f0 lock 3d: 0f .byte 0xf 3e: b1 53 mov $0x53,%cl Code starting with the faulting instruction =========================================== 0: 8b 43 34 mov 0x34(%rbx),%eax 3: 85 c0 test %eax,%eax 5: 0f 84 b7 00 00 00 je 0xc2 b: 8d 50 01 lea 0x1(%rax),%edx e: 48 8d 73 34 lea 0x34(%rbx),%rsi 12: f0 lock 13: 0f .byte 0xf 14: b1 53 mov $0x53,%cl Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RSP: 0018:ffffa8f0c07cb8f8 EFLAGS: 00010246 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000018000 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff987380564920 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R13: ffffa8f0c07cbbb8 R14: ffffa8f0c07cbcb8 R15: ffff98738c4ea800 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: FS: 0000000000000000(0000) GS:ffff9873fbd00000(0000) knlGS:0000000000000000 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CR2: 0000000000000036 CR3: 000000011aca8003 CR4: 0000000000770ef0 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PKRU: 55555554 Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Call Trace: Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: <TASK> Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? page_fault_oops (arch/x86/mm/fault.c:713) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1)) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (mm/filemap.c:1968 mm/filemap.c:2218) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: write_cache_pages (mm/page-writeback.c:2568) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: iomap_writepages (fs/iomap/buffered-io.c:1984) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: do_writepages (mm/page-writeback.c:2612) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9902 (discriminator 2) kernel/sched/fair.c:10583 (discriminator 2)) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_single_inode (fs/fs-writeback.c:1659) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_sb_inodes (fs/fs-writeback.c:1943) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_writeback (fs/fs-writeback.c:2119) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1)) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: process_one_work (kernel/workqueue.c:3254) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4)) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: kthread (kernel/kthread.c:388) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork (arch/x86/kernel/process.c:147) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257) Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: </TASK> The full decoded crash on v6.9-rc6: https://gist.github.com/mcgrof/c44aaed21b99ae4ecf3d7fc6a1bb00bc Luis
On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote: > 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction > RBX: 0000000000000002 RCX: 0000000000018000 Thanks, got it. I'll send a patch in the morning, but I know exactly what the problem is. You're seeing sibling entries tagged as dirty. That shouldn't happen; we should only see folios tagged as dirty. The bug is in node_set_marks() which calls node_mark_all(). This works fine when splitting to order 0, but we should only mark the first entry of each order. eg if we split to order 3, we should tag slots 0, 8, 16, 24, .., 56.
On Wed, May 01, 2024 at 05:13:59AM +0100, Matthew Wilcox wrote: > On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote: > > 2a:* 8b 43 34 mov 0x34(%rbx),%eax <-- trapping instruction > > RBX: 0000000000000002 RCX: 0000000000018000 > > Thanks, got it. I'll send a patch in the morning, but I know exactly > what the problem is. You're seeing sibling entries tagged as dirty. > That shouldn't happen; we should only see folios tagged as dirty. > The bug is in node_set_marks() which calls node_mark_all(). This works > fine when splitting to order 0, but we should only mark the first entry > of each order. eg if we split to order 3, we should tag slots 0, 8, > 16, 24, .., 56. Confirmed: +++ b/lib/test_xarray.c @@ -1789,8 +1789,10 @@ static void check_split_1(struct xarray *xa, unsigned lon g index, { XA_STATE_ORDER(xas, xa, index, new_order); unsigned int i; + void *entry; xa_store_order(xa, index, order, xa, GFP_KERNEL); + xa_set_mark(xa, index, XA_MARK_1); xas_split_alloc(&xas, xa, order, GFP_KERNEL); xas_lock(&xas); @@ -1807,6 +1809,12 @@ static void check_split_1(struct xarray *xa, unsigned long index, xa_set_mark(xa, index, XA_MARK_0); XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0)); + xas_set_order(&xas, index, 0); + rcu_read_lock(); + xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) + XA_BUG_ON(xa, xa_is_internal(entry)); + rcu_read_unlock(); + xa_destroy(xa); } spits out: $ ./tools/testing/radix-tree/xarray BUG at check_split_1:1815 xarray: 0x562b4043e580x head 0x50c0095cc082x flags 3000000 marks 1 1 0 0-63: node 0x50c0095cc080x max 0 parent (nil)x shift 3 count 1 values 0 array 0x562b4043e580x list 0x50c0095cc098x 0x50c0095cc098x marks 1 1 0 0-7: node 0x50c0095cc140x offset 0 parent 0x50c0095cc080x shift 0 count 8 values 4 array 0x562b4043e580x list 0x50c0095cc158x 0x50c0095cc158x marks 1 ff 0 0: value 0 (0x0) [0x1x] 1: sibling (slot 0) 2: value 2 (0x2) [0x5x] 3: sibling (slot 2) 4: value 4 (0x4) [0x9x] 5: sibling (slot 4) 6: value 6 (0x6) [0xdx] 7: sibling (slot 6) xarray: ../../../lib/test_xarray.c:1815: check_split_1: Assertion `0' failed. Aborted
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9859aa4f7553..dadf1e68dbdc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3117,6 +3117,15 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, goto out; } + /* + * Do not split if mapping has minimum folio order + * requirement. + */ + if (mapping_min_folio_order(mapping)) { + ret = -EINVAL; + goto out; + } + gfp = current_gfp_context(mapping_gfp_mask(mapping) & GFP_RECLAIM_MASK);