Message ID | 20250130100050.1868208-7-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | Get rid of PG_reclaim and rename PG_dropbehind | expand |
On Thu, Jan 30, 2025 at 12:00:44PM +0200, Kirill A. Shutemov wrote: > The recently introduced PG_dropbehind allows for freeing folios > immediately after writeback. Unlike PG_reclaim, it does not need vmscan > to be involved to get the folio freed. > > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in > pageout(). > > It is safe to leave PG_dropbehind on the folio if, for some reason > (bug?), the folio is not in a writeback state after ->writepage(). > In these cases, the kernel had to clear PG_reclaim as it shared a page > flag bit with PG_readahead. Is it correct to say that leaving PG_dropbehind on folios which doesn't have writeback state after ->writepage() (i.e. store to zswap) is fine because PG_dropbehind is not in PAGE_FLAGS_CHECK_AT_FREE? > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Acked-by: David Hildenbrand <david@redhat.com> > --- > mm/vmscan.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bc1826020159..c97adb0fdaa4 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, > if (shmem_mapping(mapping) && folio_test_large(folio)) > wbc.list = folio_list; > > - folio_set_reclaim(folio); > + folio_set_dropbehind(folio); > + > res = mapping->a_ops->writepage(&folio->page, &wbc); > if (res < 0) > handle_write_error(mapping, folio, res); > if (res == AOP_WRITEPAGE_ACTIVATE) { > - folio_clear_reclaim(folio); > + folio_clear_dropbehind(folio); > return PAGE_ACTIVATE; > } > > - if (!folio_test_writeback(folio)) { > - /* synchronous write or broken a_ops? */ > - folio_clear_reclaim(folio); > - } > trace_mm_vmscan_write_folio(folio); > node_stat_add_folio(folio, NR_VMSCAN_WRITE); > return PAGE_SUCCESS; > -- > 2.47.2 >
On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > The recently introduced PG_dropbehind allows for freeing folios > immediately after writeback. Unlike PG_reclaim, it does not need vmscan > to be involved to get the folio freed. > > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in > pageout(). > > It is safe to leave PG_dropbehind on the folio if, for some reason > (bug?), the folio is not in a writeback state after ->writepage(). > In these cases, the kernel had to clear PG_reclaim as it shared a page > flag bit with PG_readahead. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Acked-by: David Hildenbrand <david@redhat.com> > --- > mm/vmscan.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bc1826020159..c97adb0fdaa4 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, > if (shmem_mapping(mapping) && folio_test_large(folio)) > wbc.list = folio_list; > > - folio_set_reclaim(folio); > + folio_set_dropbehind(folio); > + > res = mapping->a_ops->writepage(&folio->page, &wbc); > if (res < 0) > handle_write_error(mapping, folio, res); > if (res == AOP_WRITEPAGE_ACTIVATE) { > - folio_clear_reclaim(folio); > + folio_clear_dropbehind(folio); > return PAGE_ACTIVATE; > } > > - if (!folio_test_writeback(folio)) { > - /* synchronous write or broken a_ops? */ > - folio_clear_reclaim(folio); > - } > trace_mm_vmscan_write_folio(folio); > node_stat_add_folio(folio, NR_VMSCAN_WRITE); > return PAGE_SUCCESS; > -- > 2.47.2 > Hi, I'm seeing following panic with SWAP after this commit: [ 29.672319] Oops: general protection fault, probably for non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not tainted 6.13.0.ptch-g1fe9ea48ec98 #917 [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0 [ 29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55 41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85 ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01 41 89 [ 29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002 [ 29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000 [ 29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3 [ 29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [ 29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3 [ 29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 29.692613] FS: 00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000) knlGS:0000000000000000 [ 29.694339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0 [ 29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 29.700161] PKRU: 55555554 [ 29.700759] Call Trace: [ 29.701296] <TASK> [ 29.701770] ? __die_body+0x1e/0x60 [ 29.702540] ? die_addr+0x3c/0x60 [ 29.703267] ? exc_general_protection+0x18f/0x3c0 [ 29.704290] ? asm_exc_general_protection+0x26/0x30 [ 29.705345] ? __lock_acquire+0x20/0x15d0 [ 29.706215] ? lockdep_hardirqs_on_prepare+0xda/0x190 [ 29.707304] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 29.708452] lock_acquire+0xbf/0x2e0 [ 29.709229] ? folio_unmap_invalidate+0x12f/0x220 [ 29.710257] ? __folio_end_writeback+0x15d/0x430 [ 29.711260] ? __folio_end_writeback+0x116/0x430 [ 29.712261] _raw_spin_lock+0x30/0x40 [ 29.713064] ? folio_unmap_invalidate+0x12f/0x220 [ 29.714076] folio_unmap_invalidate+0x12f/0x220 [ 29.715058] folio_end_writeback+0xdf/0x190 [ 29.715967] swap_writepage_bdev_sync+0x1e0/0x450 [ 29.716994] ? __pfx_submit_bio_wait_endio+0x10/0x10 [ 29.718074] swap_writepage+0x46b/0x6b0 [ 29.718917] pageout+0x14b/0x360 [ 29.719628] shrink_folio_list+0x67d/0xec0 [ 29.720519] ? mark_held_locks+0x48/0x80 [ 29.721375] evict_folios+0x2a7/0x9e0 [ 29.722179] try_to_shrink_lruvec+0x19a/0x270 [ 29.723130] lru_gen_shrink_lruvec+0x70/0xc0 [ 29.724060] ? __lock_acquire+0x558/0x15d0 [ 29.724954] shrink_lruvec+0x57/0x780 [ 29.725754] ? find_held_lock+0x2d/0xa0 [ 29.726588] ? rcu_read_unlock+0x17/0x60 [ 29.727449] shrink_node+0x2ad/0x930 [ 29.728229] do_try_to_free_pages+0xbd/0x4e0 [ 29.729160] try_to_free_mem_cgroup_pages+0x123/0x2c0 [ 29.730252] try_charge_memcg+0x222/0x660 [ 29.731128] charge_memcg+0x3c/0x80 [ 29.731888] __mem_cgroup_charge+0x30/0x70 [ 29.732776] shmem_alloc_and_add_folio+0x1a5/0x480 [ 29.733818] ? filemap_get_entry+0x155/0x390 [ 29.734748] shmem_get_folio_gfp+0x28c/0x6c0 [ 29.735680] shmem_write_begin+0x5a/0xc0 [ 29.736535] generic_perform_write+0x12a/0x2e0 [ 29.737503] shmem_file_write_iter+0x86/0x90 [ 29.738428] vfs_write+0x364/0x530 [ 29.739180] ksys_write+0x6c/0xe0 [ 29.739906] do_syscall_64+0x66/0x140 [ 29.740713] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 29.741800] RIP: 0033:0x7fa05c439984 [ 29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 [ 29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984 [ 29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004 [ 29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007 [ 29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000 [ 29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af [ 29.755797] </TASK> [ 29.756285] Modules linked in: zram virtiofs I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is called for swapcache folio and it doesn't work well, following PATCH on top of mm-unstable seems fix it well: diff --git a/mm/filemap.c b/mm/filemap.c index 4fe551037bf7..98493443d120 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio) * invalidation in that case. */ if (in_task() && folio_trylock(folio)) { - if (folio->mapping) - folio_unmap_invalidate(folio->mapping, folio, 0); + struct address_space *mapping = folio_mapping(folio); + if (mapping) + folio_unmap_invalidate(mapping, folio, 0); folio_unlock(folio); } } diff --git a/mm/truncate.c b/mm/truncate.c index e922ceb66c44..4f3e34c52d8b 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -565,23 +565,29 @@ int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, if (!filemap_release_folio(folio, gfp)) return -EBUSY; - spin_lock(&mapping->host->i_lock); + if (!folio_test_swapcache(folio)) { + spin_lock(&mapping->host->i_lock); + BUG_ON(folio_has_private(folio)); + } + xa_lock_irq(&mapping->i_pages); if (folio_test_dirty(folio)) goto failed; - BUG_ON(folio_has_private(folio)); __filemap_remove_folio(folio, NULL); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); + + if (!folio_test_swapcache(folio)) + spin_unlock(&mapping->host->i_lock); filemap_free_folio(mapping, folio); return 1; failed: xa_unlock_irq(&mapping->i_pages); - spin_unlock(&mapping->host->i_lock); + if (!folio_test_swapcache(folio)) + spin_unlock(&mapping->host->i_lock); return -EBUSY; }
On Sat, Feb 1, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > The recently introduced PG_dropbehind allows for freeing folios > > immediately after writeback. Unlike PG_reclaim, it does not need vmscan > > to be involved to get the folio freed. > > > > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in > > pageout(). > > > > It is safe to leave PG_dropbehind on the folio if, for some reason > > (bug?), the folio is not in a writeback state after ->writepage(). > > In these cases, the kernel had to clear PG_reclaim as it shared a page > > flag bit with PG_readahead. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Acked-by: David Hildenbrand <david@redhat.com> > > --- > > mm/vmscan.c | 9 +++------ > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index bc1826020159..c97adb0fdaa4 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, > > if (shmem_mapping(mapping) && folio_test_large(folio)) > > wbc.list = folio_list; > > > > - folio_set_reclaim(folio); > > + folio_set_dropbehind(folio); > > + > > res = mapping->a_ops->writepage(&folio->page, &wbc); > > if (res < 0) > > handle_write_error(mapping, folio, res); > > if (res == AOP_WRITEPAGE_ACTIVATE) { > > - folio_clear_reclaim(folio); > > + folio_clear_dropbehind(folio); > > return PAGE_ACTIVATE; > > } > > > > - if (!folio_test_writeback(folio)) { > > - /* synchronous write or broken a_ops? */ > > - folio_clear_reclaim(folio); > > - } > > trace_mm_vmscan_write_folio(folio); > > node_stat_add_folio(folio, NR_VMSCAN_WRITE); > > return PAGE_SUCCESS; > > -- > > 2.47.2 > > > > Hi, I'm seeing following panic with SWAP after this commit: > > [ 29.672319] Oops: general protection fault, probably for > non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI > [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not > tainted 6.13.0.ptch-g1fe9ea48ec98 #917 > [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0 > [ 29.679891] Code: 90 90 90 90 90 90 90 90 90 90 41 57 41 56 41 55 > 41 54 55 53 48 83 ec 30 8b 2d 10 ac f3 01 44 8b ac 24 88 00 00 00 85 > ed 74 64 <48> 8b 07 49 89 ff 48 3d 20 1d bf 83 74 56 8b 1d 8c f5 b1 01 > 41 89 > [ 29.683852] RSP: 0018:ffffc9000bea3148 EFLAGS: 00010002 > [ 29.684980] RAX: ffff8890874b2940 RBX: 0000000000000200 RCX: 0000000000000000 > [ 29.686510] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00ffff88909a3be3 > [ 29.688031] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 > [ 29.689561] R10: 0000000000000000 R11: 0000000000000020 R12: 00ffff88909a3be3 > [ 29.691087] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 29.692613] FS: 00007fa05c2824c0(0000) GS:ffff88a03fa80000(0000) > knlGS:0000000000000000 > [ 29.694339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 29.695581] CR2: 000055f9abb7fc7d CR3: 00000010932f2002 CR4: 0000000000770eb0 > [ 29.697109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 29.698637] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 29.700161] PKRU: 55555554 > [ 29.700759] Call Trace: > [ 29.701296] <TASK> > [ 29.701770] ? __die_body+0x1e/0x60 > [ 29.702540] ? die_addr+0x3c/0x60 > [ 29.703267] ? exc_general_protection+0x18f/0x3c0 > [ 29.704290] ? asm_exc_general_protection+0x26/0x30 > [ 29.705345] ? __lock_acquire+0x20/0x15d0 > [ 29.706215] ? lockdep_hardirqs_on_prepare+0xda/0x190 > [ 29.707304] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 29.708452] lock_acquire+0xbf/0x2e0 > [ 29.709229] ? folio_unmap_invalidate+0x12f/0x220 > [ 29.710257] ? __folio_end_writeback+0x15d/0x430 > [ 29.711260] ? __folio_end_writeback+0x116/0x430 > [ 29.712261] _raw_spin_lock+0x30/0x40 > [ 29.713064] ? folio_unmap_invalidate+0x12f/0x220 > [ 29.714076] folio_unmap_invalidate+0x12f/0x220 > [ 29.715058] folio_end_writeback+0xdf/0x190 > [ 29.715967] swap_writepage_bdev_sync+0x1e0/0x450 > [ 29.716994] ? __pfx_submit_bio_wait_endio+0x10/0x10 > [ 29.718074] swap_writepage+0x46b/0x6b0 > [ 29.718917] pageout+0x14b/0x360 > [ 29.719628] shrink_folio_list+0x67d/0xec0 > [ 29.720519] ? mark_held_locks+0x48/0x80 > [ 29.721375] evict_folios+0x2a7/0x9e0 > [ 29.722179] try_to_shrink_lruvec+0x19a/0x270 > [ 29.723130] lru_gen_shrink_lruvec+0x70/0xc0 > [ 29.724060] ? __lock_acquire+0x558/0x15d0 > [ 29.724954] shrink_lruvec+0x57/0x780 > [ 29.725754] ? find_held_lock+0x2d/0xa0 > [ 29.726588] ? rcu_read_unlock+0x17/0x60 > [ 29.727449] shrink_node+0x2ad/0x930 > [ 29.728229] do_try_to_free_pages+0xbd/0x4e0 > [ 29.729160] try_to_free_mem_cgroup_pages+0x123/0x2c0 > [ 29.730252] try_charge_memcg+0x222/0x660 > [ 29.731128] charge_memcg+0x3c/0x80 > [ 29.731888] __mem_cgroup_charge+0x30/0x70 > [ 29.732776] shmem_alloc_and_add_folio+0x1a5/0x480 > [ 29.733818] ? filemap_get_entry+0x155/0x390 > [ 29.734748] shmem_get_folio_gfp+0x28c/0x6c0 > [ 29.735680] shmem_write_begin+0x5a/0xc0 > [ 29.736535] generic_perform_write+0x12a/0x2e0 > [ 29.737503] shmem_file_write_iter+0x86/0x90 > [ 29.738428] vfs_write+0x364/0x530 > [ 29.739180] ksys_write+0x6c/0xe0 > [ 29.739906] do_syscall_64+0x66/0x140 > [ 29.740713] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 29.741800] RIP: 0033:0x7fa05c439984 > [ 29.742584] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f > 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 06 0e 00 00 74 13 b8 01 00 00 > 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 > 48 89 > [ 29.746542] RSP: 002b:00007ffece7720f8 EFLAGS: 00000202 ORIG_RAX: > 0000000000000001 > [ 29.748157] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007fa05c439984 > [ 29.749682] RDX: 0000000000002800 RSI: 000055f9cfa08000 RDI: 0000000000000004 > [ 29.751216] RBP: 00007ffece772140 R08: 0000000000002800 R09: 0000000000000007 > [ 29.752743] R10: 0000000000000180 R11: 0000000000000202 R12: 000055f9cfa08000 > [ 29.754262] R13: 0000000000000004 R14: 0000000000002800 R15: 00000000000009af > [ 29.755797] </TASK> > [ 29.756285] Modules linked in: zram virtiofs > > I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is > called for swapcache folio and it doesn't work well, following PATCH > on top of mm-unstable seems fix it well: I think there is a bigger problem here. folio_end_reclaim_write() currently calls folio_unmap_invalidate() to remove the mapping, and that's very different from what __remove_mapping() does in the reclaim path: not only it breaks the swapcache case, the shadow entry is also lost.
On Sat, Feb 01, 2025 at 04:01:43PM +0800, Kairui Song wrote: > On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > The recently introduced PG_dropbehind allows for freeing folios > > immediately after writeback. Unlike PG_reclaim, it does not need vmscan > > to be involved to get the folio freed. > > > > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in > > pageout(). > > > > It is safe to leave PG_dropbehind on the folio if, for some reason > > (bug?), the folio is not in a writeback state after ->writepage(). > > In these cases, the kernel had to clear PG_reclaim as it shared a page > > flag bit with PG_readahead. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Acked-by: David Hildenbrand <david@redhat.com> > > --- > > mm/vmscan.c | 9 +++------ > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index bc1826020159..c97adb0fdaa4 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, > > if (shmem_mapping(mapping) && folio_test_large(folio)) > > wbc.list = folio_list; > > > > - folio_set_reclaim(folio); > > + folio_set_dropbehind(folio); > > + > > res = mapping->a_ops->writepage(&folio->page, &wbc); > > if (res < 0) > > handle_write_error(mapping, folio, res); > > if (res == AOP_WRITEPAGE_ACTIVATE) { > > - folio_clear_reclaim(folio); > > + folio_clear_dropbehind(folio); > > return PAGE_ACTIVATE; > > } > > > > - if (!folio_test_writeback(folio)) { > > - /* synchronous write or broken a_ops? */ > > - folio_clear_reclaim(folio); > > - } > > trace_mm_vmscan_write_folio(folio); > > node_stat_add_folio(folio, NR_VMSCAN_WRITE); > > return PAGE_SUCCESS; > > -- > > 2.47.2 > > > > Hi, I'm seeing following panic with SWAP after this commit: > > [ 29.672319] Oops: general protection fault, probably for > non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI > [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not > tainted 6.13.0.ptch-g1fe9ea48ec98 #917 > [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0 Ouch. I failed to trigger it my setup. Could you share your reproducer? > I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is > called for swapcache folio and it doesn't work well, following PATCH > on top of mm-unstable seems fix it well: Right. I don't understand swapping good enough. I missed this. > diff --git a/mm/filemap.c b/mm/filemap.c > index 4fe551037bf7..98493443d120 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio) > * invalidation in that case. > */ > if (in_task() && folio_trylock(folio)) { > - if (folio->mapping) > - folio_unmap_invalidate(folio->mapping, folio, 0); > + struct address_space *mapping = folio_mapping(folio); > + if (mapping) > + folio_unmap_invalidate(mapping, folio, 0); > folio_unlock(folio); > } > } Once you do this, folio_unmap_invalidate() will never succeed for swapcache as folio->mapping != mapping check will always be true and it will fail with -EBUSY. I guess we need to do something similar to what __remove_mapping() does for swapcache folios.
On Mon, 3 Feb 2025 10:39:58 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote: > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 4fe551037bf7..98493443d120 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio) > > * invalidation in that case. > > */ > > if (in_task() && folio_trylock(folio)) { > > - if (folio->mapping) > > - folio_unmap_invalidate(folio->mapping, folio, 0); > > + struct address_space *mapping = folio_mapping(folio); > > + if (mapping) > > + folio_unmap_invalidate(mapping, folio, 0); > > folio_unlock(folio); > > } > > } > > Once you do this, folio_unmap_invalidate() will never succeed for > swapcache as folio->mapping != mapping check will always be true and it > will fail with -EBUSY. > > I guess we need to do something similar to what __remove_mapping() does > for swapcache folios. Thanks, I'll drop the v3 series from mm.git.
On (25/02/03 10:39), Kirill A. Shutemov wrote: > > Hi, I'm seeing following panic with SWAP after this commit: > > > > [ 29.672319] Oops: general protection fault, probably for > > non-canonical address 0xffff88909a3be3: 0000 [#1] PREEMPT SMP NOPTI > > [ 29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not > > tainted 6.13.0.ptch-g1fe9ea48ec98 #917 > > [ 29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > > [ 29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0 > > Ouch. > > I failed to trigger it my setup. Could you share your reproducer? I'm seeing this as well (backtraces below). My repro is: - 4GB VM with 2 zram devices - one is setup as swap - the other one has ext4 fs on it - I dd large files to it --- xa_lock_irq(&mapping->i_pages): [ 94.609589][ T157] Oops: general protection fault, probably for non-canonical address 0xe01ffbf11020301a: 0000 [#1] PREEMPT SMP KASAN PTI [ 94.611881][ T157] KASAN: maybe wild-memory-access in range [0x00ffff88810180d0-0x00ffff88810180d7] [ 94.613567][ T157] CPU: 1 UID: 0 PID: 157 Comm: kswapd0 Not tainted 6.13.0+ #927 [ 94.614947][ T157] RIP: 0010:__lock_acquire+0x6a/0x1ef0 [ 94.615942][ T157] Code: 08 84 d2 0f 85 ed 13 00 00 44 8b 05 24 30 d5 02 45 85 c0 0f 84 bc 07 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 eb 18 00 00 49 8b 04 24 48 3d a0 8b ac 84 0f 84 [ 94.619668][ T157] RSP: 0018:ffff88810510eec0 EFLAGS: 00010002 [ 94.620835][ T157] RAX: dffffc0000000000 RBX: 1ffff11020a21df5 RCX: 1ffffffff084c092 [ 94.622329][ T157] RDX: 001ffff11020301a RSI: 0000000000000000 RDI: 00ffff88810180d1 [ 94.623779][ T157] RBP: 00ffff88810180d1 R08: 0000000000000001 R09: 0000000000000000 [ 94.625213][ T157] R10: ffffffff8425d0d7 R11: 0000000000000000 R12: 00ffff88810180d1 [ 94.626656][ T157] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 94.628086][ T157] FS: 0000000000000000(0000) GS:ffff88815aa80000(0000) knlGS:0000000000000000 [ 94.629700][ T157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 94.630894][ T157] CR2: 00007f757719c2b0 CR3: 0000000003c82005 CR4: 0000000000770ef0 [ 94.632333][ T157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 94.633796][ T157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 94.635265][ T157] PKRU: 55555554 [ 94.635909][ T157] Call Trace: [ 94.636512][ T157] <TASK> [ 94.637052][ T157] ? show_trace_log_lvl+0x1a7/0x2e0 [ 94.638005][ T157] ? show_trace_log_lvl+0x1a7/0x2e0 [ 94.638960][ T157] ? lock_acquire.part.0+0xfa/0x310 [ 94.639909][ T157] ? __die_body.cold+0x8/0x12 [ 94.640765][ T157] ? die_addr+0x42/0x70 [ 94.641530][ T157] ? exc_general_protection+0x12e/0x210 [ 94.642558][ T157] ? asm_exc_general_protection+0x22/0x30 [ 94.643610][ T157] ? __lock_acquire+0x6a/0x1ef0 [ 94.644506][ T157] ? _raw_spin_unlock_irq+0x24/0x40 [ 94.645468][ T157] ? __wait_for_common+0x2f2/0x610 [ 94.646412][ T157] ? pci_mmcfg_reserved+0x120/0x120 [ 94.647364][ T157] ? submit_bio_noacct_nocheck+0x32e/0x3e0 [ 94.648448][ T157] ? lock_is_held_type+0x81/0xe0 [ 94.649360][ T157] lock_acquire.part.0+0xfa/0x310 [ 94.650288][ T157] ? folio_unmap_invalidate+0x286/0x550 [ 94.651324][ T157] ? __lock_acquire+0x1ef0/0x1ef0 [ 94.652250][ T157] ? submit_bio_wait+0x17c/0x200 [ 94.653166][ T157] ? submit_bio_wait_endio+0x40/0x40 [ 94.654140][ T157] ? lock_acquire+0x18a/0x1f0 [ 94.655008][ T157] _raw_spin_lock+0x2c/0x40 [ 94.655853][ T157] ? folio_unmap_invalidate+0x286/0x550 [ 94.656879][ T157] folio_unmap_invalidate+0x286/0x550 [ 94.657866][ T157] folio_end_writeback+0x146/0x190 [ 94.658815][ T157] swap_writepage_bdev_sync+0x312/0x410 [ 94.659840][ T157] ? swap_read_folio_bdev_sync+0x3c0/0x3c0 [ 94.660917][ T157] ? do_raw_spin_lock+0x12a/0x260 [ 94.661845][ T157] ? __rwlock_init+0x150/0x150 [ 94.662726][ T157] ? bio_kmalloc+0x20/0x20 [ 94.663548][ T157] ? swapcache_clear+0xd0/0xd0 [ 94.664431][ T157] swap_writepage+0x2a5/0x720 [ 94.665298][ T157] pageout+0x304/0x6a0 [ 94.666052][ T157] ? get_pte_pfn.isra.0+0x4d0/0x4d0 [ 94.667025][ T157] ? find_held_lock+0x2d/0x110 [ 94.667912][ T157] ? enable_swap_slots_cache+0x90/0x90 [ 94.668925][ T157] ? arch_tlbbatch_flush+0x1f6/0x370 [ 94.669903][ T157] shrink_folio_list+0x19b5/0x2600 [ 94.670856][ T157] ? pageout+0x6a0/0x6a0 [ 94.671649][ T157] ? isolate_folios+0x156/0x320 [ 94.672544][ T157] ? find_held_lock+0x2d/0x110 [ 94.673428][ T157] ? mark_lock+0xcc/0x12c0 [ 94.674258][ T157] ? mark_lock_irq+0x1cd0/0x1cd0 [ 94.675174][ T157] ? reacquire_held_locks+0x4d0/0x4d0 [ 94.676166][ T157] ? mark_held_locks+0x94/0xe0 [ 94.677045][ T157] evict_folios+0x4bb/0x1580 [ 94.677890][ T157] ? isolate_folios+0x320/0x320 [ 94.678787][ T157] ? __lock_acquire+0xc4c/0x1ef0 [ 94.679695][ T157] ? lock_is_held_type+0x81/0xe0 [ 94.680607][ T157] try_to_shrink_lruvec+0x41e/0x9e0 [ 94.681564][ T157] ? __lock_acquire+0xc4c/0x1ef0 [ 94.682482][ T157] ? evict_folios+0x1580/0x1580 [ 94.683390][ T157] ? lock_release+0x105/0x260 [ 94.684255][ T157] lru_gen_shrink_node+0x25d/0x660 [ 94.685202][ T157] ? balance_pgdat+0x5b5/0xf00 [ 94.686083][ T157] ? try_to_shrink_lruvec+0x9e0/0x9e0 [ 94.687076][ T157] ? pgdat_balanced+0xb8/0x110 [ 94.687957][ T157] balance_pgdat+0x532/0xf00 [ 94.688803][ T157] ? shrink_node.part.0+0xc30/0xc30 [ 94.689758][ T157] ? io_schedule_timeout+0x110/0x110 [ 94.690741][ T157] ? reacquire_held_locks+0x4d0/0x4d0 [ 94.691723][ T157] ? __lock_acquire+0x1ef0/0x1ef0 [ 94.692643][ T157] ? zone_watermark_ok_safe+0x32/0x290 [ 94.693650][ T157] ? inactive_is_low.isra.0+0xe0/0xe0 [ 94.694639][ T157] ? do_raw_spin_lock+0x12a/0x260 [ 94.695567][ T157] kswapd+0x2ef/0x4e0 [ 94.696297][ T157] ? balance_pgdat+0xf00/0xf00 [ 94.697176][ T157] ? __kthread_parkme+0xb1/0x1c0 [ 94.698087][ T157] ? balance_pgdat+0xf00/0xf00 [ 94.698971][ T157] kthread+0x38b/0x700 [ 94.699721][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.700648][ T157] ? lock_acquire+0x18a/0x1f0 [ 94.701516][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.702438][ T157] ret_from_fork+0x2d/0x70 [ 94.703267][ T157] ? kthread_is_per_cpu+0xb0/0xb0 [ 94.704193][ T157] ret_from_fork_asm+0x11/0x20 [ 94.705074][ T157] </TASK> Also UAF in compactd [ 95.249096][ T146] ================================================================== [ 95.254091][ T146] BUG: KASAN: slab-use-after-free in kcompactd+0x9cd/0xa60 [ 95.257959][ T146] Read of size 4 at addr ffff888105100018 by task kcompactd0/146 [ 95.262100][ T146] [ 95.263347][ T146] CPU: 11 UID: 0 PID: 146 Comm: kcompactd0 Tainted: G D W 6.13.0+ #927 [ 95.263363][ T146] Tainted: [D]=DIE, [W]=WARN [ 95.263367][ T146] Call Trace: [ 95.263379][ T146] <TASK> [ 95.263386][ T146] dump_stack_lvl+0x57/0x80 [ 95.263403][ T146] print_address_description.constprop.0+0x88/0x330 [ 95.263416][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263425][ T146] print_report+0xe2/0x1cc [ 95.263433][ T146] ? __virt_addr_valid+0x1d1/0x3b0 [ 95.263442][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263449][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263456][ T146] kasan_report+0xb9/0x180 [ 95.263466][ T146] ? kcompactd+0x9cd/0xa60 [ 95.263476][ T146] kcompactd+0x9cd/0xa60 [ 95.263487][ T146] ? kcompactd_do_work+0x710/0x710 [ 95.263495][ T146] ? prepare_to_swait_exclusive+0x260/0x260 [ 95.263506][ T146] ? __kthread_parkme+0xb1/0x1c0 [ 95.263520][ T146] ? kcompactd_do_work+0x710/0x710 [ 95.263527][ T146] kthread+0x38b/0x700 [ 95.263535][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263542][ T146] ? lock_acquire+0x18a/0x1f0 [ 95.263552][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263559][ T146] ret_from_fork+0x2d/0x70 [ 95.263569][ T146] ? kthread_is_per_cpu+0xb0/0xb0 [ 95.263576][ T146] ret_from_fork_asm+0x11/0x20 [ 95.263589][ T146] </TASK> [ 95.263592][ T146] [ 95.293474][ T146] Allocated by task 2: [ 95.294209][ T146] kasan_save_stack+0x1e/0x40 [ 95.295111][ T146] kasan_save_track+0x10/0x30 [ 95.295978][ T146] __kasan_slab_alloc+0x62/0x70 [ 95.296860][ T146] kmem_cache_alloc_node_noprof+0xdb/0x2a0 [ 95.297915][ T146] dup_task_struct+0x32/0x550 [ 95.298797][ T146] copy_process+0x309/0x45d0 [ 95.299656][ T146] kernel_clone+0xb7/0x600 [ 95.300451][ T146] kernel_thread+0xb0/0xe0 [ 95.301253][ T146] kthreadd+0x3b5/0x620 [ 95.302019][ T146] ret_from_fork+0x2d/0x70 [ 95.302865][ T146] ret_from_fork_asm+0x11/0x20 [ 95.303724][ T146] [ 95.304146][ T146] Freed by task 0: [ 95.304836][ T146] kasan_save_stack+0x1e/0x40 [ 95.305708][ T146] kasan_save_track+0x10/0x30 [ 95.306569][ T146] kasan_save_free_info+0x37/0x50 [ 95.307515][ T146] __kasan_slab_free+0x33/0x40 [ 95.308402][ T146] kmem_cache_free+0xff/0x480 [ 95.309256][ T146] delayed_put_task_struct+0x15a/0x1d0 [ 95.310258][ T146] rcu_do_batch+0x2ee/0xb70 [ 95.311113][ T146] rcu_core+0x4a6/0xa10 [ 95.311868][ T146] handle_softirqs+0x191/0x650 [ 95.312747][ T146] __irq_exit_rcu+0xaf/0xe0 [ 95.313643][ T146] irq_exit_rcu+0xa/0x20 [ 95.314536][ T146] sysvec_apic_timer_interrupt+0x65/0x80 [ 95.315616][ T146] asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 95.316702][ T146] [ 95.317127][ T146] Last potentially related work creation: [ 95.318155][ T146] kasan_save_stack+0x1e/0x40 [ 95.319006][ T146] kasan_record_aux_stack+0x97/0xa0 [ 95.319947][ T146] __call_rcu_common.constprop.0+0x70/0x7b0 [ 95.321014][ T146] __schedule+0x75d/0x1720 [ 95.321817][ T146] schedule_idle+0x55/0x80 [ 95.322624][ T146] cpu_startup_entry+0x50/0x60 [ 95.323490][ T146] start_secondary+0x1b6/0x210 [ 95.324354][ T146] common_startup_64+0x12c/0x138 [ 95.325248][ T146] [ 95.325669][ T146] The buggy address belongs to the object at ffff888105100000 [ 95.325669][ T146] which belongs to the cache task_struct of size 8200 [ 95.328215][ T146] The buggy address is located 24 bytes inside of [ 95.328215][ T146] freed 8200-byte region [ffff888105100000, ffff888105102008) [ 95.330692][ T146] [ 95.331116][ T146] The buggy address belongs to the physical page: [ 95.332275][ T146] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105100 [ 95.333862][ T146] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 95.335399][ T146] flags: 0x8000000000000040(head|zone=2) [ 95.336425][ T146] page_type: f5(slab) [ 95.337155][ T146] raw: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000 [ 95.338716][ T146] raw: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000 [ 95.340273][ T146] head: 8000000000000040 ffff888100a80c80 dead000000000122 0000000000000000 [ 95.341844][ T146] head: 0000000000000000 0000000000030003 00000000f5000000 0000000000000000 [ 95.343418][ T146] head: 8000000000000003 ffffea0004144001 ffffffffffffffff 0000000000000000 [ 95.344977][ T146] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ 95.346540][ T146] page dumped because: kasan: bad access detected [ 95.347701][ T146] [ 95.348123][ T146] Memory state around the buggy address: [ 95.349139][ T146] ffff8881050fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 95.350598][ T146] ffff8881050fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 95.352054][ T146] >ffff888105100000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.353510][ T146] ^ [ 95.354389][ T146] ffff888105100080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.355856][ T146] ffff888105100100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 95.357315][ T146] ==================================================================
diff --git a/mm/vmscan.c b/mm/vmscan.c index bc1826020159..c97adb0fdaa4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, if (shmem_mapping(mapping) && folio_test_large(folio)) wbc.list = folio_list; - folio_set_reclaim(folio); + folio_set_dropbehind(folio); + res = mapping->a_ops->writepage(&folio->page, &wbc); if (res < 0) handle_write_error(mapping, folio, res); if (res == AOP_WRITEPAGE_ACTIVATE) { - folio_clear_reclaim(folio); + folio_clear_dropbehind(folio); return PAGE_ACTIVATE; } - if (!folio_test_writeback(folio)) { - /* synchronous write or broken a_ops? */ - folio_clear_reclaim(folio); - } trace_mm_vmscan_write_folio(folio); node_stat_add_folio(folio, NR_VMSCAN_WRITE); return PAGE_SUCCESS;