diff mbox series

[2/3] fs/buffer: avoid races with folio migrations on __find_get_block_slow()

Message ID 20250330064732.3781046-3-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series mm: move migration work around to buffer-heads | expand

Commit Message

Luis Chamberlain March 30, 2025, 6:47 a.m. UTC
Filesystems which use buffer-heads where it cannot guarantees that the
there are no other references to the folio, for example with a folio
lock, must use buffer_migrate_folio_norefs() for the address space
mapping migrate_folio() callback. There are only 3 filesystems which use
this callback:

  1) the block device cache
  2) ext4 for its ext4_journalled_aops
  3) nilfs2

The commit ebdf4de5642fb6 ("mm: migrate: fix reference  check race
between __find_get_block() and migration") added a spin lock to
prevent races with page migration which ext4 users were reporting
through the SUSE bugzilla (bnc#1137609 [0]). Although implicit,
the spinlock is only held for users of buffer_migrate_folio_norefs()
which was added by commit 89cb0888ca148 ("mm: migrate: provide
buffer_migrate_page_norefs()") to support page migration on block
device folios. Later commit dae999602eeb ("ext4: stop providing
.writepage hook") made ext4_journalled_aops use the same callback.
It is worth elaborating on why ext4 journalled aops uses this:

    so that buffers cannot be modified under jdb2's hands as that can
    cause data corruption. For example when commit code does writeout
    of transaction buffers in jbd2_journal_write_metadata_buffer(), we
    don't hold page lock or have page writeback bit set or have the
    buffer locked. So page migration code would go and happily migrate
    the page elsewhere while the copy is running thus corrupting data.

Although we don't have exact traces of the filesystem corruption we
can can reproduce fs corruption one ext4 by just removing the spinlock
and stress testing the filesystem with generic/750, we eventually end
up after 3 hours of testing with kdevops using libvirt on the ext4
profile ext4-4k. Things like the below as reported recently [1]:

Mar 28 03:36:37 extra-ext4-4k unknown: run fstests generic/750 at 2025-03-28 03:36:37
<-- etc -->
Mar 28 05:57:09 extra-ext4-4k kernel: EXT4-fs error (device loop5): ext4_get_first_dir_block:3538: inode #5174: comm fsstress: directory missing '.'
Mar 28 06:04:43 extra-ext4-4k kernel: EXT4-fs warning (device loop5): ext4_empty_dir:3088: inode #5176: comm fsstress: directory missing '.'
Mar 28 06:42:05 extra-ext4-4k kernel: EXT4-fs error (device loop5): __ext4_find_entry:1626: inode #5173: comm fsstress: checksumming directory block 0
Mar 28 08:16:43 extra-ext4-4k kernel: EXT4-fs error (device loop5): ext4_find_extent:938: inode #1104560: comm fsstress: pblk 4932229 bad header/extent: invalid magic - magic 8383, entries 33667, max 33667(0), depth 33667(0)

The block device cache is a user of buffer_migrate_folio_norefs()
and it supports large folios, in that case we can sleep on
folio_mc_copy() on page migration on a cond_resched(). So we want
to avoid requiring a spin lock even on the buffer_migrate_folio_norefs()
case so to enable large folios on buffer-head folio migration.
To address this we must avoid races with folio migration in a
different way.

This provides an alternative by avoiding giving away a folio in
__find_get_block_slow() on folio migration candidates so to enable us
to let us later rip out the spin_lock() held on the folio migration
buffer_migrate_folio_norefs() path. We limit the scope of this sanity
check only for filesystems which cannot provide any guarantees that
there are no references to the folio, so only users of the folio
migration callback buffer_migrate_folio_norefs().

Although we have no direct clear semantics to check if a folio is
being evaluated for folio migration we know that folio migration
happens LRU folios [2]. Since folio migration must not be called
with folio_test_writeback() folios we can skip these folios as well.
The other corner case we can be concerned is for a drive implement
mops, but the docs indicate VM seems to use lru for that too.
A real concern to have here is if the check is starving readers or
writers who  want to read a block into the page cache and it
is part of the LRU. The path __getblk_slow() will first try
__find_get_block() which uses __filemap_get_folio() without FGP_CREAT,
and if it fails it will call grow_buffers() which calls again
__filemap_get_folio() but with with FGP_CREAT now, but
__filemap_get_folio() won't create a folio for us if it already exists.
So  if the folio was in LRU __getblk_slow()  will essentially end up
checking again for the folio until its gone from the page cache or
migration ended, effectively preventing a race with folio migration
which is what we want.

This commit and the subsequent one prove to be an alternative to fix
the filesystem corruption noted above.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0]
Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [1]
Link: https://docs.kernel.org/mm/page_migration.html # [2]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Luis Chamberlain March 31, 2025, 7:58 p.m. UTC | #1
On Sat, Mar 29, 2025 at 11:47:31PM -0700, Luis Chamberlain wrote:
> Although we don't have exact traces of the filesystem corruption we
> can can reproduce fs corruption one ext4 by just removing the spinlock
> and stress testing the filesystem with generic/750, we eventually end
> up after 3 hours of testing with kdevops using libvirt on the ext4
> profile ext4-4k. Things like the below as reported recently [1]:
> 
> Mar 28 03:36:37 extra-ext4-4k unknown: run fstests generic/750 at 2025-03-28 03:36:37
> <-- etc -->
> Mar 28 05:57:09 extra-ext4-4k kernel: EXT4-fs error (device loop5): ext4_get_first_dir_block:3538: inode #5174: comm fsstress: directory missing '.'
> Mar 28 06:04:43 extra-ext4-4k kernel: EXT4-fs warning (device loop5): ext4_empty_dir:3088: inode #5176: comm fsstress: directory missing '.'
> Mar 28 06:42:05 extra-ext4-4k kernel: EXT4-fs error (device loop5): __ext4_find_entry:1626: inode #5173: comm fsstress: checksumming directory block 0
> Mar 28 08:16:43 extra-ext4-4k kernel: EXT4-fs error (device loop5): ext4_find_extent:938: inode #1104560: comm fsstress: pblk 4932229 bad header/extent: invalid magic - magic 8383, entries 33667, max 33667(0), depth 33667(0)

I reproduced again a corruption with ext4 when we remove the spin lock
alone with generic/750, the trace looks slightly different, and
this time I ran the test with ext4 with 2k block size filesystem. I
also reverted both "mm: migrate: remove folio_migrate_copy()" and
"mm: migrate: support poisoned recover from migrate folio" just in case
the extra check added by the later was helping avoid the corruption. I
also added some debugfs stats one can use to verify exact values of
how buffer_migrate_folio_norefs() can fail. I put this tree on
linux-kdevops 20250321-accel-ext4-bug branch [0] for other folks' convenience.
This time it took ~ 5 hours to reproduce and we start off with a jbd
jbd2_journal_dirty_metadata() kernel warning followed by putting
the filesystem in read-only mode:

Mar 31 06:51:30 extra-ext4-2k kernel: Linux version 6.14.0-rc7-next-20250321-00004-ge4b961f1dadb (mcgrof@beef) (gcc (Debian 14.2.0-16) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #92 SMP PREEMPT_DYNAMIC Mon Mar 31 06:43:08 UTC 2025
Mar 31 06:51:30 extra-ext4-2k kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.14.0-rc7-next-20250321-00004-ge4b961f1dadb root=PARTUUID=503fa6f2-2d5b-4d7e-8cf8-3a811de326ce ro console=tty0 console=tty1 console=ttyS0,115200n8 console=ttyS0

<-- snip -->

Mar 31 06:56:05 extra-ext4-2k unknown: run fstests generic/750 at 2025-03-31 06:56:05
Mar 31 06:56:06 extra-ext4-2k kernel: EXT4-fs (loop5): mounted filesystem 858bf9bf-6582-42d3-b2f5-41c7061033df r/w with ordered data mode. Quota mode: none.
Mar 31 07:02:04 extra-ext4-2k kernel: NOHZ tick-stop error: local softirq work is pending, handler #10!!!
Mar 31 07:35:22 extra-ext4-2k kernel: NOHZ tick-stop error: local softirq work is pending, handler #10!!!
Mar 31 07:53:05 extra-ext4-2k kernel: NOHZ tick-stop error: local softirq work is pending, handler #10!!!
Mar 31 11:47:01 extra-ext4-2k kernel: ------------[ cut here ]------------
Mar 31 11:47:01 extra-ext4-2k kernel: WARNING: CPU: 5 PID: 3092 at fs/jbd2/transaction.c:1552 jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel: Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat crc32c_generic fat kvm_intel kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd 9pnet_virtio virtio_balloon virtio_console button evdev joydev serio_raw nvme_fabrics drm dm_mod nvme_core nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover failover virtio_blk psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Mar 31 11:47:01 extra-ext4-2k kernel: CPU: 5 UID: 0 PID: 3092 Comm: fsstress Not tainted 6.14.0-rc7-next-20250321-00004-ge4b961f1dadb #92 PREEMPT(full) 
Mar 31 11:47:01 extra-ext4-2k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.11-5 01/28/2025
Mar 31 11:47:01 extra-ext4-2k kernel: RIP: 0010:jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel: Code: 30 0f 84 5b fe ff ff 0f 0b 41 bc 8b ff ff ff e9 69 fe ff ff 48 8b 04 24 4c 8b 48 70 4d 39 cf 0f 84 53 ff ff ff e9 22 c5 00 00 <0f> 0b 41 bc e4 ff ff ff e9 41 ff ff ff 0f 0b 90 0f 1f 40 00 90 90
Mar 31 11:47:01 extra-ext4-2k kernel: RSP: 0018:ffffb8bd83d2f7f8 EFLAGS: 00010246
Mar 31 11:47:01 extra-ext4-2k kernel: RAX: 0000000000000001 RBX: ffff9d72cb6d7528 RCX: 00000000000000fd
Mar 31 11:47:01 extra-ext4-2k kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: RBP: ffff9d72cf5127b8 R08: ffff9d72cf5127b8 R09: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R10: ffff9d733c8d983c R11: 0000000000006496 R12: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R13: ffff9d72c258cce8 R14: ffff9d72cb6d7530 R15: ffff9d72c25ff000
Mar 31 11:47:01 extra-ext4-2k kernel: FS:  00007f1f5ae7f740(0000) GS:ffff9d7386349000(0000) knlGS:0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 31 11:47:01 extra-ext4-2k kernel: CR2: 000055a76a03bfa8 CR3: 0000000108f76001 CR4: 0000000000772ef0
Mar 31 11:47:01 extra-ext4-2k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 31 11:47:01 extra-ext4-2k kernel: PKRU: 55555554
Mar 31 11:47:01 extra-ext4-2k kernel: Call Trace:
Mar 31 11:47:01 extra-ext4-2k kernel:  <TASK>
Mar 31 11:47:01 extra-ext4-2k kernel:  ? jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __warn.cold+0x93/0xf8
Mar 31 11:47:01 extra-ext4-2k kernel:  ? jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? report_bug+0xe6/0x170
Mar 31 11:47:01 extra-ext4-2k kernel:  ? jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? handle_bug+0x199/0x260
Mar 31 11:47:01 extra-ext4-2k kernel:  ? exc_invalid_op+0x13/0x60
Mar 31 11:47:01 extra-ext4-2k kernel:  ? asm_exc_invalid_op+0x16/0x20
Mar 31 11:47:01 extra-ext4-2k kernel:  ? jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? jbd2_journal_dirty_metadata+0x9e/0x230 [jbd2]
Mar 31 11:47:01 extra-ext4-2k kernel:  __ext4_handle_dirty_metadata+0x6d/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_ext_insert_extent+0x5c1/0x1510 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_cache_extents+0x5a/0xd0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_find_extent+0x37c/0x3a0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_ext_map_blocks+0x51e/0x1900 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_map_blocks+0x11a/0x4d0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? kmem_cache_alloc_noprof+0x321/0x3e0
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_do_writepages+0x762/0xd40 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_da_write_end+0xa3/0x3c0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_writepages+0xd7/0x1b0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_writepages+0xd7/0x1b0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  do_writepages+0xdd/0x250
Mar 31 11:47:01 extra-ext4-2k kernel:  filemap_fdatawrite_wbc+0x48/0x60
Mar 31 11:47:01 extra-ext4-2k kernel:  __filemap_fdatawrite_range+0x5b/0x80
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_release_file+0x6d/0xa0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  __fput+0xf4/0x2b0
Mar 31 11:47:01 extra-ext4-2k kernel:  __x64_sys_close+0x39/0x80
Mar 31 11:47:01 extra-ext4-2k kernel:  do_syscall_64+0x4b/0x120
Mar 31 11:47:01 extra-ext4-2k kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Mar 31 11:47:01 extra-ext4-2k kernel: RIP: 0033:0x7f1f5af11687
Mar 31 11:47:01 extra-ext4-2k kernel: Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
Mar 31 11:47:01 extra-ext4-2k kernel: RSP: 002b:00007ffc2509e1d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
Mar 31 11:47:01 extra-ext4-2k kernel: RAX: ffffffffffffffda RBX: 00007f1f5ae7f740 RCX: 00007f1f5af11687
Mar 31 11:47:01 extra-ext4-2k kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
Mar 31 11:47:01 extra-ext4-2k kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R13: 00007ffc2509e7c0 R14: 0000000000000003 R15: 0000000000000004
Mar 31 11:47:01 extra-ext4-2k kernel:  </TASK>
Mar 31 11:47:01 extra-ext4-2k kernel: ---[ end trace 0000000000000000 ]---
Mar 31 11:47:01 extra-ext4-2k kernel: ------------[ cut here ]------------
Mar 31 11:47:01 extra-ext4-2k kernel: WARNING: CPU: 5 PID: 3092 at fs/ext4/ext4_jbd2.c:360 __ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel: Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat crc32c_generic fat kvm_intel kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd 9pnet_virtio virtio_balloon virtio_console button evdev joydev serio_raw nvme_fabrics drm dm_mod nvme_core nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover failover virtio_blk psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Mar 31 11:47:01 extra-ext4-2k kernel: CPU: 5 UID: 0 PID: 3092 Comm: fsstress Tainted: G        W           6.14.0-rc7-next-20250321-00004-ge4b961f1dadb #92 PREEMPT(full) 
Mar 31 11:47:01 extra-ext4-2k kernel: Tainted: [W]=WARN
Mar 31 11:47:01 extra-ext4-2k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.11-5 01/28/2025
Mar 31 11:47:01 extra-ext4-2k kernel: RIP: 0010:__ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel: Code: 00 00 00 44 89 fa 4c 89 f6 49 c7 c1 df 2b 97 c0 4c 89 e7 41 bd fb ff ff ff e8 6a ce 05 00 eb 97 48 89 df e8 60 99 8a f5 eb 8a <0f> 0b 48 c7 c2 60 42 96 c0 45 89 e8 48 89 e9 44 89 fe 4c 89 f7 e8
Mar 31 11:47:01 extra-ext4-2k kernel: RSP: 0018:ffffb8bd83d2f838 EFLAGS: 00010286
Mar 31 11:47:01 extra-ext4-2k kernel: RAX: ffff9d72c04ef800 RBX: ffff9d72cf5127b8 RCX: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000ffffffff
Mar 31 11:47:01 extra-ext4-2k kernel: RBP: ffff9d72c258cce8 R08: ffff9d72cf5127b8 R09: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R10: ffff9d733c8d983c R11: 0000000000006496 R12: ffff9d71d05fe378
Mar 31 11:47:01 extra-ext4-2k kernel: R13: 00000000ffffffe4 R14: ffffffffc0964720 R15: 0000000000000556
Mar 31 11:47:01 extra-ext4-2k kernel: FS:  00007f1f5ae7f740(0000) GS:ffff9d7386349000(0000) knlGS:0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 31 11:47:01 extra-ext4-2k kernel: CR2: 000055a76a03bfa8 CR3: 0000000108f76001 CR4: 0000000000772ef0
Mar 31 11:47:01 extra-ext4-2k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 31 11:47:01 extra-ext4-2k kernel: PKRU: 55555554
Mar 31 11:47:01 extra-ext4-2k kernel: Call Trace:
Mar 31 11:47:01 extra-ext4-2k kernel:  <TASK>
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __warn.cold+0x93/0xf8
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? report_bug+0xe6/0x170
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? handle_bug+0x199/0x260
Mar 31 11:47:01 extra-ext4-2k kernel:  ? exc_invalid_op+0x13/0x60
Mar 31 11:47:01 extra-ext4-2k kernel:  ? asm_exc_invalid_op+0x16/0x20
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __ext4_handle_dirty_metadata+0x102/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? __ext4_handle_dirty_metadata+0x6d/0x190 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_ext_insert_extent+0x5c1/0x1510 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_cache_extents+0x5a/0xd0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_find_extent+0x37c/0x3a0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_ext_map_blocks+0x51e/0x1900 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_map_blocks+0x11a/0x4d0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? kmem_cache_alloc_noprof+0x321/0x3e0
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_do_writepages+0x762/0xd40 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_da_write_end+0xa3/0x3c0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ? ext4_writepages+0xd7/0x1b0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_writepages+0xd7/0x1b0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  do_writepages+0xdd/0x250
Mar 31 11:47:01 extra-ext4-2k kernel:  filemap_fdatawrite_wbc+0x48/0x60
Mar 31 11:47:01 extra-ext4-2k kernel:  __filemap_fdatawrite_range+0x5b/0x80
Mar 31 11:47:01 extra-ext4-2k kernel:  ext4_release_file+0x6d/0xa0 [ext4]
Mar 31 11:47:01 extra-ext4-2k kernel:  __fput+0xf4/0x2b0
Mar 31 11:47:01 extra-ext4-2k kernel:  __x64_sys_close+0x39/0x80
Mar 31 11:47:01 extra-ext4-2k kernel:  do_syscall_64+0x4b/0x120
Mar 31 11:47:01 extra-ext4-2k kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Mar 31 11:47:01 extra-ext4-2k kernel: RIP: 0033:0x7f1f5af11687
Mar 31 11:47:01 extra-ext4-2k kernel: Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
Mar 31 11:47:01 extra-ext4-2k kernel: RSP: 002b:00007ffc2509e1d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
Mar 31 11:47:01 extra-ext4-2k kernel: RAX: ffffffffffffffda RBX: 00007f1f5ae7f740 RCX: 00007f1f5af11687
Mar 31 11:47:01 extra-ext4-2k kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
Mar 31 11:47:01 extra-ext4-2k kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
Mar 31 11:47:01 extra-ext4-2k kernel: R13: 00007ffc2509e7c0 R14: 0000000000000003 R15: 0000000000000004
Mar 31 11:47:01 extra-ext4-2k kernel:  </TASK>
Mar 31 11:47:01 extra-ext4-2k kernel: ---[ end trace 0000000000000000 ]---
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs: ext4_ext_grow_indepth:1366: aborting transaction: error 28 in __ext4_handle_dirty_metadata
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5): ext4_ext_grow_indepth:1366: inode #213689: block 9062541: comm fsstress: journal_dirty_metadata failed: handle type 2 started at line 2721, credits 11/0, errcode -28
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5) in ext4_mb_clear_bb:6550: Readonly filesystem
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5) in ext4_reserve_inode_write:5870: Readonly filesystem
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5): mpage_map_and_submit_extent:2336: inode #213689: comm fsstress: mark_inode_dirty error
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5): mpage_map_and_submit_extent:2338: comm fsstress: Failed to mark inode 213689 dirty
Mar 31 11:47:01 extra-ext4-2k kernel: EXT4-fs error (device loop5) in ext4_do_writepages:2751: error 28

So at least we now have 2 different public filesystem corruptions traces with
ext4 without the spin lock.

[0] https://github.com/linux-kdevops/linux/tree/20250321-accel-ext4-bug

  Luis
Jan Kara April 1, 2025, 10:57 a.m. UTC | #2
On Sat 29-03-25 23:47:31, Luis Chamberlain wrote:
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c7abb4a029dc..a4e4455a6ce2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -208,6 +208,15 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
>  	head = folio_buffers(folio);
>  	if (!head)
>  		goto out_unlock;
> +
> +	if (folio->mapping->a_ops->migrate_folio &&
> +	    folio->mapping->a_ops->migrate_folio == buffer_migrate_folio_norefs) {

This is always true for bdev mapping we have here, isn't it?

> +		if (folio_test_lru(folio) &&

Do you expect bdev page cache to contain non-LRU folios? I thought every
pagecache folio is on LRU so this seems pointless as well? But I may be
missing something here.

> +		    folio_test_locked(folio) &&
> +		    !folio_test_writeback(folio))
> +			goto out_unlock;

I find this problematic. It fixes the race with migration, alright
(although IMO we should have a comment very well explaining the interplay
of folio lock and mapping->private_lock to make this work - probably in
buffer_migrate_folio_norefs() - and reference it from here), but there are
places which expect that if __find_get_block() doesn't return anything,
this block is not cached in the buffer cache. And your change breaks this
assumption. Look for example at write_boundary_block(), that will fail to
write the block it should write if it races with someone locking the folio
after your changes. Similarly the code tracking state of deleted metadata
blocks in fs/jbd2/revoke.c will fail to properly update buffer's state if
__find_get_block() suddently starts returning NULL although the buffer is
present in cache. 

> +	}
> +
>  	bh = head;
>  	do {
>  		if (!buffer_mapped(bh))

								Honza
Davidlohr Bueso April 1, 2025, 9:49 p.m. UTC | #3
On Tue, 01 Apr 2025, Jan Kara wrote:

>I find this problematic. It fixes the race with migration, alright
>(although IMO we should have a comment very well explaining the interplay
>of folio lock and mapping->private_lock to make this work - probably in
>buffer_migrate_folio_norefs() - and reference it from here), but there are
>places which expect that if __find_get_block() doesn't return anything,
>this block is not cached in the buffer cache. And your change breaks this
>assumption. Look for example at write_boundary_block(), that will fail to
>write the block it should write if it races with someone locking the folio
>after your changes. Similarly the code tracking state of deleted metadata
>blocks in fs/jbd2/revoke.c will fail to properly update buffer's state if
>__find_get_block() suddently starts returning NULL although the buffer is
>present in cache.

Yeah - one thing I was thinking about, _iff_ failing lookups (__find_get_block()
returning nil) during migration is in fact permitted, was adding a BH_migrate
flag and serialize vs __buffer_migrate_folio() entirely. Semantically there
are no users, and none are added during this window, but as a consequence I
suppose one thread could see the page not cached, act upon that, then see it
cached once the migration is done and get confused(?). So I don't see a problem
here for write_boundary_block() specifically, but I'm probably overlooking others.

Now, if bailing on the lookup is not an option, meaning it must wait for the
migration to complete, I'm not sure large folios will ever be compatible with
the "Various filesystems appear to want __find_get_block to be non-blocking."
comment.

So the below could be tucked in for norefs only (because this is about the addr
space i_private_lock), but this also shortens the hold time; if that matters
at all, of course, vs changing the migration semantics.

Thanks,
Davidlohr

---8<----------------------------
diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..f585339ae2e4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -208,6 +208,14 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
	head = folio_buffers(folio);
	if (!head)
		goto out_unlock;
+
+	bh = head;
+	do {
+		if (test_bit(BH_migrate, &bh->b_state))
+			goto out_unlock;
+		bh = bh->b_this_page;
+	} while (bh != head);
+
	bh = head;
	do {
		if (!buffer_mapped(bh))
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 932139c5d46f..e956a1509a05 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -34,6 +34,7 @@ enum bh_state_bits {
	BH_Meta,	/* Buffer contains metadata */
	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_migrate,     /* Buffer is being migrated */

	BH_PrivateStart,/* not a state bit, but the first bit available
			 * for private allocation by other entities
diff --git a/mm/migrate.c b/mm/migrate.c
index a073eb6c5009..0ffa8b478fd3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -846,6 +846,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
	if (!buffer_migrate_lock_buffers(head, mode))
		return -EAGAIN;

+	bh = head;
+	do {
+		set_bit(BH_migrate, &bh->b_state);
+		bh = bh->b_this_page;
+	} while (bh != head);
+
	if (check_refs) {
		bool busy;
		bool invalidated = false;
@@ -861,12 +867,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
			}
			bh = bh->b_this_page;
		} while (bh != head);
+		spin_unlock(&mapping->i_private_lock);
		if (busy) {
			if (invalidated) {
				rc = -EAGAIN;
				goto unlock_buffers;
			}
-			spin_unlock(&mapping->i_private_lock);
			invalidate_bh_lrus();
			invalidated = true;
			goto recheck_buffers;
@@ -884,10 +890,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
	} while (bh != head);

  unlock_buffers:
-	if (check_refs)
-		spin_unlock(&mapping->i_private_lock);
	bh = head;
	do {
+		clear_bit(BH_migrate, &bh->b_state)
		unlock_buffer(bh);
		bh = bh->b_this_page;
	} while (bh != head);
Matthew Wilcox (Oracle) April 2, 2025, 1:58 a.m. UTC | #4
On Tue, Apr 01, 2025 at 02:49:51PM -0700, Davidlohr Bueso wrote:
> So the below could be tucked in for norefs only (because this is about the addr
> space i_private_lock), but this also shortens the hold time; if that matters
> at all, of course, vs changing the migration semantics.

I like this approach a lot better.  One wrinkle is that it doesn't seem
that we need to set the BH_Migrate bit on every buffer; we could define
that it's only set on the head BH, right?
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index c7abb4a029dc..a4e4455a6ce2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -208,6 +208,15 @@  __find_get_block_slow(struct block_device *bdev, sector_t block)
 	head = folio_buffers(folio);
 	if (!head)
 		goto out_unlock;
+
+	if (folio->mapping->a_ops->migrate_folio &&
+	    folio->mapping->a_ops->migrate_folio == buffer_migrate_folio_norefs) {
+		if (folio_test_lru(folio) &&
+		    folio_test_locked(folio) &&
+		    !folio_test_writeback(folio))
+			goto out_unlock;
+	}
+
 	bh = head;
 	do {
 		if (!buffer_mapped(bh))