diff mbox series

[v2,1/8] migrate: fix skipping metadata buffer heads on migration

Message ID 20250410014945.2140781-2-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series mm: enhance migration work around on noref buffer-heads | expand

Commit Message

Luis Chamberlain April 10, 2025, 1:49 a.m. UTC
Filesystems which use buffer-heads where it cannot guarantees that there
are no other references to the folio, for example with a folio
lock, must use buffer_migrate_folio_norefs() for the address space
mapping migrate_folio() callback. There are only 3 filesystems which use
this callback:

  1) the block device cache
  2) ext4 for its ext4_journalled_aops, ie, jbd2
  3) nilfs2

jbd2's use of this however callback however is very race prone, consider
folio migration while reviewing jbd2_journal_write_metadata_buffer()
and the fact that jbd2:

  - does not hold the folio lock
  - does not have have page writeback bit set
  - does not lock the buffer

And so, it can race with folio_set_bh() on folio migration. The commit
ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
__find_get_block() and migration") added a spin lock to prevent races
with page migration which ext4 users were reporting through the SUSE
bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
original filesystem corruption we can can reproduce fs corruption on
ext4 by just removing the spinlock and stress testing the filesystem
with generic/750, we eventually end up after 3 hours of testing with
kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. It
turns out that the spin lock doesn't in the end protect against
corruption, it *helps* reduce the possibility, but ext4 filesystem
corruption can still happen even with the spin lock held. A test was
done using vanilla Linux and adding a udelay(2000) right before we
spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
we can reproduce the same exact filesystem corruption issues as observed
without the spinlock with generic/750 [1].

We now have a slew of traces collected for the ext4 corruptions possible
without the current spin lock held [2] [3] [4] but the general pattern
is as follows, as summarized by ChatGPT from all traces:

do_writepages() # write back -->
   ext4_map_block() # performs logical to physical block mapping -->
     ext4_ext_insert_extent() # updates extent tree -->
       jbd2_journal_dirty_metadata()  # marks metadata as dirty for
                                      # journaling. This can lead
                                      # to any of the following hints
                                      # as to what happened from
                                      # ext4 / jbd2

         - Directory and extent metadata corruption splats or

         - Filure to handle out-of-space conditions gracefully, with
           cascading metadata errors and eventual filesystem shutdown
           to prevent further damage.

         - Failure to journal new extent metadata during extent tree
           growth, triggered under memory pressure or heavy writeback.
           Commonly results in ENOSPC, journal abort, and read-only
           fallback. **

         - Journal metadata failure during extent tree growth causes
           read-only fallback. Seen repeatedly on small-block (2k)
           filesystems under stress (e.g. fsstress). Triggers errors in
           bitmap and inode updates, and persists in journal replay logs.
           "Error count since last fsck" shows long-term corruption
           footprint.

** Reproduced on vanilla Linux with udelay(2000) **

Call trace (ENOSPC journal failure):
  do_writepages()
    → ext4_do_writepages()
      → ext4_map_blocks()
        → ext4_ext_map_blocks()
          → ext4_ext_insert_extent()
            → __ext4_handle_dirty_metadata()
              → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

And so jbd2 still needs more work to avoid races with folio migration.
So replace the current spin lock solution by just skipping jbd buffers
on folio migration. We identify jbd buffers as its the only user of
set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
buffer_meta() and bailing on migration we fix the existing racy ext4
corruption while also removing the spin lock to be held while sleeping
complaints originally reported by 0-day [5], and paves the way for
buffer-heads for more users of large folios other than the block
device cache.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0]
Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250408-ext4-jbd-force-corruption # [1]
Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [2]
Link: https://lore.kernel.org/all/Z-rzyrS0Jr7t984Y@bombadil.infradead.org/ # [3]
Link: https://lore.kernel.org/all/Z_AA9SHZdRGq6tb4@bombadil.infradead.org/ # [4]
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com # [5]
Fixes: ebdf4de5642fb6 ("mm: migrate: fix reference  check race between __find_get_block() and migration")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/migrate.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Matthew Wilcox April 10, 2025, 3:18 a.m. UTC | #1
On Wed, Apr 09, 2025 at 06:49:38PM -0700, Luis Chamberlain wrote:
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;

This isn't enough on filesystems with bs<PS.  You're only testing the
meta bit on the first buffer_head on the folio and not on the rest of
them.  If this is the right approach to take, then we want:

+++ b/mm/migrate.c
@@ -799,6 +799,8 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
        struct buffer_head *failed_bh;

        do {
+               if (buffer_meta(bh))
+                       goto unlock;
                if (!trylock_buffer(bh)) {
                        if (mode == MIGRATE_ASYNC)
                                goto unlock;
Jan Kara April 10, 2025, 12:05 p.m. UTC | #2
On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> Filesystems which use buffer-heads where it cannot guarantees that there
> are no other references to the folio, for example with a folio
> lock, must use buffer_migrate_folio_norefs() for the address space
> mapping migrate_folio() callback. There are only 3 filesystems which use
> this callback:
> 
>   1) the block device cache

Well, but through this also all simple filesystems that use buffer_heads
for metadata handling...

>   2) ext4 for its ext4_journalled_aops, ie, jbd2
>   3) nilfs2
> 
> jbd2's use of this however callback however is very race prone, consider
> folio migration while reviewing jbd2_journal_write_metadata_buffer()
> and the fact that jbd2:
> 
>   - does not hold the folio lock
>   - does not have have page writeback bit set
>   - does not lock the buffer
>
> And so, it can race with folio_set_bh() on folio migration. The commit
> ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
> __find_get_block() and migration") added a spin lock to prevent races
> with page migration which ext4 users were reporting through the SUSE
> bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
> original filesystem corruption we can can reproduce fs corruption on
> ext4 by just removing the spinlock and stress testing the filesystem
> with generic/750, we eventually end up after 3 hours of testing with
> kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k.

Correct, jbd2 holds bh reference (its private jh structure attached to
bh->b_private holds it) and that is expected to protect jbd2 from anybody
else mucking with the bh.

> It turns out that the spin lock doesn't in the end protect against
> corruption, it *helps* reduce the possibility, but ext4 filesystem
> corruption can still happen even with the spin lock held. A test was
> done using vanilla Linux and adding a udelay(2000) right before we
> spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> we can reproduce the same exact filesystem corruption issues as observed
> without the spinlock with generic/750 [1].

This is unexpected.

> ** Reproduced on vanilla Linux with udelay(2000) **
> 
> Call trace (ENOSPC journal failure):
>   do_writepages()
>     → ext4_do_writepages()
>       → ext4_map_blocks()
>         → ext4_ext_map_blocks()
>           → ext4_ext_insert_extent()
>             → __ext4_handle_dirty_metadata()
>               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

Curious. Did you try running e2fsck after the filesystem complained like
this? This complains about journal handle not having enough credits for
needed metadata update. Either we've lost some update to the journal_head
structure (b_modified got accidentally cleared) or some update to extent
tree.

> And so jbd2 still needs more work to avoid races with folio migration.
> So replace the current spin lock solution by just skipping jbd buffers
> on folio migration. We identify jbd buffers as its the only user of
> set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
> buffer_meta() and bailing on migration we fix the existing racy ext4
> corruption while also removing the spin lock to be held while sleeping
> complaints originally reported by 0-day [5], and paves the way for
> buffer-heads for more users of large folios other than the block
> device cache.

I think we need to understand why private_lock protection does not protect
bh users holding reference like jbd2 from folio migration before papering
over this problem with the hack. Because there are chances other simple
filesystems suffer from the same problem...

> diff --git a/mm/migrate.c b/mm/migrate.c
> index f3ee6d8d5e2e..32fa72ba10b4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;
> +
>  	if (!buffer_migrate_lock_buffers(head, mode))
>  		return -EAGAIN;
>  
> @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  			}
>  			bh = bh->b_this_page;
>  		} while (bh != head);
> +		spin_unlock(&mapping->i_private_lock);

No, you've just broken all simple filesystems (like ext2) with this patch.
You can reduce the spinlock critical section only after providing
alternative way to protect them from migration. So this should probably
happen at the end of the series.

								Honza
Luis Chamberlain April 14, 2025, 9:09 p.m. UTC | #3
On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> >  			}
> >  			bh = bh->b_this_page;
> >  		} while (bh != head);
> > +		spin_unlock(&mapping->i_private_lock);
> 
> No, you've just broken all simple filesystems (like ext2) with this patch.
> You can reduce the spinlock critical section only after providing
> alternative way to protect them from migration. So this should probably
> happen at the end of the series.

So you're OK with this spin lock move with the other series in place?

And so we punt the hard-to-reproduce corruption issue as future work
to do? Becuase the other alternative for now is to just disable
migration for jbd2:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1dc09ed5d403..ef1c3ef68877 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.bmap			= ext4_bmap,
 	.invalidate_folio	= ext4_journalled_invalidate_folio,
 	.release_folio		= ext4_release_folio,
-	.migrate_folio		= buffer_migrate_folio_norefs,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= ext4_iomap_swap_activate,
Luis Chamberlain April 14, 2025, 10:19 p.m. UTC | #4
On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > >  			}
> > >  			bh = bh->b_this_page;
> > >  		} while (bh != head);
> > > +		spin_unlock(&mapping->i_private_lock);
> > 
> > No, you've just broken all simple filesystems (like ext2) with this patch.
> > You can reduce the spinlock critical section only after providing
> > alternative way to protect them from migration. So this should probably
> > happen at the end of the series.
> 
> So you're OK with this spin lock move with the other series in place?
> 
> And so we punt the hard-to-reproduce corruption issue as future work
> to do? Becuase the other alternative for now is to just disable
> migration for jbd2:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dc09ed5d403..ef1c3ef68877 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
>  	.bmap			= ext4_bmap,
>  	.invalidate_folio	= ext4_journalled_invalidate_folio,
>  	.release_folio		= ext4_release_folio,
> -	.migrate_folio		= buffer_migrate_folio_norefs,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_folio	= generic_error_remove_folio,
>  	.swap_activate		= ext4_iomap_swap_activate,

BTW I ask because.. are your expectations that the next v3 series also
be a target for Linus tree as part of a fix for this spinlock
replacement?

  Luis
Davidlohr Bueso April 15, 2025, 1:36 a.m. UTC | #5
On Wed, 09 Apr 2025, Luis Chamberlain wrote:

>corruption can still happen even with the spin lock held. A test was
>done using vanilla Linux and adding a udelay(2000) right before we
>spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
>we can reproduce the same exact filesystem corruption issues as observed
>without the spinlock with generic/750 [1].

fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
not even having to add the artificial delay.

[336534.157119] ------------[ cut here ]------------
[336534.158911] WARNING: CPU: 3 PID: 87221 at fs/jbd2/transaction.c:1552 jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
[336534.160771] Modules linked in: loop sunrpc 9p kvm_intel nls_iso8859_1 nls_cp437 vfat fat crc32c_generic kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd 9pnet_virtio cryptd virtio_balloon virtio_console evdev joydev button nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover virtio_blk failover psmouse serio_raw virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[336534.173218] CPU: 3 UID: 0 PID: 87221 Comm: kworker/u36:8 Not tainted 6.15.0-rc1 #2 PREEMPT(full)
[336534.175146] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2025.02-5 03/28/2025
[336534.176947] Workqueue: writeback wb_workfn (flush-7:5)
[336534.178183] RIP: 0010:jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
[336534.179626] Code: 30 0f 84 5b fe ff ff 0f 0b 41 bc 8b ff ff ff e9 69 fe ff ff 48 8b 04 24 4c 8b 48 70 4d 39 cf 0f 84 53 ff ff ff e9 32 c3 00 00 <0f> 0b 41 bc e4 ff ff ff e9 41 ff ff ff 0f 0b 90 0f 1f 40 00 90 90
[336534.183983] RSP: 0018:ffff9f168d38f548 EFLAGS: 00010246
[336534.185194] RAX: 0000000000000001 RBX: ffff8c0ae8244e10 RCX: 00000000000000fd
[336534.186810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[336534.188399] RBP: ffff8c09db0d8618 R08: ffff8c09db0d8618 R09: 0000000000000000
[336534.189977] R10: ffff8c0b2671a83c R11: 0000000000006989 R12: 0000000000000000
[336534.191243] R13: ffff8c09cc3b33f0 R14: ffff8c0ae8244e18 R15: ffff8c0ad5e0ef00
[336534.192469] FS:  0000000000000000(0000) GS:ffff8c0b95b8d000(0000) knlGS:0000000000000000
[336534.193840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[336534.194831] CR2: 00007f0ebab4f000 CR3: 000000011e616005 CR4: 0000000000772ef0
[336534.196044] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[336534.197274] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[336534.198473] PKRU: 55555554
[336534.198927] Call Trace:
[336534.199350]  <TASK>
[336534.199701]  __ext4_handle_dirty_metadata+0x5c/0x190 [ext4]
[336534.200626]  ext4_ext_insert_extent+0x575/0x1440 [ext4]
[336534.201465]  ? ext4_cache_extents+0x5a/0xd0 [ext4]
[336534.202243]  ? ext4_find_extent+0x37c/0x3a0 [ext4]
[336534.203024]  ext4_ext_map_blocks+0x50e/0x18d0 [ext4]
[336534.203803]  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
[336534.204723]  ext4_map_blocks+0x11a/0x4d0 [ext4]
[336534.205442]  ? ext4_alloc_io_end_vec+0x1f/0x70 [ext4]
[336534.206239]  ? kmem_cache_alloc_noprof+0x310/0x3d0
[336534.206982]  ext4_do_writepages+0x762/0xd40 [ext4]
[336534.207706]  ? __pfx_block_write_full_folio+0x10/0x10
[336534.208451]  ? ext4_writepages+0xc6/0x1a0 [ext4]
[336534.209161]  ext4_writepages+0xc6/0x1a0 [ext4]
[336534.209834]  do_writepages+0xdd/0x250
[336534.210378]  ? filemap_get_read_batch+0x170/0x310
[336534.211069]  __writeback_single_inode+0x41/0x330
[336534.211738]  writeback_sb_inodes+0x21b/0x4d0
[336534.212375]  __writeback_inodes_wb+0x4c/0xe0
[336534.212998]  wb_writeback+0x19c/0x320
[336534.213546]  wb_workfn+0x30e/0x440
[336534.214039]  process_one_work+0x188/0x340
[336534.214650]  worker_thread+0x246/0x390
[336534.215196]  ? _raw_spin_lock_irqsave+0x23/0x50
[336534.215879]  ? __pfx_worker_thread+0x10/0x10
[336534.216522]  kthread+0x104/0x250
[336534.217004]  ? __pfx_kthread+0x10/0x10
[336534.217554]  ? _raw_spin_unlock+0x15/0x30
[336534.218140]  ? finish_task_switch.isra.0+0x94/0x290
[336534.218979]  ? __pfx_kthread+0x10/0x10
[336534.220347]  ret_from_fork+0x2d/0x50
[336534.221086]  ? __pfx_kthread+0x10/0x10
[336534.221703]  ret_from_fork_asm+0x1a/0x30
[336534.222415]  </TASK>
[336534.222775] ---[ end trace 0000000000000000 ]---
Christian Brauner April 15, 2025, 9:05 a.m. UTC | #6
On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > >  			}
> > > >  			bh = bh->b_this_page;
> > > >  		} while (bh != head);
> > > > +		spin_unlock(&mapping->i_private_lock);
> > > 
> > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > You can reduce the spinlock critical section only after providing
> > > alternative way to protect them from migration. So this should probably
> > > happen at the end of the series.
> > 
> > So you're OK with this spin lock move with the other series in place?
> > 
> > And so we punt the hard-to-reproduce corruption issue as future work
> > to do? Becuase the other alternative for now is to just disable
> > migration for jbd2:
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 1dc09ed5d403..ef1c3ef68877 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> >  	.bmap			= ext4_bmap,
> >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> >  	.release_folio		= ext4_release_folio,
> > -	.migrate_folio		= buffer_migrate_folio_norefs,
> >  	.is_partially_uptodate  = block_is_partially_uptodate,
> >  	.error_remove_folio	= generic_error_remove_folio,
> >  	.swap_activate		= ext4_iomap_swap_activate,
> 
> BTW I ask because.. are your expectations that the next v3 series also
> be a target for Linus tree as part of a fix for this spinlock
> replacement?

Since this is fixing potential filesystem corruption I will upstream
whatever we need to do to fix this. Ideally we have a minimal fix to
upstream now and a comprehensive fix and cleanup for v6.16.
Jan Kara April 15, 2025, 11:17 a.m. UTC | #7
On Mon 14-04-25 15:19:33, Luis Chamberlain wrote:
> On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > >  			}
> > > >  			bh = bh->b_this_page;
> > > >  		} while (bh != head);
> > > > +		spin_unlock(&mapping->i_private_lock);
> > > 
> > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > You can reduce the spinlock critical section only after providing
> > > alternative way to protect them from migration. So this should probably
> > > happen at the end of the series.
> > 
> > So you're OK with this spin lock move with the other series in place?
> > 
> > And so we punt the hard-to-reproduce corruption issue as future work
> > to do? Becuase the other alternative for now is to just disable
> > migration for jbd2:
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 1dc09ed5d403..ef1c3ef68877 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> >  	.bmap			= ext4_bmap,
> >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> >  	.release_folio		= ext4_release_folio,
> > -	.migrate_folio		= buffer_migrate_folio_norefs,
> >  	.is_partially_uptodate  = block_is_partially_uptodate,
> >  	.error_remove_folio	= generic_error_remove_folio,
> >  	.swap_activate		= ext4_iomap_swap_activate,
> 
> BTW I ask because.. are your expectations that the next v3 series also
> be a target for Linus tree as part of a fix for this spinlock
> replacement?

I have no problem with this (the use of data=journal mode is rare) but I
suspect this won't fix the corruption you're seeing because that happens in
the block device mapping. So to fix that you'd have to disable page
migration for block device mappings (i.e., do the above with def_blk_aops)
and *that* will make a lot of people unhappy.

								Honza
Jan Kara April 15, 2025, 11:23 a.m. UTC | #8
On Mon 14-04-25 14:09:44, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > >  			}
> > >  			bh = bh->b_this_page;
> > >  		} while (bh != head);
> > > +		spin_unlock(&mapping->i_private_lock);
> > 
> > No, you've just broken all simple filesystems (like ext2) with this patch.
> > You can reduce the spinlock critical section only after providing
> > alternative way to protect them from migration. So this should probably
> > happen at the end of the series.
> 
> So you're OK with this spin lock move with the other series in place?

Yes.

> And so we punt the hard-to-reproduce corruption issue as future work
> to do?

Yes, I'm not happy about that but I prefer that over putting band aids
into that code. Even more so because we don't even understand whether they
fix the problem or just make this reproducer stop working...

> Becuase the other alternative for now is to just disable
> migration for jbd2:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dc09ed5d403..ef1c3ef68877 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
>  	.bmap			= ext4_bmap,
>  	.invalidate_folio	= ext4_journalled_invalidate_folio,
>  	.release_folio		= ext4_release_folio,
> -	.migrate_folio		= buffer_migrate_folio_norefs,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_folio	= generic_error_remove_folio,
>  	.swap_activate		= ext4_iomap_swap_activate,

This will not disable migration for JBD2 buffers. This will just disable
migration for files with data journalling. See my other reply for details.

								Honza
Jan Kara April 15, 2025, 11:25 a.m. UTC | #9
On Mon 14-04-25 18:36:41, Davidlohr Bueso wrote:
> On Wed, 09 Apr 2025, Luis Chamberlain wrote:
> 
> > corruption can still happen even with the spin lock held. A test was
> > done using vanilla Linux and adding a udelay(2000) right before we
> > spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> > we can reproduce the same exact filesystem corruption issues as observed
> > without the spinlock with generic/750 [1].
> 
> fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
> not even having to add the artificial delay.

OK, so this is using generic/750, isn't it? How long did you have to run it
to trigger this? Because I've never seen this tripping...

								Honza

> 
> [336534.157119] ------------[ cut here ]------------
> [336534.158911] WARNING: CPU: 3 PID: 87221 at fs/jbd2/transaction.c:1552 jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
> [336534.160771] Modules linked in: loop sunrpc 9p kvm_intel nls_iso8859_1 nls_cp437 vfat fat crc32c_generic kvm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd 9pnet_virtio cryptd virtio_balloon virtio_console evdev joydev button nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 md_mod virtio_net net_failover virtio_blk failover psmouse serio_raw virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
> [336534.173218] CPU: 3 UID: 0 PID: 87221 Comm: kworker/u36:8 Not tainted 6.15.0-rc1 #2 PREEMPT(full)
> [336534.175146] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2025.02-5 03/28/2025
> [336534.176947] Workqueue: writeback wb_workfn (flush-7:5)
> [336534.178183] RIP: 0010:jbd2_journal_dirty_metadata+0x21c/0x230 [jbd2]
> [336534.179626] Code: 30 0f 84 5b fe ff ff 0f 0b 41 bc 8b ff ff ff e9 69 fe ff ff 48 8b 04 24 4c 8b 48 70 4d 39 cf 0f 84 53 ff ff ff e9 32 c3 00 00 <0f> 0b 41 bc e4 ff ff ff e9 41 ff ff ff 0f 0b 90 0f 1f 40 00 90 90
> [336534.183983] RSP: 0018:ffff9f168d38f548 EFLAGS: 00010246
> [336534.185194] RAX: 0000000000000001 RBX: ffff8c0ae8244e10 RCX: 00000000000000fd
> [336534.186810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [336534.188399] RBP: ffff8c09db0d8618 R08: ffff8c09db0d8618 R09: 0000000000000000
> [336534.189977] R10: ffff8c0b2671a83c R11: 0000000000006989 R12: 0000000000000000
> [336534.191243] R13: ffff8c09cc3b33f0 R14: ffff8c0ae8244e18 R15: ffff8c0ad5e0ef00
> [336534.192469] FS:  0000000000000000(0000) GS:ffff8c0b95b8d000(0000) knlGS:0000000000000000
> [336534.193840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [336534.194831] CR2: 00007f0ebab4f000 CR3: 000000011e616005 CR4: 0000000000772ef0
> [336534.196044] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [336534.197274] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [336534.198473] PKRU: 55555554
> [336534.198927] Call Trace:
> [336534.199350]  <TASK>
> [336534.199701]  __ext4_handle_dirty_metadata+0x5c/0x190 [ext4]
> [336534.200626]  ext4_ext_insert_extent+0x575/0x1440 [ext4]
> [336534.201465]  ? ext4_cache_extents+0x5a/0xd0 [ext4]
> [336534.202243]  ? ext4_find_extent+0x37c/0x3a0 [ext4]
> [336534.203024]  ext4_ext_map_blocks+0x50e/0x18d0 [ext4]
> [336534.203803]  ? mpage_map_and_submit_buffers+0x23f/0x270 [ext4]
> [336534.204723]  ext4_map_blocks+0x11a/0x4d0 [ext4]
> [336534.205442]  ? ext4_alloc_io_end_vec+0x1f/0x70 [ext4]
> [336534.206239]  ? kmem_cache_alloc_noprof+0x310/0x3d0
> [336534.206982]  ext4_do_writepages+0x762/0xd40 [ext4]
> [336534.207706]  ? __pfx_block_write_full_folio+0x10/0x10
> [336534.208451]  ? ext4_writepages+0xc6/0x1a0 [ext4]
> [336534.209161]  ext4_writepages+0xc6/0x1a0 [ext4]
> [336534.209834]  do_writepages+0xdd/0x250
> [336534.210378]  ? filemap_get_read_batch+0x170/0x310
> [336534.211069]  __writeback_single_inode+0x41/0x330
> [336534.211738]  writeback_sb_inodes+0x21b/0x4d0
> [336534.212375]  __writeback_inodes_wb+0x4c/0xe0
> [336534.212998]  wb_writeback+0x19c/0x320
> [336534.213546]  wb_workfn+0x30e/0x440
> [336534.214039]  process_one_work+0x188/0x340
> [336534.214650]  worker_thread+0x246/0x390
> [336534.215196]  ? _raw_spin_lock_irqsave+0x23/0x50
> [336534.215879]  ? __pfx_worker_thread+0x10/0x10
> [336534.216522]  kthread+0x104/0x250
> [336534.217004]  ? __pfx_kthread+0x10/0x10
> [336534.217554]  ? _raw_spin_unlock+0x15/0x30
> [336534.218140]  ? finish_task_switch.isra.0+0x94/0x290
> [336534.218979]  ? __pfx_kthread+0x10/0x10
> [336534.220347]  ret_from_fork+0x2d/0x50
> [336534.221086]  ? __pfx_kthread+0x10/0x10
> [336534.221703]  ret_from_fork_asm+0x1a/0x30
> [336534.222415]  </TASK>
> [336534.222775] ---[ end trace 0000000000000000 ]---
Luis Chamberlain April 15, 2025, 3:47 p.m. UTC | #10
On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > >  			}
> > > > >  			bh = bh->b_this_page;
> > > > >  		} while (bh != head);
> > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > 
> > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > You can reduce the spinlock critical section only after providing
> > > > alternative way to protect them from migration. So this should probably
> > > > happen at the end of the series.
> > > 
> > > So you're OK with this spin lock move with the other series in place?
> > > 
> > > And so we punt the hard-to-reproduce corruption issue as future work
> > > to do? Becuase the other alternative for now is to just disable
> > > migration for jbd2:
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > >  	.bmap			= ext4_bmap,
> > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > >  	.release_folio		= ext4_release_folio,
> > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > >  	.error_remove_folio	= generic_error_remove_folio,
> > >  	.swap_activate		= ext4_iomap_swap_activate,
> > 
> > BTW I ask because.. are your expectations that the next v3 series also
> > be a target for Linus tree as part of a fix for this spinlock
> > replacement?
> 
> Since this is fixing potential filesystem corruption I will upstream
> whatever we need to do to fix this. Ideally we have a minimal fix to
> upstream now and a comprehensive fix and cleanup for v6.16.

Despite our efforts we don't yet have an agreement on how to fix the
ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
is too broad and would affect other filesystems (I have yet to
understand how, but will review).

And so while we have agreement we can remove the spin lock to fix the
sleeping while atomic incurred by large folios for buffer heads by this
patch series, the removal of the spin lock would happen at the end of
this series.

And so the ext4 corruption is an existing issue as-is today, its
separate from the spin lock removal goal to fix the sleeping while
atomic..

However this series might be quite big for an rc2 or rc3 fix for that spin
lock removal issue. It should bring in substantial performance benefits
though, so it might be worthy to consider. We can re-run tests with the
adjustment to remove the spin lock until the last patch in this series.

The alternative is to revert the spin lock addition commit for Linus'
tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
between __find_get_block() and migration") and note that it in fact does
not fix the ext4 corruption as we've noted, and in fact causes an issue
with sleeping while atomic with support for large folios for buffer
heads. If we do that then we  punt this series for the next development
window, and it would just not have the spin lock removal on the last
patch.

Jan Kara, Christian, thoughts?

  Luis
Luis Chamberlain April 15, 2025, 4:18 p.m. UTC | #11
On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> > ** Reproduced on vanilla Linux with udelay(2000) **
> > 
> > Call trace (ENOSPC journal failure):
> >   do_writepages()
> >     → ext4_do_writepages()
> >       → ext4_map_blocks()
> >         → ext4_ext_map_blocks()
> >           → ext4_ext_insert_extent()
> >             → __ext4_handle_dirty_metadata()
> >               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
> 
> Curious. Did you try running e2fsck after the filesystem complained like
> this? This complains about journal handle not having enough credits for
> needed metadata update. Either we've lost some update to the journal_head
> structure (b_modified got accidentally cleared) or some update to extent
> tree.

Just tried it after pkill fsstress and stopping the test:

root@e1-ext4-2k /var/lib/xfstests # umount /dev/loop5
root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5
fsck from util-linux 2.41
e2fsck 1.47.2 (1-Jan-2025)
/dev/loop5 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 26 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 129 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 592 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1095 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1416 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1653 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1929 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 1965 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2538 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2765 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 2838 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 3595 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 4659 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 5268 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 6400 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 6830 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8490 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8555 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8658 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8754 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 8996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9168 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9430 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9468 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9543 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9632 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 9746 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10043 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10280 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10623 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10651 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10691 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 10708 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11389 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11564 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11578 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11842 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 11900 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 12721 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 12831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 13531 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 16580 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 16740 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 17123 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 17192 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 17412 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 17519 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 18730 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 18974 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 19118 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 19806 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 19942 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 20303 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 20323 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 21047 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 21919 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 22180 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 22856 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23462 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23587 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23775 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 23916 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 24046 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 24161 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 25576 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 25666 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 25992 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26404 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26795 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26862 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 26868 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 27627 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 27959 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 28014 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 29120 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 29308 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 30673 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 31127 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 31332 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 31547 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 32727 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 32888 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 33062 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyInode 34421 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 34508 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyyyInode 35996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 36258 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yyInode 36867 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37166 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37171 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37548 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 37732 extent tree (at level 1) could be shorter.  Optimize<y>? yes
yInode 38028 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inode 38099 extent tree (at level 1) could be shorter.  Optimize<y>? yes
....

So I tried:

root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
e2fsck 1.47.2 (1-Jan-2025)
root@e1-ext4-2k /var/lib/xfstests # wc -l log
16411 log

root@e1-ext4-2k /var/lib/xfstests # tail log

Free blocks count wrong for group #609 (62, counted=63).
Fix? yes

Free blocks count wrong (12289, counted=12293).
Fix? yes


/dev/loop5: ***** FILE SYSTEM WAS MODIFIED *****
/dev/loop5: 1310719/1310720 files (10.7% non-contiguous),
10473467/10485760 blocks

  Luis
Jan Kara April 15, 2025, 4:23 p.m. UTC | #12
On Tue 15-04-25 08:47:51, Luis Chamberlain wrote:
> On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> > On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > > >  			}
> > > > > >  			bh = bh->b_this_page;
> > > > > >  		} while (bh != head);
> > > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > > 
> > > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > > You can reduce the spinlock critical section only after providing
> > > > > alternative way to protect them from migration. So this should probably
> > > > > happen at the end of the series.
> > > > 
> > > > So you're OK with this spin lock move with the other series in place?
> > > > 
> > > > And so we punt the hard-to-reproduce corruption issue as future work
> > > > to do? Becuase the other alternative for now is to just disable
> > > > migration for jbd2:
> > > > 
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > > >  	.bmap			= ext4_bmap,
> > > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > > >  	.release_folio		= ext4_release_folio,
> > > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > > >  	.error_remove_folio	= generic_error_remove_folio,
> > > >  	.swap_activate		= ext4_iomap_swap_activate,
> > > 
> > > BTW I ask because.. are your expectations that the next v3 series also
> > > be a target for Linus tree as part of a fix for this spinlock
> > > replacement?
> > 
> > Since this is fixing potential filesystem corruption I will upstream
> > whatever we need to do to fix this. Ideally we have a minimal fix to
> > upstream now and a comprehensive fix and cleanup for v6.16.
> 
> Despite our efforts we don't yet have an agreement on how to fix the
> ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
> is too broad and would affect other filesystems (I have yet to
> understand how, but will review).
> 
> And so while we have agreement we can remove the spin lock to fix the
> sleeping while atomic incurred by large folios for buffer heads by this
> patch series, the removal of the spin lock would happen at the end of
> this series.
> 
> And so the ext4 corruption is an existing issue as-is today, its
> separate from the spin lock removal goal to fix the sleeping while
> atomic..

I agree. Ext4 corruption problems are separate from sleeping in atomic
issues.

> However this series might be quite big for an rc2 or rc3 fix for that spin
> lock removal issue. It should bring in substantial performance benefits
> though, so it might be worthy to consider. We can re-run tests with the
> adjustment to remove the spin lock until the last patch in this series.
> 
> The alternative is to revert the spin lock addition commit for Linus'
> tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
> between __find_get_block() and migration") and note that it in fact does
> not fix the ext4 corruption as we've noted, and in fact causes an issue
> with sleeping while atomic with support for large folios for buffer
> heads. If we do that then we  punt this series for the next development
> window, and it would just not have the spin lock removal on the last
> patch.

Well, the commit ebdf4de5642fb6 is 6 years old. At that time there were no
large folios (in fact there were no folios at all ;)) in the page cache and
it does work quite well (I didn't see a corruption report from real users
since then). So I don't like removing that commit because it makes a
"reproducible with a heavy stress test" problem become a "reproduced by
real world workloads" problem.

If you look for a fast way to fixup sleep in atomic issues, then I'd
suggest just disabling large pages for block device page cache. That is the
new functionality that actually triggered all these investigations and
sleep-in-atomic reports. And once this patch set gets merged, we can
reenable large folios in the block device page cache again.

								Honza
Jan Kara April 15, 2025, 4:28 p.m. UTC | #13
On Tue 15-04-25 09:18:10, Luis Chamberlain wrote:
> On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> > > ** Reproduced on vanilla Linux with udelay(2000) **
> > > 
> > > Call trace (ENOSPC journal failure):
> > >   do_writepages()
> > >     → ext4_do_writepages()
> > >       → ext4_map_blocks()
> > >         → ext4_ext_map_blocks()
> > >           → ext4_ext_insert_extent()
> > >             → __ext4_handle_dirty_metadata()
> > >               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
> > 
> > Curious. Did you try running e2fsck after the filesystem complained like
> > this? This complains about journal handle not having enough credits for
> > needed metadata update. Either we've lost some update to the journal_head
> > structure (b_modified got accidentally cleared) or some update to extent
> > tree.
> 
> Just tried it after pkill fsstress and stopping the test:
> 
> root@e1-ext4-2k /var/lib/xfstests # umount /dev/loop5
> root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5
> fsck from util-linux 2.41
> e2fsck 1.47.2 (1-Jan-2025)
> /dev/loop5 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Inode 26 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 129 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 592 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1095 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1416 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1653 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1929 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 1965 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2538 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2765 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 2838 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 3595 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 4659 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 5268 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 6400 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 6830 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8490 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8555 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8658 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8754 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 8996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9168 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9430 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9468 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9543 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9632 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 9746 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10043 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10280 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10623 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10651 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10691 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 10708 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11389 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11564 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11578 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11842 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 11900 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 12721 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 12831 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 13531 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 16580 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 16740 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 17123 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 17192 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 17412 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 17519 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 18730 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 18974 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 19118 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 19806 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 19942 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 20303 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 20323 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 21047 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 21919 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 22180 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 22856 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23462 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23587 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23775 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 23916 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 24046 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 24161 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 25576 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 25666 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 25992 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26404 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26795 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26862 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 26868 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 27627 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 27959 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 28014 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 29120 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 29308 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 30673 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 31127 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 31332 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 31547 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 32727 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 32888 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 33062 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyInode 34421 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 34508 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyyyInode 35996 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 36258 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yyInode 36867 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37166 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37171 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37548 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 37732 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> yInode 38028 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> Inode 38099 extent tree (at level 1) could be shorter.  Optimize<y>? yes
> ....

These are harmless. They are not errors, just opportunities for
optimization of the extent tree e2fsck can make.

> So I tried:
> 
> root@e1-ext4-2k /var/lib/xfstests # fsck /dev/loop5 -y 2>&1 > log
> e2fsck 1.47.2 (1-Jan-2025)
> root@e1-ext4-2k /var/lib/xfstests # wc -l log
> 16411 log

Can you share the log please?

> root@e1-ext4-2k /var/lib/xfstests # tail log
> 
> Free blocks count wrong for group #609 (62, counted=63).
> Fix? yes
> 
> Free blocks count wrong (12289, counted=12293).
> Fix? yes

These could potentially indicate some interesting issues but it depends on
what the above e2fsck messages are...

								Honza
Davidlohr Bueso April 15, 2025, 6:14 p.m. UTC | #14
On Tue, 15 Apr 2025, Jan Kara wrote:

>On Mon 14-04-25 18:36:41, Davidlohr Bueso wrote:
>> On Wed, 09 Apr 2025, Luis Chamberlain wrote:
>>
>> > corruption can still happen even with the spin lock held. A test was
>> > done using vanilla Linux and adding a udelay(2000) right before we
>> > spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
>> > we can reproduce the same exact filesystem corruption issues as observed
>> > without the spinlock with generic/750 [1].
>>
>> fyi I was actually able to trigger this on a vanilla 6.15-rc1 kernel,
>> not even having to add the artificial delay.
>
>OK, so this is using generic/750, isn't it? How long did you have to run it
>to trigger this? Because I've never seen this tripping...

Correct, generic/750. It triggered after 90+ hrs running.
Luis Chamberlain April 15, 2025, 9:06 p.m. UTC | #15
On Tue, Apr 15, 2025 at 06:23:54PM +0200, Jan Kara wrote:
> On Tue 15-04-25 08:47:51, Luis Chamberlain wrote:
> > On Tue, Apr 15, 2025 at 11:05:38AM +0200, Christian Brauner wrote:
> > > On Mon, Apr 14, 2025 at 03:19:33PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Apr 14, 2025 at 02:09:46PM -0700, Luis Chamberlain wrote:
> > > > > On Thu, Apr 10, 2025 at 02:05:38PM +0200, Jan Kara wrote:
> > > > > > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
> > > > > > >  			}
> > > > > > >  			bh = bh->b_this_page;
> > > > > > >  		} while (bh != head);
> > > > > > > +		spin_unlock(&mapping->i_private_lock);
> > > > > > 
> > > > > > No, you've just broken all simple filesystems (like ext2) with this patch.
> > > > > > You can reduce the spinlock critical section only after providing
> > > > > > alternative way to protect them from migration. So this should probably
> > > > > > happen at the end of the series.
> > > > > 
> > > > > So you're OK with this spin lock move with the other series in place?
> > > > > 
> > > > > And so we punt the hard-to-reproduce corruption issue as future work
> > > > > to do? Becuase the other alternative for now is to just disable
> > > > > migration for jbd2:
> > > > > 
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 1dc09ed5d403..ef1c3ef68877 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -3631,7 +3631,6 @@ static const struct address_space_operations ext4_journalled_aops = {
> > > > >  	.bmap			= ext4_bmap,
> > > > >  	.invalidate_folio	= ext4_journalled_invalidate_folio,
> > > > >  	.release_folio		= ext4_release_folio,
> > > > > -	.migrate_folio		= buffer_migrate_folio_norefs,
> > > > >  	.is_partially_uptodate  = block_is_partially_uptodate,
> > > > >  	.error_remove_folio	= generic_error_remove_folio,
> > > > >  	.swap_activate		= ext4_iomap_swap_activate,
> > > > 
> > > > BTW I ask because.. are your expectations that the next v3 series also
> > > > be a target for Linus tree as part of a fix for this spinlock
> > > > replacement?
> > > 
> > > Since this is fixing potential filesystem corruption I will upstream
> > > whatever we need to do to fix this. Ideally we have a minimal fix to
> > > upstream now and a comprehensive fix and cleanup for v6.16.
> > 
> > Despite our efforts we don't yet have an agreement on how to fix the
> > ext4 corruption, becuase Jan noted the buffer_meta() check in this patch
> > is too broad and would affect other filesystems (I have yet to
> > understand how, but will review).
> > 
> > And so while we have agreement we can remove the spin lock to fix the
> > sleeping while atomic incurred by large folios for buffer heads by this
> > patch series, the removal of the spin lock would happen at the end of
> > this series.
> > 
> > And so the ext4 corruption is an existing issue as-is today, its
> > separate from the spin lock removal goal to fix the sleeping while
> > atomic..
> 
> I agree. Ext4 corruption problems are separate from sleeping in atomic
> issues.

Glad that's clear.

> > However this series might be quite big for an rc2 or rc3 fix for that spin
> > lock removal issue. It should bring in substantial performance benefits
> > though, so it might be worthy to consider. We can re-run tests with the
> > adjustment to remove the spin lock until the last patch in this series.
> > 
> > The alternative is to revert the spin lock addition commit for Linus'
> > tree, ie commit ebdf4de5642fb6 ("mm: migrate: fix reference check race
> > between __find_get_block() and migration") and note that it in fact does
> > not fix the ext4 corruption as we've noted, and in fact causes an issue
> > with sleeping while atomic with support for large folios for buffer
> > heads. If we do that then we  punt this series for the next development
> > window, and it would just not have the spin lock removal on the last
> > patch.
> 
> Well, the commit ebdf4de5642fb6 is 6 years old. At that time there were no
> large folios (in fact there were no folios at all ;)) in the page cache and
> it does work quite well (I didn't see a corruption report from real users
> since then).

It is still a work around.

> So I don't like removing that commit because it makes a
> "reproducible with a heavy stress test" problem become a "reproduced by
> real world workloads" problem.

So how about just patch 2 and 8 in this series, with the spin lock
removal happening on the last patch for Linus tree?

  Luis
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..32fa72ba10b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -841,6 +841,9 @@  static int __buffer_migrate_folio(struct address_space *mapping,
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
+	if (buffer_meta(head))
+		return -EAGAIN;
+
 	if (!buffer_migrate_lock_buffers(head, mode))
 		return -EAGAIN;
 
@@ -859,12 +862,12 @@  static int __buffer_migrate_folio(struct address_space *mapping,
 			}
 			bh = bh->b_this_page;
 		} while (bh != head);
+		spin_unlock(&mapping->i_private_lock);
 		if (busy) {
 			if (invalidated) {
 				rc = -EAGAIN;
 				goto unlock_buffers;
 			}
-			spin_unlock(&mapping->i_private_lock);
 			invalidate_bh_lrus();
 			invalidated = true;
 			goto recheck_buffers;
@@ -880,10 +883,7 @@  static int __buffer_migrate_folio(struct address_space *mapping,
 		folio_set_bh(bh, dst, bh_offset(bh));
 		bh = bh->b_this_page;
 	} while (bh != head);
-
 unlock_buffers:
-	if (check_refs)
-		spin_unlock(&mapping->i_private_lock);
 	bh = head;
 	do {
 		unlock_buffer(bh);