mbox series

[v2,0/8] mm: enhance migration work around on noref buffer-heads

Message ID 20250410014945.2140781-1-mcgrof@kernel.org (mailing list archive)
Headers show
Series mm: enhance migration work around on noref buffer-heads | expand

Message

Luis Chamberlain April 10, 2025, 1:49 a.m. UTC
We have an eye-sore of a spin lock held during page migration which
was added for a ext4 jbd corruption fix for which we have no clear
public corruption data. We want to remove the spin lock on mm/migrate
so to help buffer-head filesystems embrace large folios, since we
can cond_resched() on large folios on folio_mc_copy(). We've managed
to reproduce a corruption by just removing the spinlock and stressing
ext4 with generic/750, a corruption happens after 3 hours many times.
However, while developing an alternative fix based on feedback [0], we've
come the conclusion ext4 on vanilla Linux is still affected. We still have
a lingering jbd2 corruption issue.

The underlying race is in jbd2’s use of buffer_migrate_folio_norefs() without
holding doing proper synchronization, making it unsafe during folio migration.
ext4 uses jbd2 as its journaling backend. The corruption surfaces in ext4's
metadata operations, like ext4_ext_insert_extent(), when journal metadata fails
to be marked dirty due to the migration race. This leads to ENOSPC, journal
aborts, read-only fallback, and long-term filesystem corruption seen in replay
logs and "error count since last fsck".

This simply skips folio migration on jbd2 metadata buffers to avoid races during
journal writes that can lead to filesystem corruption, but also paves the way
to enable jbd2 to eventually overcome this limitation and enable folio
migration, while also implementing some of the suggested enhancements on
__find_get_block_slow(). The suggested trylock idea is implemented, thereby
potentially reducing i_private_lock contention and leveraging folio_trylock()
when allowed.

The first patch is intended to go through Linus' tree, if agreeable, and then
the rest can be evaluated for fs-next. Although I did not intend to upstream
the debugfs interface, at this point I'm convinced the statistics are extremely
useful while enhacing this path, and should also prove useful in enhacing and
eventually enabling folio migration on jbd2 metadata buffers.

If you want this in tree form, see 20250409-bh-meta-migrate-optimal [1].

[0] https://lore.kernel.org/all/20250330064732.3781046-3-mcgrof@kernel.org/T/#mf2fb79c9ab0d20fab65c65142b7f53680e68d8fa
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20250409-bh-meta-migrate-optimal

Changes on v2:

 - replace heuristic with buffer_meta() check as we're convinced the issue
   with corruption stil exist and jbd2 metadata buffers still needs work
   to enable folio migration
 - implements community feedback and performance suggestions on code
   paving the way to eventually enable jbd2 metadata buffers to leverage
   folio migration
 - adds debugfs interface

Davidlohr Bueso (6):
  fs/buffer: try to use folio lock for pagecache lookups
  fs/buffer: introduce __find_get_block_nonatomic()
  fs/ocfs2: use sleeping version of __find_get_block()
  fs/jbd2: use sleeping version of __find_get_block()
  fs/ext4: use sleeping version of __find_get_block()
  mm/migrate: enable noref migration for jbd2

Luis Chamberlain (2):
  migrate: fix skipping metadata buffer heads on migration
  mm: add migration buffer-head debugfs interface

 fs/buffer.c                 |  76 ++++++++++----
 fs/ext4/ialloc.c            |   3 +-
 fs/ext4/inode.c             |   2 +
 fs/ext4/mballoc.c           |   3 +-
 fs/jbd2/revoke.c            |  15 +--
 fs/ocfs2/journal.c          |   2 +-
 include/linux/buffer_head.h |   9 ++
 mm/migrate.c                | 192 ++++++++++++++++++++++++++++++++++--
 8 files changed, 266 insertions(+), 36 deletions(-)