diff mbox series

[1/3] mm/migrate: add might_sleep() on __migrate_folio()

Message ID 20250330064732.3781046-2-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series mm: move migration work around to buffer-heads | expand

Commit Message

Luis Chamberlain March 30, 2025, 6:47 a.m. UTC
When we do page migration of large folios folio_mc_copy() can
cond_resched() *iff* we are on a large folio. There's a hairy
bug reported by both 0-day [0] and  syzbot [1] where it has been
detected we can call folio_mc_copy() in atomic context. While,
technically speaking that should in theory be only possible today
from buffer-head filesystems using buffer_migrate_folio_norefs()
on page migration the only buffer-head large folio filesystem -- the
block device cache, and so with block devices with large block sizes.
However tracing shows that folio_mc_copy() *isn't* being called
as often as we'd expect from buffer_migrate_folio_norefs() path
as we're likely bailing early now thanks to the check added by commit
060913999d7a ("mm: migrate: support poisoned recover from migrate
folio").

*Most* folio_mc_copy() calls in turn end up *not* being in atomic
context, and so we won't hit a splat when using:

CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

But we *want* to help proactively find callers of __migrate_folio() in
atomic context, so make might_sleep() explicit to help us root out
large folio atomic callers of migrate_folio().

Link: https://lkml.kernel.org/r/202503101536.27099c77-lkp@intel.com # [0]
Link: https://lkml.kernel.org/r/67e57c41.050a0220.2f068f.0033.GAE@google.com # [1]
Link: https://lkml.kernel.org/r/Z-c6BqCSmAnNxb57@bombadil.infradead.org # [2]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/migrate.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Matthew Wilcox March 30, 2025, 12:04 p.m. UTC | #1
On Sat, Mar 29, 2025 at 11:47:30PM -0700, Luis Chamberlain wrote:
> However tracing shows that folio_mc_copy() *isn't* being called
> as often as we'd expect from buffer_migrate_folio_norefs() path
> as we're likely bailing early now thanks to the check added by commit
> 060913999d7a ("mm: migrate: support poisoned recover from migrate
> folio").

Umm.  You're saying that most folios we try to migrate have extra refs?
That seems unexpected; does it indicate a bug in 060913999d7a?

> +++ b/mm/migrate.c
> @@ -751,6 +751,8 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  {
>  	int rc, expected_count = folio_expected_refs(mapping, src);
>  
> +	might_sleep();

We deliberately don't sleep when the folio is only a single page.
So this needs to be:

	might_sleep_if(folio_test_large(folio));

>  	/* Check whether src does not have extra refs before we do more work */
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
> -- 
> 2.47.2
>
Luis Chamberlain March 31, 2025, 6:28 a.m. UTC | #2
On Sun, Mar 30, 2025 at 01:04:02PM +0100, Matthew Wilcox wrote:
> On Sat, Mar 29, 2025 at 11:47:30PM -0700, Luis Chamberlain wrote:
> > However tracing shows that folio_mc_copy() *isn't* being called
> > as often as we'd expect from buffer_migrate_folio_norefs() path
> > as we're likely bailing early now thanks to the check added by commit
> > 060913999d7a ("mm: migrate: support poisoned recover from migrate
> > folio").
> 
> Umm.  You're saying that most folios we try to migrate have extra refs?
> That seems unexpected; does it indicate a bug in 060913999d7a?

I've debugged this further, the migration does succeed and I don't see
any failures due to the new refcheck added by 060913999d7a. I've added
stats in a out of tree patch [0] in case folks find this useful, I could
submit this. But the point is that even if you use dd against a large
block device you won't always end up trying to migrate large folios
*right away* even if you trigger folio migration through compaction,
specially if you use a large bs on dd like bs=1M. Using a size matching
more close to the logical block size will trigger large folio migration
much faster.

Example of the stats:

# cat /sys/kernel/debug/mm/migrate/bh/stats

[buffer_migrate_folio]
                    calls       9874
                  success       9854
                    fails       20

[buffer_migrate_folio_norefs]
                    calls       3694
                  success       1651
                    fails       2043
          no-head-success       532
            no-head-fails       0
                  invalid       2040
                    valid       1119
            valid-success       1119
              valid-fails       0

Success ratios:
buffer_migrate_folio: 99% success (9854/9874)
buffer_migrate_folio_norefs: 44% success (1651/3694)

> > +++ b/mm/migrate.c
> > @@ -751,6 +751,8 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
> >  {
> >  	int rc, expected_count = folio_expected_refs(mapping, src);
> >  
> > +	might_sleep();
> 
> We deliberately don't sleep when the folio is only a single page.
> So this needs to be:
> 
> 	might_sleep_if(folio_test_large(folio));

That does reduce the scope of our test coverage but, sure.

[0] https://lore.kernel.org/all/20250331061306.4073352-1-mcgrof@kernel.org/

  Luis
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..712ddd11f3f0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -751,6 +751,8 @@  static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 {
 	int rc, expected_count = folio_expected_refs(mapping, src);
 
+	might_sleep();
+
 	/* Check whether src does not have extra refs before we do more work */
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;