diff mbox series

[hotfix] mm: fix crashes from deferred split racing folio migration

Message ID 29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com (mailing list archive)
State New
Headers show
Series [hotfix] mm: fix crashes from deferred split racing folio migration | expand

Commit Message

Hugh Dickins July 2, 2024, 7:40 a.m. UTC
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.

6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.

Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).

Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
---
This patch against 6.10-rc6: Kefeng has commits in the mm-tree which
which will need adjustment to go over this, but we can both check the
result.  I have wondered whether just reverting 85ce2c517ade and its
subsequent fixups would be better: but that would be a bigger job,
and probably not the right choice.

 mm/memcontrol.c | 11 -----------
 mm/migrate.c    | 13 +++++++++++++
 2 files changed, 13 insertions(+), 11 deletions(-)

Comments

Baolin Wang July 2, 2024, 9:25 a.m. UTC | #1
On 2024/7/2 15:40, Hugh Dickins wrote:
> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> flags when freeing, yet the flags shown are not bad: PG_locked had been
> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> symptoms implying double free by deferred split and large folio migration.
> 
> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> folio migration") was right to fix the memcg-dependent locking broken in
> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> but missed a subtlety of deferred_split_scan(): it moves folios to its own
> local list to work on them without split_queue_lock, during which time
> folio->_deferred_list is not empty, but even the "right" lock does nothing
> to secure the folio and the list it is on.
> 
> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> while the old folio's reference count is temporarily frozen to 0 - adding
> such a freeze in the !mapping case too (originally, folio lock and
> unmapping and no swap cache left an anon folio unreachable, so no freezing
> was needed there: but the deferred split queue offers a way to reach it).

Thanks Hugh.

But after reading your analysis, I am concerned that the 
folio_undo_large_rmappable() and deferred_split_scan() may still 
encounter a race condition with the local list, even with your patch.

Suppose folio A has already been queued into the local list in 
deferred_split_scan() by thread A, but fails to 'folio_trylock' and then 
releases the reference count. At the same time, folio A can be frozen by 
another thread B in folio_migrate_mapping(). In such a case, 
folio_undo_large_rmappable() would remove folio A from the local list 
without *any* lock protection, creating a race condition with the local 
list iteration in deferred_split_scan().

Anyway, I think this patch can still fix some possible races. Feel free 
to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> 
> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org
> ---
> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which
> which will need adjustment to go over this, but we can both check the
> result.  I have wondered whether just reverting 85ce2c517ade and its
> subsequent fixups would be better: but that would be a bigger job,
> and probably not the right choice.
> 
>   mm/memcontrol.c | 11 -----------
>   mm/migrate.c    | 13 +++++++++++++
>   2 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71fe2a95b8bd..8f2f1bb18c9c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>   
>   	/* Transfer the charge and the css ref */
>   	commit_charge(new, memcg);
> -	/*
> -	 * If the old folio is a large folio and is in the split queue, it needs
> -	 * to be removed from the split queue now, in case getting an incorrect
> -	 * split queue in destroy_large_folio() after the memcg of the old folio
> -	 * is cleared.
> -	 *
> -	 * In addition, the old folio is about to be freed after migration, so
> -	 * removing from the split queue a bit earlier seems reasonable.
> -	 */
> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
> -		folio_undo_large_rmappable(old);
>   	old->memcg_data = 0;
>   }
>   
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 20cb9f5f7446..a8c6f466e33a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping,
>   		if (folio_ref_count(folio) != expected_count)
>   			return -EAGAIN;
>   
> +		/* Take off deferred split queue while frozen and memcg set */
> +		if (folio_test_large(folio) &&
> +		    folio_test_large_rmappable(folio)) {
> +			if (!folio_ref_freeze(folio, expected_count))
> +				return -EAGAIN;
> +			folio_undo_large_rmappable(folio);
> +			folio_ref_unfreeze(folio, expected_count);
> +		}
> +
>   		/* No turning back from here */
>   		newfolio->index = folio->index;
>   		newfolio->mapping = folio->mapping;
> @@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>   		return -EAGAIN;
>   	}
>   
> +	/* Take off deferred split queue while frozen and memcg set */
> +	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> +		folio_undo_large_rmappable(folio);
> +
>   	/*
>   	 * Now we know that no one else is looking at the folio:
>   	 * no turning back from here.
Hugh Dickins July 2, 2024, 4:15 p.m. UTC | #2
On Tue, 2 Jul 2024, Baolin Wang wrote:
> On 2024/7/2 15:40, Hugh Dickins wrote:
> > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> > flags when freeing, yet the flags shown are not bad: PG_locked had been
> > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> > symptoms implying double free by deferred split and large folio migration.
> > 
> > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> > folio migration") was right to fix the memcg-dependent locking broken in
> > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> > but missed a subtlety of deferred_split_scan(): it moves folios to its own
> > local list to work on them without split_queue_lock, during which time
> > folio->_deferred_list is not empty, but even the "right" lock does nothing
> > to secure the folio and the list it is on.
> > 
> > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> > while the old folio's reference count is temporarily frozen to 0 - adding
> > such a freeze in the !mapping case too (originally, folio lock and

(I should have added "isolation and" into that list of conditions.)

> > unmapping and no swap cache left an anon folio unreachable, so no freezing
> > was needed there: but the deferred split queue offers a way to reach it).
> 
> Thanks Hugh.
> 
> But after reading your analysis, I am concerned that the
> folio_undo_large_rmappable() and deferred_split_scan() may still encounter a
> race condition with the local list, even with your patch.
> 
> Suppose folio A has already been queued into the local list in
> deferred_split_scan() by thread A, but fails to 'folio_trylock' and then
> releases the reference count. At the same time, folio A can be frozen by
> another thread B in folio_migrate_mapping(). In such a case,
> folio_undo_large_rmappable() would remove folio A from the local list without
> *any* lock protection, creating a race condition with the local list iteration
> in deferred_split_scan().

It's a good doubt to raise, but I think we're okay: because Kirill
designed the deferred split list (and its temporary local list) to
be safe in that way.

You're right that if thread B's freeze to 0 wins the race, thread B
will be messing with a list on thread A's stack while thread A is
quite possibly running; but thread A will not leave that stack frame
without taking again the split_queue_lock which thread B holds while
removing from the list.

We would certainly not want to introduce such a subtlety right now!
But never mind page migration, this is how it has always worked when
racing with the folio being freed at the same time - maybe
deferred_split_scan() wins the freeing race and is the one to remove
folio from deferred split list, or maybe the other freer does that.

I forget whether there was an initial flurry of races to be fixed when
it came in, but it has been stable against concurrent freeing for years.

Please think this over again: do not trust my honeyed words!

> 
> Anyway, I think this patch can still fix some possible races. Feel free to
> add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Thanks, but I certainly don't want this to go into the tree if it's
still flawed as you suggest.

Hugh
Baolin Wang July 3, 2024, 1:51 a.m. UTC | #3
On 2024/7/3 00:15, Hugh Dickins wrote:
> On Tue, 2 Jul 2024, Baolin Wang wrote:
>> On 2024/7/2 15:40, Hugh Dickins wrote:
>>> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
>>> flags when freeing, yet the flags shown are not bad: PG_locked had been
>>> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
>>> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
>>> symptoms implying double free by deferred split and large folio migration.
>>>
>>> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
>>> folio migration") was right to fix the memcg-dependent locking broken in
>>> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
>>> but missed a subtlety of deferred_split_scan(): it moves folios to its own
>>> local list to work on them without split_queue_lock, during which time
>>> folio->_deferred_list is not empty, but even the "right" lock does nothing
>>> to secure the folio and the list it is on.
>>>
>>> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
>>> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
>>> while the old folio's reference count is temporarily frozen to 0 - adding
>>> such a freeze in the !mapping case too (originally, folio lock and
> 
> (I should have added "isolation and" into that list of conditions.)
> 
>>> unmapping and no swap cache left an anon folio unreachable, so no freezing
>>> was needed there: but the deferred split queue offers a way to reach it).
>>
>> Thanks Hugh.
>>
>> But after reading your analysis, I am concerned that the
>> folio_undo_large_rmappable() and deferred_split_scan() may still encounter a
>> race condition with the local list, even with your patch.
>>
>> Suppose folio A has already been queued into the local list in
>> deferred_split_scan() by thread A, but fails to 'folio_trylock' and then
>> releases the reference count. At the same time, folio A can be frozen by
>> another thread B in folio_migrate_mapping(). In such a case,
>> folio_undo_large_rmappable() would remove folio A from the local list without
>> *any* lock protection, creating a race condition with the local list iteration
>> in deferred_split_scan().
> 
> It's a good doubt to raise, but I think we're okay: because Kirill
> designed the deferred split list (and its temporary local list) to
> be safe in that way.
> 
> You're right that if thread B's freeze to 0 wins the race, thread B
> will be messing with a list on thread A's stack while thread A is
> quite possibly running; but thread A will not leave that stack frame
> without taking again the split_queue_lock which thread B holds while
> removing from the list.
> 
> We would certainly not want to introduce such a subtlety right now!
> But never mind page migration, this is how it has always worked when
> racing with the folio being freed at the same time - maybe
> deferred_split_scan() wins the freeing race and is the one to remove
> folio from deferred split list, or maybe the other freer does that.

Yes, thanks for explanation. And after thinking more, the 
'list_for_each_entry_safe' in deferred_split_scan() can maintain the 
list iteration safety, so I think this is safe.

> I forget whether there was an initial flurry of races to be fixed when
> it came in, but it has been stable against concurrent freeing for years.
> 
> Please think this over again: do not trust my honeyed words!
> 
>>
>> Anyway, I think this patch can still fix some possible races. Feel free to
>> add:
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Thanks, but I certainly don't want this to go into the tree if it's
> still flawed as you suggest.

Now I have no doubt for this fix, and please continue to keep my 
Reviewed-by tag, thanks.
Andrew Morton July 3, 2024, 2:13 a.m. UTC | #4
On Tue, 2 Jul 2024 09:15:54 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> > 
> > Anyway, I think this patch can still fix some possible races. Feel free to
> > add:
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Thanks, but I certainly don't want this to go into the tree if it's
> still flawed as you suggest.

I queued it for testing but I shall not send it unsrteam until we have
sorted these things out.
Zi Yan July 3, 2024, 2:30 p.m. UTC | #5
On 2 Jul 2024, at 3:40, Hugh Dickins wrote:

> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> flags when freeing, yet the flags shown are not bad: PG_locked had been
> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> symptoms implying double free by deferred split and large folio migration.
>
> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> folio migration") was right to fix the memcg-dependent locking broken in
> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> but missed a subtlety of deferred_split_scan(): it moves folios to its own
> local list to work on them without split_queue_lock, during which time
> folio->_deferred_list is not empty, but even the "right" lock does nothing
> to secure the folio and the list it is on.
>
> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> while the old folio's reference count is temporarily frozen to 0 - adding
> such a freeze in the !mapping case too (originally, folio lock and
> unmapping and no swap cache left an anon folio unreachable, so no freezing
> was needed there: but the deferred split queue offers a way to reach it).
>
> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org
> ---
> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which
> which will need adjustment to go over this, but we can both check the
> result.  I have wondered whether just reverting 85ce2c517ade and its
> subsequent fixups would be better: but that would be a bigger job,
> and probably not the right choice.
>
>  mm/memcontrol.c | 11 -----------
>  mm/migrate.c    | 13 +++++++++++++
>  2 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71fe2a95b8bd..8f2f1bb18c9c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>  	/* Transfer the charge and the css ref */
>  	commit_charge(new, memcg);
> -	/*
> -	 * If the old folio is a large folio and is in the split queue, it needs
> -	 * to be removed from the split queue now, in case getting an incorrect
> -	 * split queue in destroy_large_folio() after the memcg of the old folio
> -	 * is cleared.
> -	 *
> -	 * In addition, the old folio is about to be freed after migration, so
> -	 * removing from the split queue a bit earlier seems reasonable.
> -	 */
> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
> -		folio_undo_large_rmappable(old);
>  	old->memcg_data = 0;
>  }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 20cb9f5f7446..a8c6f466e33a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping,
>  		if (folio_ref_count(folio) != expected_count)
>  			return -EAGAIN;
>
> +		/* Take off deferred split queue while frozen and memcg set */
> +		if (folio_test_large(folio) &&
> +		    folio_test_large_rmappable(folio)) {
> +			if (!folio_ref_freeze(folio, expected_count))
> +				return -EAGAIN;
> +			folio_undo_large_rmappable(folio);
> +			folio_ref_unfreeze(folio, expected_count);
> +		}
> +

I wonder if the patch below would make the code look better by using
the same freeze/unfreeze pattern like file-backed path. After
reading the emails between you and Baolin and checking the code,
I think the patch looks good to me. Feel free to add
Reviewed-by: Zi Yan <ziy@nvidia.com>

BTW, this subtlety is very error prone, as Matthew, Ryan, and I all
encountered errors because of this[1][2]. Matthew has a good summary
of the subtlety:

"the (undocumented) logic in deferred_split_scan() that a folio
with a positive refcount will not be removed from the list."

[1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/

diff --git a/mm/migrate.c b/mm/migrate.c
index a8c6f466e33a..afcc0653dcb7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping,

        if (!mapping) {
                /* Anonymous page without mapping */
-               if (folio_ref_count(folio) != expected_count)
+               if (!folio_ref_freeze(folio, expected_count))
                        return -EAGAIN;

                /* Take off deferred split queue while frozen and memcg set */
                if (folio_test_large(folio) &&
-                   folio_test_large_rmappable(folio)) {
-                       if (!folio_ref_freeze(folio, expected_count))
-                               return -EAGAIN;
+                   folio_test_large_rmappable(folio))
                        folio_undo_large_rmappable(folio);
-                       folio_ref_unfreeze(folio, expected_count);
-               }
+
+               folio_ref_unfreeze(folio, expected_count);

                /* No turning back from here */
                newfolio->index = folio->index;


>  		/* No turning back from here */
>  		newfolio->index = folio->index;
>  		newfolio->mapping = folio->mapping;
> @@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>  		return -EAGAIN;
>  	}
>
> +	/* Take off deferred split queue while frozen and memcg set */
> +	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> +		folio_undo_large_rmappable(folio);
> +
>  	/*
>  	 * Now we know that no one else is looking at the folio:
>  	 * no turning back from here.
> -- 
> 2.35.3


--
Best Regards,
Yan, Zi
David Hildenbrand July 3, 2024, 4:21 p.m. UTC | #6
On 03.07.24 16:30, Zi Yan wrote:
> On 2 Jul 2024, at 3:40, Hugh Dickins wrote:
> 
>> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
>> flags when freeing, yet the flags shown are not bad: PG_locked had been
>> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
>> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
>> symptoms implying double free by deferred split and large folio migration.
>>
>> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
>> folio migration") was right to fix the memcg-dependent locking broken in
>> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
>> but missed a subtlety of deferred_split_scan(): it moves folios to its own
>> local list to work on them without split_queue_lock, during which time
>> folio->_deferred_list is not empty, but even the "right" lock does nothing
>> to secure the folio and the list it is on.
>>
>> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
>> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
>> while the old folio's reference count is temporarily frozen to 0 - adding
>> such a freeze in the !mapping case too (originally, folio lock and
>> unmapping and no swap cache left an anon folio unreachable, so no freezing
>> was needed there: but the deferred split queue offers a way to reach it).
>>
>> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> Cc: stable@vger.kernel.org
>> ---
>> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which
>> which will need adjustment to go over this, but we can both check the
>> result.  I have wondered whether just reverting 85ce2c517ade and its
>> subsequent fixups would be better: but that would be a bigger job,
>> and probably not the right choice.
>>
>>   mm/memcontrol.c | 11 -----------
>>   mm/migrate.c    | 13 +++++++++++++
>>   2 files changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 71fe2a95b8bd..8f2f1bb18c9c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>>
>>   	/* Transfer the charge and the css ref */
>>   	commit_charge(new, memcg);
>> -	/*
>> -	 * If the old folio is a large folio and is in the split queue, it needs
>> -	 * to be removed from the split queue now, in case getting an incorrect
>> -	 * split queue in destroy_large_folio() after the memcg of the old folio
>> -	 * is cleared.
>> -	 *
>> -	 * In addition, the old folio is about to be freed after migration, so
>> -	 * removing from the split queue a bit earlier seems reasonable.
>> -	 */
>> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
>> -		folio_undo_large_rmappable(old);
>>   	old->memcg_data = 0;
>>   }
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 20cb9f5f7446..a8c6f466e33a 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping,
>>   		if (folio_ref_count(folio) != expected_count)
>>   			return -EAGAIN;
>>
>> +		/* Take off deferred split queue while frozen and memcg set */
>> +		if (folio_test_large(folio) &&
>> +		    folio_test_large_rmappable(folio)) {
>> +			if (!folio_ref_freeze(folio, expected_count))
>> +				return -EAGAIN;
>> +			folio_undo_large_rmappable(folio);
>> +			folio_ref_unfreeze(folio, expected_count);
>> +		}
>> +
> 
> I wonder if the patch below would make the code look better by using
> the same freeze/unfreeze pattern like file-backed path. After
> reading the emails between you and Baolin and checking the code,
> I think the patch looks good to me. Feel free to add
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> 
> BTW, this subtlety is very error prone, as Matthew, Ryan, and I all
> encountered errors because of this[1][2]. Matthew has a good summary
> of the subtlety:
> 
> "the (undocumented) logic in deferred_split_scan() that a folio
> with a positive refcount will not be removed from the list."
> 
> [1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/
> [2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8c6f466e33a..afcc0653dcb7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping,
> 
>          if (!mapping) {
>                  /* Anonymous page without mapping */
> -               if (folio_ref_count(folio) != expected_count)
> +               if (!folio_ref_freeze(folio, expected_count))
>                          return -EAGAIN;
> 
>                  /* Take off deferred split queue while frozen and memcg set */
>                  if (folio_test_large(folio) &&
> -                   folio_test_large_rmappable(folio)) {
> -                       if (!folio_ref_freeze(folio, expected_count))
> -                               return -EAGAIN;
> +                   folio_test_large_rmappable(folio))
>                          folio_undo_large_rmappable(folio);
> -                       folio_ref_unfreeze(folio, expected_count);
> -               }
> +
> +               folio_ref_unfreeze(folio, expected_count);
> 

The downside is freezing order-0, where we don't need to freeze, right?
Zi Yan July 3, 2024, 4:22 p.m. UTC | #7
On 3 Jul 2024, at 12:21, David Hildenbrand wrote:

> On 03.07.24 16:30, Zi Yan wrote:
>> On 2 Jul 2024, at 3:40, Hugh Dickins wrote:
>>
>>> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
>>> flags when freeing, yet the flags shown are not bad: PG_locked had been
>>> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
>>> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
>>> symptoms implying double free by deferred split and large folio migration.
>>>
>>> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
>>> folio migration") was right to fix the memcg-dependent locking broken in
>>> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
>>> but missed a subtlety of deferred_split_scan(): it moves folios to its own
>>> local list to work on them without split_queue_lock, during which time
>>> folio->_deferred_list is not empty, but even the "right" lock does nothing
>>> to secure the folio and the list it is on.
>>>
>>> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
>>> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
>>> while the old folio's reference count is temporarily frozen to 0 - adding
>>> such a freeze in the !mapping case too (originally, folio lock and
>>> unmapping and no swap cache left an anon folio unreachable, so no freezing
>>> was needed there: but the deferred split queue offers a way to reach it).
>>>
>>> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which
>>> which will need adjustment to go over this, but we can both check the
>>> result.  I have wondered whether just reverting 85ce2c517ade and its
>>> subsequent fixups would be better: but that would be a bigger job,
>>> and probably not the right choice.
>>>
>>>   mm/memcontrol.c | 11 -----------
>>>   mm/migrate.c    | 13 +++++++++++++
>>>   2 files changed, 13 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 71fe2a95b8bd..8f2f1bb18c9c 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>>>
>>>   	/* Transfer the charge and the css ref */
>>>   	commit_charge(new, memcg);
>>> -	/*
>>> -	 * If the old folio is a large folio and is in the split queue, it needs
>>> -	 * to be removed from the split queue now, in case getting an incorrect
>>> -	 * split queue in destroy_large_folio() after the memcg of the old folio
>>> -	 * is cleared.
>>> -	 *
>>> -	 * In addition, the old folio is about to be freed after migration, so
>>> -	 * removing from the split queue a bit earlier seems reasonable.
>>> -	 */
>>> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
>>> -		folio_undo_large_rmappable(old);
>>>   	old->memcg_data = 0;
>>>   }
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 20cb9f5f7446..a8c6f466e33a 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping,
>>>   		if (folio_ref_count(folio) != expected_count)
>>>   			return -EAGAIN;
>>>
>>> +		/* Take off deferred split queue while frozen and memcg set */
>>> +		if (folio_test_large(folio) &&
>>> +		    folio_test_large_rmappable(folio)) {
>>> +			if (!folio_ref_freeze(folio, expected_count))
>>> +				return -EAGAIN;
>>> +			folio_undo_large_rmappable(folio);
>>> +			folio_ref_unfreeze(folio, expected_count);
>>> +		}
>>> +
>>
>> I wonder if the patch below would make the code look better by using
>> the same freeze/unfreeze pattern like file-backed path. After
>> reading the emails between you and Baolin and checking the code,
>> I think the patch looks good to me. Feel free to add
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>>
>> BTW, this subtlety is very error prone, as Matthew, Ryan, and I all
>> encountered errors because of this[1][2]. Matthew has a good summary
>> of the subtlety:
>>
>> "the (undocumented) logic in deferred_split_scan() that a folio
>> with a positive refcount will not be removed from the list."
>>
>> [1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/
>> [2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a8c6f466e33a..afcc0653dcb7 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping,
>>
>>          if (!mapping) {
>>                  /* Anonymous page without mapping */
>> -               if (folio_ref_count(folio) != expected_count)
>> +               if (!folio_ref_freeze(folio, expected_count))
>>                          return -EAGAIN;
>>
>>                  /* Take off deferred split queue while frozen and memcg set */
>>                  if (folio_test_large(folio) &&
>> -                   folio_test_large_rmappable(folio)) {
>> -                       if (!folio_ref_freeze(folio, expected_count))
>> -                               return -EAGAIN;
>> +                   folio_test_large_rmappable(folio))
>>                          folio_undo_large_rmappable(folio);
>> -                       folio_ref_unfreeze(folio, expected_count);
>> -               }
>> +
>> +               folio_ref_unfreeze(folio, expected_count);
>>
>
> The downside is freezing order-0, where we don't need to freeze, right?

Right. I missed that part. Forget about my change above. Thanks.


--
Best Regards,
Yan, Zi
Andrew Morton July 4, 2024, 2:35 a.m. UTC | #8
On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> flags when freeing, yet the flags shown are not bad: PG_locked had been
> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> symptoms implying double free by deferred split and large folio migration.
> 
> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> folio migration") was right to fix the memcg-dependent locking broken in
> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> but missed a subtlety of deferred_split_scan(): it moves folios to its own
> local list to work on them without split_queue_lock, during which time
> folio->_deferred_list is not empty, but even the "right" lock does nothing
> to secure the folio and the list it is on.
> 
> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> while the old folio's reference count is temporarily frozen to 0 - adding
> such a freeze in the !mapping case too (originally, folio lock and
> unmapping and no swap cache left an anon folio unreachable, so no freezing
> was needed there: but the deferred split queue offers a way to reach it).

There's a conflict when applying Kefeng's "mm: refactor
folio_undo_large_rmappable()"
(https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com)
on top of this hotfix.

--- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable
+++ mm/memcontrol.c
@@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol
 	 * In addition, the old folio is about to be freed after migration, so
 	 * removing from the split queue a bit earlier seems reasonable.
 	 */
-	if (folio_test_large(old) && folio_test_large_rmappable(old))
-		folio_undo_large_rmappable(old);
+	folio_undo_large_rmappable(old);
 	old->memcg_data = 0;
 }

I'm resolving this by simply dropping the above hunk.  So Kefeng's
patch is now as below.  Please check.

--- a/mm/huge_memory.c~mm-refactor-folio_undo_large_rmappable
+++ a/mm/huge_memory.c
@@ -3258,22 +3258,11 @@ out:
 	return ret;
 }
 
-void folio_undo_large_rmappable(struct folio *folio)
+void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
 	unsigned long flags;
 
-	if (folio_order(folio) <= 1)
-		return;
-
-	/*
-	 * At this point, there is no one trying to add the folio to
-	 * deferred_list. If folio is not in deferred_list, it's safe
-	 * to check without acquiring the split_queue_lock.
-	 */
-	if (data_race(list_empty(&folio->_deferred_list)))
-		return;
-
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (!list_empty(&folio->_deferred_list)) {
--- a/mm/internal.h~mm-refactor-folio_undo_large_rmappable
+++ a/mm/internal.h
@@ -622,7 +622,22 @@ static inline void folio_set_order(struc
 #endif
 }
 
-void folio_undo_large_rmappable(struct folio *folio);
+void __folio_undo_large_rmappable(struct folio *folio);
+static inline void folio_undo_large_rmappable(struct folio *folio)
+{
+	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
+		return;
+
+	/*
+	 * At this point, there is no one trying to add the folio to
+	 * deferred_list. If folio is not in deferred_list, it's safe
+	 * to check without acquiring the split_queue_lock.
+	 */
+	if (data_race(list_empty(&folio->_deferred_list)))
+		return;
+
+	__folio_undo_large_rmappable(folio);
+}
 
 static inline struct folio *page_rmappable_folio(struct page *page)
 {
--- a/mm/page_alloc.c~mm-refactor-folio_undo_large_rmappable
+++ a/mm/page_alloc.c
@@ -2661,8 +2661,7 @@ void free_unref_folios(struct folio_batc
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
-		if (order > 0 && folio_test_large_rmappable(folio))
-			folio_undo_large_rmappable(folio);
+		folio_undo_large_rmappable(folio);
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
--- a/mm/swap.c~mm-refactor-folio_undo_large_rmappable
+++ a/mm/swap.c
@@ -123,8 +123,7 @@ void __folio_put(struct folio *folio)
 	}
 
 	page_cache_release(folio);
-	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
-		folio_undo_large_rmappable(folio);
+	folio_undo_large_rmappable(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, folio_order(folio));
 }
@@ -1021,10 +1020,7 @@ void folios_put_refs(struct folio_batch
 			free_huge_folio(folio);
 			continue;
 		}
-		if (folio_test_large(folio) &&
-		    folio_test_large_rmappable(folio))
-			folio_undo_large_rmappable(folio);
-
+		folio_undo_large_rmappable(folio);
 		__page_cache_release(folio, &lruvec, &flags);
 
 		if (j != i)
--- a/mm/vmscan.c~mm-refactor-folio_undo_large_rmappable
+++ a/mm/vmscan.c
@@ -1439,9 +1439,7 @@ free_it:
 		 */
 		nr_reclaimed += nr_pages;
 
-		if (folio_test_large(folio) &&
-		    folio_test_large_rmappable(folio))
-			folio_undo_large_rmappable(folio);
+		folio_undo_large_rmappable(folio);
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1848,9 +1846,7 @@ static unsigned int move_folios_to_lru(s
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
-			if (folio_test_large(folio) &&
-			    folio_test_large_rmappable(folio))
-				folio_undo_large_rmappable(folio);
+			folio_undo_large_rmappable(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
Hugh Dickins July 4, 2024, 3:21 a.m. UTC | #9
On Wed, 3 Jul 2024, Andrew Morton wrote:
> On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> > flags when freeing, yet the flags shown are not bad: PG_locked had been
> > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> > symptoms implying double free by deferred split and large folio migration.
> > 
> > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> > folio migration") was right to fix the memcg-dependent locking broken in
> > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> > but missed a subtlety of deferred_split_scan(): it moves folios to its own
> > local list to work on them without split_queue_lock, during which time
> > folio->_deferred_list is not empty, but even the "right" lock does nothing
> > to secure the folio and the list it is on.
> > 
> > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> > while the old folio's reference count is temporarily frozen to 0 - adding
> > such a freeze in the !mapping case too (originally, folio lock and
> > unmapping and no swap cache left an anon folio unreachable, so no freezing
> > was needed there: but the deferred split queue offers a way to reach it).
> 
> There's a conflict when applying Kefeng's "mm: refactor
> folio_undo_large_rmappable()"
> (https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com)
> on top of this hotfix.

Yes, anticipated in my "below the --- line" comments:
sorry for giving you this nuisance.

And perhaps a conflict with another one of Kefeng's, which deletes a hunk
in mm/migrate.c just above where I add a hunk: and that's indeed how it
should end up, hunk deleted by Kefeng, hunk added by me.

> 
> --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable
> +++ mm/memcontrol.c
> @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol
>  	 * In addition, the old folio is about to be freed after migration, so
>  	 * removing from the split queue a bit earlier seems reasonable.
>  	 */
> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
> -		folio_undo_large_rmappable(old);
> +	folio_undo_large_rmappable(old);
>  	old->memcg_data = 0;
>  }
> 
> I'm resolving this by simply dropping the above hunk.  So Kefeng's
> patch is now as below.  Please check.

Checked, and that is correct, thank you Andrew.  Correct, but not quite
complete: because I'm sure that if Kefeng had written his patch after
mine, he would have made the equivalent change in mm/migrate.c:

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	/* Take off deferred split queue while frozen and memcg set */
-	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
-		folio_undo_large_rmappable(folio);
+	folio_undo_large_rmappable(folio);
 
 	/*
 	 * Now we know that no one else is looking at the folio:

But there's no harm done if you push out a tree without that additional
mod: we can add it as a fixup afterwards, it's no more than a cleanup.

(I'm on the lookout for an mm.git update, hope to give it a try when it
appears.)

Thanks,
Hugh
Andrew Morton July 4, 2024, 3:28 a.m. UTC | #10
On Wed, 3 Jul 2024 20:21:22 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Wed, 3 Jul 2024, Andrew Morton wrote:
> > On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > 
> > > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
> > > flags when freeing, yet the flags shown are not bad: PG_locked had been
> > > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
> > > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
> > > symptoms implying double free by deferred split and large folio migration.
> > > 
> > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
> > > folio migration") was right to fix the memcg-dependent locking broken in
> > > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
> > > but missed a subtlety of deferred_split_scan(): it moves folios to its own
> > > local list to work on them without split_queue_lock, during which time
> > > folio->_deferred_list is not empty, but even the "right" lock does nothing
> > > to secure the folio and the list it is on.
> > > 
> > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
> > > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
> > > while the old folio's reference count is temporarily frozen to 0 - adding
> > > such a freeze in the !mapping case too (originally, folio lock and
> > > unmapping and no swap cache left an anon folio unreachable, so no freezing
> > > was needed there: but the deferred split queue offers a way to reach it).
> > 
> > There's a conflict when applying Kefeng's "mm: refactor
> > folio_undo_large_rmappable()"
> > (https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com)
> > on top of this hotfix.
> 
> Yes, anticipated in my "below the --- line" comments:
> sorry for giving you this nuisance.

np

> And perhaps a conflict with another one of Kefeng's, which deletes a hunk
> in mm/migrate.c just above where I add a hunk: and that's indeed how it
> should end up, hunk deleted by Kefeng, hunk added by me.

Sorted, I hope.

> > 
> > --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable
> > +++ mm/memcontrol.c
> > @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol
> >  	 * In addition, the old folio is about to be freed after migration, so
> >  	 * removing from the split queue a bit earlier seems reasonable.
> >  	 */
> > -	if (folio_test_large(old) && folio_test_large_rmappable(old))
> > -		folio_undo_large_rmappable(old);
> > +	folio_undo_large_rmappable(old);
> >  	old->memcg_data = 0;
> >  }
> > 
> > I'm resolving this by simply dropping the above hunk.  So Kefeng's
> > patch is now as below.  Please check.
> 
> Checked, and that is correct, thank you Andrew.

great.

> Correct, but not quite
> complete: because I'm sure that if Kefeng had written his patch after
> mine, he would have made the equivalent change in mm/migrate.c:
> 
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>  	}
>  
>  	/* Take off deferred split queue while frozen and memcg set */
> -	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> -		folio_undo_large_rmappable(folio);
> +	folio_undo_large_rmappable(folio);
>  
>  	/*
>  	 * Now we know that no one else is looking at the folio:
> 
> But there's no harm done if you push out a tree without that additional
> mod: we can add it as a fixup afterwards, it's no more than a cleanup.

OK, someone please send that along?  I'll queue it as a -fix so a
single line of changelog is all that I shall retain (but more is
welcome!  People can follow the Link:)

> (I'm on the lookout for an mm.git update, hope to give it a try when it
> appears.)

12 seconds ago.
Kefeng Wang July 4, 2024, 6:12 a.m. UTC | #11
On 2024/7/4 11:21, Hugh Dickins wrote:
> On Wed, 3 Jul 2024, Andrew Morton wrote:
>> On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
>>
...
> 
> And perhaps a conflict with another one of Kefeng's, which deletes a hunk
> in mm/migrate.c just above where I add a hunk: and that's indeed how it
> should end up, hunk deleted by Kefeng, hunk added by me.
> 
>>
>> --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable
>> +++ mm/memcontrol.c
>> @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol
>>   	 * In addition, the old folio is about to be freed after migration, so
>>   	 * removing from the split queue a bit earlier seems reasonable.
>>   	 */
>> -	if (folio_test_large(old) && folio_test_large_rmappable(old))
>> -		folio_undo_large_rmappable(old);
>> +	folio_undo_large_rmappable(old);
>>   	old->memcg_data = 0;
>>   }
>>
>> I'm resolving this by simply dropping the above hunk.  So Kefeng's
>> patch is now as below.  Please check.
> 
> Checked, and that is correct, thank you Andrew.  Correct, but not quite
> complete: because I'm sure that if Kefeng had written his patch after
> mine, he would have made the equivalent change in mm/migrate.c:
> 

Yes,

> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>   	}
>   
>   	/* Take off deferred split queue while frozen and memcg set */
> -	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> -		folio_undo_large_rmappable(folio);
> +	folio_undo_large_rmappable(folio);
>   
>   	/*
>   	 * Now we know that no one else is looking at the folio:
> 
> But there's no harm done if you push out a tree without that additional
> mod: we can add it as a fixup afterwards, it's no more than a cleanup.
> 
Maybe we could convert to __folio_undo_large_rmappable() for !maping 
part, which will avoid unnecessary freeze/unfreeze for empty deferred
list.

diff --git a/mm/migrate.c b/mm/migrate.c
index ca65f03acb31..e6af9c25c25b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -412,10 +412,11 @@ static int __folio_migrate_mapping(struct 
address_space *mapping,
         if (!mapping) {
                 /* Take off deferred split queue while frozen and memcg 
set */
                 if (folio_test_large(folio) &&
-                   folio_test_large_rmappable(folio)) {
+                   folio_test_large_rmappable(folio) &&
+                   !data_race(list_empty(&folio->_deferred_list))) {
                         if (!folio_ref_freeze(folio, expected_count))
                                 return -EAGAIN;
-                       folio_undo_large_rmappable(folio);
+                       __folio_undo_large_rmappable(folio);
                         folio_ref_unfreeze(folio, expected_count);
                 }


> (I'm on the lookout for an mm.git update, hope to give it a try when it
> appears.)
> 
> Thanks,
> Hugh
>
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 71fe2a95b8bd..8f2f1bb18c9c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7823,17 +7823,6 @@  void mem_cgroup_migrate(struct folio *old, struct folio *new)
 
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
-	/*
-	 * If the old folio is a large folio and is in the split queue, it needs
-	 * to be removed from the split queue now, in case getting an incorrect
-	 * split queue in destroy_large_folio() after the memcg of the old folio
-	 * is cleared.
-	 *
-	 * In addition, the old folio is about to be freed after migration, so
-	 * removing from the split queue a bit earlier seems reasonable.
-	 */
-	if (folio_test_large(old) && folio_test_large_rmappable(old))
-		folio_undo_large_rmappable(old);
 	old->memcg_data = 0;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 20cb9f5f7446..a8c6f466e33a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -415,6 +415,15 @@  int folio_migrate_mapping(struct address_space *mapping,
 		if (folio_ref_count(folio) != expected_count)
 			return -EAGAIN;
 
+		/* Take off deferred split queue while frozen and memcg set */
+		if (folio_test_large(folio) &&
+		    folio_test_large_rmappable(folio)) {
+			if (!folio_ref_freeze(folio, expected_count))
+				return -EAGAIN;
+			folio_undo_large_rmappable(folio);
+			folio_ref_unfreeze(folio, expected_count);
+		}
+
 		/* No turning back from here */
 		newfolio->index = folio->index;
 		newfolio->mapping = folio->mapping;
@@ -433,6 +442,10 @@  int folio_migrate_mapping(struct address_space *mapping,
 		return -EAGAIN;
 	}
 
+	/* Take off deferred split queue while frozen and memcg set */
+	if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+		folio_undo_large_rmappable(folio);
+
 	/*
 	 * Now we know that no one else is looking at the folio:
 	 * no turning back from here.