diff mbox series

[v5] mm/migrate: split source folio if it is on deferred split list

Message ID 20240322193304.522496-1-zi.yan@sent.com (mailing list archive)
State New
Headers show
Series [v5] mm/migrate: split source folio if it is on deferred split list | expand

Commit Message

Zi Yan March 22, 2024, 7:33 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

If the source folio is on deferred split list, it is likely some subpages
are not used. Split it before migration to avoid migrating unused subpages.

Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path")
did not check if a THP is on deferred split list before migration, thus,
the destination THP is never put on deferred split list even if the source
THP might be. The opportunity of reclaiming free pages in a partially
mapped THP during deferred list scanning is lost, but no other harmful
consequence is present[1].

From v4:
1. Simplify _deferred_list check without locking and do not count as
   migration failures. (per Matthew Wilcox)

From v3:
1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid
   compilation error (per SeongJae Park).

From v2:
1. Split the source folio instead of migrating it (per Matthew Wilcox)[2].

From v1:
1. Used dst to get correct deferred split list after migration
   (per Ryan Roberts).

[1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
[2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/

Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/migrate.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)


base-commit: 08a487ab26d541a3bd0adaee144f684b724d233b

Comments

Baolin Wang March 26, 2024, 6:19 a.m. UTC | #1
On 2024/3/23 03:33, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> If the source folio is on deferred split list, it is likely some subpages
> are not used. Split it before migration to avoid migrating unused subpages.
> 
> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path")
> did not check if a THP is on deferred split list before migration, thus,
> the destination THP is never put on deferred split list even if the source
> THP might be. The opportunity of reclaiming free pages in a partially
> mapped THP during deferred list scanning is lost, but no other harmful
> consequence is present[1].
> 
>  From v4:
> 1. Simplify _deferred_list check without locking and do not count as
>     migration failures. (per Matthew Wilcox)
> 
>  From v3:
> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid
>     compilation error (per SeongJae Park).
> 
>  From v2:
> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2].
> 
>  From v1:
> 1. Used dst to get correct deferred split list after migration
>     (per Ryan Roberts).
> 
> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
> 
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/migrate.c | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ab9856f5931b..6bd9319624a3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from,
>   
>   			cond_resched();
>   
> +			/*
> +			 * The rare folio on the deferred split list should
> +			 * be split now. It should not count as a failure.
> +			 * Only check it without removing it from the list.
> +			 * Since the folio can be on deferred_split_scan()
> +			 * local list and removing it can cause the local list
> +			 * corruption. Folio split process below can handle it
> +			 * with the help of folio_ref_freeze().
> +			 *
> +			 * nr_pages > 2 is needed to avoid checking order-1
> +			 * page cache folios. They exist, in contrast to
> +			 * non-existent order-1 anonymous folios, and do not
> +			 * use _deferred_list.
> +			 */
> +			if (nr_pages > 2 &&
> +			   !list_empty(&folio->_deferred_list)) {
> +				if (try_split_folio(folio, from) == 0) {

IMO, we should move the split folios into the 'split_folios' list 
instead of the 'from' list, otherwise there might be unhandled folios 
remaining in the from list.

> +					stats->nr_thp_split += is_thp;
> +					stats->nr_split++;
> +					continue;
> +				}
> +			}
> +
>   			/*
>   			 * Large folio migration might be unsupported or
>   			 * the allocation might be failed so we should retry
> 
> base-commit: 08a487ab26d541a3bd0adaee144f684b724d233b
Zi Yan March 26, 2024, 1:26 p.m. UTC | #2
On 26 Mar 2024, at 2:19, Baolin Wang wrote:

> On 2024/3/23 03:33, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> If the source folio is on deferred split list, it is likely some subpages
>> are not used. Split it before migration to avoid migrating unused subpages.
>>
>> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>> did not check if a THP is on deferred split list before migration, thus,
>> the destination THP is never put on deferred split list even if the source
>> THP might be. The opportunity of reclaiming free pages in a partially
>> mapped THP during deferred list scanning is lost, but no other harmful
>> consequence is present[1].
>>
>>  From v4:
>> 1. Simplify _deferred_list check without locking and do not count as
>>     migration failures. (per Matthew Wilcox)
>>
>>  From v3:
>> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid
>>     compilation error (per SeongJae Park).
>>
>>  From v2:
>> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2].
>>
>>  From v1:
>> 1. Used dst to get correct deferred split list after migration
>>     (per Ryan Roberts).
>>
>> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
>> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
>>
>> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/migrate.c | 23 +++++++++++++++++++++++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index ab9856f5931b..6bd9319624a3 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from,
>>    			cond_resched();
>>  +			/*
>> +			 * The rare folio on the deferred split list should
>> +			 * be split now. It should not count as a failure.
>> +			 * Only check it without removing it from the list.
>> +			 * Since the folio can be on deferred_split_scan()
>> +			 * local list and removing it can cause the local list
>> +			 * corruption. Folio split process below can handle it
>> +			 * with the help of folio_ref_freeze().
>> +			 *
>> +			 * nr_pages > 2 is needed to avoid checking order-1
>> +			 * page cache folios. They exist, in contrast to
>> +			 * non-existent order-1 anonymous folios, and do not
>> +			 * use _deferred_list.
>> +			 */
>> +			if (nr_pages > 2 &&
>> +			   !list_empty(&folio->_deferred_list)) {
>> +				if (try_split_folio(folio, from) == 0) {
>
> IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list.

Can you elaborate on the actual situation you are thinking about? Thanks.

>
>> +					stats->nr_thp_split += is_thp;
>> +					stats->nr_split++;
>> +					continue;
>> +				}
>> +			}
>> +
>>   			/*
>>   			 * Large folio migration might be unsupported or
>>   			 * the allocation might be failed so we should retry
>>
>> base-commit: 08a487ab26d541a3bd0adaee144f684b724d233b


--
Best Regards,
Yan, Zi
Baolin Wang March 26, 2024, 2:42 p.m. UTC | #3
On 2024/3/26 21:26, Zi Yan wrote:
> On 26 Mar 2024, at 2:19, Baolin Wang wrote:
> 
>> On 2024/3/23 03:33, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> If the source folio is on deferred split list, it is likely some subpages
>>> are not used. Split it before migration to avoid migrating unused subpages.
>>>
>>> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>> did not check if a THP is on deferred split list before migration, thus,
>>> the destination THP is never put on deferred split list even if the source
>>> THP might be. The opportunity of reclaiming free pages in a partially
>>> mapped THP during deferred list scanning is lost, but no other harmful
>>> consequence is present[1].
>>>
>>>   From v4:
>>> 1. Simplify _deferred_list check without locking and do not count as
>>>      migration failures. (per Matthew Wilcox)
>>>
>>>   From v3:
>>> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid
>>>      compilation error (per SeongJae Park).
>>>
>>>   From v2:
>>> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2].
>>>
>>>   From v1:
>>> 1. Used dst to get correct deferred split list after migration
>>>      (per Ryan Roberts).
>>>
>>> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
>>> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
>>>
>>> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/migrate.c | 23 +++++++++++++++++++++++
>>>    1 file changed, 23 insertions(+)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index ab9856f5931b..6bd9319624a3 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from,
>>>     			cond_resched();
>>>   +			/*
>>> +			 * The rare folio on the deferred split list should
>>> +			 * be split now. It should not count as a failure.
>>> +			 * Only check it without removing it from the list.
>>> +			 * Since the folio can be on deferred_split_scan()
>>> +			 * local list and removing it can cause the local list
>>> +			 * corruption. Folio split process below can handle it
>>> +			 * with the help of folio_ref_freeze().
>>> +			 *
>>> +			 * nr_pages > 2 is needed to avoid checking order-1
>>> +			 * page cache folios. They exist, in contrast to
>>> +			 * non-existent order-1 anonymous folios, and do not
>>> +			 * use _deferred_list.
>>> +			 */
>>> +			if (nr_pages > 2 &&
>>> +			   !list_empty(&folio->_deferred_list)) {
>>> +				if (try_split_folio(folio, from) == 0) {
>>
>> IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list.
> 
> Can you elaborate on the actual situation you are thinking about? Thanks.

Sure.

Suppose there is only one large folio in the from list that needs to be 
migrated, and this large folio is in the _deferred_list, which means it 
needs to be split. Your patch will re-add the split base pages back into 
the 'from' list. However, please see the list_for_each_entry_safe macro:

#define list_for_each_entry_safe(pos, n, head, member)			\
	for (pos = list_first_entry(head, typeof(*pos), member),	\
		n = list_next_entry(pos, member);			\
	     !list_entry_is_head(pos, head, member); 			\
	     pos = n, n = list_next_entry(n, member))

It will terminate the iteration early because the next entry 'n' taken 
out in advance is already the head, leading to the remaining split base 
pages still in the from list. This can cause the following crash when I 
did some migration testing:

[  412.576943] ------------[ cut here ]------------
[  412.576947] kernel BUG at mm/migrate.c:2634!
[  412.577132] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  412.577201] CPU: 59 PID: 9581 Comm: numa01 Kdump: loaded Tainted: G 
          E      6.9.0-rc1+ #69
........
[  412.578651] Call Trace:
[  412.578692]  <TASK>
[  412.578730]  ? die+0x33/0x90
[  412.578770]  ? do_trap+0xdf/0x110
[  412.578815]  ? migrate_misplaced_folio+0x1f2/0x2b0
[  412.578875]  ? do_error_trap+0x65/0x80
[  412.578922]  ? migrate_misplaced_folio+0x1f2/0x2b0
[  412.578977]  ? exc_invalid_op+0x4e/0x70
[  412.579048]  ? migrate_misplaced_folio+0x1f2/0x2b0
[  412.579131]  ? asm_exc_invalid_op+0x16/0x20
[  412.579182]  ? migrate_misplaced_folio+0x1f2/0x2b0
[  412.579255]  do_numa_page+0x205/0x5b0
[  412.579305]  __handle_mm_fault+0x2b0/0x6c0
[  412.579354]  handle_mm_fault+0x105/0x270
[  412.579404]  do_user_addr_fault+0x214/0x6b0
[  412.579453]  exc_page_fault+0x64/0x140
[  412.579509]  asm_exc_page_fault+0x22/0x30

2583 int migrate_misplaced_folio(struct folio *folio, struct 
vm_area_struct *vma,
2584                             int node)
2585 {
		......

2628         if (nr_succeeded) {
2629                 count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
2630                 if (!node_is_toptier(folio_nid(folio)) && 
node_is_toptier(node))
2631                         mod_node_page_state(pgdat, PGPROMOTE_SUCCESS,
2632                                             nr_succeeded);
2633         }
2634         BUG_ON(!list_empty(&migratepages));
2635         return isolated;
2636
2637 out:

After changing as below, the system crash issue is gone.

+++ b/mm/migrate.c
@@ -1668,7 +1668,7 @@ static int migrate_pages_batch(struct list_head *from,
                          */
                         if (nr_pages > 2 &&
                            !list_empty(&folio->_deferred_list)) {
-                               if (try_split_folio(folio, from) == 0) {
+                               if (try_split_folio(folio, split_folios) 
== 0) {
                                         stats->nr_thp_split += is_thp;
                                         stats->nr_split++;
                                         continue;
Zi Yan March 26, 2024, 2:53 p.m. UTC | #4
On 26 Mar 2024, at 10:42, Baolin Wang wrote:

> On 2024/3/26 21:26, Zi Yan wrote:
>> On 26 Mar 2024, at 2:19, Baolin Wang wrote:
>>
>>> On 2024/3/23 03:33, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> If the source folio is on deferred split list, it is likely some subpages
>>>> are not used. Split it before migration to avoid migrating unused subpages.
>>>>
>>>> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>> did not check if a THP is on deferred split list before migration, thus,
>>>> the destination THP is never put on deferred split list even if the source
>>>> THP might be. The opportunity of reclaiming free pages in a partially
>>>> mapped THP during deferred list scanning is lost, but no other harmful
>>>> consequence is present[1].
>>>>
>>>>   From v4:
>>>> 1. Simplify _deferred_list check without locking and do not count as
>>>>      migration failures. (per Matthew Wilcox)
>>>>
>>>>   From v3:
>>>> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid
>>>>      compilation error (per SeongJae Park).
>>>>
>>>>   From v2:
>>>> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2].
>>>>
>>>>   From v1:
>>>> 1. Used dst to get correct deferred split list after migration
>>>>      (per Ryan Roberts).
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
>>>> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/
>>>>
>>>> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>    mm/migrate.c | 23 +++++++++++++++++++++++
>>>>    1 file changed, 23 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index ab9856f5931b..6bd9319624a3 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from,
>>>>     			cond_resched();
>>>>   +			/*
>>>> +			 * The rare folio on the deferred split list should
>>>> +			 * be split now. It should not count as a failure.
>>>> +			 * Only check it without removing it from the list.
>>>> +			 * Since the folio can be on deferred_split_scan()
>>>> +			 * local list and removing it can cause the local list
>>>> +			 * corruption. Folio split process below can handle it
>>>> +			 * with the help of folio_ref_freeze().
>>>> +			 *
>>>> +			 * nr_pages > 2 is needed to avoid checking order-1
>>>> +			 * page cache folios. They exist, in contrast to
>>>> +			 * non-existent order-1 anonymous folios, and do not
>>>> +			 * use _deferred_list.
>>>> +			 */
>>>> +			if (nr_pages > 2 &&
>>>> +			   !list_empty(&folio->_deferred_list)) {
>>>> +				if (try_split_folio(folio, from) == 0) {
>>>
>>> IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list.
>>
>> Can you elaborate on the actual situation you are thinking about? Thanks.
>
> Sure.
>
> Suppose there is only one large folio in the from list that needs to be migrated, and this large folio is in the _deferred_list, which means it needs to be split. Your patch will re-add the split base pages back into the 'from' list. However, please see the list_for_each_entry_safe macro:
>
> #define list_for_each_entry_safe(pos, n, head, member)			\
> 	for (pos = list_first_entry(head, typeof(*pos), member),	\
> 		n = list_next_entry(pos, member);			\
> 	     !list_entry_is_head(pos, head, member); 			\
> 	     pos = n, n = list_next_entry(n, member))
>
> It will terminate the iteration early because the next entry 'n' taken out in advance is already the head, leading to the remaining split base pages still in the from list. This can cause the following crash when I did some migration testing:
>
> [  412.576943] ------------[ cut here ]------------
> [  412.576947] kernel BUG at mm/migrate.c:2634!
> [  412.577132] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  412.577201] CPU: 59 PID: 9581 Comm: numa01 Kdump: loaded Tainted: G          E      6.9.0-rc1+ #69
> ........
> [  412.578651] Call Trace:
> [  412.578692]  <TASK>
> [  412.578730]  ? die+0x33/0x90
> [  412.578770]  ? do_trap+0xdf/0x110
> [  412.578815]  ? migrate_misplaced_folio+0x1f2/0x2b0
> [  412.578875]  ? do_error_trap+0x65/0x80
> [  412.578922]  ? migrate_misplaced_folio+0x1f2/0x2b0
> [  412.578977]  ? exc_invalid_op+0x4e/0x70
> [  412.579048]  ? migrate_misplaced_folio+0x1f2/0x2b0
> [  412.579131]  ? asm_exc_invalid_op+0x16/0x20
> [  412.579182]  ? migrate_misplaced_folio+0x1f2/0x2b0
> [  412.579255]  do_numa_page+0x205/0x5b0
> [  412.579305]  __handle_mm_fault+0x2b0/0x6c0
> [  412.579354]  handle_mm_fault+0x105/0x270
> [  412.579404]  do_user_addr_fault+0x214/0x6b0
> [  412.579453]  exc_page_fault+0x64/0x140
> [  412.579509]  asm_exc_page_fault+0x22/0x30
>
> 2583 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
> 2584                             int node)
> 2585 {
> 		......
>
> 2628         if (nr_succeeded) {
> 2629                 count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> 2630                 if (!node_is_toptier(folio_nid(folio)) && node_is_toptier(node))
> 2631                         mod_node_page_state(pgdat, PGPROMOTE_SUCCESS,
> 2632                                             nr_succeeded);
> 2633         }
> 2634         BUG_ON(!list_empty(&migratepages));
> 2635         return isolated;
> 2636
> 2637 out:

Got it. Thanks.

>
> After changing as below, the system crash issue is gone.
>
> +++ b/mm/migrate.c
> @@ -1668,7 +1668,7 @@ static int migrate_pages_batch(struct list_head *from,
>                          */
>                         if (nr_pages > 2 &&
>                            !list_empty(&folio->_deferred_list)) {
> -                               if (try_split_folio(folio, from) == 0) {
> +                               if (try_split_folio(folio, split_folios) == 0) {
>                                         stats->nr_thp_split += is_thp;
>                                         stats->nr_split++;
>                                         continue;

Let me resend with this fix.

--
Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index ab9856f5931b..6bd9319624a3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1652,6 +1652,29 @@  static int migrate_pages_batch(struct list_head *from,
 
 			cond_resched();
 
+			/*
+			 * The rare folio on the deferred split list should
+			 * be split now. It should not count as a failure.
+			 * Only check it without removing it from the list.
+			 * Since the folio can be on deferred_split_scan()
+			 * local list and removing it can cause the local list
+			 * corruption. Folio split process below can handle it
+			 * with the help of folio_ref_freeze().
+			 *
+			 * nr_pages > 2 is needed to avoid checking order-1
+			 * page cache folios. They exist, in contrast to
+			 * non-existent order-1 anonymous folios, and do not
+			 * use _deferred_list.
+			 */
+			if (nr_pages > 2 &&
+			   !list_empty(&folio->_deferred_list)) {
+				if (try_split_folio(folio, from) == 0) {
+					stats->nr_thp_split += is_thp;
+					stats->nr_split++;
+					continue;
+				}
+			}
+
 			/*
 			 * Large folio migration might be unsupported or
 			 * the allocation might be failed so we should retry