diff mbox series

[v3] mm/vmscan: stop the loop if enough pages have been page_out

Message ID 20241010081802.290893-1-chenridong@huaweicloud.com (mailing list archive)
State New
Headers show
Series [v3] mm/vmscan: stop the loop if enough pages have been page_out | expand

Commit Message

Chen Ridong Oct. 10, 2024, 8:18 a.m. UTC
From: Chen Ridong <chenridong@huawei.com>

An issue was found with the following testing step:
1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
2. Mount memcg v1, and create memcg named test_memcg and set
   usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.

It was found that:

cat memory.usage_in_bytes
2144940032
cat memory.memsw.usage_in_bytes
2255056896

free -h
              total        used        free
Mem:           31Gi       2.1Gi        27Gi
Swap:         1.0Gi       618Mi       405Mi

As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
was used, which means that 500M may be wasted because other memcgs can not
use these swap memory.

It can be explained as follows:
1. When entering shrink_inactive_list, it isolates folios from lru from
   tail to head. If it just takes folioN from lru(make it simple).

   inactive lru: folio1<->folio2<->folio3...<->folioN-1
   isolated list: folioN

2. In shrink_page_list function, if folioN is THP, it may be splited and
   added to swap cache folio by folio. After adding to swap cache, it will
   submit io to writeback folio to swap, which is asynchronous.
   When shrink_page_list is finished, the isolated folios list will be
   moved back to the head of inactive lru. The inactive lru may just look
   like this, with 512 filioes have been move to the head of inactive lru.

   folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1

3. When folio writeback io is completed, the folio may be rotated to tail
   of lru. The following lru list is expected, with those filioes that have
   been added to swap cache are rotated to tail of lru. So those folios
   can be reclaimed as soon as possible.

   folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

4. However, shrink_page_list and folio writeback are asynchronous. If THP
   is splited, shrink_page_list loops at least 512 times, which means that
   shrink_page_list is not completed but some folios writeback have been
   completed, and this may lead to failure to rotate these folios to the
   tail of lru. The lru may look likes as below:

   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
   folioN51<->folioN52<->...folioN511<->folioN512

   Although those folios (N1-N50) have been finished writing back, they
   are still at the head of lru. When isolating folios from lru, it scans
   from tail to head, so it is difficult to scan those folios again.

What mentioned above may lead to a large number of folios have been added
to swap cache but can not be reclaimed in time, which may reduce reclaim
efficiency and prevent other memcgs from using this swap memory even if
they trigger OOM.

To fix this issue, it's better to stop looping if THP has been splited and
nr_pageout is greater than nr_to_reclaim.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/vmscan.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Kefeng Wang Oct. 10, 2024, 8:59 a.m. UTC | #1
Hi Ridong,

This should be the first version for upstream, and the issue only
occurred when large folio is spited.

Adding more CCs to see if there's more feedback.


On 2024/10/10 16:18, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> An issue was found with the following testing step:
> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> 2. Mount memcg v1, and create memcg named test_memcg and set
>     usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> 
> It was found that:
> 
> cat memory.usage_in_bytes
> 2144940032
> cat memory.memsw.usage_in_bytes
> 2255056896
> 
> free -h
>                total        used        free
> Mem:           31Gi       2.1Gi        27Gi
> Swap:         1.0Gi       618Mi       405Mi
> 
> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> was used, which means that 500M may be wasted because other memcgs can not
> use these swap memory.
> 
> It can be explained as follows:
> 1. When entering shrink_inactive_list, it isolates folios from lru from
>     tail to head. If it just takes folioN from lru(make it simple).
> 
>     inactive lru: folio1<->folio2<->folio3...<->folioN-1
>     isolated list: folioN
> 
> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>     added to swap cache folio by folio. After adding to swap cache, it will
>     submit io to writeback folio to swap, which is asynchronous.
>     When shrink_page_list is finished, the isolated folios list will be
>     moved back to the head of inactive lru. The inactive lru may just look
>     like this, with 512 filioes have been move to the head of inactive lru.
> 
>     folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> 
> 3. When folio writeback io is completed, the folio may be rotated to tail
>     of lru. The following lru list is expected, with those filioes that have
>     been added to swap cache are rotated to tail of lru. So those folios
>     can be reclaimed as soon as possible.
> 
>     folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> 
> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>     is splited, shrink_page_list loops at least 512 times, which means that
>     shrink_page_list is not completed but some folios writeback have been
>     completed, and this may lead to failure to rotate these folios to the
>     tail of lru. The lru may look likes as below:
> 
>     folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>     folioN51<->folioN52<->...folioN511<->folioN512
> 
>     Although those folios (N1-N50) have been finished writing back, they
>     are still at the head of lru. When isolating folios from lru, it scans
>     from tail to head, so it is difficult to scan those folios again.
> 
> What mentioned above may lead to a large number of folios have been added
> to swap cache but can not be reclaimed in time, which may reduce reclaim
> efficiency and prevent other memcgs from using this swap memory even if
> they trigger OOM.
> 
> To fix this issue, it's better to stop looping if THP has been splited and
> nr_pageout is greater than nr_to_reclaim.
> 
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>   mm/vmscan.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..fd8ad251eda2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   	LIST_HEAD(demote_folios);
>   	unsigned int nr_reclaimed = 0;
>   	unsigned int pgactivate = 0;
> -	bool do_demote_pass;
> +	bool do_demote_pass, splited = false;
>   	struct swap_iocb *plug = NULL;
>   
>   	folio_batch_init(&free_folios);
> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   
>   		cond_resched();
>   
> +		/*
> +		 * If a large folio has been split, many folios are added
> +		 * to folio_list. Looping through the entire list takes
> +		 * too much time, which may prevent folios that have completed
> +		 * writeback from rotateing to the tail of the lru. Just
> +		 * stop looping if nr_pageout is greater than nr_to_reclaim.
> +		 */
> +		if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> +			break;
> +
>   		folio = lru_to_folio(folio_list);
>   		list_del(&folio->lru);
>   
> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   		if ((nr_pages > 1) && !folio_test_large(folio)) {
>   			sc->nr_scanned -= (nr_pages - 1);
>   			nr_pages = 1;
> +			splited = true;
>   		}
>   
>   		/*
> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   				if (nr_pages > 1 && !folio_test_large(folio)) {
>   					sc->nr_scanned -= (nr_pages - 1);
>   					nr_pages = 1;
> +					splited = true;
>   				}
>   				goto activate_locked;
>   			case PAGE_SUCCESS:
>   				if (nr_pages > 1 && !folio_test_large(folio)) {
>   					sc->nr_scanned -= (nr_pages - 1);
>   					nr_pages = 1;
> +					splited = true;
>   				}
>   				stat->nr_pageout += nr_pages;
>   
> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   		if (nr_pages > 1) {
>   			sc->nr_scanned -= (nr_pages - 1);
>   			nr_pages = 1;
> +			splited = true;
>   		}
>   activate_locked:
>   		/* Not a candidate for swapping, so reclaim swap space. */
chenridong Oct. 10, 2024, 9:28 a.m. UTC | #2
On 2024/10/10 16:59, Kefeng Wang wrote:
> Hi Ridong,
> 
> This should be the first version for upstream, and the issue only
> occurred when large folio is spited.
> 
> Adding more CCs to see if there's more feedback.
> 

Thank you very much, Kefeng, this is the first version.

Best regards,
Ridong
chenridong Oct. 11, 2024, 6:49 a.m. UTC | #3
On 2024/10/11 0:17, Barry Song wrote:
> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> Hi Ridong,
>>
>> This should be the first version for upstream, and the issue only
>> occurred when large folio is spited.
>>
>> Adding more CCs to see if there's more feedback.
>>
>>
>> On 2024/10/10 16:18, Chen Ridong wrote:
>>> From: Chen Ridong <chenridong@huawei.com>
>>>
>>> An issue was found with the following testing step:
>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>
>>> It was found that:
>>>
>>> cat memory.usage_in_bytes
>>> 2144940032
>>> cat memory.memsw.usage_in_bytes
>>> 2255056896
>>>
>>> free -h
>>>                 total        used        free
>>> Mem:           31Gi       2.1Gi        27Gi
>>> Swap:         1.0Gi       618Mi       405Mi
>>>
>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>> was used, which means that 500M may be wasted because other memcgs can not
>>> use these swap memory.
>>>
>>> It can be explained as follows:
>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>
>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>      isolated list: folioN
>>>
>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>      submit io to writeback folio to swap, which is asynchronous.
>>>      When shrink_page_list is finished, the isolated folios list will be
>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>
>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>
>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>      of lru. The following lru list is expected, with those filioes that have
>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>      can be reclaimed as soon as possible.
>>>
>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>      shrink_page_list is not completed but some folios writeback have been
>>>      completed, and this may lead to failure to rotate these folios to the
>>>      tail of lru. The lru may look likes as below:
> 
> I assume you’re referring to PMD-mapped THP, but your code also modifies
> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> 
>>>
>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>
>>>      Although those folios (N1-N50) have been finished writing back, they
>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>      from tail to head, so it is difficult to scan those folios again.
>>>
>>> What mentioned above may lead to a large number of folios have been added
>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>> efficiency and prevent other memcgs from using this swap memory even if
>>> they trigger OOM.
>>>
>>> To fix this issue, it's better to stop looping if THP has been splited and
>>> nr_pageout is greater than nr_to_reclaim.
>>>
>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>> ---
>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 749cdc110c74..fd8ad251eda2 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>        LIST_HEAD(demote_folios);
>>>        unsigned int nr_reclaimed = 0;
>>>        unsigned int pgactivate = 0;
>>> -     bool do_demote_pass;
>>> +     bool do_demote_pass, splited = false;
>>>        struct swap_iocb *plug = NULL;
>>>
>>>        folio_batch_init(&free_folios);
>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>
>>>                cond_resched();
>>>
>>> +             /*
>>> +              * If a large folio has been split, many folios are added
>>> +              * to folio_list. Looping through the entire list takes
>>> +              * too much time, which may prevent folios that have completed
>>> +              * writeback from rotateing to the tail of the lru. Just
>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>> +              */
>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>> +                     break;
> 
> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> with sc->nr_to_reclaim. However, the condition might still hold true even
> if you’ve split a relatively small “large folio,” such as 16kB?
> 

Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all 
pages that have been pageout can be reclaimed, then enough pages can be 
reclaimed when all pages have finished writeback. Thus, it may not have 
to pageout more.

If a small large folio(16 kB) has been split, it may return early 
without the entire pages in the folio_list being pageout, but I think 
that is fine. It can pageout more pages the next time it enters 
shrink_folio_list if there are not enough pages to reclaimed.

However, if pages that have been pageout are still at the head of the 
LRU, it is difficult to scan these pages again. In this case, not only 
might it "waste" some swap memory but it also has to pageout more pages.

Considering the above, I sent this patch. It may not be a perfect 
solution, but i think it's a good option to consider. And I am wondering 
if anyone has a better solution.

Best regards,
Ridong

>>> +
>>>                folio = lru_to_folio(folio_list);
>>>                list_del(&folio->lru);
>>>
>>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                if ((nr_pages > 1) && !folio_test_large(folio)) {
>>>                        sc->nr_scanned -= (nr_pages - 1);
>>>                        nr_pages = 1;
>>> +                     splited = true;
>>>                }
>>>
>>>                /*
>>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
>>>                                        sc->nr_scanned -= (nr_pages - 1);
>>>                                        nr_pages = 1;
>>> +                                     splited = true;
>>>                                }
>>>                                goto activate_locked;
>>>                        case PAGE_SUCCESS:
>>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
>>>                                        sc->nr_scanned -= (nr_pages - 1);
>>>                                        nr_pages = 1;
>>> +                                     splited = true;
>>>                                }
>>>                                stat->nr_pageout += nr_pages;
>>>
>>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                if (nr_pages > 1) {
>>>                        sc->nr_scanned -= (nr_pages - 1);
>>>                        nr_pages = 1;
>>> +                     splited = true;
>>>                }
>>>    activate_locked:
>>>                /* Not a candidate for swapping, so reclaim swap space. */
>>
Chen Ridong Oct. 21, 2024, 8:14 a.m. UTC | #4
On 2024/10/21 12:44, Barry Song wrote:
> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/11 0:17, Barry Song wrote:
>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> Hi Ridong,
>>>>
>>>> This should be the first version for upstream, and the issue only
>>>> occurred when large folio is spited.
>>>>
>>>> Adding more CCs to see if there's more feedback.
>>>>
>>>>
>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>>
>>>>> An issue was found with the following testing step:
>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>
>>>>> It was found that:
>>>>>
>>>>> cat memory.usage_in_bytes
>>>>> 2144940032
>>>>> cat memory.memsw.usage_in_bytes
>>>>> 2255056896
>>>>>
>>>>> free -h
>>>>>                 total        used        free
>>>>> Mem:           31Gi       2.1Gi        27Gi
>>>>> Swap:         1.0Gi       618Mi       405Mi
>>>>>
>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>> use these swap memory.
>>>>>
>>>>> It can be explained as follows:
>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>>>
>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>>      isolated list: folioN
>>>>>
>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>>>      submit io to writeback folio to swap, which is asynchronous.
>>>>>      When shrink_page_list is finished, the isolated folios list will be
>>>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>>>
>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>
>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>      of lru. The following lru list is expected, with those filioes that have
>>>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>>>      can be reclaimed as soon as possible.
>>>>>
>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>>>      shrink_page_list is not completed but some folios writeback have been
>>>>>      completed, and this may lead to failure to rotate these folios to the
>>>>>      tail of lru. The lru may look likes as below:
>>>
>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>
>>>>>
>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>>>
>>>>>      Although those folios (N1-N50) have been finished writing back, they
>>>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>>>      from tail to head, so it is difficult to scan those folios again.
>>>>>
>>>>> What mentioned above may lead to a large number of folios have been added
>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>> they trigger OOM.
>>>>>
>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>
>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>> ---
>>>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>        LIST_HEAD(demote_folios);
>>>>>        unsigned int nr_reclaimed = 0;
>>>>>        unsigned int pgactivate = 0;
>>>>> -     bool do_demote_pass;
>>>>> +     bool do_demote_pass, splited = false;
>>>>>        struct swap_iocb *plug = NULL;
>>>>>
>>>>>        folio_batch_init(&free_folios);
>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>
>>>>>                cond_resched();
>>>>>
>>>>> +             /*
>>>>> +              * If a large folio has been split, many folios are added
>>>>> +              * to folio_list. Looping through the entire list takes
>>>>> +              * too much time, which may prevent folios that have completed
>>>>> +              * writeback from rotateing to the tail of the lru. Just
>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>> +              */
>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>> +                     break;
>>>
>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>
>>
>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>> pages that have been pageout can be reclaimed, then enough pages can be
>> reclaimed when all pages have finished writeback. Thus, it may not have
>> to pageout more.
>>
>> If a small large folio(16 kB) has been split, it may return early
>> without the entire pages in the folio_list being pageout, but I think
>> that is fine. It can pageout more pages the next time it enters
>> shrink_folio_list if there are not enough pages to reclaimed.
>>
>> However, if pages that have been pageout are still at the head of the
>> LRU, it is difficult to scan these pages again. In this case, not only
>> might it "waste" some swap memory but it also has to pageout more pages.
>>
>> Considering the above, I sent this patch. It may not be a perfect
>> solution, but i think it's a good option to consider. And I am wondering
>> if anyone has a better solution.
> 
> Hi Ridong,
> My overall understanding is that you have failed to describe your problem
> particularly I don't understand what your 3 and 4 mean:
> 
>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>    of lru. The following lru list is expected, with those filioes that have
>>    been added to swap cache are rotated to tail of lru. So those folios
>>  can be reclaimed as soon as possible.
>>
>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> 
>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>  >    is splited, shrink_page_list loops at least 512 times, which means that
>  >    shrink_page_list is not completed but some folios writeback have been
>  >    completed, and this may lead to failure to rotate these folios to the
>   >  tail of lru. The lru may look likes as below:
> 
> can you please describe it in a readable approach?
> 
> i feel your below diagram is somehow wrong:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> 
> You mentioned "rotate', how could "rotate" makes:
> folioN512<->folioN511<->...filioN1 in (2)
> become
> filioN1<->...folioN511<->folioN512 in (3).
> 

I am sorry for any confusion.

If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
to writeback one by one. it assumed that filioN1,
filioN2,filioN3,...filioN512 are completed in order.

Orignal:
folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1

filioN1 is finished, filioN1 is rotated to the tail of LRU:
folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1

filioN2 is finished:
folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2

filioN3 is finished:
folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3

...

filioN512 is finished:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

When the filios are finished, the LRU might just like this:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

> btw, writeback isn't always async. it could be sync for zram and sync_io
> swap. in that case, your patch might change the order of LRU. i mean,
> for example, while a mTHP becomes cold, we always reclaim all of them,
> but not part of them and put back part of small folios to the head of lru.
> 

Yes, This can be changed.
Although it may put back part of small folios to the head of lru, it can
return in time from shrink_folio_list without causing much additional I/O.

If you have understood this issue, do you have any suggestions to fix
it? My patch may not be a perfect way to fix this issue.

Best regards,
Ridong
Barry Song Oct. 21, 2024, 9:42 a.m. UTC | #5
On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
>
>
> On 2024/10/21 12:44, Barry Song wrote:
> > On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/11 0:17, Barry Song wrote:
> >>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>> Hi Ridong,
> >>>>
> >>>> This should be the first version for upstream, and the issue only
> >>>> occurred when large folio is spited.
> >>>>
> >>>> Adding more CCs to see if there's more feedback.
> >>>>
> >>>>
> >>>> On 2024/10/10 16:18, Chen Ridong wrote:
> >>>>> From: Chen Ridong <chenridong@huawei.com>
> >>>>>
> >>>>> An issue was found with the following testing step:
> >>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> >>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> >>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >>>>>
> >>>>> It was found that:
> >>>>>
> >>>>> cat memory.usage_in_bytes
> >>>>> 2144940032
> >>>>> cat memory.memsw.usage_in_bytes
> >>>>> 2255056896
> >>>>>
> >>>>> free -h
> >>>>>                 total        used        free
> >>>>> Mem:           31Gi       2.1Gi        27Gi
> >>>>> Swap:         1.0Gi       618Mi       405Mi
> >>>>>
> >>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> >>>>> was used, which means that 500M may be wasted because other memcgs can not
> >>>>> use these swap memory.
> >>>>>
> >>>>> It can be explained as follows:
> >>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
> >>>>>      tail to head. If it just takes folioN from lru(make it simple).
> >>>>>
> >>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>>>      isolated list: folioN
> >>>>>
> >>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> >>>>>      added to swap cache folio by folio. After adding to swap cache, it will
> >>>>>      submit io to writeback folio to swap, which is asynchronous.
> >>>>>      When shrink_page_list is finished, the isolated folios list will be
> >>>>>      moved back to the head of inactive lru. The inactive lru may just look
> >>>>>      like this, with 512 filioes have been move to the head of inactive lru.
> >>>>>
> >>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>>>>
> >>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>>>>      of lru. The following lru list is expected, with those filioes that have
> >>>>>      been added to swap cache are rotated to tail of lru. So those folios
> >>>>>      can be reclaimed as soon as possible.
> >>>>>
> >>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>>>
> >>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >>>>>      is splited, shrink_page_list loops at least 512 times, which means that
> >>>>>      shrink_page_list is not completed but some folios writeback have been
> >>>>>      completed, and this may lead to failure to rotate these folios to the
> >>>>>      tail of lru. The lru may look likes as below:
> >>>
> >>> I assume you’re referring to PMD-mapped THP, but your code also modifies
> >>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> >>>
> >>>>>
> >>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>>>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>>>
> >>>>>      Although those folios (N1-N50) have been finished writing back, they
> >>>>>      are still at the head of lru. When isolating folios from lru, it scans
> >>>>>      from tail to head, so it is difficult to scan those folios again.
> >>>>>
> >>>>> What mentioned above may lead to a large number of folios have been added
> >>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
> >>>>> efficiency and prevent other memcgs from using this swap memory even if
> >>>>> they trigger OOM.
> >>>>>
> >>>>> To fix this issue, it's better to stop looping if THP has been splited and
> >>>>> nr_pageout is greater than nr_to_reclaim.
> >>>>>
> >>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >>>>> ---
> >>>>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>> index 749cdc110c74..fd8ad251eda2 100644
> >>>>> --- a/mm/vmscan.c
> >>>>> +++ b/mm/vmscan.c
> >>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>        LIST_HEAD(demote_folios);
> >>>>>        unsigned int nr_reclaimed = 0;
> >>>>>        unsigned int pgactivate = 0;
> >>>>> -     bool do_demote_pass;
> >>>>> +     bool do_demote_pass, splited = false;
> >>>>>        struct swap_iocb *plug = NULL;
> >>>>>
> >>>>>        folio_batch_init(&free_folios);
> >>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>
> >>>>>                cond_resched();
> >>>>>
> >>>>> +             /*
> >>>>> +              * If a large folio has been split, many folios are added
> >>>>> +              * to folio_list. Looping through the entire list takes
> >>>>> +              * too much time, which may prevent folios that have completed
> >>>>> +              * writeback from rotateing to the tail of the lru. Just
> >>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> >>>>> +              */
> >>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> >>>>> +                     break;
> >>>
> >>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> >>> with sc->nr_to_reclaim. However, the condition might still hold true even
> >>> if you’ve split a relatively small “large folio,” such as 16kB?
> >>>
> >>
> >> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> >> pages that have been pageout can be reclaimed, then enough pages can be
> >> reclaimed when all pages have finished writeback. Thus, it may not have
> >> to pageout more.
> >>
> >> If a small large folio(16 kB) has been split, it may return early
> >> without the entire pages in the folio_list being pageout, but I think
> >> that is fine. It can pageout more pages the next time it enters
> >> shrink_folio_list if there are not enough pages to reclaimed.
> >>
> >> However, if pages that have been pageout are still at the head of the
> >> LRU, it is difficult to scan these pages again. In this case, not only
> >> might it "waste" some swap memory but it also has to pageout more pages.
> >>
> >> Considering the above, I sent this patch. It may not be a perfect
> >> solution, but i think it's a good option to consider. And I am wondering
> >> if anyone has a better solution.
> >
> > Hi Ridong,
> > My overall understanding is that you have failed to describe your problem
> > particularly I don't understand what your 3 and 4 mean:
> >
> >> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>    of lru. The following lru list is expected, with those filioes that have
> >>    been added to swap cache are rotated to tail of lru. So those folios
> >>  can be reclaimed as soon as possible.
> >>
> >>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> >  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >  >    is splited, shrink_page_list loops at least 512 times, which means that
> >  >    shrink_page_list is not completed but some folios writeback have been
> >  >    completed, and this may lead to failure to rotate these folios to the
> >   >  tail of lru. The lru may look likes as below:
> >
> > can you please describe it in a readable approach?
> >
> > i feel your below diagram is somehow wrong:
> > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > You mentioned "rotate', how could "rotate" makes:
> > folioN512<->folioN511<->...filioN1 in (2)
> > become
> > filioN1<->...folioN511<->folioN512 in (3).
> >
>
> I am sorry for any confusion.
>
> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
> to writeback one by one. it assumed that filioN1,
> filioN2,filioN3,...filioN512 are completed in order.
>
> Orignal:
> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>
> filioN1 is finished, filioN1 is rotated to the tail of LRU:
> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>
> filioN2 is finished:
> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>
> filioN3 is finished:
> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>
> ...
>
> filioN512 is finished:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> When the filios are finished, the LRU might just like this:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

understood, thanks!

Let me try to understand the following part:

> 4:
>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>   folioN51<->folioN52<->...folioN511<->folioN512

 >  Although those folios (N1-N50) have been finished writing back, they
 >  are still at the head of lru. When isolating folios from lru, it scans
 >  from tail to head, so it is difficult to scan those folios again.

What is the reason that "those folios (N1-N50) have finished writing back,
yet they remain at the head of the LRU"? Is it because their writeback_end
occurred while we were still looping in shrink_folio_list(), causing
folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
these folios, which are still in the "folio_list", to the tail of the LRU?

>
> > btw, writeback isn't always async. it could be sync for zram and sync_io
> > swap. in that case, your patch might change the order of LRU. i mean,
> > for example, while a mTHP becomes cold, we always reclaim all of them,
> > but not part of them and put back part of small folios to the head of lru.
> >
>
> Yes, This can be changed.
> Although it may put back part of small folios to the head of lru, it can
> return in time from shrink_folio_list without causing much additional I/O.
>
> If you have understood this issue, do you have any suggestions to fix
> it? My patch may not be a perfect way to fix this issue.
>

My point is that synchronous I/O, like zRAM, doesn't have this issue and
doesn't require this fix, as writeback is always completed without
asynchronous latency.


> Best regards,
> Ridong
>

Thanks
Barry
chenridong Oct. 21, 2024, 9:56 a.m. UTC | #6
On 2024/10/21 17:42, Barry Song wrote:
> On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>>
>>
>> On 2024/10/21 12:44, Barry Song wrote:
>>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/11 0:17, Barry Song wrote:
>>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>> Hi Ridong,
>>>>>>
>>>>>> This should be the first version for upstream, and the issue only
>>>>>> occurred when large folio is spited.
>>>>>>
>>>>>> Adding more CCs to see if there's more feedback.
>>>>>>
>>>>>>
>>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>>>>
>>>>>>> An issue was found with the following testing step:
>>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>>>
>>>>>>> It was found that:
>>>>>>>
>>>>>>> cat memory.usage_in_bytes
>>>>>>> 2144940032
>>>>>>> cat memory.memsw.usage_in_bytes
>>>>>>> 2255056896
>>>>>>>
>>>>>>> free -h
>>>>>>>                 total        used        free
>>>>>>> Mem:           31Gi       2.1Gi        27Gi
>>>>>>> Swap:         1.0Gi       618Mi       405Mi
>>>>>>>
>>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>>>> use these swap memory.
>>>>>>>
>>>>>>> It can be explained as follows:
>>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>>>>>
>>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>>>>      isolated list: folioN
>>>>>>>
>>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>>>>>      submit io to writeback folio to swap, which is asynchronous.
>>>>>>>      When shrink_page_list is finished, the isolated folios list will be
>>>>>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>>>>>
>>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>>>
>>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>>      of lru. The following lru list is expected, with those filioes that have
>>>>>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>>>>>      can be reclaimed as soon as possible.
>>>>>>>
>>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>>>
>>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>>>>>      shrink_page_list is not completed but some folios writeback have been
>>>>>>>      completed, and this may lead to failure to rotate these folios to the
>>>>>>>      tail of lru. The lru may look likes as below:
>>>>>
>>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>>>
>>>>>>>
>>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>>>>>
>>>>>>>      Although those folios (N1-N50) have been finished writing back, they
>>>>>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>>>>>      from tail to head, so it is difficult to scan those folios again.
>>>>>>>
>>>>>>> What mentioned above may lead to a large number of folios have been added
>>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>>>> they trigger OOM.
>>>>>>>
>>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>>>
>>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>>>> ---
>>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>        LIST_HEAD(demote_folios);
>>>>>>>        unsigned int nr_reclaimed = 0;
>>>>>>>        unsigned int pgactivate = 0;
>>>>>>> -     bool do_demote_pass;
>>>>>>> +     bool do_demote_pass, splited = false;
>>>>>>>        struct swap_iocb *plug = NULL;
>>>>>>>
>>>>>>>        folio_batch_init(&free_folios);
>>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>
>>>>>>>                cond_resched();
>>>>>>>
>>>>>>> +             /*
>>>>>>> +              * If a large folio has been split, many folios are added
>>>>>>> +              * to folio_list. Looping through the entire list takes
>>>>>>> +              * too much time, which may prevent folios that have completed
>>>>>>> +              * writeback from rotateing to the tail of the lru. Just
>>>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>>>> +              */
>>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>>>> +                     break;
>>>>>
>>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>>>
>>>>
>>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>>>> pages that have been pageout can be reclaimed, then enough pages can be
>>>> reclaimed when all pages have finished writeback. Thus, it may not have
>>>> to pageout more.
>>>>
>>>> If a small large folio(16 kB) has been split, it may return early
>>>> without the entire pages in the folio_list being pageout, but I think
>>>> that is fine. It can pageout more pages the next time it enters
>>>> shrink_folio_list if there are not enough pages to reclaimed.
>>>>
>>>> However, if pages that have been pageout are still at the head of the
>>>> LRU, it is difficult to scan these pages again. In this case, not only
>>>> might it "waste" some swap memory but it also has to pageout more pages.
>>>>
>>>> Considering the above, I sent this patch. It may not be a perfect
>>>> solution, but i think it's a good option to consider. And I am wondering
>>>> if anyone has a better solution.
>>>
>>> Hi Ridong,
>>> My overall understanding is that you have failed to describe your problem
>>> particularly I don't understand what your 3 and 4 mean:
>>>
>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>    of lru. The following lru list is expected, with those filioes that have
>>>>    been added to swap cache are rotated to tail of lru. So those folios
>>>>  can be reclaimed as soon as possible.
>>>>
>>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>  >    is splited, shrink_page_list loops at least 512 times, which means that
>>>  >    shrink_page_list is not completed but some folios writeback have been
>>>  >    completed, and this may lead to failure to rotate these folios to the
>>>   >  tail of lru. The lru may look likes as below:
>>>
>>> can you please describe it in a readable approach?
>>>
>>> i feel your below diagram is somehow wrong:
>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> You mentioned "rotate', how could "rotate" makes:
>>> folioN512<->folioN511<->...filioN1 in (2)
>>> become
>>> filioN1<->...folioN511<->folioN512 in (3).
>>>
>>
>> I am sorry for any confusion.
>>
>> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
>> to writeback one by one. it assumed that filioN1,
>> filioN2,filioN3,...filioN512 are completed in order.
>>
>> Orignal:
>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>
>> filioN1 is finished, filioN1 is rotated to the tail of LRU:
>> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>>
>> filioN2 is finished:
>> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>>
>> filioN3 is finished:
>> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>>
>> ...
>>
>> filioN512 is finished:
>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>
>> When the filios are finished, the LRU might just like this:
>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> 
> understood, thanks!
> 
> Let me try to understand the following part:
> 
>> 4:
>>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>   folioN51<->folioN52<->...folioN511<->folioN512
> 
>  >  Although those folios (N1-N50) have been finished writing back, they
>  >  are still at the head of lru. When isolating folios from lru, it scans
>  >  from tail to head, so it is difficult to scan those folios again.
> 
> What is the reason that "those folios (N1-N50) have finished writing back,
> yet they remain at the head of the LRU"? Is it because their writeback_end
> occurred while we were still looping in shrink_folio_list(), causing
> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
> these folios, which are still in the "folio_list", to the tail of the LRU?
> 

Yes, you are right.

>>
>>> btw, writeback isn't always async. it could be sync for zram and sync_io
>>> swap. in that case, your patch might change the order of LRU. i mean,
>>> for example, while a mTHP becomes cold, we always reclaim all of them,
>>> but not part of them and put back part of small folios to the head of lru.
>>>
>>
>> Yes, This can be changed.
>> Although it may put back part of small folios to the head of lru, it can
>> return in time from shrink_folio_list without causing much additional I/O.
>>
>> If you have understood this issue, do you have any suggestions to fix
>> it? My patch may not be a perfect way to fix this issue.
>>
> 
> My point is that synchronous I/O, like zRAM, doesn't have this issue and
> doesn't require this fix, as writeback is always completed without
> asynchronous latency.
> 

I have tested zRAM and found no issues.
This is version 1, and I don't know whether this fix will be accepted.
If it is accepted, perhaps this patch could be modified to apply only to
asynchronous io.

Best regards,
Ridong

> 
>> Best regards,
>> Ridong
>>
> 
> Thanks
> Barry
Barry Song Oct. 21, 2024, 10:09 a.m. UTC | #7
On Mon, Oct 21, 2024 at 10:56 PM chenridong <chenridong@huawei.com> wrote:
>
>
>
> On 2024/10/21 17:42, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/21 12:44, Barry Song wrote:
> >>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024/10/11 0:17, Barry Song wrote:
> >>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>
> >>>>>> Hi Ridong,
> >>>>>>
> >>>>>> This should be the first version for upstream, and the issue only
> >>>>>> occurred when large folio is spited.
> >>>>>>
> >>>>>> Adding more CCs to see if there's more feedback.
> >>>>>>
> >>>>>>
> >>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
> >>>>>>> From: Chen Ridong <chenridong@huawei.com>
> >>>>>>>
> >>>>>>> An issue was found with the following testing step:
> >>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> >>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> >>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >>>>>>>
> >>>>>>> It was found that:
> >>>>>>>
> >>>>>>> cat memory.usage_in_bytes
> >>>>>>> 2144940032
> >>>>>>> cat memory.memsw.usage_in_bytes
> >>>>>>> 2255056896
> >>>>>>>
> >>>>>>> free -h
> >>>>>>>                 total        used        free
> >>>>>>> Mem:           31Gi       2.1Gi        27Gi
> >>>>>>> Swap:         1.0Gi       618Mi       405Mi
> >>>>>>>
> >>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> >>>>>>> was used, which means that 500M may be wasted because other memcgs can not
> >>>>>>> use these swap memory.
> >>>>>>>
> >>>>>>> It can be explained as follows:
> >>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
> >>>>>>>      tail to head. If it just takes folioN from lru(make it simple).
> >>>>>>>
> >>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>>>>>      isolated list: folioN
> >>>>>>>
> >>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> >>>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
> >>>>>>>      submit io to writeback folio to swap, which is asynchronous.
> >>>>>>>      When shrink_page_list is finished, the isolated folios list will be
> >>>>>>>      moved back to the head of inactive lru. The inactive lru may just look
> >>>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
> >>>>>>>
> >>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>>>>>>
> >>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>>>>>>      of lru. The following lru list is expected, with those filioes that have
> >>>>>>>      been added to swap cache are rotated to tail of lru. So those folios
> >>>>>>>      can be reclaimed as soon as possible.
> >>>>>>>
> >>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>>>>>
> >>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >>>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
> >>>>>>>      shrink_page_list is not completed but some folios writeback have been
> >>>>>>>      completed, and this may lead to failure to rotate these folios to the
> >>>>>>>      tail of lru. The lru may look likes as below:
> >>>>>
> >>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
> >>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> >>>>>
> >>>>>>>
> >>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>>>>>
> >>>>>>>      Although those folios (N1-N50) have been finished writing back, they
> >>>>>>>      are still at the head of lru. When isolating folios from lru, it scans
> >>>>>>>      from tail to head, so it is difficult to scan those folios again.
> >>>>>>>
> >>>>>>> What mentioned above may lead to a large number of folios have been added
> >>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
> >>>>>>> efficiency and prevent other memcgs from using this swap memory even if
> >>>>>>> they trigger OOM.
> >>>>>>>
> >>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
> >>>>>>> nr_pageout is greater than nr_to_reclaim.
> >>>>>>>
> >>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >>>>>>> ---
> >>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>>>> index 749cdc110c74..fd8ad251eda2 100644
> >>>>>>> --- a/mm/vmscan.c
> >>>>>>> +++ b/mm/vmscan.c
> >>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>>>        LIST_HEAD(demote_folios);
> >>>>>>>        unsigned int nr_reclaimed = 0;
> >>>>>>>        unsigned int pgactivate = 0;
> >>>>>>> -     bool do_demote_pass;
> >>>>>>> +     bool do_demote_pass, splited = false;
> >>>>>>>        struct swap_iocb *plug = NULL;
> >>>>>>>
> >>>>>>>        folio_batch_init(&free_folios);
> >>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>>>
> >>>>>>>                cond_resched();
> >>>>>>>
> >>>>>>> +             /*
> >>>>>>> +              * If a large folio has been split, many folios are added
> >>>>>>> +              * to folio_list. Looping through the entire list takes
> >>>>>>> +              * too much time, which may prevent folios that have completed
> >>>>>>> +              * writeback from rotateing to the tail of the lru. Just
> >>>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> >>>>>>> +              */
> >>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> >>>>>>> +                     break;
> >>>>>
> >>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> >>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
> >>>>> if you’ve split a relatively small “large folio,” such as 16kB?
> >>>>>
> >>>>
> >>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> >>>> pages that have been pageout can be reclaimed, then enough pages can be
> >>>> reclaimed when all pages have finished writeback. Thus, it may not have
> >>>> to pageout more.
> >>>>
> >>>> If a small large folio(16 kB) has been split, it may return early
> >>>> without the entire pages in the folio_list being pageout, but I think
> >>>> that is fine. It can pageout more pages the next time it enters
> >>>> shrink_folio_list if there are not enough pages to reclaimed.
> >>>>
> >>>> However, if pages that have been pageout are still at the head of the
> >>>> LRU, it is difficult to scan these pages again. In this case, not only
> >>>> might it "waste" some swap memory but it also has to pageout more pages.
> >>>>
> >>>> Considering the above, I sent this patch. It may not be a perfect
> >>>> solution, but i think it's a good option to consider. And I am wondering
> >>>> if anyone has a better solution.
> >>>
> >>> Hi Ridong,
> >>> My overall understanding is that you have failed to describe your problem
> >>> particularly I don't understand what your 3 and 4 mean:
> >>>
> >>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>>>    of lru. The following lru list is expected, with those filioes that have
> >>>>    been added to swap cache are rotated to tail of lru. So those folios
> >>>>  can be reclaimed as soon as possible.
> >>>>
> >>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>
> >>>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >>>  >    is splited, shrink_page_list loops at least 512 times, which means that
> >>>  >    shrink_page_list is not completed but some folios writeback have been
> >>>  >    completed, and this may lead to failure to rotate these folios to the
> >>>   >  tail of lru. The lru may look likes as below:
> >>>
> >>> can you please describe it in a readable approach?
> >>>
> >>> i feel your below diagram is somehow wrong:
> >>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>
> >>> You mentioned "rotate', how could "rotate" makes:
> >>> folioN512<->folioN511<->...filioN1 in (2)
> >>> become
> >>> filioN1<->...folioN511<->folioN512 in (3).
> >>>
> >>
> >> I am sorry for any confusion.
> >>
> >> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
> >> to writeback one by one. it assumed that filioN1,
> >> filioN2,filioN3,...filioN512 are completed in order.
> >>
> >> Orignal:
> >> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>
> >> filioN1 is finished, filioN1 is rotated to the tail of LRU:
> >> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
> >>
> >> filioN2 is finished:
> >> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
> >>
> >> filioN3 is finished:
> >> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
> >>
> >> ...
> >>
> >> filioN512 is finished:
> >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>
> >> When the filios are finished, the LRU might just like this:
> >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > understood, thanks!
> >
> > Let me try to understand the following part:
> >
> >> 4:
> >>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>   folioN51<->folioN52<->...folioN511<->folioN512
> >
> >  >  Although those folios (N1-N50) have been finished writing back, they
> >  >  are still at the head of lru. When isolating folios from lru, it scans
> >  >  from tail to head, so it is difficult to scan those folios again.
> >
> > What is the reason that "those folios (N1-N50) have finished writing back,
> > yet they remain at the head of the LRU"? Is it because their writeback_end
> > occurred while we were still looping in shrink_folio_list(), causing
> > folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
> > these folios, which are still in the "folio_list", to the tail of the LRU?
> >
>
> Yes, you are right.
>
> >>
> >>> btw, writeback isn't always async. it could be sync for zram and sync_io
> >>> swap. in that case, your patch might change the order of LRU. i mean,
> >>> for example, while a mTHP becomes cold, we always reclaim all of them,
> >>> but not part of them and put back part of small folios to the head of lru.
> >>>
> >>
> >> Yes, This can be changed.
> >> Although it may put back part of small folios to the head of lru, it can
> >> return in time from shrink_folio_list without causing much additional I/O.
> >>
> >> If you have understood this issue, do you have any suggestions to fix
> >> it? My patch may not be a perfect way to fix this issue.
> >>
> >
> > My point is that synchronous I/O, like zRAM, doesn't have this issue and
> > doesn't require this fix, as writeback is always completed without
> > asynchronous latency.
> >
>
> I have tested zRAM and found no issues.
> This is version 1, and I don't know whether this fix will be accepted.
> If it is accepted, perhaps this patch could be modified to apply only to
> asynchronous io.

Consider a 2MB THP: when it becomes cold, we detect that it is cold and
decide to page it out. Even if we split it into 512 * 4KiB folios, the entire
2MB is still cold, so we want pageout() to be called for the entire 2MB.
With your current approach, some parts of the 2MB are moved to the
LRU head while we're still paging out other parts, which seems
problematic.

Could we address this in move_folios_to_lru()? Perhaps we could find
a way to detect folios whose writeback has done and move them to the
tail instead of always placing them at the head.

>
> Best regards,
> Ridong
>
> >
> >> Best regards,
> >> Ridong
> >>

Thanks
Barry
Barry Song Oct. 21, 2024, 10:45 a.m. UTC | #8
On Mon, Oct 21, 2024 at 11:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Oct 21, 2024 at 10:56 PM chenridong <chenridong@huawei.com> wrote:
> >
> >
> >
> > On 2024/10/21 17:42, Barry Song wrote:
> > > On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2024/10/21 12:44, Barry Song wrote:
> > >>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2024/10/11 0:17, Barry Song wrote:
> > >>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> > >>>>>>
> > >>>>>> Hi Ridong,
> > >>>>>>
> > >>>>>> This should be the first version for upstream, and the issue only
> > >>>>>> occurred when large folio is spited.
> > >>>>>>
> > >>>>>> Adding more CCs to see if there's more feedback.
> > >>>>>>
> > >>>>>>
> > >>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
> > >>>>>>> From: Chen Ridong <chenridong@huawei.com>
> > >>>>>>>
> > >>>>>>> An issue was found with the following testing step:
> > >>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> > >>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
> > >>>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> > >>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> > >>>>>>>
> > >>>>>>> It was found that:
> > >>>>>>>
> > >>>>>>> cat memory.usage_in_bytes
> > >>>>>>> 2144940032
> > >>>>>>> cat memory.memsw.usage_in_bytes
> > >>>>>>> 2255056896
> > >>>>>>>
> > >>>>>>> free -h
> > >>>>>>>                 total        used        free
> > >>>>>>> Mem:           31Gi       2.1Gi        27Gi
> > >>>>>>> Swap:         1.0Gi       618Mi       405Mi
> > >>>>>>>
> > >>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> > >>>>>>> was used, which means that 500M may be wasted because other memcgs can not
> > >>>>>>> use these swap memory.
> > >>>>>>>
> > >>>>>>> It can be explained as follows:
> > >>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
> > >>>>>>>      tail to head. If it just takes folioN from lru(make it simple).
> > >>>>>>>
> > >>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> > >>>>>>>      isolated list: folioN
> > >>>>>>>
> > >>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> > >>>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
> > >>>>>>>      submit io to writeback folio to swap, which is asynchronous.
> > >>>>>>>      When shrink_page_list is finished, the isolated folios list will be
> > >>>>>>>      moved back to the head of inactive lru. The inactive lru may just look
> > >>>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
> > >>>>>>>
> > >>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> > >>>>>>>
> > >>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> > >>>>>>>      of lru. The following lru list is expected, with those filioes that have
> > >>>>>>>      been added to swap cache are rotated to tail of lru. So those folios
> > >>>>>>>      can be reclaimed as soon as possible.
> > >>>>>>>
> > >>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> > >>>>>>>
> > >>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> > >>>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
> > >>>>>>>      shrink_page_list is not completed but some folios writeback have been
> > >>>>>>>      completed, and this may lead to failure to rotate these folios to the
> > >>>>>>>      tail of lru. The lru may look likes as below:
> > >>>>>
> > >>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
> > >>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> > >>>>>
> > >>>>>>>
> > >>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> > >>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
> > >>>>>>>
> > >>>>>>>      Although those folios (N1-N50) have been finished writing back, they
> > >>>>>>>      are still at the head of lru. When isolating folios from lru, it scans
> > >>>>>>>      from tail to head, so it is difficult to scan those folios again.
> > >>>>>>>
> > >>>>>>> What mentioned above may lead to a large number of folios have been added
> > >>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
> > >>>>>>> efficiency and prevent other memcgs from using this swap memory even if
> > >>>>>>> they trigger OOM.
> > >>>>>>>
> > >>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
> > >>>>>>> nr_pageout is greater than nr_to_reclaim.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> > >>>>>>> ---
> > >>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
> > >>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> > >>>>>>>
> > >>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>>>> index 749cdc110c74..fd8ad251eda2 100644
> > >>>>>>> --- a/mm/vmscan.c
> > >>>>>>> +++ b/mm/vmscan.c
> > >>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > >>>>>>>        LIST_HEAD(demote_folios);
> > >>>>>>>        unsigned int nr_reclaimed = 0;
> > >>>>>>>        unsigned int pgactivate = 0;
> > >>>>>>> -     bool do_demote_pass;
> > >>>>>>> +     bool do_demote_pass, splited = false;
> > >>>>>>>        struct swap_iocb *plug = NULL;
> > >>>>>>>
> > >>>>>>>        folio_batch_init(&free_folios);
> > >>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > >>>>>>>
> > >>>>>>>                cond_resched();
> > >>>>>>>
> > >>>>>>> +             /*
> > >>>>>>> +              * If a large folio has been split, many folios are added
> > >>>>>>> +              * to folio_list. Looping through the entire list takes
> > >>>>>>> +              * too much time, which may prevent folios that have completed
> > >>>>>>> +              * writeback from rotateing to the tail of the lru. Just
> > >>>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> > >>>>>>> +              */
> > >>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> > >>>>>>> +                     break;
> > >>>>>
> > >>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> > >>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
> > >>>>> if you’ve split a relatively small “large folio,” such as 16kB?
> > >>>>>
> > >>>>
> > >>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> > >>>> pages that have been pageout can be reclaimed, then enough pages can be
> > >>>> reclaimed when all pages have finished writeback. Thus, it may not have
> > >>>> to pageout more.
> > >>>>
> > >>>> If a small large folio(16 kB) has been split, it may return early
> > >>>> without the entire pages in the folio_list being pageout, but I think
> > >>>> that is fine. It can pageout more pages the next time it enters
> > >>>> shrink_folio_list if there are not enough pages to reclaimed.
> > >>>>
> > >>>> However, if pages that have been pageout are still at the head of the
> > >>>> LRU, it is difficult to scan these pages again. In this case, not only
> > >>>> might it "waste" some swap memory but it also has to pageout more pages.
> > >>>>
> > >>>> Considering the above, I sent this patch. It may not be a perfect
> > >>>> solution, but i think it's a good option to consider. And I am wondering
> > >>>> if anyone has a better solution.
> > >>>
> > >>> Hi Ridong,
> > >>> My overall understanding is that you have failed to describe your problem
> > >>> particularly I don't understand what your 3 and 4 mean:
> > >>>
> > >>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> > >>>>    of lru. The following lru list is expected, with those filioes that have
> > >>>>    been added to swap cache are rotated to tail of lru. So those folios
> > >>>>  can be reclaimed as soon as possible.
> > >>>>
> > >>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> > >>>
> > >>>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> > >>>  >    is splited, shrink_page_list loops at least 512 times, which means that
> > >>>  >    shrink_page_list is not completed but some folios writeback have been
> > >>>  >    completed, and this may lead to failure to rotate these folios to the
> > >>>   >  tail of lru. The lru may look likes as below:
> > >>>
> > >>> can you please describe it in a readable approach?
> > >>>
> > >>> i feel your below diagram is somehow wrong:
> > >>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> > >>>
> > >>> You mentioned "rotate', how could "rotate" makes:
> > >>> folioN512<->folioN511<->...filioN1 in (2)
> > >>> become
> > >>> filioN1<->...folioN511<->folioN512 in (3).
> > >>>
> > >>
> > >> I am sorry for any confusion.
> > >>
> > >> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
> > >> to writeback one by one. it assumed that filioN1,
> > >> filioN2,filioN3,...filioN512 are completed in order.
> > >>
> > >> Orignal:
> > >> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> > >>
> > >> filioN1 is finished, filioN1 is rotated to the tail of LRU:
> > >> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
> > >>
> > >> filioN2 is finished:
> > >> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
> > >>
> > >> filioN3 is finished:
> > >> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
> > >>
> > >> ...
> > >>
> > >> filioN512 is finished:
> > >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> > >>
> > >> When the filios are finished, the LRU might just like this:
> > >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> > >
> > > understood, thanks!
> > >
> > > Let me try to understand the following part:
> > >
> > >> 4:
> > >>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> > >>   folioN51<->folioN52<->...folioN511<->folioN512
> > >
> > >  >  Although those folios (N1-N50) have been finished writing back, they
> > >  >  are still at the head of lru. When isolating folios from lru, it scans
> > >  >  from tail to head, so it is difficult to scan those folios again.
> > >
> > > What is the reason that "those folios (N1-N50) have finished writing back,
> > > yet they remain at the head of the LRU"? Is it because their writeback_end
> > > occurred while we were still looping in shrink_folio_list(), causing
> > > folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
> > > these folios, which are still in the "folio_list", to the tail of the LRU?
> > >
> >
> > Yes, you are right.
> >
> > >>
> > >>> btw, writeback isn't always async. it could be sync for zram and sync_io
> > >>> swap. in that case, your patch might change the order of LRU. i mean,
> > >>> for example, while a mTHP becomes cold, we always reclaim all of them,
> > >>> but not part of them and put back part of small folios to the head of lru.
> > >>>
> > >>
> > >> Yes, This can be changed.
> > >> Although it may put back part of small folios to the head of lru, it can
> > >> return in time from shrink_folio_list without causing much additional I/O.
> > >>
> > >> If you have understood this issue, do you have any suggestions to fix
> > >> it? My patch may not be a perfect way to fix this issue.
> > >>
> > >
> > > My point is that synchronous I/O, like zRAM, doesn't have this issue and
> > > doesn't require this fix, as writeback is always completed without
> > > asynchronous latency.
> > >
> >
> > I have tested zRAM and found no issues.
> > This is version 1, and I don't know whether this fix will be accepted.

Hi Ridong,

Let me clarify further. I don't believe this is the right fix for either
asynchronous or synchronous I/O, as the coldness of the folios
is being overlooked in both cases. It could lead to more refaults
because cold memory isn't having pageout() executed with the
earlier break you're trying to add, even though it works around
the problem you're facing.

We should ensure that pageout() is always called for all subpages,
regardless of whether an mTHP is split. However, we can adjust
how the folio_list is returned to the LRU.

> > If it is accepted, perhaps this patch could be modified to apply only to
> > asynchronous io.
>
> Consider a 2MB THP: when it becomes cold, we detect that it is cold and
> decide to page it out. Even if we split it into 512 * 4KiB folios, the entire
> 2MB is still cold, so we want pageout() to be called for the entire 2MB.
> With your current approach, some parts of the 2MB are moved to the
> LRU head while we're still paging out other parts, which seems
> problematic.
>
> Could we address this in move_folios_to_lru()? Perhaps we could find
> a way to detect folios whose writeback has done and move them to the
> tail instead of always placing them at the head.
>
> >
> > Best regards,
> > Ridong

Thanks
Barry
chenridong Oct. 21, 2024, 12:15 p.m. UTC | #9
On 2024/10/21 18:09, Barry Song wrote:
> On Mon, Oct 21, 2024 at 10:56 PM chenridong <chenridong@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/21 17:42, Barry Song wrote:
>>> On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/21 12:44, Barry Song wrote:
>>>>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/10/11 0:17, Barry Song wrote:
>>>>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>
>>>>>>>> Hi Ridong,
>>>>>>>>
>>>>>>>> This should be the first version for upstream, and the issue only
>>>>>>>> occurred when large folio is spited.
>>>>>>>>
>>>>>>>> Adding more CCs to see if there's more feedback.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>>>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>>>>>>
>>>>>>>>> An issue was found with the following testing step:
>>>>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>>>>>
>>>>>>>>> It was found that:
>>>>>>>>>
>>>>>>>>> cat memory.usage_in_bytes
>>>>>>>>> 2144940032
>>>>>>>>> cat memory.memsw.usage_in_bytes
>>>>>>>>> 2255056896
>>>>>>>>>
>>>>>>>>> free -h
>>>>>>>>>                 total        used        free
>>>>>>>>> Mem:           31Gi       2.1Gi        27Gi
>>>>>>>>> Swap:         1.0Gi       618Mi       405Mi
>>>>>>>>>
>>>>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>>>>>> use these swap memory.
>>>>>>>>>
>>>>>>>>> It can be explained as follows:
>>>>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>>>>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>>>>>>>
>>>>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>>>>>>      isolated list: folioN
>>>>>>>>>
>>>>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>>>>>>>      submit io to writeback folio to swap, which is asynchronous.
>>>>>>>>>      When shrink_page_list is finished, the isolated folios list will be
>>>>>>>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>>>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>>>>>>>
>>>>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>>>>>
>>>>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>>>>      of lru. The following lru list is expected, with those filioes that have
>>>>>>>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>>>>>>>      can be reclaimed as soon as possible.
>>>>>>>>>
>>>>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>>>>>
>>>>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>>>>>>>      shrink_page_list is not completed but some folios writeback have been
>>>>>>>>>      completed, and this may lead to failure to rotate these folios to the
>>>>>>>>>      tail of lru. The lru may look likes as below:
>>>>>>>
>>>>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>>>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>>>>>
>>>>>>>>>
>>>>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>>>>>>>
>>>>>>>>>      Although those folios (N1-N50) have been finished writing back, they
>>>>>>>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>>>>>>>      from tail to head, so it is difficult to scan those folios again.
>>>>>>>>>
>>>>>>>>> What mentioned above may lead to a large number of folios have been added
>>>>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>>>>>> they trigger OOM.
>>>>>>>>>
>>>>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>>>>>> ---
>>>>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>>>        LIST_HEAD(demote_folios);
>>>>>>>>>        unsigned int nr_reclaimed = 0;
>>>>>>>>>        unsigned int pgactivate = 0;
>>>>>>>>> -     bool do_demote_pass;
>>>>>>>>> +     bool do_demote_pass, splited = false;
>>>>>>>>>        struct swap_iocb *plug = NULL;
>>>>>>>>>
>>>>>>>>>        folio_batch_init(&free_folios);
>>>>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>>>
>>>>>>>>>                cond_resched();
>>>>>>>>>
>>>>>>>>> +             /*
>>>>>>>>> +              * If a large folio has been split, many folios are added
>>>>>>>>> +              * to folio_list. Looping through the entire list takes
>>>>>>>>> +              * too much time, which may prevent folios that have completed
>>>>>>>>> +              * writeback from rotateing to the tail of the lru. Just
>>>>>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>>>>>> +              */
>>>>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>>>>>> +                     break;
>>>>>>>
>>>>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>>>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>>>>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>>>>>
>>>>>>
>>>>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>>>>>> pages that have been pageout can be reclaimed, then enough pages can be
>>>>>> reclaimed when all pages have finished writeback. Thus, it may not have
>>>>>> to pageout more.
>>>>>>
>>>>>> If a small large folio(16 kB) has been split, it may return early
>>>>>> without the entire pages in the folio_list being pageout, but I think
>>>>>> that is fine. It can pageout more pages the next time it enters
>>>>>> shrink_folio_list if there are not enough pages to reclaimed.
>>>>>>
>>>>>> However, if pages that have been pageout are still at the head of the
>>>>>> LRU, it is difficult to scan these pages again. In this case, not only
>>>>>> might it "waste" some swap memory but it also has to pageout more pages.
>>>>>>
>>>>>> Considering the above, I sent this patch. It may not be a perfect
>>>>>> solution, but i think it's a good option to consider. And I am wondering
>>>>>> if anyone has a better solution.
>>>>>
>>>>> Hi Ridong,
>>>>> My overall understanding is that you have failed to describe your problem
>>>>> particularly I don't understand what your 3 and 4 mean:
>>>>>
>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>    of lru. The following lru list is expected, with those filioes that have
>>>>>>    been added to swap cache are rotated to tail of lru. So those folios
>>>>>>  can be reclaimed as soon as possible.
>>>>>>
>>>>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>  >    is splited, shrink_page_list loops at least 512 times, which means that
>>>>>  >    shrink_page_list is not completed but some folios writeback have been
>>>>>  >    completed, and this may lead to failure to rotate these folios to the
>>>>>   >  tail of lru. The lru may look likes as below:
>>>>>
>>>>> can you please describe it in a readable approach?
>>>>>
>>>>> i feel your below diagram is somehow wrong:
>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>> You mentioned "rotate', how could "rotate" makes:
>>>>> folioN512<->folioN511<->...filioN1 in (2)
>>>>> become
>>>>> filioN1<->...folioN511<->folioN512 in (3).
>>>>>
>>>>
>>>> I am sorry for any confusion.
>>>>
>>>> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
>>>> to writeback one by one. it assumed that filioN1,
>>>> filioN2,filioN3,...filioN512 are completed in order.
>>>>
>>>> Orignal:
>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>
>>>> filioN1 is finished, filioN1 is rotated to the tail of LRU:
>>>> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>>>>
>>>> filioN2 is finished:
>>>> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>>>>
>>>> filioN3 is finished:
>>>> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>>>>
>>>> ...
>>>>
>>>> filioN512 is finished:
>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>
>>>> When the filios are finished, the LRU might just like this:
>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> understood, thanks!
>>>
>>> Let me try to understand the following part:
>>>
>>>> 4:
>>>>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>   folioN51<->folioN52<->...folioN511<->folioN512
>>>
>>>  >  Although those folios (N1-N50) have been finished writing back, they
>>>  >  are still at the head of lru. When isolating folios from lru, it scans
>>>  >  from tail to head, so it is difficult to scan those folios again.
>>>
>>> What is the reason that "those folios (N1-N50) have finished writing back,
>>> yet they remain at the head of the LRU"? Is it because their writeback_end
>>> occurred while we were still looping in shrink_folio_list(), causing
>>> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
>>> these folios, which are still in the "folio_list", to the tail of the LRU?
>>>
>>
>> Yes, you are right.
>>
>>>>
>>>>> btw, writeback isn't always async. it could be sync for zram and sync_io
>>>>> swap. in that case, your patch might change the order of LRU. i mean,
>>>>> for example, while a mTHP becomes cold, we always reclaim all of them,
>>>>> but not part of them and put back part of small folios to the head of lru.
>>>>>
>>>>
>>>> Yes, This can be changed.
>>>> Although it may put back part of small folios to the head of lru, it can
>>>> return in time from shrink_folio_list without causing much additional I/O.
>>>>
>>>> If you have understood this issue, do you have any suggestions to fix
>>>> it? My patch may not be a perfect way to fix this issue.
>>>>
>>>
>>> My point is that synchronous I/O, like zRAM, doesn't have this issue and
>>> doesn't require this fix, as writeback is always completed without
>>> asynchronous latency.
>>>
>>
>> I have tested zRAM and found no issues.
>> This is version 1, and I don't know whether this fix will be accepted.
>> If it is accepted, perhaps this patch could be modified to apply only to
>> asynchronous io.
> 
> Consider a 2MB THP: when it becomes cold, we detect that it is cold and
> decide to page it out. Even if we split it into 512 * 4KiB folios, the entire
> 2MB is still cold, so we want pageout() to be called for the entire 2MB.
> With your current approach, some parts of the 2MB are moved to the
> LRU head while we're still paging out other parts, which seems
> problematic.
> 
> Could we address this in move_folios_to_lru()? Perhaps we could find
> a way to detect folios whose writeback has done and move them to the
> tail instead of always placing them at the head.
> 

I'll try to address this issue with this idea.
Thanks you.

Best regards,
Ridong

>>
>> Best regards,
>> Ridong
>>
>>>
>>>> Best regards,
>>>> Ridong
>>>>
> 
> Thanks
> Barry
解 咏梅 Nov. 1, 2024, 8:49 a.m. UTC | #10
Hi Barry & Ridong

Small change for your solution.  Below is my navie idea :P

func shrink_page_list()
{
    loop for iterate the pages on page_list {
         if (page is compound_page && page size > 32 * 4k)
               try_lock(page);
               call split_folio_to_list
               return; // let move_folios_to_lru to push those splited pages back to the head of lru. and wait for the subsequent scan to handle those splited pages one by one
    }
}

BTW: page_unlock is called after folio_start_writeback. if swap backend is based on memory, it's synced I/O and __end_swap_bio_write will be called before the next try_lock(page) after pageout.
That means, there's no racing between folio_end_writeback and shrink_page_list for synced swapout.


Best Regards,
Yongmei.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..fd8ad251eda2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,7 +1047,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
-	bool do_demote_pass;
+	bool do_demote_pass, splited = false;
 	struct swap_iocb *plug = NULL;
 
 	folio_batch_init(&free_folios);
@@ -1065,6 +1065,16 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 		cond_resched();
 
+		/*
+		 * If a large folio has been split, many folios are added
+		 * to folio_list. Looping through the entire list takes
+		 * too much time, which may prevent folios that have completed
+		 * writeback from rotateing to the tail of the lru. Just
+		 * stop looping if nr_pageout is greater than nr_to_reclaim.
+		 */
+		if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
+			break;
+
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
 
@@ -1273,6 +1283,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if ((nr_pages > 1) && !folio_test_large(folio)) {
 			sc->nr_scanned -= (nr_pages - 1);
 			nr_pages = 1;
+			splited = true;
 		}
 
 		/*
@@ -1375,12 +1386,14 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 				if (nr_pages > 1 && !folio_test_large(folio)) {
 					sc->nr_scanned -= (nr_pages - 1);
 					nr_pages = 1;
+					splited = true;
 				}
 				goto activate_locked;
 			case PAGE_SUCCESS:
 				if (nr_pages > 1 && !folio_test_large(folio)) {
 					sc->nr_scanned -= (nr_pages - 1);
 					nr_pages = 1;
+					splited = true;
 				}
 				stat->nr_pageout += nr_pages;
 
@@ -1491,6 +1504,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (nr_pages > 1) {
 			sc->nr_scanned -= (nr_pages - 1);
 			nr_pages = 1;
+			splited = true;
 		}
 activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */