diff mbox series

[v7] mm: shrink skip folio mapped by an exiting process

Message ID 20240709123115.117-1-justinjiang@vivo.com (mailing list archive)
State New
Headers show
Series [v7] mm: shrink skip folio mapped by an exiting process | expand

Commit Message

zhiguojiang July 9, 2024, 12:31 p.m. UTC
The releasing process of the non-shared anonymous folio mapped solely by
an exiting process may go through two flows: 1) the anonymous folio is
firstly is swaped-out into swapspace and transformed into a swp_entry
in shrink_folio_list; 2) then the swp_entry is released in the process
exiting flow. This will result in the high cpu load of releasing a
non-shared anonymous folio mapped solely by an exiting process.

When the low system memory and the exiting process exist at the same
time, it will be likely to happen, because the non-shared anonymous
folio mapped solely by an exiting process may be reclaimed by
shrink_folio_list.

This patch is that shrink skips the non-shared anonymous folio solely
mapped by an exting process and this folio is only released directly in
the process exiting flow, which will save swap-out time and alleviate
the load of the process exiting. 

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---

Change log:
v6->v7:
1.Modify tab indentation to space indentation of the continuation
lines of the condition.
v5->v6:
1.Move folio_likely_mapped_shared() under the PTL.
2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
3.Remove folio_test_anon(folio).
v4->v5:
1.Further modify to skip non-shared anonymous folio only.
2.Update comments for pra->referenced = -1.
v3->v4:
1.Modify to skip only the non-shared anonymous folio mapped solely
by an exiting process.
v2->v3:
Nothing.
v1->v2:
1.The VM_EXITING added in v1 patch is removed, because it will fail
to compile in 32-bit system.


Comments from participants and my responses:
[v6->v7]:
1.Matthew Wilcox <willy@infradead.org>
You told me you'd fix the indentation.  You cannot indent both the
continuation lines of the condition and the body of the if by one tab
each!
-->
Modify tab indentation to space indentation of the continuation
lines of the condition.

[v5->v6]:
1.David Hildenbrand <david@redhat.com>
I'm currently working on moving all folio_likely_mapped_shared() under
the PTL, where we are then sure that the folio is actually mapped by
this process (e.g., no concurrent unmapping poisslbe). Can we do the
same here directly? 
-->
You are right. we might use page_vma_mapped_walk_done() to bail out.
(Barry Song)

2.Barry Song <baohua@kernel.org>
By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
&vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
automatically has MMF_OOM_SKIP. What is the purpose of this check?
Is there a better way to determine if a process is an OOM target?
What about check_stable_address_space() ?
-->
Sorry, I overlook the situation with if (is_global_init(p)),
MMF_OOM_SKIP is indeed not suitable. It seems feasible for
check_stable_address_space() replacing MMF_OOM_SKIP.
check_stable_address_space() can indicate oom kill, and
!atomic_read(&vma->vm_mm->mm_users) can indicate the normal
process exiting. 

I also think we actually can remove "folio_test_anon(folio)".
-->
Yes, update in patch v6.

[v4->v5]:
1.Barry Song <baohua@kernel.org>
I don't think this is correct. folio_likely_mapped_shared() is almost
"correct" but not always.
Please explain why you set  pra->referenced =  -1. Please address all
comments before you send a new version.
-->
Update in patch v5.

2.Matthew Wilcox <willy@infradead.org>
How is the file folio similar?  File folios are never written to swap,
and they'll be written back from the page cache whenever the filesystem
decides it's a good time to do so.
-->
What do you mean is that the file folio will not have any relevant
identifier left in memory after it is reclamed in the shrink flow,
and it will not be released again during an exiting process? If that's
the case, I think we only need the anon folio is skipped here. 

[v3->v4]:
1.Barry Song <baohua@kernel.org>
This is clearly version 3, as you previously sent version 2, correct?
-->
Yes.

Could you please describe the specific impact on users, including user
experience and power consumption? How serious is this problem?
-->
At present, I do not have a suitable method to accurately measure the
optimization benefit datas of this modifications, but I believe it
theoretically has some benefits.
Launching large memory app (for example, starting the camera) in multiple
backend scenes may result in the high cpu load of the exiting processes. 

Applications?
-->
Yes, when system is low memory, it more likely to occur.

I'm not completely convinced this patch is correct, but it appears to be
heading in the right direction. Therefore, I expect to see new versions
rather than it being dead.
You changed the file mode to 755, which is incorrect.
-->
Solved.

Why use -1? Is this meant to simulate lock contention to keep the folio
without activating it? Please do have some comments to explain why.
I'm not convinced this change is appropriate for shared folios. It seems
more suitable for exclusive folios used solely by the exiting process.
-->
The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
the folios will be freed soon in the exiting process flow.
Yes, the shared folios can not be simply skipped. I have made relevant
modifications in patch v4 and please help to further review.
https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/

2.David Hildenbrand <david@redhat.com>
but what if it is shared among multiple processes and only one of them
is exiting?
-->
Modify to skip only the non-shared anonymous folio mapped solely
by an exiting process in next version v4.

[v2->v3:]
Nothing.

[v1->v2]:
1.Matthew Wilcox <willy@infradead.org>
What testing have you done of this patch?  How often does it happen?
Are there particular workloads that benefit from this?  (I'm not sure
what "mutil backed-applications" are?)
And I do mean specifically of this patch, because to my eyes it
shouldn't even compile. Except on 32-bit where it'll say "warning:
integer constant out of range".
-->
Yes, I have tested. When the low system memory and the exiting process
exist at the same time, it will happen. This modification can alleviate
the load of the exiting process. 
"mutil backed-applications" means that there are a large number of
the backend applications in the system.
The VM_EXITING added in v1 patch is removed, because it will fail
to compile in 32-bit system.

 mm/rmap.c   | 14 ++++++++++++++
 mm/vmscan.c |  7 ++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

Comments

Barry Song July 9, 2024, 1:02 p.m. UTC | #1
On Tue, Jul 9, 2024 at 8:31 PM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
> The releasing process of the non-shared anonymous folio mapped solely by
> an exiting process may go through two flows: 1) the anonymous folio is
> firstly is swaped-out into swapspace and transformed into a swp_entry
> in shrink_folio_list; 2) then the swp_entry is released in the process
> exiting flow. This will result in the high cpu load of releasing a
> non-shared anonymous folio mapped solely by an exiting process.
>
> When the low system memory and the exiting process exist at the same
> time, it will be likely to happen, because the non-shared anonymous
> folio mapped solely by an exiting process may be reclaimed by
> shrink_folio_list.
>
> This patch is that shrink skips the non-shared anonymous folio solely
> mapped by an exting process and this folio is only released directly in
> the process exiting flow, which will save swap-out time and alleviate
> the load of the process exiting.
>
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>

You should have collected tags such as reviewed-by, acked-by you got in v6
while sending v7.

Again,
Acked-by: Barry Song <baohua@kernel.org>

> ---
>
> Change log:
> v6->v7:
> 1.Modify tab indentation to space indentation of the continuation
> lines of the condition.
> v5->v6:
> 1.Move folio_likely_mapped_shared() under the PTL.
> 2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
> 3.Remove folio_test_anon(folio).
> v4->v5:
> 1.Further modify to skip non-shared anonymous folio only.
> 2.Update comments for pra->referenced = -1.
> v3->v4:
> 1.Modify to skip only the non-shared anonymous folio mapped solely
> by an exiting process.
> v2->v3:
> Nothing.
> v1->v2:
> 1.The VM_EXITING added in v1 patch is removed, because it will fail
> to compile in 32-bit system.
>
>
> Comments from participants and my responses:
> [v6->v7]:
> 1.Matthew Wilcox <willy@infradead.org>
> You told me you'd fix the indentation.  You cannot indent both the
> continuation lines of the condition and the body of the if by one tab
> each!
> -->
> Modify tab indentation to space indentation of the continuation
> lines of the condition.
>
> [v5->v6]:
> 1.David Hildenbrand <david@redhat.com>
> I'm currently working on moving all folio_likely_mapped_shared() under
> the PTL, where we are then sure that the folio is actually mapped by
> this process (e.g., no concurrent unmapping poisslbe). Can we do the
> same here directly?
> -->
> You are right. we might use page_vma_mapped_walk_done() to bail out.
> (Barry Song)
>
> 2.Barry Song <baohua@kernel.org>
> By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
> &vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
> automatically has MMF_OOM_SKIP. What is the purpose of this check?
> Is there a better way to determine if a process is an OOM target?
> What about check_stable_address_space() ?
> -->
> Sorry, I overlook the situation with if (is_global_init(p)),
> MMF_OOM_SKIP is indeed not suitable. It seems feasible for
> check_stable_address_space() replacing MMF_OOM_SKIP.
> check_stable_address_space() can indicate oom kill, and
> !atomic_read(&vma->vm_mm->mm_users) can indicate the normal
> process exiting.
>
> I also think we actually can remove "folio_test_anon(folio)".
> -->
> Yes, update in patch v6.
>
> [v4->v5]:
> 1.Barry Song <baohua@kernel.org>
> I don't think this is correct. folio_likely_mapped_shared() is almost
> "correct" but not always.
> Please explain why you set  pra->referenced =  -1. Please address all
> comments before you send a new version.
> -->
> Update in patch v5.
>
> 2.Matthew Wilcox <willy@infradead.org>
> How is the file folio similar?  File folios are never written to swap,
> and they'll be written back from the page cache whenever the filesystem
> decides it's a good time to do so.
> -->
> What do you mean is that the file folio will not have any relevant
> identifier left in memory after it is reclamed in the shrink flow,
> and it will not be released again during an exiting process? If that's
> the case, I think we only need the anon folio is skipped here.
>
> [v3->v4]:
> 1.Barry Song <baohua@kernel.org>
> This is clearly version 3, as you previously sent version 2, correct?
> -->
> Yes.
>
> Could you please describe the specific impact on users, including user
> experience and power consumption? How serious is this problem?
> -->
> At present, I do not have a suitable method to accurately measure the
> optimization benefit datas of this modifications, but I believe it
> theoretically has some benefits.
> Launching large memory app (for example, starting the camera) in multiple
> backend scenes may result in the high cpu load of the exiting processes.
>
> Applications?
> -->
> Yes, when system is low memory, it more likely to occur.
>
> I'm not completely convinced this patch is correct, but it appears to be
> heading in the right direction. Therefore, I expect to see new versions
> rather than it being dead.
> You changed the file mode to 755, which is incorrect.
> -->
> Solved.
>
> Why use -1? Is this meant to simulate lock contention to keep the folio
> without activating it? Please do have some comments to explain why.
> I'm not convinced this change is appropriate for shared folios. It seems
> more suitable for exclusive folios used solely by the exiting process.
> -->
> The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
> the folios will be freed soon in the exiting process flow.
> Yes, the shared folios can not be simply skipped. I have made relevant
> modifications in patch v4 and please help to further review.
> https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/
>
> 2.David Hildenbrand <david@redhat.com>
> but what if it is shared among multiple processes and only one of them
> is exiting?
> -->
> Modify to skip only the non-shared anonymous folio mapped solely
> by an exiting process in next version v4.
>
> [v2->v3:]
> Nothing.
>
> [v1->v2]:
> 1.Matthew Wilcox <willy@infradead.org>
> What testing have you done of this patch?  How often does it happen?
> Are there particular workloads that benefit from this?  (I'm not sure
> what "mutil backed-applications" are?)
> And I do mean specifically of this patch, because to my eyes it
> shouldn't even compile. Except on 32-bit where it'll say "warning:
> integer constant out of range".
> -->
> Yes, I have tested. When the low system memory and the exiting process
> exist at the same time, it will happen. This modification can alleviate
> the load of the exiting process.
> "mutil backed-applications" means that there are a large number of
> the backend applications in the system.
> The VM_EXITING added in v1 patch is removed, because it will fail
> to compile in 32-bit system.
>
>  mm/rmap.c   | 14 ++++++++++++++
>  mm/vmscan.c |  7 ++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 88156deb46a6..bb9954773cce 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -877,6 +877,20 @@ static bool folio_referenced_one(struct folio *folio,
>                         continue;
>                 }
>
> +               /*
> +                * Skip the non-shared swapbacked folio mapped solely by
> +                * the exiting or OOM-reaped process. This avoids redundant
> +                * swap-out followed by an immediate unmap.
> +                */
> +               if ((!atomic_read(&vma->vm_mm->mm_users) ||
> +                   check_stable_address_space(vma->vm_mm)) &&
> +                   folio_test_swapbacked(folio) &&
> +                   !folio_likely_mapped_shared(folio)) {
> +                       pra->referenced = -1;
> +                       page_vma_mapped_walk_done(&pvmw);
> +                       return false;
> +               }
> +
>                 if (pvmw.pte) {
>                         if (lru_gen_enabled() &&
>                             pte_young(ptep_get(pvmw.pte))) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 80f9a486cf27..1d5f78a3dbeb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio,
>         if (vm_flags & VM_LOCKED)
>                 return FOLIOREF_ACTIVATE;
>
> -       /* rmap lock contention: rotate */
> +       /*
> +        * There are two cases to consider.
> +        * 1) Rmap lock contention: rotate.
> +        * 2) Skip the non-shared swapbacked folio mapped solely by
> +        *    the exiting or OOM-reaped process.
> +        */
>         if (referenced_ptes == -1)
>                 return FOLIOREF_KEEP;
>
> --
> 2.39.0
>
Andrew Morton July 9, 2024, 9:23 p.m. UTC | #2
On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:

> The releasing process of the non-shared anonymous folio mapped solely by
> an exiting process may go through two flows: 1) the anonymous folio is
> firstly is swaped-out into swapspace and transformed into a swp_entry
> in shrink_folio_list; 2) then the swp_entry is released in the process
> exiting flow. This will result in the high cpu load of releasing a
> non-shared anonymous folio mapped solely by an exiting process.
> 
> When the low system memory and the exiting process exist at the same
> time, it will be likely to happen, because the non-shared anonymous
> folio mapped solely by an exiting process may be reclaimed by
> shrink_folio_list.
> 
> This patch is that shrink skips the non-shared anonymous folio solely
> mapped by an exting process and this folio is only released directly in
> the process exiting flow, which will save swap-out time and alleviate
> the load of the process exiting. 

It would be helpful to provide some before-and-after runtime
measurements, please.  It's a performance optimization so please let's
see what effect it has.
zhiguojiang July 10, 2024, 1:46 a.m. UTC | #3
在 2024/7/9 21:02, Barry Song 写道:
> On Tue, Jul 9, 2024 at 8:31 PM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>> The releasing process of the non-shared anonymous folio mapped solely by
>> an exiting process may go through two flows: 1) the anonymous folio is
>> firstly is swaped-out into swapspace and transformed into a swp_entry
>> in shrink_folio_list; 2) then the swp_entry is released in the process
>> exiting flow. This will result in the high cpu load of releasing a
>> non-shared anonymous folio mapped solely by an exiting process.
>>
>> When the low system memory and the exiting process exist at the same
>> time, it will be likely to happen, because the non-shared anonymous
>> folio mapped solely by an exiting process may be reclaimed by
>> shrink_folio_list.
>>
>> This patch is that shrink skips the non-shared anonymous folio solely
>> mapped by an exting process and this folio is only released directly in
>> the process exiting flow, which will save swap-out time and alleviate
>> the load of the process exiting.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> You should have collected tags such as reviewed-by, acked-by you got in v6
> while sending v7.
>
> Again,
> Acked-by: Barry Song <baohua@kernel.org>
Yes, it is alreadly included in patch v7.

Thanks
Zhiguo
>
>> ---
>>
>> Change log:
>> v6->v7:
>> 1.Modify tab indentation to space indentation of the continuation
>> lines of the condition.
>> v5->v6:
>> 1.Move folio_likely_mapped_shared() under the PTL.
>> 2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
>> 3.Remove folio_test_anon(folio).
>> v4->v5:
>> 1.Further modify to skip non-shared anonymous folio only.
>> 2.Update comments for pra->referenced = -1.
>> v3->v4:
>> 1.Modify to skip only the non-shared anonymous folio mapped solely
>> by an exiting process.
>> v2->v3:
>> Nothing.
>> v1->v2:
>> 1.The VM_EXITING added in v1 patch is removed, because it will fail
>> to compile in 32-bit system.
>>
>>
>> Comments from participants and my responses:
>> [v6->v7]:
>> 1.Matthew Wilcox <willy@infradead.org>
>> You told me you'd fix the indentation.  You cannot indent both the
>> continuation lines of the condition and the body of the if by one tab
>> each!
>> -->
>> Modify tab indentation to space indentation of the continuation
>> lines of the condition.
>>
>> [v5->v6]:
>> 1.David Hildenbrand <david@redhat.com>
>> I'm currently working on moving all folio_likely_mapped_shared() under
>> the PTL, where we are then sure that the folio is actually mapped by
>> this process (e.g., no concurrent unmapping poisslbe). Can we do the
>> same here directly?
>> -->
>> You are right. we might use page_vma_mapped_walk_done() to bail out.
>> (Barry Song)
>>
>> 2.Barry Song <baohua@kernel.org>
>> By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
>> &vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
>> automatically has MMF_OOM_SKIP. What is the purpose of this check?
>> Is there a better way to determine if a process is an OOM target?
>> What about check_stable_address_space() ?
>> -->
>> Sorry, I overlook the situation with if (is_global_init(p)),
>> MMF_OOM_SKIP is indeed not suitable. It seems feasible for
>> check_stable_address_space() replacing MMF_OOM_SKIP.
>> check_stable_address_space() can indicate oom kill, and
>> !atomic_read(&vma->vm_mm->mm_users) can indicate the normal
>> process exiting.
>>
>> I also think we actually can remove "folio_test_anon(folio)".
>> -->
>> Yes, update in patch v6.
>>
>> [v4->v5]:
>> 1.Barry Song <baohua@kernel.org>
>> I don't think this is correct. folio_likely_mapped_shared() is almost
>> "correct" but not always.
>> Please explain why you set  pra->referenced =  -1. Please address all
>> comments before you send a new version.
>> -->
>> Update in patch v5.
>>
>> 2.Matthew Wilcox <willy@infradead.org>
>> How is the file folio similar?  File folios are never written to swap,
>> and they'll be written back from the page cache whenever the filesystem
>> decides it's a good time to do so.
>> -->
>> What do you mean is that the file folio will not have any relevant
>> identifier left in memory after it is reclamed in the shrink flow,
>> and it will not be released again during an exiting process? If that's
>> the case, I think we only need the anon folio is skipped here.
>>
>> [v3->v4]:
>> 1.Barry Song <baohua@kernel.org>
>> This is clearly version 3, as you previously sent version 2, correct?
>> -->
>> Yes.
>>
>> Could you please describe the specific impact on users, including user
>> experience and power consumption? How serious is this problem?
>> -->
>> At present, I do not have a suitable method to accurately measure the
>> optimization benefit datas of this modifications, but I believe it
>> theoretically has some benefits.
>> Launching large memory app (for example, starting the camera) in multiple
>> backend scenes may result in the high cpu load of the exiting processes.
>>
>> Applications?
>> -->
>> Yes, when system is low memory, it more likely to occur.
>>
>> I'm not completely convinced this patch is correct, but it appears to be
>> heading in the right direction. Therefore, I expect to see new versions
>> rather than it being dead.
>> You changed the file mode to 755, which is incorrect.
>> -->
>> Solved.
>>
>> Why use -1? Is this meant to simulate lock contention to keep the folio
>> without activating it? Please do have some comments to explain why.
>> I'm not convinced this change is appropriate for shared folios. It seems
>> more suitable for exclusive folios used solely by the exiting process.
>> -->
>> The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
>> the folios will be freed soon in the exiting process flow.
>> Yes, the shared folios can not be simply skipped. I have made relevant
>> modifications in patch v4 and please help to further review.
>> https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/
>>
>> 2.David Hildenbrand <david@redhat.com>
>> but what if it is shared among multiple processes and only one of them
>> is exiting?
>> -->
>> Modify to skip only the non-shared anonymous folio mapped solely
>> by an exiting process in next version v4.
>>
>> [v2->v3:]
>> Nothing.
>>
>> [v1->v2]:
>> 1.Matthew Wilcox <willy@infradead.org>
>> What testing have you done of this patch?  How often does it happen?
>> Are there particular workloads that benefit from this?  (I'm not sure
>> what "mutil backed-applications" are?)
>> And I do mean specifically of this patch, because to my eyes it
>> shouldn't even compile. Except on 32-bit where it'll say "warning:
>> integer constant out of range".
>> -->
>> Yes, I have tested. When the low system memory and the exiting process
>> exist at the same time, it will happen. This modification can alleviate
>> the load of the exiting process.
>> "mutil backed-applications" means that there are a large number of
>> the backend applications in the system.
>> The VM_EXITING added in v1 patch is removed, because it will fail
>> to compile in 32-bit system.
>>
>>   mm/rmap.c   | 14 ++++++++++++++
>>   mm/vmscan.c |  7 ++++++-
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 88156deb46a6..bb9954773cce 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -877,6 +877,20 @@ static bool folio_referenced_one(struct folio *folio,
>>                          continue;
>>                  }
>>
>> +               /*
>> +                * Skip the non-shared swapbacked folio mapped solely by
>> +                * the exiting or OOM-reaped process. This avoids redundant
>> +                * swap-out followed by an immediate unmap.
>> +                */
>> +               if ((!atomic_read(&vma->vm_mm->mm_users) ||
>> +                   check_stable_address_space(vma->vm_mm)) &&
>> +                   folio_test_swapbacked(folio) &&
>> +                   !folio_likely_mapped_shared(folio)) {
>> +                       pra->referenced = -1;
>> +                       page_vma_mapped_walk_done(&pvmw);
>> +                       return false;
>> +               }
>> +
>>                  if (pvmw.pte) {
>>                          if (lru_gen_enabled() &&
>>                              pte_young(ptep_get(pvmw.pte))) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 80f9a486cf27..1d5f78a3dbeb 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio,
>>          if (vm_flags & VM_LOCKED)
>>                  return FOLIOREF_ACTIVATE;
>>
>> -       /* rmap lock contention: rotate */
>> +       /*
>> +        * There are two cases to consider.
>> +        * 1) Rmap lock contention: rotate.
>> +        * 2) Skip the non-shared swapbacked folio mapped solely by
>> +        *    the exiting or OOM-reaped process.
>> +        */
>>          if (referenced_ptes == -1)
>>                  return FOLIOREF_KEEP;
>>
>> --
>> 2.39.0
>>
Barry Song July 10, 2024, 2 a.m. UTC | #4
On Wed, Jul 10, 2024 at 1:46 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/7/9 21:02, Barry Song 写道:
> > On Tue, Jul 9, 2024 at 8:31 PM Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >> The releasing process of the non-shared anonymous folio mapped solely by
> >> an exiting process may go through two flows: 1) the anonymous folio is
> >> firstly is swaped-out into swapspace and transformed into a swp_entry
> >> in shrink_folio_list; 2) then the swp_entry is released in the process
> >> exiting flow. This will result in the high cpu load of releasing a
> >> non-shared anonymous folio mapped solely by an exiting process.
> >>
> >> When the low system memory and the exiting process exist at the same
> >> time, it will be likely to happen, because the non-shared anonymous
> >> folio mapped solely by an exiting process may be reclaimed by
> >> shrink_folio_list.
> >>
> >> This patch is that shrink skips the non-shared anonymous folio solely
> >> mapped by an exting process and this folio is only released directly in
> >> the process exiting flow, which will save swap-out time and alleviate
> >> the load of the process exiting.
> >>
> >> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> > You should have collected tags such as reviewed-by, acked-by you got in v6
> > while sending v7.
> >
> > Again,
> > Acked-by: Barry Song <baohua@kernel.org>
> Yes, it is alreadly included in patch v7.

obviously no. please take a look how people collect tags while sending a
new version:

https://lore.kernel.org/linux-mm/20240704012905.42971-2-ioworker0@gmail.com/


>
> Thanks
> Zhiguo
> >
> >> ---
> >>
> >> Change log:
> >> v6->v7:
> >> 1.Modify tab indentation to space indentation of the continuation
> >> lines of the condition.
> >> v5->v6:
> >> 1.Move folio_likely_mapped_shared() under the PTL.
> >> 2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
> >> 3.Remove folio_test_anon(folio).
> >> v4->v5:
> >> 1.Further modify to skip non-shared anonymous folio only.
> >> 2.Update comments for pra->referenced = -1.
> >> v3->v4:
> >> 1.Modify to skip only the non-shared anonymous folio mapped solely
> >> by an exiting process.
> >> v2->v3:
> >> Nothing.
> >> v1->v2:
> >> 1.The VM_EXITING added in v1 patch is removed, because it will fail
> >> to compile in 32-bit system.
> >>
> >>
> >> Comments from participants and my responses:
> >> [v6->v7]:
> >> 1.Matthew Wilcox <willy@infradead.org>
> >> You told me you'd fix the indentation.  You cannot indent both the
> >> continuation lines of the condition and the body of the if by one tab
> >> each!
> >> -->
> >> Modify tab indentation to space indentation of the continuation
> >> lines of the condition.
> >>
> >> [v5->v6]:
> >> 1.David Hildenbrand <david@redhat.com>
> >> I'm currently working on moving all folio_likely_mapped_shared() under
> >> the PTL, where we are then sure that the folio is actually mapped by
> >> this process (e.g., no concurrent unmapping poisslbe). Can we do the
> >> same here directly?
> >> -->
> >> You are right. we might use page_vma_mapped_walk_done() to bail out.
> >> (Barry Song)
> >>
> >> 2.Barry Song <baohua@kernel.org>
> >> By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
> >> &vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
> >> automatically has MMF_OOM_SKIP. What is the purpose of this check?
> >> Is there a better way to determine if a process is an OOM target?
> >> What about check_stable_address_space() ?
> >> -->
> >> Sorry, I overlook the situation with if (is_global_init(p)),
> >> MMF_OOM_SKIP is indeed not suitable. It seems feasible for
> >> check_stable_address_space() replacing MMF_OOM_SKIP.
> >> check_stable_address_space() can indicate oom kill, and
> >> !atomic_read(&vma->vm_mm->mm_users) can indicate the normal
> >> process exiting.
> >>
> >> I also think we actually can remove "folio_test_anon(folio)".
> >> -->
> >> Yes, update in patch v6.
> >>
> >> [v4->v5]:
> >> 1.Barry Song <baohua@kernel.org>
> >> I don't think this is correct. folio_likely_mapped_shared() is almost
> >> "correct" but not always.
> >> Please explain why you set  pra->referenced =  -1. Please address all
> >> comments before you send a new version.
> >> -->
> >> Update in patch v5.
> >>
> >> 2.Matthew Wilcox <willy@infradead.org>
> >> How is the file folio similar?  File folios are never written to swap,
> >> and they'll be written back from the page cache whenever the filesystem
> >> decides it's a good time to do so.
> >> -->
> >> What do you mean is that the file folio will not have any relevant
> >> identifier left in memory after it is reclamed in the shrink flow,
> >> and it will not be released again during an exiting process? If that's
> >> the case, I think we only need the anon folio is skipped here.
> >>
> >> [v3->v4]:
> >> 1.Barry Song <baohua@kernel.org>
> >> This is clearly version 3, as you previously sent version 2, correct?
> >> -->
> >> Yes.
> >>
> >> Could you please describe the specific impact on users, including user
> >> experience and power consumption? How serious is this problem?
> >> -->
> >> At present, I do not have a suitable method to accurately measure the
> >> optimization benefit datas of this modifications, but I believe it
> >> theoretically has some benefits.
> >> Launching large memory app (for example, starting the camera) in multiple
> >> backend scenes may result in the high cpu load of the exiting processes.
> >>
> >> Applications?
> >> -->
> >> Yes, when system is low memory, it more likely to occur.
> >>
> >> I'm not completely convinced this patch is correct, but it appears to be
> >> heading in the right direction. Therefore, I expect to see new versions
> >> rather than it being dead.
> >> You changed the file mode to 755, which is incorrect.
> >> -->
> >> Solved.
> >>
> >> Why use -1? Is this meant to simulate lock contention to keep the folio
> >> without activating it? Please do have some comments to explain why.
> >> I'm not convinced this change is appropriate for shared folios. It seems
> >> more suitable for exclusive folios used solely by the exiting process.
> >> -->
> >> The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
> >> the folios will be freed soon in the exiting process flow.
> >> Yes, the shared folios can not be simply skipped. I have made relevant
> >> modifications in patch v4 and please help to further review.
> >> https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/
> >>
> >> 2.David Hildenbrand <david@redhat.com>
> >> but what if it is shared among multiple processes and only one of them
> >> is exiting?
> >> -->
> >> Modify to skip only the non-shared anonymous folio mapped solely
> >> by an exiting process in next version v4.
> >>
> >> [v2->v3:]
> >> Nothing.
> >>
> >> [v1->v2]:
> >> 1.Matthew Wilcox <willy@infradead.org>
> >> What testing have you done of this patch?  How often does it happen?
> >> Are there particular workloads that benefit from this?  (I'm not sure
> >> what "mutil backed-applications" are?)
> >> And I do mean specifically of this patch, because to my eyes it
> >> shouldn't even compile. Except on 32-bit where it'll say "warning:
> >> integer constant out of range".
> >> -->
> >> Yes, I have tested. When the low system memory and the exiting process
> >> exist at the same time, it will happen. This modification can alleviate
> >> the load of the exiting process.
> >> "mutil backed-applications" means that there are a large number of
> >> the backend applications in the system.
> >> The VM_EXITING added in v1 patch is removed, because it will fail
> >> to compile in 32-bit system.
> >>
> >>   mm/rmap.c   | 14 ++++++++++++++
> >>   mm/vmscan.c |  7 ++++++-
> >>   2 files changed, 20 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 88156deb46a6..bb9954773cce 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -877,6 +877,20 @@ static bool folio_referenced_one(struct folio *folio,
> >>                          continue;
> >>                  }
> >>
> >> +               /*
> >> +                * Skip the non-shared swapbacked folio mapped solely by
> >> +                * the exiting or OOM-reaped process. This avoids redundant
> >> +                * swap-out followed by an immediate unmap.
> >> +                */
> >> +               if ((!atomic_read(&vma->vm_mm->mm_users) ||
> >> +                   check_stable_address_space(vma->vm_mm)) &&
> >> +                   folio_test_swapbacked(folio) &&
> >> +                   !folio_likely_mapped_shared(folio)) {
> >> +                       pra->referenced = -1;
> >> +                       page_vma_mapped_walk_done(&pvmw);
> >> +                       return false;
> >> +               }
> >> +
> >>                  if (pvmw.pte) {
> >>                          if (lru_gen_enabled() &&
> >>                              pte_young(ptep_get(pvmw.pte))) {
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 80f9a486cf27..1d5f78a3dbeb 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio,
> >>          if (vm_flags & VM_LOCKED)
> >>                  return FOLIOREF_ACTIVATE;
> >>
> >> -       /* rmap lock contention: rotate */
> >> +       /*
> >> +        * There are two cases to consider.
> >> +        * 1) Rmap lock contention: rotate.
> >> +        * 2) Skip the non-shared swapbacked folio mapped solely by
> >> +        *    the exiting or OOM-reaped process.
> >> +        */
> >>          if (referenced_ptes == -1)
> >>                  return FOLIOREF_KEEP;
> >>
> >> --
> >> 2.39.0
> >>
>
Barry Song July 10, 2024, 2:12 a.m. UTC | #5
On Wed, Jul 10, 2024 at 12:31 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
> The releasing process of the non-shared anonymous folio mapped solely by
> an exiting process may go through two flows: 1) the anonymous folio is
> firstly is swaped-out into swapspace and transformed into a swp_entry
> in shrink_folio_list; 2) then the swp_entry is released in the process
> exiting flow. This will result in the high cpu load of releasing a
> non-shared anonymous folio mapped solely by an exiting process.
>
> When the low system memory and the exiting process exist at the same
> time, it will be likely to happen, because the non-shared anonymous
> folio mapped solely by an exiting process may be reclaimed by
> shrink_folio_list.
>
> This patch is that shrink skips the non-shared anonymous folio solely
> mapped by an exting process and this folio is only released directly in
> the process exiting flow, which will save swap-out time and alleviate
> the load of the process exiting.
>
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
>
> Change log:
> v6->v7:
> 1.Modify tab indentation to space indentation of the continuation
> lines of the condition.
> v5->v6:
> 1.Move folio_likely_mapped_shared() under the PTL.
> 2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
> 3.Remove folio_test_anon(folio).
> v4->v5:
> 1.Further modify to skip non-shared anonymous folio only.
> 2.Update comments for pra->referenced = -1.
> v3->v4:
> 1.Modify to skip only the non-shared anonymous folio mapped solely
> by an exiting process.
> v2->v3:
> Nothing.
> v1->v2:
> 1.The VM_EXITING added in v1 patch is removed, because it will fail
> to compile in 32-bit system.
>
>
> Comments from participants and my responses:
> [v6->v7]:
> 1.Matthew Wilcox <willy@infradead.org>
> You told me you'd fix the indentation.  You cannot indent both the
> continuation lines of the condition and the body of the if by one tab
> each!
> -->
> Modify tab indentation to space indentation of the continuation
> lines of the condition.
>
> [v5->v6]:
> 1.David Hildenbrand <david@redhat.com>
> I'm currently working on moving all folio_likely_mapped_shared() under
> the PTL, where we are then sure that the folio is actually mapped by
> this process (e.g., no concurrent unmapping poisslbe). Can we do the
> same here directly?
> -->
> You are right. we might use page_vma_mapped_walk_done() to bail out.
> (Barry Song)
>
> 2.Barry Song <baohua@kernel.org>
> By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
> &vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
> automatically has MMF_OOM_SKIP. What is the purpose of this check?
> Is there a better way to determine if a process is an OOM target?
> What about check_stable_address_space() ?
> -->
> Sorry, I overlook the situation with if (is_global_init(p)),
> MMF_OOM_SKIP is indeed not suitable. It seems feasible for
> check_stable_address_space() replacing MMF_OOM_SKIP.
> check_stable_address_space() can indicate oom kill, and
> !atomic_read(&vma->vm_mm->mm_users) can indicate the normal
> process exiting.
>
> I also think we actually can remove "folio_test_anon(folio)".
> -->
> Yes, update in patch v6.
>
> [v4->v5]:
> 1.Barry Song <baohua@kernel.org>
> I don't think this is correct. folio_likely_mapped_shared() is almost
> "correct" but not always.
> Please explain why you set  pra->referenced =  -1. Please address all
> comments before you send a new version.
> -->
> Update in patch v5.
>
> 2.Matthew Wilcox <willy@infradead.org>
> How is the file folio similar?  File folios are never written to swap,
> and they'll be written back from the page cache whenever the filesystem
> decides it's a good time to do so.
> -->
> What do you mean is that the file folio will not have any relevant
> identifier left in memory after it is reclamed in the shrink flow,
> and it will not be released again during an exiting process? If that's
> the case, I think we only need the anon folio is skipped here.
>
> [v3->v4]:
> 1.Barry Song <baohua@kernel.org>
> This is clearly version 3, as you previously sent version 2, correct?
> -->
> Yes.
>
> Could you please describe the specific impact on users, including user
> experience and power consumption? How serious is this problem?
> -->
> At present, I do not have a suitable method to accurately measure the
> optimization benefit datas of this modifications, but I believe it
> theoretically has some benefits.
> Launching large memory app (for example, starting the camera) in multiple
> backend scenes may result in the high cpu load of the exiting processes.
>
> Applications?
> -->
> Yes, when system is low memory, it more likely to occur.
>
> I'm not completely convinced this patch is correct, but it appears to be
> heading in the right direction. Therefore, I expect to see new versions
> rather than it being dead.
> You changed the file mode to 755, which is incorrect.
> -->
> Solved.
>
> Why use -1? Is this meant to simulate lock contention to keep the folio
> without activating it? Please do have some comments to explain why.
> I'm not convinced this change is appropriate for shared folios. It seems
> more suitable for exclusive folios used solely by the exiting process.
> -->
> The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
> the folios will be freed soon in the exiting process flow.
> Yes, the shared folios can not be simply skipped. I have made relevant
> modifications in patch v4 and please help to further review.
> https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/
>
> 2.David Hildenbrand <david@redhat.com>
> but what if it is shared among multiple processes and only one of them
> is exiting?
> -->
> Modify to skip only the non-shared anonymous folio mapped solely
> by an exiting process in next version v4.
>
> [v2->v3:]
> Nothing.
>
> [v1->v2]:
> 1.Matthew Wilcox <willy@infradead.org>
> What testing have you done of this patch?  How often does it happen?
> Are there particular workloads that benefit from this?  (I'm not sure
> what "mutil backed-applications" are?)
> And I do mean specifically of this patch, because to my eyes it
> shouldn't even compile. Except on 32-bit where it'll say "warning:
> integer constant out of range".
> -->
> Yes, I have tested. When the low system memory and the exiting process
> exist at the same time, it will happen. This modification can alleviate
> the load of the exiting process.
> "mutil backed-applications" means that there are a large number of
> the backend applications in the system.
> The VM_EXITING added in v1 patch is removed, because it will fail
> to compile in 32-bit system.
>
>  mm/rmap.c   | 14 ++++++++++++++
>  mm/vmscan.c |  7 ++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 88156deb46a6..bb9954773cce 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -877,6 +877,20 @@ static bool folio_referenced_one(struct folio *folio,
>                         continue;
>                 }
>
> +               /*
> +                * Skip the non-shared swapbacked folio mapped solely by
> +                * the exiting or OOM-reaped process. This avoids redundant
> +                * swap-out followed by an immediate unmap.
> +                */
> +               if ((!atomic_read(&vma->vm_mm->mm_users) ||
> +                   check_stable_address_space(vma->vm_mm)) &&

You didn't even pass the compilation stage because you're missing 'linux/oom.h'.
It's quite disappointing because I believe in your idea, but you
didn't even build it
before sending.

@@ -75,6 +75,7 @@
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mm_inline.h>
+#include <linux/oom.h>

> +                   folio_test_swapbacked(folio) &&
> +                   !folio_likely_mapped_shared(folio)) {
> +                       pra->referenced = -1;
> +                       page_vma_mapped_walk_done(&pvmw);
> +                       return false;
> +               }
> +
>                 if (pvmw.pte) {
>                         if (lru_gen_enabled() &&
>                             pte_young(ptep_get(pvmw.pte))) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 80f9a486cf27..1d5f78a3dbeb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio,
>         if (vm_flags & VM_LOCKED)
>                 return FOLIOREF_ACTIVATE;
>
> -       /* rmap lock contention: rotate */
> +       /*
> +        * There are two cases to consider.
> +        * 1) Rmap lock contention: rotate.
> +        * 2) Skip the non-shared swapbacked folio mapped solely by
> +        *    the exiting or OOM-reaped process.
> +        */
>         if (referenced_ptes == -1)
>                 return FOLIOREF_KEEP;
>
> --
> 2.39.0
>
zhiguojiang July 10, 2024, 2:41 a.m. UTC | #6
在 2024/7/10 10:12, Barry Song 写道:
> On Wed, Jul 10, 2024 at 12:31 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>> The releasing process of the non-shared anonymous folio mapped solely by
>> an exiting process may go through two flows: 1) the anonymous folio is
>> firstly is swaped-out into swapspace and transformed into a swp_entry
>> in shrink_folio_list; 2) then the swp_entry is released in the process
>> exiting flow. This will result in the high cpu load of releasing a
>> non-shared anonymous folio mapped solely by an exiting process.
>>
>> When the low system memory and the exiting process exist at the same
>> time, it will be likely to happen, because the non-shared anonymous
>> folio mapped solely by an exiting process may be reclaimed by
>> shrink_folio_list.
>>
>> This patch is that shrink skips the non-shared anonymous folio solely
>> mapped by an exting process and this folio is only released directly in
>> the process exiting flow, which will save swap-out time and alleviate
>> the load of the process exiting.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>> ---
>>
>> Change log:
>> v6->v7:
>> 1.Modify tab indentation to space indentation of the continuation
>> lines of the condition.
>> v5->v6:
>> 1.Move folio_likely_mapped_shared() under the PTL.
>> 2.Add check_stable_address_space() to replace MMF_OOM_SKIP.
>> 3.Remove folio_test_anon(folio).
>> v4->v5:
>> 1.Further modify to skip non-shared anonymous folio only.
>> 2.Update comments for pra->referenced = -1.
>> v3->v4:
>> 1.Modify to skip only the non-shared anonymous folio mapped solely
>> by an exiting process.
>> v2->v3:
>> Nothing.
>> v1->v2:
>> 1.The VM_EXITING added in v1 patch is removed, because it will fail
>> to compile in 32-bit system.
>>
>>
>> Comments from participants and my responses:
>> [v6->v7]:
>> 1.Matthew Wilcox <willy@infradead.org>
>> You told me you'd fix the indentation.  You cannot indent both the
>> continuation lines of the condition and the body of the if by one tab
>> each!
>> -->
>> Modify tab indentation to space indentation of the continuation
>> lines of the condition.
>>
>> [v5->v6]:
>> 1.David Hildenbrand <david@redhat.com>
>> I'm currently working on moving all folio_likely_mapped_shared() under
>> the PTL, where we are then sure that the folio is actually mapped by
>> this process (e.g., no concurrent unmapping poisslbe). Can we do the
>> same here directly?
>> -->
>> You are right. we might use page_vma_mapped_walk_done() to bail out.
>> (Barry Song)
>>
>> 2.Barry Song <baohua@kernel.org>
>> By the way, I am not convinced that using test_bit(MMF_OOM_SKIP,
>> &vma->vm_mm->flags) is correct (I think it is wrong). And exit_mmap()
>> automatically has MMF_OOM_SKIP. What is the purpose of this check?
>> Is there a better way to determine if a process is an OOM target?
>> What about check_stable_address_space() ?
>> -->
>> Sorry, I overlook the situation with if (is_global_init(p)),
>> MMF_OOM_SKIP is indeed not suitable. It seems feasible for
>> check_stable_address_space() replacing MMF_OOM_SKIP.
>> check_stable_address_space() can indicate oom kill, and
>> !atomic_read(&vma->vm_mm->mm_users) can indicate the normal
>> process exiting.
>>
>> I also think we actually can remove "folio_test_anon(folio)".
>> -->
>> Yes, update in patch v6.
>>
>> [v4->v5]:
>> 1.Barry Song <baohua@kernel.org>
>> I don't think this is correct. folio_likely_mapped_shared() is almost
>> "correct" but not always.
>> Please explain why you set  pra->referenced =  -1. Please address all
>> comments before you send a new version.
>> -->
>> Update in patch v5.
>>
>> 2.Matthew Wilcox <willy@infradead.org>
>> How is the file folio similar?  File folios are never written to swap,
>> and they'll be written back from the page cache whenever the filesystem
>> decides it's a good time to do so.
>> -->
>> What do you mean is that the file folio will not have any relevant
>> identifier left in memory after it is reclamed in the shrink flow,
>> and it will not be released again during an exiting process? If that's
>> the case, I think we only need the anon folio is skipped here.
>>
>> [v3->v4]:
>> 1.Barry Song <baohua@kernel.org>
>> This is clearly version 3, as you previously sent version 2, correct?
>> -->
>> Yes.
>>
>> Could you please describe the specific impact on users, including user
>> experience and power consumption? How serious is this problem?
>> -->
>> At present, I do not have a suitable method to accurately measure the
>> optimization benefit datas of this modifications, but I believe it
>> theoretically has some benefits.
>> Launching large memory app (for example, starting the camera) in multiple
>> backend scenes may result in the high cpu load of the exiting processes.
>>
>> Applications?
>> -->
>> Yes, when system is low memory, it more likely to occur.
>>
>> I'm not completely convinced this patch is correct, but it appears to be
>> heading in the right direction. Therefore, I expect to see new versions
>> rather than it being dead.
>> You changed the file mode to 755, which is incorrect.
>> -->
>> Solved.
>>
>> Why use -1? Is this meant to simulate lock contention to keep the folio
>> without activating it? Please do have some comments to explain why.
>> I'm not convinced this change is appropriate for shared folios. It seems
>> more suitable for exclusive folios used solely by the exiting process.
>> -->
>> The skiped folios are FOLIOREF_KEEP and added into inactive lru, beacase
>> the folios will be freed soon in the exiting process flow.
>> Yes, the shared folios can not be simply skipped. I have made relevant
>> modifications in patch v4 and please help to further review.
>> https://lore.kernel.org/linux-mm/20240708031517.856-1-justinjiang@vivo.com/
>>
>> 2.David Hildenbrand <david@redhat.com>
>> but what if it is shared among multiple processes and only one of them
>> is exiting?
>> -->
>> Modify to skip only the non-shared anonymous folio mapped solely
>> by an exiting process in next version v4.
>>
>> [v2->v3:]
>> Nothing.
>>
>> [v1->v2]:
>> 1.Matthew Wilcox <willy@infradead.org>
>> What testing have you done of this patch?  How often does it happen?
>> Are there particular workloads that benefit from this?  (I'm not sure
>> what "mutil backed-applications" are?)
>> And I do mean specifically of this patch, because to my eyes it
>> shouldn't even compile. Except on 32-bit where it'll say "warning:
>> integer constant out of range".
>> -->
>> Yes, I have tested. When the low system memory and the exiting process
>> exist at the same time, it will happen. This modification can alleviate
>> the load of the exiting process.
>> "mutil backed-applications" means that there are a large number of
>> the backend applications in the system.
>> The VM_EXITING added in v1 patch is removed, because it will fail
>> to compile in 32-bit system.
>>
>>   mm/rmap.c   | 14 ++++++++++++++
>>   mm/vmscan.c |  7 ++++++-
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 88156deb46a6..bb9954773cce 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -877,6 +877,20 @@ static bool folio_referenced_one(struct folio *folio,
>>                          continue;
>>                  }
>>
>> +               /*
>> +                * Skip the non-shared swapbacked folio mapped solely by
>> +                * the exiting or OOM-reaped process. This avoids redundant
>> +                * swap-out followed by an immediate unmap.
>> +                */
>> +               if ((!atomic_read(&vma->vm_mm->mm_users) ||
>> +                   check_stable_address_space(vma->vm_mm)) &&
> You didn't even pass the compilation stage because you're missing 'linux/oom.h'.
> It's quite disappointing because I believe in your idea, but you
> didn't even build it
> before sending.
>
> @@ -75,6 +75,7 @@
>   #include <linux/memremap.h>
>   #include <linux/userfaultfd_k.h>
>   #include <linux/mm_inline.h>
> +#include <linux/oom.h>
Sorry, I overlooked the compilation of folio_likely_mapped_shared() added
in patch v5. Compiled and Updated have been compeleted in patch v8.

Thanks
Zhiguo
>
>> +                   folio_test_swapbacked(folio) &&
>> +                   !folio_likely_mapped_shared(folio)) {
>> +                       pra->referenced = -1;
>> +                       page_vma_mapped_walk_done(&pvmw);
>> +                       return false;
>> +               }
>> +
>>                  if (pvmw.pte) {
>>                          if (lru_gen_enabled() &&
>>                              pte_young(ptep_get(pvmw.pte))) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 80f9a486cf27..1d5f78a3dbeb 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio,
>>          if (vm_flags & VM_LOCKED)
>>                  return FOLIOREF_ACTIVATE;
>>
>> -       /* rmap lock contention: rotate */
>> +       /*
>> +        * There are two cases to consider.
>> +        * 1) Rmap lock contention: rotate.
>> +        * 2) Skip the non-shared swapbacked folio mapped solely by
>> +        *    the exiting or OOM-reaped process.
>> +        */
>>          if (referenced_ptes == -1)
>>                  return FOLIOREF_KEEP;
>>
>> --
>> 2.39.0
>>
Barry Song July 10, 2024, 3:32 a.m. UTC | #7
On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
> > The releasing process of the non-shared anonymous folio mapped solely by
> > an exiting process may go through two flows: 1) the anonymous folio is
> > firstly is swaped-out into swapspace and transformed into a swp_entry
> > in shrink_folio_list; 2) then the swp_entry is released in the process
> > exiting flow. This will result in the high cpu load of releasing a
> > non-shared anonymous folio mapped solely by an exiting process.
> >
> > When the low system memory and the exiting process exist at the same
> > time, it will be likely to happen, because the non-shared anonymous
> > folio mapped solely by an exiting process may be reclaimed by
> > shrink_folio_list.
> >
> > This patch is that shrink skips the non-shared anonymous folio solely
> > mapped by an exting process and this folio is only released directly in
> > the process exiting flow, which will save swap-out time and alleviate
> > the load of the process exiting.
>
> It would be helpful to provide some before-and-after runtime
> measurements, please.  It's a performance optimization so please let's
> see what effect it has.

Hi Andrew,

This was something I was curious about too, so I created a small test program
that allocates and continuously writes to 256MB of memory. Using QEMU, I set
up a small machine with only 300MB of RAM to trigger kswapd.

qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
 -smp cpus=4 -cpu max \
 -m 300M -kernel arch/arm64/boot/Image
 
The test program will be randomly terminated by its subprocess to trigger
the use case of this patch.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <signal.h>

#define MEMORY_SIZE (256 * 1024 * 1024)

unsigned char *memory;

void allocate_and_write_memory()
{
    memory = (unsigned char *)malloc(MEMORY_SIZE);
    if (memory == NULL) {
        perror("malloc");
        exit(EXIT_FAILURE);
    }

    while (1)
        memset(memory, 0x11, MEMORY_SIZE);
}

int main()
{
    pid_t pid;
    srand(time(NULL));

    pid = fork();

    if (pid < 0) {
        perror("fork");
        exit(EXIT_FAILURE);
    }

    if (pid == 0) {
        int delay = (rand() % 10000) + 10000;
        usleep(delay * 1000);

	/* kill parent when it is busy on swapping */
        kill(getppid(), SIGKILL);
        _exit(0);
    } else {
        allocate_and_write_memory();

        wait(NULL);

        free(memory);
    }

    return 0;
}

I tracked the number of folios that could be redundantly
swapped out by adding a simple counter as shown below:

@@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
                    check_stable_address_space(vma->vm_mm)) &&
                    folio_test_swapbacked(folio) &&
                    !folio_likely_mapped_shared(folio)) {
+                       static long i, size;
+                       size += folio_size(folio);
+                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
                        pra->referenced = -1;
                        page_vma_mapped_walk_done(&pvmw);
                        return false;


This is what I have observed:

/ # /home/barry/develop/linux/skip_swap_out_test
[   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
[   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
[   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
[   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
[   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
[   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
...
[   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
[   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
[   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480

I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
the process exit.

Despite the numerous mistakes Zhiguo made in sending this patch, it is still
quite valuable. Please consider pulling his v9 into the mm tree for testing.

Thanks
Barry
David Hildenbrand July 10, 2024, 3:59 a.m. UTC | #8
On 10.07.24 05:32, Barry Song wrote:
> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>
>>> The releasing process of the non-shared anonymous folio mapped solely by
>>> an exiting process may go through two flows: 1) the anonymous folio is
>>> firstly is swaped-out into swapspace and transformed into a swp_entry
>>> in shrink_folio_list; 2) then the swp_entry is released in the process
>>> exiting flow. This will result in the high cpu load of releasing a
>>> non-shared anonymous folio mapped solely by an exiting process.
>>>
>>> When the low system memory and the exiting process exist at the same
>>> time, it will be likely to happen, because the non-shared anonymous
>>> folio mapped solely by an exiting process may be reclaimed by
>>> shrink_folio_list.
>>>
>>> This patch is that shrink skips the non-shared anonymous folio solely
>>> mapped by an exting process and this folio is only released directly in
>>> the process exiting flow, which will save swap-out time and alleviate
>>> the load of the process exiting.
>>
>> It would be helpful to provide some before-and-after runtime
>> measurements, please.  It's a performance optimization so please let's
>> see what effect it has.
> 
> Hi Andrew,
> 
> This was something I was curious about too, so I created a small test program
> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
> up a small machine with only 300MB of RAM to trigger kswapd.
> 
> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
>   -smp cpus=4 -cpu max \
>   -m 300M -kernel arch/arm64/boot/Image
>   
> The test program will be randomly terminated by its subprocess to trigger
> the use case of this patch.
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <time.h>
> #include <signal.h>
> 
> #define MEMORY_SIZE (256 * 1024 * 1024)
> 
> unsigned char *memory;
> 
> void allocate_and_write_memory()
> {
>      memory = (unsigned char *)malloc(MEMORY_SIZE);
>      if (memory == NULL) {
>          perror("malloc");
>          exit(EXIT_FAILURE);
>      }
> 
>      while (1)
>          memset(memory, 0x11, MEMORY_SIZE);
> }
> 
> int main()
> {
>      pid_t pid;
>      srand(time(NULL));
> 
>      pid = fork();
> 
>      if (pid < 0) {
>          perror("fork");
>          exit(EXIT_FAILURE);
>      }
> 
>      if (pid == 0) {
>          int delay = (rand() % 10000) + 10000;
>          usleep(delay * 1000);
> 
> 	/* kill parent when it is busy on swapping */
>          kill(getppid(), SIGKILL);
>          _exit(0);
>      } else {
>          allocate_and_write_memory();
> 
>          wait(NULL);
> 
>          free(memory);
>      }
> 
>      return 0;
> }
> 
> I tracked the number of folios that could be redundantly
> swapped out by adding a simple counter as shown below:
> 
> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
>                      check_stable_address_space(vma->vm_mm)) &&
>                      folio_test_swapbacked(folio) &&
>                      !folio_likely_mapped_shared(folio)) {
> +                       static long i, size;
> +                       size += folio_size(folio);
> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
>                          pra->referenced = -1;
>                          page_vma_mapped_walk_done(&pvmw);
>                          return false;
> 
> 
> This is what I have observed:
> 
> / # /home/barry/develop/linux/skip_swap_out_test
> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
> ...
> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
> 
> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
> the process exit.
> 
> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
> quite valuable. Please consider pulling his v9 into the mm tree for testing.

BTW, we dropped the folio_test_anon() check, but what about shmem? They 
also do __folio_set_swapbacked()?
Barry Song July 10, 2024, 4:02 a.m. UTC | #9
On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.07.24 05:32, Barry Song wrote:
> > On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>
> >>> The releasing process of the non-shared anonymous folio mapped solely by
> >>> an exiting process may go through two flows: 1) the anonymous folio is
> >>> firstly is swaped-out into swapspace and transformed into a swp_entry
> >>> in shrink_folio_list; 2) then the swp_entry is released in the process
> >>> exiting flow. This will result in the high cpu load of releasing a
> >>> non-shared anonymous folio mapped solely by an exiting process.
> >>>
> >>> When the low system memory and the exiting process exist at the same
> >>> time, it will be likely to happen, because the non-shared anonymous
> >>> folio mapped solely by an exiting process may be reclaimed by
> >>> shrink_folio_list.
> >>>
> >>> This patch is that shrink skips the non-shared anonymous folio solely
> >>> mapped by an exting process and this folio is only released directly in
> >>> the process exiting flow, which will save swap-out time and alleviate
> >>> the load of the process exiting.
> >>
> >> It would be helpful to provide some before-and-after runtime
> >> measurements, please.  It's a performance optimization so please let's
> >> see what effect it has.
> >
> > Hi Andrew,
> >
> > This was something I was curious about too, so I created a small test program
> > that allocates and continuously writes to 256MB of memory. Using QEMU, I set
> > up a small machine with only 300MB of RAM to trigger kswapd.
> >
> > qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
> >   -smp cpus=4 -cpu max \
> >   -m 300M -kernel arch/arm64/boot/Image
> >
> > The test program will be randomly terminated by its subprocess to trigger
> > the use case of this patch.
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> > #include <time.h>
> > #include <signal.h>
> >
> > #define MEMORY_SIZE (256 * 1024 * 1024)
> >
> > unsigned char *memory;
> >
> > void allocate_and_write_memory()
> > {
> >      memory = (unsigned char *)malloc(MEMORY_SIZE);
> >      if (memory == NULL) {
> >          perror("malloc");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      while (1)
> >          memset(memory, 0x11, MEMORY_SIZE);
> > }
> >
> > int main()
> > {
> >      pid_t pid;
> >      srand(time(NULL));
> >
> >      pid = fork();
> >
> >      if (pid < 0) {
> >          perror("fork");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (pid == 0) {
> >          int delay = (rand() % 10000) + 10000;
> >          usleep(delay * 1000);
> >
> >       /* kill parent when it is busy on swapping */
> >          kill(getppid(), SIGKILL);
> >          _exit(0);
> >      } else {
> >          allocate_and_write_memory();
> >
> >          wait(NULL);
> >
> >          free(memory);
> >      }
> >
> >      return 0;
> > }
> >
> > I tracked the number of folios that could be redundantly
> > swapped out by adding a simple counter as shown below:
> >
> > @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
> >                      check_stable_address_space(vma->vm_mm)) &&
> >                      folio_test_swapbacked(folio) &&
> >                      !folio_likely_mapped_shared(folio)) {
> > +                       static long i, size;
> > +                       size += folio_size(folio);
> > +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
> >                          pra->referenced = -1;
> >                          page_vma_mapped_walk_done(&pvmw);
> >                          return false;
> >
> >
> > This is what I have observed:
> >
> > / # /home/barry/develop/linux/skip_swap_out_test
> > [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
> > [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
> > [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
> > [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
> > [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
> > [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
> > ...
> > [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
> > [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
> > [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
> >
> > I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
> > mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
> > the process exit.
> >
> > Despite the numerous mistakes Zhiguo made in sending this patch, it is still
> > quite valuable. Please consider pulling his v9 into the mm tree for testing.
>
> BTW, we dropped the folio_test_anon() check, but what about shmem? They
> also do __folio_set_swapbacked()?

my point is that the purpose is skipping redundant swap-out, if shmem is single
mapped, they could be also skipped.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand July 10, 2024, 4:04 a.m. UTC | #10
On 10.07.24 06:02, Barry Song wrote:
> On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.07.24 05:32, Barry Song wrote:
>>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>
>>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>
>>>>> The releasing process of the non-shared anonymous folio mapped solely by
>>>>> an exiting process may go through two flows: 1) the anonymous folio is
>>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
>>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
>>>>> exiting flow. This will result in the high cpu load of releasing a
>>>>> non-shared anonymous folio mapped solely by an exiting process.
>>>>>
>>>>> When the low system memory and the exiting process exist at the same
>>>>> time, it will be likely to happen, because the non-shared anonymous
>>>>> folio mapped solely by an exiting process may be reclaimed by
>>>>> shrink_folio_list.
>>>>>
>>>>> This patch is that shrink skips the non-shared anonymous folio solely
>>>>> mapped by an exting process and this folio is only released directly in
>>>>> the process exiting flow, which will save swap-out time and alleviate
>>>>> the load of the process exiting.
>>>>
>>>> It would be helpful to provide some before-and-after runtime
>>>> measurements, please.  It's a performance optimization so please let's
>>>> see what effect it has.
>>>
>>> Hi Andrew,
>>>
>>> This was something I was curious about too, so I created a small test program
>>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
>>> up a small machine with only 300MB of RAM to trigger kswapd.
>>>
>>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
>>>    -smp cpus=4 -cpu max \
>>>    -m 300M -kernel arch/arm64/boot/Image
>>>
>>> The test program will be randomly terminated by its subprocess to trigger
>>> the use case of this patch.
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>> #include <string.h>
>>> #include <sys/types.h>
>>> #include <sys/wait.h>
>>> #include <time.h>
>>> #include <signal.h>
>>>
>>> #define MEMORY_SIZE (256 * 1024 * 1024)
>>>
>>> unsigned char *memory;
>>>
>>> void allocate_and_write_memory()
>>> {
>>>       memory = (unsigned char *)malloc(MEMORY_SIZE);
>>>       if (memory == NULL) {
>>>           perror("malloc");
>>>           exit(EXIT_FAILURE);
>>>       }
>>>
>>>       while (1)
>>>           memset(memory, 0x11, MEMORY_SIZE);
>>> }
>>>
>>> int main()
>>> {
>>>       pid_t pid;
>>>       srand(time(NULL));
>>>
>>>       pid = fork();
>>>
>>>       if (pid < 0) {
>>>           perror("fork");
>>>           exit(EXIT_FAILURE);
>>>       }
>>>
>>>       if (pid == 0) {
>>>           int delay = (rand() % 10000) + 10000;
>>>           usleep(delay * 1000);
>>>
>>>        /* kill parent when it is busy on swapping */
>>>           kill(getppid(), SIGKILL);
>>>           _exit(0);
>>>       } else {
>>>           allocate_and_write_memory();
>>>
>>>           wait(NULL);
>>>
>>>           free(memory);
>>>       }
>>>
>>>       return 0;
>>> }
>>>
>>> I tracked the number of folios that could be redundantly
>>> swapped out by adding a simple counter as shown below:
>>>
>>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
>>>                       check_stable_address_space(vma->vm_mm)) &&
>>>                       folio_test_swapbacked(folio) &&
>>>                       !folio_likely_mapped_shared(folio)) {
>>> +                       static long i, size;
>>> +                       size += folio_size(folio);
>>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
>>>                           pra->referenced = -1;
>>>                           page_vma_mapped_walk_done(&pvmw);
>>>                           return false;
>>>
>>>
>>> This is what I have observed:
>>>
>>> / # /home/barry/develop/linux/skip_swap_out_test
>>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
>>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
>>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
>>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
>>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
>>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
>>> ...
>>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
>>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
>>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
>>>
>>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
>>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
>>> the process exit.
>>>
>>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
>>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
>>
>> BTW, we dropped the folio_test_anon() check, but what about shmem? They
>> also do __folio_set_swapbacked()?
> 
> my point is that the purpose is skipping redundant swap-out, if shmem is single
> mapped, they could be also skipped.

But they won't get necessarily *freed* when unmapping them. They might 
just continue living in tmpfs? where some other process might just map 
them later?

IMHO, there is a big difference here between anon and shmem. (well, 
anon_shmem would actually be different :) )
Barry Song July 10, 2024, 4:44 a.m. UTC | #11
On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.07.24 06:02, Barry Song wrote:
> > On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 10.07.24 05:32, Barry Song wrote:
> >>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>>
> >>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>>
> >>>>> The releasing process of the non-shared anonymous folio mapped solely by
> >>>>> an exiting process may go through two flows: 1) the anonymous folio is
> >>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
> >>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
> >>>>> exiting flow. This will result in the high cpu load of releasing a
> >>>>> non-shared anonymous folio mapped solely by an exiting process.
> >>>>>
> >>>>> When the low system memory and the exiting process exist at the same
> >>>>> time, it will be likely to happen, because the non-shared anonymous
> >>>>> folio mapped solely by an exiting process may be reclaimed by
> >>>>> shrink_folio_list.
> >>>>>
> >>>>> This patch is that shrink skips the non-shared anonymous folio solely
> >>>>> mapped by an exting process and this folio is only released directly in
> >>>>> the process exiting flow, which will save swap-out time and alleviate
> >>>>> the load of the process exiting.
> >>>>
> >>>> It would be helpful to provide some before-and-after runtime
> >>>> measurements, please.  It's a performance optimization so please let's
> >>>> see what effect it has.
> >>>
> >>> Hi Andrew,
> >>>
> >>> This was something I was curious about too, so I created a small test program
> >>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
> >>> up a small machine with only 300MB of RAM to trigger kswapd.
> >>>
> >>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
> >>>    -smp cpus=4 -cpu max \
> >>>    -m 300M -kernel arch/arm64/boot/Image
> >>>
> >>> The test program will be randomly terminated by its subprocess to trigger
> >>> the use case of this patch.
> >>>
> >>> #include <stdio.h>
> >>> #include <stdlib.h>
> >>> #include <unistd.h>
> >>> #include <string.h>
> >>> #include <sys/types.h>
> >>> #include <sys/wait.h>
> >>> #include <time.h>
> >>> #include <signal.h>
> >>>
> >>> #define MEMORY_SIZE (256 * 1024 * 1024)
> >>>
> >>> unsigned char *memory;
> >>>
> >>> void allocate_and_write_memory()
> >>> {
> >>>       memory = (unsigned char *)malloc(MEMORY_SIZE);
> >>>       if (memory == NULL) {
> >>>           perror("malloc");
> >>>           exit(EXIT_FAILURE);
> >>>       }
> >>>
> >>>       while (1)
> >>>           memset(memory, 0x11, MEMORY_SIZE);
> >>> }
> >>>
> >>> int main()
> >>> {
> >>>       pid_t pid;
> >>>       srand(time(NULL));
> >>>
> >>>       pid = fork();
> >>>
> >>>       if (pid < 0) {
> >>>           perror("fork");
> >>>           exit(EXIT_FAILURE);
> >>>       }
> >>>
> >>>       if (pid == 0) {
> >>>           int delay = (rand() % 10000) + 10000;
> >>>           usleep(delay * 1000);
> >>>
> >>>        /* kill parent when it is busy on swapping */
> >>>           kill(getppid(), SIGKILL);
> >>>           _exit(0);
> >>>       } else {
> >>>           allocate_and_write_memory();
> >>>
> >>>           wait(NULL);
> >>>
> >>>           free(memory);
> >>>       }
> >>>
> >>>       return 0;
> >>> }
> >>>
> >>> I tracked the number of folios that could be redundantly
> >>> swapped out by adding a simple counter as shown below:
> >>>
> >>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
> >>>                       check_stable_address_space(vma->vm_mm)) &&
> >>>                       folio_test_swapbacked(folio) &&
> >>>                       !folio_likely_mapped_shared(folio)) {
> >>> +                       static long i, size;
> >>> +                       size += folio_size(folio);
> >>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
> >>>                           pra->referenced = -1;
> >>>                           page_vma_mapped_walk_done(&pvmw);
> >>>                           return false;
> >>>
> >>>
> >>> This is what I have observed:
> >>>
> >>> / # /home/barry/develop/linux/skip_swap_out_test
> >>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
> >>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
> >>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
> >>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
> >>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
> >>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
> >>> ...
> >>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
> >>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
> >>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
> >>>
> >>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
> >>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
> >>> the process exit.
> >>>
> >>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
> >>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
> >>
> >> BTW, we dropped the folio_test_anon() check, but what about shmem? They
> >> also do __folio_set_swapbacked()?
> >
> > my point is that the purpose is skipping redundant swap-out, if shmem is single
> > mapped, they could be also skipped.
>
> But they won't get necessarily *freed* when unmapping them. They might
> just continue living in tmpfs? where some other process might just map
> them later?
>

You're correct. I overlooked this aspect, focusing on swap and thinking of shmem
solely in terms of swap.

> IMHO, there is a big difference here between anon and shmem. (well,
> anon_shmem would actually be different :) )

Even though anon_shmem behaves similarly to anonymous memory when
releasing memory, it doesn't seem worth the added complexity?

So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon()
back? Sorry for my bad, Zhiguo.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
zhiguojiang July 10, 2024, 6:47 a.m. UTC | #12
在 2024/7/10 12:44, Barry Song 写道:
> [Some people who received this message don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@redhat.com> wrote:
>> On 10.07.24 06:02, Barry Song wrote:
>>> On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
>>>> On 10.07.24 05:32, Barry Song wrote:
>>>>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>>
>>>>>>> The releasing process of the non-shared anonymous folio mapped solely by
>>>>>>> an exiting process may go through two flows: 1) the anonymous folio is
>>>>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
>>>>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
>>>>>>> exiting flow. This will result in the high cpu load of releasing a
>>>>>>> non-shared anonymous folio mapped solely by an exiting process.
>>>>>>>
>>>>>>> When the low system memory and the exiting process exist at the same
>>>>>>> time, it will be likely to happen, because the non-shared anonymous
>>>>>>> folio mapped solely by an exiting process may be reclaimed by
>>>>>>> shrink_folio_list.
>>>>>>>
>>>>>>> This patch is that shrink skips the non-shared anonymous folio solely
>>>>>>> mapped by an exting process and this folio is only released directly in
>>>>>>> the process exiting flow, which will save swap-out time and alleviate
>>>>>>> the load of the process exiting.
>>>>>> It would be helpful to provide some before-and-after runtime
>>>>>> measurements, please.  It's a performance optimization so please let's
>>>>>> see what effect it has.
>>>>> Hi Andrew,
>>>>>
>>>>> This was something I was curious about too, so I created a small test program
>>>>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
>>>>> up a small machine with only 300MB of RAM to trigger kswapd.
>>>>>
>>>>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
>>>>>     -smp cpus=4 -cpu max \
>>>>>     -m 300M -kernel arch/arm64/boot/Image
>>>>>
>>>>> The test program will be randomly terminated by its subprocess to trigger
>>>>> the use case of this patch.
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <unistd.h>
>>>>> #include <string.h>
>>>>> #include <sys/types.h>
>>>>> #include <sys/wait.h>
>>>>> #include <time.h>
>>>>> #include <signal.h>
>>>>>
>>>>> #define MEMORY_SIZE (256 * 1024 * 1024)
>>>>>
>>>>> unsigned char *memory;
>>>>>
>>>>> void allocate_and_write_memory()
>>>>> {
>>>>>        memory = (unsigned char *)malloc(MEMORY_SIZE);
>>>>>        if (memory == NULL) {
>>>>>            perror("malloc");
>>>>>            exit(EXIT_FAILURE);
>>>>>        }
>>>>>
>>>>>        while (1)
>>>>>            memset(memory, 0x11, MEMORY_SIZE);
>>>>> }
>>>>>
>>>>> int main()
>>>>> {
>>>>>        pid_t pid;
>>>>>        srand(time(NULL));
>>>>>
>>>>>        pid = fork();
>>>>>
>>>>>        if (pid < 0) {
>>>>>            perror("fork");
>>>>>            exit(EXIT_FAILURE);
>>>>>        }
>>>>>
>>>>>        if (pid == 0) {
>>>>>            int delay = (rand() % 10000) + 10000;
>>>>>            usleep(delay * 1000);
>>>>>
>>>>>         /* kill parent when it is busy on swapping */
>>>>>            kill(getppid(), SIGKILL);
>>>>>            _exit(0);
>>>>>        } else {
>>>>>            allocate_and_write_memory();
>>>>>
>>>>>            wait(NULL);
>>>>>
>>>>>            free(memory);
>>>>>        }
>>>>>
>>>>>        return 0;
>>>>> }
>>>>>
>>>>> I tracked the number of folios that could be redundantly
>>>>> swapped out by adding a simple counter as shown below:
>>>>>
>>>>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
>>>>>                        check_stable_address_space(vma->vm_mm)) &&
>>>>>                        folio_test_swapbacked(folio) &&
>>>>>                        !folio_likely_mapped_shared(folio)) {
>>>>> +                       static long i, size;
>>>>> +                       size += folio_size(folio);
>>>>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
>>>>>                            pra->referenced = -1;
>>>>>                            page_vma_mapped_walk_done(&pvmw);
>>>>>                            return false;
>>>>>
>>>>>
>>>>> This is what I have observed:
>>>>>
>>>>> / # /home/barry/develop/linux/skip_swap_out_test
>>>>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
>>>>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
>>>>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
>>>>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
>>>>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
>>>>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
>>>>> ...
>>>>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
>>>>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
>>>>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
>>>>>
>>>>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
>>>>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
>>>>> the process exit.
>>>>>
>>>>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
>>>>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
>>>> BTW, we dropped the folio_test_anon() check, but what about shmem? They
>>>> also do __folio_set_swapbacked()?
>>> my point is that the purpose is skipping redundant swap-out, if shmem is single
>>> mapped, they could be also skipped.
>> But they won't get necessarily *freed* when unmapping them. They might
>> just continue living in tmpfs? where some other process might just map
>> them later?
>>
> You're correct. I overlooked this aspect, focusing on swap and thinking of shmem
> solely in terms of swap.
>
>> IMHO, there is a big difference here between anon and shmem. (well,
>> anon_shmem would actually be different :) )
> Even though anon_shmem behaves similarly to anonymous memory when
> releasing memory, it doesn't seem worth the added complexity?
>
> So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon()
> back? Sorry for my bad, Zhiguo.
If folio_test_anon(folio) && folio_test_swapbacked(folio) condition is 
used, can
it means that the folio is anonymous anther than shmem definitely? So does
folio_likely_mapped_shared() need to be removed?
>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> Thanks
> Barry
Thanks
Zhiguo
Barry Song July 10, 2024, 7:11 a.m. UTC | #13
On Wed, Jul 10, 2024 at 6:47 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/7/10 12:44, Barry Song 写道:
> > [Some people who received this message don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@redhat.com> wrote:
> >> On 10.07.24 06:02, Barry Song wrote:
> >>> On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
> >>>> On 10.07.24 05:32, Barry Song wrote:
> >>>>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>>>>
> >>>>>>> The releasing process of the non-shared anonymous folio mapped solely by
> >>>>>>> an exiting process may go through two flows: 1) the anonymous folio is
> >>>>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
> >>>>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
> >>>>>>> exiting flow. This will result in the high cpu load of releasing a
> >>>>>>> non-shared anonymous folio mapped solely by an exiting process.
> >>>>>>>
> >>>>>>> When the low system memory and the exiting process exist at the same
> >>>>>>> time, it will be likely to happen, because the non-shared anonymous
> >>>>>>> folio mapped solely by an exiting process may be reclaimed by
> >>>>>>> shrink_folio_list.
> >>>>>>>
> >>>>>>> This patch is that shrink skips the non-shared anonymous folio solely
> >>>>>>> mapped by an exting process and this folio is only released directly in
> >>>>>>> the process exiting flow, which will save swap-out time and alleviate
> >>>>>>> the load of the process exiting.
> >>>>>> It would be helpful to provide some before-and-after runtime
> >>>>>> measurements, please.  It's a performance optimization so please let's
> >>>>>> see what effect it has.
> >>>>> Hi Andrew,
> >>>>>
> >>>>> This was something I was curious about too, so I created a small test program
> >>>>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
> >>>>> up a small machine with only 300MB of RAM to trigger kswapd.
> >>>>>
> >>>>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
> >>>>>     -smp cpus=4 -cpu max \
> >>>>>     -m 300M -kernel arch/arm64/boot/Image
> >>>>>
> >>>>> The test program will be randomly terminated by its subprocess to trigger
> >>>>> the use case of this patch.
> >>>>>
> >>>>> #include <stdio.h>
> >>>>> #include <stdlib.h>
> >>>>> #include <unistd.h>
> >>>>> #include <string.h>
> >>>>> #include <sys/types.h>
> >>>>> #include <sys/wait.h>
> >>>>> #include <time.h>
> >>>>> #include <signal.h>
> >>>>>
> >>>>> #define MEMORY_SIZE (256 * 1024 * 1024)
> >>>>>
> >>>>> unsigned char *memory;
> >>>>>
> >>>>> void allocate_and_write_memory()
> >>>>> {
> >>>>>        memory = (unsigned char *)malloc(MEMORY_SIZE);
> >>>>>        if (memory == NULL) {
> >>>>>            perror("malloc");
> >>>>>            exit(EXIT_FAILURE);
> >>>>>        }
> >>>>>
> >>>>>        while (1)
> >>>>>            memset(memory, 0x11, MEMORY_SIZE);
> >>>>> }
> >>>>>
> >>>>> int main()
> >>>>> {
> >>>>>        pid_t pid;
> >>>>>        srand(time(NULL));
> >>>>>
> >>>>>        pid = fork();
> >>>>>
> >>>>>        if (pid < 0) {
> >>>>>            perror("fork");
> >>>>>            exit(EXIT_FAILURE);
> >>>>>        }
> >>>>>
> >>>>>        if (pid == 0) {
> >>>>>            int delay = (rand() % 10000) + 10000;
> >>>>>            usleep(delay * 1000);
> >>>>>
> >>>>>         /* kill parent when it is busy on swapping */
> >>>>>            kill(getppid(), SIGKILL);
> >>>>>            _exit(0);
> >>>>>        } else {
> >>>>>            allocate_and_write_memory();
> >>>>>
> >>>>>            wait(NULL);
> >>>>>
> >>>>>            free(memory);
> >>>>>        }
> >>>>>
> >>>>>        return 0;
> >>>>> }
> >>>>>
> >>>>> I tracked the number of folios that could be redundantly
> >>>>> swapped out by adding a simple counter as shown below:
> >>>>>
> >>>>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
> >>>>>                        check_stable_address_space(vma->vm_mm)) &&
> >>>>>                        folio_test_swapbacked(folio) &&
> >>>>>                        !folio_likely_mapped_shared(folio)) {
> >>>>> +                       static long i, size;
> >>>>> +                       size += folio_size(folio);
> >>>>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
> >>>>>                            pra->referenced = -1;
> >>>>>                            page_vma_mapped_walk_done(&pvmw);
> >>>>>                            return false;
> >>>>>
> >>>>>
> >>>>> This is what I have observed:
> >>>>>
> >>>>> / # /home/barry/develop/linux/skip_swap_out_test
> >>>>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
> >>>>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
> >>>>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
> >>>>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
> >>>>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
> >>>>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
> >>>>> ...
> >>>>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
> >>>>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
> >>>>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
> >>>>>
> >>>>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
> >>>>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
> >>>>> the process exit.
> >>>>>
> >>>>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
> >>>>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
> >>>> BTW, we dropped the folio_test_anon() check, but what about shmem? They
> >>>> also do __folio_set_swapbacked()?
> >>> my point is that the purpose is skipping redundant swap-out, if shmem is single
> >>> mapped, they could be also skipped.
> >> But they won't get necessarily *freed* when unmapping them. They might
> >> just continue living in tmpfs? where some other process might just map
> >> them later?
> >>
> > You're correct. I overlooked this aspect, focusing on swap and thinking of shmem
> > solely in terms of swap.
> >
> >> IMHO, there is a big difference here between anon and shmem. (well,
> >> anon_shmem would actually be different :) )
> > Even though anon_shmem behaves similarly to anonymous memory when
> > releasing memory, it doesn't seem worth the added complexity?
> >
> > So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon()
> > back? Sorry for my bad, Zhiguo.
> If folio_test_anon(folio) && folio_test_swapbacked(folio) condition is
> used, can
> it means that the folio is anonymous anther than shmem definitely? So does
> folio_likely_mapped_shared() need to be removed?

No, shared memory (shmem) isn't necessarily shared, and private anonymous
memory isn't necessarily unshared. There is no direct relationship between
them.

In the case of a fork, your private anonymous folio can be shared by
two or more processes before CoW.

> >
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
> > Thanks
> > Barry
> Thanks
> Zhiguo
>
zhiguojiang July 10, 2024, 8:38 a.m. UTC | #14
在 2024/7/10 15:11, Barry Song 写道:
> [Some people who received this message don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, Jul 10, 2024 at 6:47 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/7/10 12:44, Barry Song 写道:
>>> [Some people who received this message don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@redhat.com> wrote:
>>>> On 10.07.24 06:02, Barry Song wrote:
>>>>> On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>> On 10.07.24 05:32, Barry Song wrote:
>>>>>>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>>>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>>>>
>>>>>>>>> The releasing process of the non-shared anonymous folio mapped solely by
>>>>>>>>> an exiting process may go through two flows: 1) the anonymous folio is
>>>>>>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
>>>>>>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
>>>>>>>>> exiting flow. This will result in the high cpu load of releasing a
>>>>>>>>> non-shared anonymous folio mapped solely by an exiting process.
>>>>>>>>>
>>>>>>>>> When the low system memory and the exiting process exist at the same
>>>>>>>>> time, it will be likely to happen, because the non-shared anonymous
>>>>>>>>> folio mapped solely by an exiting process may be reclaimed by
>>>>>>>>> shrink_folio_list.
>>>>>>>>>
>>>>>>>>> This patch is that shrink skips the non-shared anonymous folio solely
>>>>>>>>> mapped by an exting process and this folio is only released directly in
>>>>>>>>> the process exiting flow, which will save swap-out time and alleviate
>>>>>>>>> the load of the process exiting.
>>>>>>>> It would be helpful to provide some before-and-after runtime
>>>>>>>> measurements, please.  It's a performance optimization so please let's
>>>>>>>> see what effect it has.
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> This was something I was curious about too, so I created a small test program
>>>>>>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
>>>>>>> up a small machine with only 300MB of RAM to trigger kswapd.
>>>>>>>
>>>>>>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
>>>>>>>      -smp cpus=4 -cpu max \
>>>>>>>      -m 300M -kernel arch/arm64/boot/Image
>>>>>>>
>>>>>>> The test program will be randomly terminated by its subprocess to trigger
>>>>>>> the use case of this patch.
>>>>>>>
>>>>>>> #include <stdio.h>
>>>>>>> #include <stdlib.h>
>>>>>>> #include <unistd.h>
>>>>>>> #include <string.h>
>>>>>>> #include <sys/types.h>
>>>>>>> #include <sys/wait.h>
>>>>>>> #include <time.h>
>>>>>>> #include <signal.h>
>>>>>>>
>>>>>>> #define MEMORY_SIZE (256 * 1024 * 1024)
>>>>>>>
>>>>>>> unsigned char *memory;
>>>>>>>
>>>>>>> void allocate_and_write_memory()
>>>>>>> {
>>>>>>>         memory = (unsigned char *)malloc(MEMORY_SIZE);
>>>>>>>         if (memory == NULL) {
>>>>>>>             perror("malloc");
>>>>>>>             exit(EXIT_FAILURE);
>>>>>>>         }
>>>>>>>
>>>>>>>         while (1)
>>>>>>>             memset(memory, 0x11, MEMORY_SIZE);
>>>>>>> }
>>>>>>>
>>>>>>> int main()
>>>>>>> {
>>>>>>>         pid_t pid;
>>>>>>>         srand(time(NULL));
>>>>>>>
>>>>>>>         pid = fork();
>>>>>>>
>>>>>>>         if (pid < 0) {
>>>>>>>             perror("fork");
>>>>>>>             exit(EXIT_FAILURE);
>>>>>>>         }
>>>>>>>
>>>>>>>         if (pid == 0) {
>>>>>>>             int delay = (rand() % 10000) + 10000;
>>>>>>>             usleep(delay * 1000);
>>>>>>>
>>>>>>>          /* kill parent when it is busy on swapping */
>>>>>>>             kill(getppid(), SIGKILL);
>>>>>>>             _exit(0);
>>>>>>>         } else {
>>>>>>>             allocate_and_write_memory();
>>>>>>>
>>>>>>>             wait(NULL);
>>>>>>>
>>>>>>>             free(memory);
>>>>>>>         }
>>>>>>>
>>>>>>>         return 0;
>>>>>>> }
>>>>>>>
>>>>>>> I tracked the number of folios that could be redundantly
>>>>>>> swapped out by adding a simple counter as shown below:
>>>>>>>
>>>>>>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
>>>>>>>                         check_stable_address_space(vma->vm_mm)) &&
>>>>>>>                         folio_test_swapbacked(folio) &&
>>>>>>>                         !folio_likely_mapped_shared(folio)) {
>>>>>>> +                       static long i, size;
>>>>>>> +                       size += folio_size(folio);
>>>>>>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
>>>>>>>                             pra->referenced = -1;
>>>>>>>                             page_vma_mapped_walk_done(&pvmw);
>>>>>>>                             return false;
>>>>>>>
>>>>>>>
>>>>>>> This is what I have observed:
>>>>>>>
>>>>>>> / # /home/barry/develop/linux/skip_swap_out_test
>>>>>>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
>>>>>>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
>>>>>>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
>>>>>>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
>>>>>>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
>>>>>>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
>>>>>>> ...
>>>>>>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
>>>>>>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
>>>>>>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
>>>>>>>
>>>>>>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
>>>>>>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
>>>>>>> the process exit.
>>>>>>>
>>>>>>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
>>>>>>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
>>>>>> BTW, we dropped the folio_test_anon() check, but what about shmem? They
>>>>>> also do __folio_set_swapbacked()?
>>>>> my point is that the purpose is skipping redundant swap-out, if shmem is single
>>>>> mapped, they could be also skipped.
>>>> But they won't get necessarily *freed* when unmapping them. They might
>>>> just continue living in tmpfs? where some other process might just map
>>>> them later?
>>>>
>>> You're correct. I overlooked this aspect, focusing on swap and thinking of shmem
>>> solely in terms of swap.
>>>
>>>> IMHO, there is a big difference here between anon and shmem. (well,
>>>> anon_shmem would actually be different :) )
>>> Even though anon_shmem behaves similarly to anonymous memory when
>>> releasing memory, it doesn't seem worth the added complexity?
>>>
>>> So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon()
>>> back? Sorry for my bad, Zhiguo.
>> If folio_test_anon(folio) && folio_test_swapbacked(folio) condition is
>> used, can
>> it means that the folio is anonymous anther than shmem definitely? So does
>> folio_likely_mapped_shared() need to be removed?
> No, shared memory (shmem) isn't necessarily shared, and private anonymous
> memory isn't necessarily unshared. There is no direct relationship between
> them.
>
> In the case of a fork, your private anonymous folio can be shared by
> two or more processes before CoW.
Hi,
I have added folio_test_anon(folio) condition in v10.
Thanks
>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>>
>>> Thanks
>>> Barry
>> Thanks
>> Zhiguo
>>
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 88156deb46a6..bb9954773cce 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -877,6 +877,20 @@  static bool folio_referenced_one(struct folio *folio,
 			continue;
 		}
 
+		/*
+		 * Skip the non-shared swapbacked folio mapped solely by
+		 * the exiting or OOM-reaped process. This avoids redundant
+		 * swap-out followed by an immediate unmap.
+		 */
+		if ((!atomic_read(&vma->vm_mm->mm_users) ||
+		    check_stable_address_space(vma->vm_mm)) &&
+		    folio_test_swapbacked(folio) &&
+		    !folio_likely_mapped_shared(folio)) {
+			pra->referenced = -1;
+			page_vma_mapped_walk_done(&pvmw);
+			return false;
+		}
+
 		if (pvmw.pte) {
 			if (lru_gen_enabled() &&
 			    pte_young(ptep_get(pvmw.pte))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 80f9a486cf27..1d5f78a3dbeb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -863,7 +863,12 @@  static enum folio_references folio_check_references(struct folio *folio,
 	if (vm_flags & VM_LOCKED)
 		return FOLIOREF_ACTIVATE;
 
-	/* rmap lock contention: rotate */
+	/*
+	 * There are two cases to consider.
+	 * 1) Rmap lock contention: rotate.
+	 * 2) Skip the non-shared swapbacked folio mapped solely by
+	 *    the exiting or OOM-reaped process.
+	 */
 	if (referenced_ptes == -1)
 		return FOLIOREF_KEEP;