mbox series

[v2,0/3] mm: tlb swap entries batch async release

Message ID 20240731133318.527-1-justinjiang@vivo.com (mailing list archive)
Headers show
Series mm: tlb swap entries batch async release | expand

Message

zhiguojiang July 31, 2024, 1:33 p.m. UTC
The main reasons for the prolonged exit of a background process is the
time-consuming release of its swap entries. The proportion of swap memory
occupied by the background process increases with its duration in the
background, and after a period of time, this value can reach 60% or more.
Additionally, the relatively lengthy path for releasing swap entries
further contributes to the longer time required for the background process
to release its swap entries.

In the multiple background applications scenario, when launching a large
memory application such as a camera, system may enter a low memory state,
which will triggers the killing of multiple background processes at the
same time. Due to multiple exiting processes occupying multiple CPUs for
concurrent execution, the current foreground application's CPU resources
are tight and may cause issues such as lagging.

To solve this problem, we have introduced the multiple exiting process
asynchronous swap memory release mechanism, which isolates and caches
swap entries occupied by multiple exit processes, and hands them over
to an asynchronous kworker to complete the release. This allows the
exiting processes to complete quickly and release CPU resources. We have
validated this modification on the products and achieved the expected
benefits.

It offers several benefits:
1. Alleviate the high system cpu load caused by multiple exiting
   processes running simultaneously.
2. Reduce lock competition in swap entry free path by an asynchronous
   kworker instead of multiple exiting processes parallel execution.
3. Release memory occupied by exiting processes more efficiently.

-v2:
1. fix arch s390 config compilation warning.
 Reported-by: kernel test robot <lkp@intel.com>
 Closes: https://lore.kernel.org/oe-kbuild-all/202407311703.8q8sDQ2p-lkp@intel.com/
 Reported-by: kernel test robot <lkp@intel.com>
 Closes: https://lore.kernel.org/oe-kbuild-all/202407311947.VPJNRqad-lkp@intel.com/

-v1:
 https://lore.kernel.org/linux-mm/20240730114426.511-1-justinjiang@vivo.com/

Zhiguo Jiang (3):
  mm: move task_is_dying to h headfile
  mm: tlb: add tlb swap entries batch async release
  mm: s390: fix compilation warning

 arch/s390/include/asm/tlb.h |   8 +
 include/asm-generic/tlb.h   |  44 ++++++
 include/linux/mm_types.h    |  58 +++++++
 include/linux/oom.h         |   6 +
 mm/memcontrol.c             |   6 -
 mm/memory.c                 |   3 +-
 mm/mmu_gather.c             | 297 ++++++++++++++++++++++++++++++++++++
 7 files changed, 415 insertions(+), 7 deletions(-)

Comments

Andrew Morton July 31, 2024, 4:17 p.m. UTC | #1
On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:

> The main reasons for the prolonged exit of a background process is the

The kernel really doesn't have a concept of a "background process". 
It's a userspace concept - perhaps "the parent process isn't waiting on
this process via wait()".

I assume here you're referring to an Android userspace concept?  I
expect that when Android "backgrounds" a process, it does lots of
things to that process.  Perhaps scheduling priority, perhaps
alteration of various MM tunables, etc.

So rather than referring to "backgrounding" it would be better to
identify what tuning alterations are made to such processes to bring
about this behavior.

> time-consuming release of its swap entries. The proportion of swap memory
> occupied by the background process increases with its duration in the
> background, and after a period of time, this value can reach 60% or more.

Again, what is it about the tuning of such processes which causes this
behavior?

> Additionally, the relatively lengthy path for releasing swap entries
> further contributes to the longer time required for the background process
> to release its swap entries.
> 
> In the multiple background applications scenario, when launching a large
> memory application such as a camera, system may enter a low memory state,
> which will triggers the killing of multiple background processes at the
> same time. Due to multiple exiting processes occupying multiple CPUs for
> concurrent execution, the current foreground application's CPU resources
> are tight and may cause issues such as lagging.
> 
> To solve this problem, we have introduced the multiple exiting process
> asynchronous swap memory release mechanism, which isolates and caches
> swap entries occupied by multiple exit processes, and hands them over
> to an asynchronous kworker to complete the release. This allows the
> exiting processes to complete quickly and release CPU resources. We have
> validated this modification on the products and achieved the expected
> benefits.

Dumb question: why can't this be done in userspace?  The exiting
process does fork/exit and lets the child do all this asynchronous freeing?

> It offers several benefits:
> 1. Alleviate the high system cpu load caused by multiple exiting
>    processes running simultaneously.
> 2. Reduce lock competition in swap entry free path by an asynchronous
>    kworker instead of multiple exiting processes parallel execution.

Why is lock contention reduced?  The same amount of work needs to be
done.

> 3. Release memory occupied by exiting processes more efficiently.

Probably it's slightly less efficient.

There are potential problems with this approach of passing work to a
kernel thread:

- The process will exit while its resources are still allocated.  But
  its parent process assumes those resources are now all freed and the
  parent process then proceeds to allocate resources.  This results in
  a time period where peak resource consumption is higher than it was
  before such a change.

- If all CPUs are running in userspace with realtime policy
  (SCHED_FIFO, for example) then the kworker thread will not run,
  indefinitely.

- Work which should have been accounted to the exiting process will
  instead go unaccounted.  

So please fully address all these potential issues.
zhiguojiang Aug. 1, 2024, 6:30 a.m. UTC | #2
在 2024/8/1 0:17, Andrew Morton 写道:
> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
>> The main reasons for the prolonged exit of a background process is the
> The kernel really doesn't have a concept of a "background process".
> It's a userspace concept - perhaps "the parent process isn't waiting on
> this process via wait()".
>
> I assume here you're referring to an Android userspace concept?  I
> expect that when Android "backgrounds" a process, it does lots of
> things to that process.  Perhaps scheduling priority, perhaps
> alteration of various MM tunables, etc.
>
> So rather than referring to "backgrounding" it would be better to
> identify what tuning alterations are made to such processes to bring
> about this behavior.
Hi Andrew Morton,

Thank you for your review and comments.

You are right. The "background process" here refers to the process
corresponding to an Android application switched to the background.
In fact, this patch is applicable to any exiting process.

The further explaination the concept of "multiple exiting processes",
is that it refers to different processes owning independent mm rather
than sharing the same mm.

I will use "mm" to describe process instead of "background" in next
version.
>
>> time-consuming release of its swap entries. The proportion of swap memory
>> occupied by the background process increases with its duration in the
>> background, and after a period of time, this value can reach 60% or more.
> Again, what is it about the tuning of such processes which causes this
> behavior?
When system is low memory, memory recycling will be trigged, where
anonymous folios in the process will be continuously reclaimed, resulting
in an increase of swap entries occupies by this process. So when the
process is killed, it takes more time to release it's swap entries over
time.

Testing datas of process occuping different physical memory sizes at
different time points:
Testing Platform: 8GB RAM
Testing procedure:
After booting up, start 15 processes first, and then observe the
physical memory size occupied by the last launched process at
different time points.

Example:
The process launched last: com.qiyi.video
|  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
-------------------------------------------------------------------
|     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
|   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
|   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
|  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
|    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
| Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
min - minute.

Based on the above datas, we can know that the swap ratio occupied by
the process gradually increases over time.
>
>> Additionally, the relatively lengthy path for releasing swap entries
>> further contributes to the longer time required for the background process
>> to release its swap entries.
>>
>> In the multiple background applications scenario, when launching a large
>> memory application such as a camera, system may enter a low memory state,
>> which will triggers the killing of multiple background processes at the
>> same time. Due to multiple exiting processes occupying multiple CPUs for
>> concurrent execution, the current foreground application's CPU resources
>> are tight and may cause issues such as lagging.
>>
>> To solve this problem, we have introduced the multiple exiting process
>> asynchronous swap memory release mechanism, which isolates and caches
>> swap entries occupied by multiple exit processes, and hands them over
>> to an asynchronous kworker to complete the release. This allows the
>> exiting processes to complete quickly and release CPU resources. We have
>> validated this modification on the products and achieved the expected
>> benefits.
> Dumb question: why can't this be done in userspace?  The exiting
> process does fork/exit and lets the child do all this asynchronous freeing?
The logic optimization for kernel releasing swap entries cannot be
implemented in userspace. The multiple exiting processes here own
their independent mm, rather than parent and child processes share the
same mm. Therefore, when the kernel executes multiple exiting process
simultaneously, they will definitely occupy multiple CPU core resources
to complete it.
>> It offers several benefits:
>> 1. Alleviate the high system cpu load caused by multiple exiting
>>     processes running simultaneously.
>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>     kworker instead of multiple exiting processes parallel execution.
> Why is lock contention reduced?  The same amount of work needs to be
> done.
When multiple CPU cores run to release the different swap entries belong
to different exiting processes simultaneously, cluster lock or swapinfo
lock may encounter lock contention issues, and while an asynchronous
kworker that only occupies one CPU core is used to complete this work,
it can reduce the probability of lock contention and free up the
remaining CPU core resources for other non-exiting processes to use.
>
>> 3. Release memory occupied by exiting processes more efficiently.
> Probably it's slightly less efficient.
We observed that using an asynchronous kworker can result in more free
memory earlier. When multiple processes exit simultaneously, due to CPU
core resources competition, these exiting processes remain in a
runnable state for a long time and cannot release their occupied memory
resources timely.
>
> There are potential problems with this approach of passing work to a
> kernel thread:
>
> - The process will exit while its resources are still allocated.  But
>    its parent process assumes those resources are now all freed and the
>    parent process then proceeds to allocate resources.  This results in
>    a time period where peak resource consumption is higher than it was
>    before such a change.
- I don't think this modification will cause such a problem. Perhaps I
   haven't fully understood your meaning yet. Can you give me a specific
   example?
> - If all CPUs are running in userspace with realtime policy
>    (SCHED_FIFO, for example) then the kworker thread will not run,
>    indefinitely.
- In my clumsy understanding, the execution priority of kernel threads
   should not be lower than that of the exiting process, and the
   asynchronous kworker execution should only be triggered when the
   process exits. The exiting process should not be set to SCHED_LFO,
   so when the exiting process is executed, the asynchronous kworker
   should also have opportunity to get timely execution.
> - Work which should have been accounted to the exiting process will
>    instead go unaccounted.
- You are right, the statistics of process exit time may no longer be
   complete.
> So please fully address all these potential issues.
Thanks
Zhiguo
Barry Song Aug. 1, 2024, 7:36 a.m. UTC | #3
On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/8/1 0:17, Andrew Morton 写道:
> > [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >
> >> The main reasons for the prolonged exit of a background process is the
> > The kernel really doesn't have a concept of a "background process".
> > It's a userspace concept - perhaps "the parent process isn't waiting on
> > this process via wait()".
> >
> > I assume here you're referring to an Android userspace concept?  I
> > expect that when Android "backgrounds" a process, it does lots of
> > things to that process.  Perhaps scheduling priority, perhaps
> > alteration of various MM tunables, etc.
> >
> > So rather than referring to "backgrounding" it would be better to
> > identify what tuning alterations are made to such processes to bring
> > about this behavior.
> Hi Andrew Morton,
>
> Thank you for your review and comments.
>
> You are right. The "background process" here refers to the process
> corresponding to an Android application switched to the background.
> In fact, this patch is applicable to any exiting process.
>
> The further explaination the concept of "multiple exiting processes",
> is that it refers to different processes owning independent mm rather
> than sharing the same mm.
>
> I will use "mm" to describe process instead of "background" in next
> version.
> >
> >> time-consuming release of its swap entries. The proportion of swap memory
> >> occupied by the background process increases with its duration in the
> >> background, and after a period of time, this value can reach 60% or more.
> > Again, what is it about the tuning of such processes which causes this
> > behavior?
> When system is low memory, memory recycling will be trigged, where
> anonymous folios in the process will be continuously reclaimed, resulting
> in an increase of swap entries occupies by this process. So when the
> process is killed, it takes more time to release it's swap entries over
> time.
>
> Testing datas of process occuping different physical memory sizes at
> different time points:
> Testing Platform: 8GB RAM
> Testing procedure:
> After booting up, start 15 processes first, and then observe the
> physical memory size occupied by the last launched process at
> different time points.
>
> Example:
> The process launched last: com.qiyi.video
> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
> -------------------------------------------------------------------
> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> min - minute.
>
> Based on the above datas, we can know that the swap ratio occupied by
> the process gradually increases over time.

If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
pages are actually swapped out, you end up zapping 100 PTEs but only freeing
36.28 pages of memory. By doing this asynchronously, you prevent the
swap_release operation from blocking the process of zapping normal
PTEs that are mapping to memory.

Could you provide data showing the improvements after implementing
asynchronous freeing of swap entries?


> >
> >> Additionally, the relatively lengthy path for releasing swap entries
> >> further contributes to the longer time required for the background process
> >> to release its swap entries.
> >>
> >> In the multiple background applications scenario, when launching a large
> >> memory application such as a camera, system may enter a low memory state,
> >> which will triggers the killing of multiple background processes at the
> >> same time. Due to multiple exiting processes occupying multiple CPUs for
> >> concurrent execution, the current foreground application's CPU resources
> >> are tight and may cause issues such as lagging.
> >>
> >> To solve this problem, we have introduced the multiple exiting process
> >> asynchronous swap memory release mechanism, which isolates and caches
> >> swap entries occupied by multiple exit processes, and hands them over
> >> to an asynchronous kworker to complete the release. This allows the
> >> exiting processes to complete quickly and release CPU resources. We have
> >> validated this modification on the products and achieved the expected
> >> benefits.
> > Dumb question: why can't this be done in userspace?  The exiting
> > process does fork/exit and lets the child do all this asynchronous freeing?
> The logic optimization for kernel releasing swap entries cannot be
> implemented in userspace. The multiple exiting processes here own
> their independent mm, rather than parent and child processes share the
> same mm. Therefore, when the kernel executes multiple exiting process
> simultaneously, they will definitely occupy multiple CPU core resources
> to complete it.
> >> It offers several benefits:
> >> 1. Alleviate the high system cpu load caused by multiple exiting
> >>     processes running simultaneously.
> >> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>     kworker instead of multiple exiting processes parallel execution.
> > Why is lock contention reduced?  The same amount of work needs to be
> > done.
> When multiple CPU cores run to release the different swap entries belong
> to different exiting processes simultaneously, cluster lock or swapinfo
> lock may encounter lock contention issues, and while an asynchronous
> kworker that only occupies one CPU core is used to complete this work,
> it can reduce the probability of lock contention and free up the
> remaining CPU core resources for other non-exiting processes to use.
> >
> >> 3. Release memory occupied by exiting processes more efficiently.
> > Probably it's slightly less efficient.
> We observed that using an asynchronous kworker can result in more free
> memory earlier. When multiple processes exit simultaneously, due to CPU
> core resources competition, these exiting processes remain in a
> runnable state for a long time and cannot release their occupied memory
> resources timely.
> >
> > There are potential problems with this approach of passing work to a
> > kernel thread:
> >
> > - The process will exit while its resources are still allocated.  But
> >    its parent process assumes those resources are now all freed and the
> >    parent process then proceeds to allocate resources.  This results in
> >    a time period where peak resource consumption is higher than it was
> >    before such a change.
> - I don't think this modification will cause such a problem. Perhaps I
>    haven't fully understood your meaning yet. Can you give me a specific
>    example?

Normally, after completing zap_pte_range, your swap slots are returned to
the swap file, except for a few slot caches. However, with the asynchronous
approach, it means that even after your process has completely exited,
 some swap slots might still not be released to the system. This could
potentially starve other processes waiting for swap slots to perform
swap-outs. I assume this isn't a critical issue for you because, in the
case of killing processes, freeing up memory is more important than
releasing swap entries?


> > - If all CPUs are running in userspace with realtime policy
> >    (SCHED_FIFO, for example) then the kworker thread will not run,
> >    indefinitely.
> - In my clumsy understanding, the execution priority of kernel threads
>    should not be lower than that of the exiting process, and the
>    asynchronous kworker execution should only be triggered when the
>    process exits. The exiting process should not be set to SCHED_LFO,
>    so when the exiting process is executed, the asynchronous kworker
>    should also have opportunity to get timely execution.
> > - Work which should have been accounted to the exiting process will
> >    instead go unaccounted.
> - You are right, the statistics of process exit time may no longer be
>    complete.
> > So please fully address all these potential issues.
> Thanks
> Zhiguo
>

Thanks
Barry
zhiguojiang Aug. 1, 2024, 10:33 a.m. UTC | #4
在 2024/8/1 15:36, Barry Song 写道:
> On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>> 在 2024/8/1 0:17, Andrew Morton 写道:
>>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>
>>>> The main reasons for the prolonged exit of a background process is the
>>> The kernel really doesn't have a concept of a "background process".
>>> It's a userspace concept - perhaps "the parent process isn't waiting on
>>> this process via wait()".
>>>
>>> I assume here you're referring to an Android userspace concept?  I
>>> expect that when Android "backgrounds" a process, it does lots of
>>> things to that process.  Perhaps scheduling priority, perhaps
>>> alteration of various MM tunables, etc.
>>>
>>> So rather than referring to "backgrounding" it would be better to
>>> identify what tuning alterations are made to such processes to bring
>>> about this behavior.
>> Hi Andrew Morton,
>>
>> Thank you for your review and comments.
>>
>> You are right. The "background process" here refers to the process
>> corresponding to an Android application switched to the background.
>> In fact, this patch is applicable to any exiting process.
>>
>> The further explaination the concept of "multiple exiting processes",
>> is that it refers to different processes owning independent mm rather
>> than sharing the same mm.
>>
>> I will use "mm" to describe process instead of "background" in next
>> version.
>>>> time-consuming release of its swap entries. The proportion of swap memory
>>>> occupied by the background process increases with its duration in the
>>>> background, and after a period of time, this value can reach 60% or more.
>>> Again, what is it about the tuning of such processes which causes this
>>> behavior?
>> When system is low memory, memory recycling will be trigged, where
>> anonymous folios in the process will be continuously reclaimed, resulting
>> in an increase of swap entries occupies by this process. So when the
>> process is killed, it takes more time to release it's swap entries over
>> time.
>>
>> Testing datas of process occuping different physical memory sizes at
>> different time points:
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After booting up, start 15 processes first, and then observe the
>> physical memory size occupied by the last launched process at
>> different time points.
>>
>> Example:
>> The process launched last: com.qiyi.video
>> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
>> -------------------------------------------------------------------
>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>> min - minute.
>>
>> Based on the above datas, we can know that the swap ratio occupied by
>> the process gradually increases over time.
> If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
> pages are actually swapped out, you end up zapping 100 PTEs but only freeing
> 36.28 pages of memory. By doing this asynchronously, you prevent the
> swap_release operation from blocking the process of zapping normal
> PTEs that are mapping to memory.
>
> Could you provide data showing the improvements after implementing
> asynchronous freeing of swap entries?
Hi Barry,

Your understanding is correct. From the perspective of the benefits of
releasing the physical memory occupied by the exiting process, an
asynchronous kworker releasing swap entries can indeed accelerate
the exiting process to release its pte_present memory (e.g. file and
anonymous folio) faster.

In addition, from the perspective of CPU resources, for scenarios where
multiple exiting processes are running simultaneously, an asynchronous
kworker instead of multiple exiting processes is used to release swap
entries can release more CPU core resources for the current non-exiting
and important processes to use, thereby improving the user experience
of the current non-exiting and important processes. I think this is the
main contribution of this modification.

Example:
When there are multiple processes and the system memory is low, if
the camera processes are started at this time, it will trigger the
instantaneous killing of many processes because the camera processes
need to alloc a large amount of memory, resulting in multiple exiting
processes running simultaneously. These exiting processes will compete
with the current camera processes for CPU resources, and the release of
physical memory occupied by multiple exiting processes due to scheduling
is slow, ultimately affecting the slow execution of the camera process.

By using this optimization modification, multiple exiting processes can
quickly exit, freeing up their CPU resources and physical memory of
pte_preset, improving the running speed of camera processes.

Testing Platform: 8GB RAM
Testing procedure:
After restarting the machine, start 15 app processes first, and then
start the camera app processes, we monitor the cold start and preview
time datas of the camera app processes.

Test datas of camera processes cold start time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
| after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |

Test datas of camera processes preview time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
| after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |

Base on the average of the six sets of test datas above, we can see that
the benefit datas of the modified patch:
1. The cold start time of camera app processes has reduced by about 20%.
2. The preview time of camera app processes has reduced by about 42%.
>
>>>> Additionally, the relatively lengthy path for releasing swap entries
>>>> further contributes to the longer time required for the background process
>>>> to release its swap entries.
>>>>
>>>> In the multiple background applications scenario, when launching a large
>>>> memory application such as a camera, system may enter a low memory state,
>>>> which will triggers the killing of multiple background processes at the
>>>> same time. Due to multiple exiting processes occupying multiple CPUs for
>>>> concurrent execution, the current foreground application's CPU resources
>>>> are tight and may cause issues such as lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap memory release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exit processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the products and achieved the expected
>>>> benefits.
>>> Dumb question: why can't this be done in userspace?  The exiting
>>> process does fork/exit and lets the child do all this asynchronous freeing?
>> The logic optimization for kernel releasing swap entries cannot be
>> implemented in userspace. The multiple exiting processes here own
>> their independent mm, rather than parent and child processes share the
>> same mm. Therefore, when the kernel executes multiple exiting process
>> simultaneously, they will definitely occupy multiple CPU core resources
>> to complete it.
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu load caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>> Why is lock contention reduced?  The same amount of work needs to be
>>> done.
>> When multiple CPU cores run to release the different swap entries belong
>> to different exiting processes simultaneously, cluster lock or swapinfo
>> lock may encounter lock contention issues, and while an asynchronous
>> kworker that only occupies one CPU core is used to complete this work,
>> it can reduce the probability of lock contention and free up the
>> remaining CPU core resources for other non-exiting processes to use.
>>>> 3. Release memory occupied by exiting processes more efficiently.
>>> Probably it's slightly less efficient.
>> We observed that using an asynchronous kworker can result in more free
>> memory earlier. When multiple processes exit simultaneously, due to CPU
>> core resources competition, these exiting processes remain in a
>> runnable state for a long time and cannot release their occupied memory
>> resources timely.
>>> There are potential problems with this approach of passing work to a
>>> kernel thread:
>>>
>>> - The process will exit while its resources are still allocated.  But
>>>     its parent process assumes those resources are now all freed and the
>>>     parent process then proceeds to allocate resources.  This results in
>>>     a time period where peak resource consumption is higher than it was
>>>     before such a change.
>> - I don't think this modification will cause such a problem. Perhaps I
>>     haven't fully understood your meaning yet. Can you give me a specific
>>     example?
> Normally, after completing zap_pte_range, your swap slots are returned to
> the swap file, except for a few slot caches. However, with the asynchronous
> approach, it means that even after your process has completely exited,
>   some swap slots might still not be released to the system. This could
> potentially starve other processes waiting for swap slots to perform
> swap-outs. I assume this isn't a critical issue for you because, in the
> case of killing processes, freeing up memory is more important than
> releasing swap entries?
  I did not encounter issues caused by the slow release of swap entries
by asynchronous kworker during our testing. Normally, asynchronous
kworker can also release cached swap entries in a short period of time.
Of course, if the system allows, it is necessary to increase the running
priority of the asynchronous kworker appropriately in order to release
swap entries faster, which is also beneficial for the system.

The swap-out datas for swap entries is also compressed and stored in the
zram memory space, so it is relatively important to release the zram
memory space corresponding to swap entries as soon as possible.
>
>>> - If all CPUs are running in userspace with realtime policy
>>>     (SCHED_FIFO, for example) then the kworker thread will not run,
>>>     indefinitely.
>> - In my clumsy understanding, the execution priority of kernel threads
>>     should not be lower than that of the exiting process, and the
>>     asynchronous kworker execution should only be triggered when the
>>     process exits. The exiting process should not be set to SCHED_LFO,
>>     so when the exiting process is executed, the asynchronous kworker
>>     should also have opportunity to get timely execution.
>>> - Work which should have been accounted to the exiting process will
>>>     instead go unaccounted.
>> - You are right, the statistics of process exit time may no longer be
>>     complete.
>>> So please fully address all these potential issues.
>> Thanks
>> Zhiguo
>>
> Thanks
> Barry
Thanks
Zhiguo
Barry Song Aug. 2, 2024, 10:42 a.m. UTC | #5
On Thu, Aug 1, 2024 at 10:33 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/8/1 15:36, Barry Song 写道:
> > On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
> >>
> >> 在 2024/8/1 0:17, Andrew Morton 写道:
> >>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>
> >>>> The main reasons for the prolonged exit of a background process is the
> >>> The kernel really doesn't have a concept of a "background process".
> >>> It's a userspace concept - perhaps "the parent process isn't waiting on
> >>> this process via wait()".
> >>>
> >>> I assume here you're referring to an Android userspace concept?  I
> >>> expect that when Android "backgrounds" a process, it does lots of
> >>> things to that process.  Perhaps scheduling priority, perhaps
> >>> alteration of various MM tunables, etc.
> >>>
> >>> So rather than referring to "backgrounding" it would be better to
> >>> identify what tuning alterations are made to such processes to bring
> >>> about this behavior.
> >> Hi Andrew Morton,
> >>
> >> Thank you for your review and comments.
> >>
> >> You are right. The "background process" here refers to the process
> >> corresponding to an Android application switched to the background.
> >> In fact, this patch is applicable to any exiting process.
> >>
> >> The further explaination the concept of "multiple exiting processes",
> >> is that it refers to different processes owning independent mm rather
> >> than sharing the same mm.
> >>
> >> I will use "mm" to describe process instead of "background" in next
> >> version.
> >>>> time-consuming release of its swap entries. The proportion of swap memory
> >>>> occupied by the background process increases with its duration in the
> >>>> background, and after a period of time, this value can reach 60% or more.
> >>> Again, what is it about the tuning of such processes which causes this
> >>> behavior?
> >> When system is low memory, memory recycling will be trigged, where
> >> anonymous folios in the process will be continuously reclaimed, resulting
> >> in an increase of swap entries occupies by this process. So when the
> >> process is killed, it takes more time to release it's swap entries over
> >> time.
> >>
> >> Testing datas of process occuping different physical memory sizes at
> >> different time points:
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After booting up, start 15 processes first, and then observe the
> >> physical memory size occupied by the last launched process at
> >> different time points.
> >>
> >> Example:
> >> The process launched last: com.qiyi.video
> >> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
> >> -------------------------------------------------------------------
> >> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> >> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> >> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> >> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> >> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> >> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> >> min - minute.
> >>
> >> Based on the above datas, we can know that the swap ratio occupied by
> >> the process gradually increases over time.
> > If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
> > pages are actually swapped out, you end up zapping 100 PTEs but only freeing
> > 36.28 pages of memory. By doing this asynchronously, you prevent the
> > swap_release operation from blocking the process of zapping normal
> > PTEs that are mapping to memory.
> >
> > Could you provide data showing the improvements after implementing
> > asynchronous freeing of swap entries?
> Hi Barry,
>
> Your understanding is correct. From the perspective of the benefits of
> releasing the physical memory occupied by the exiting process, an
> asynchronous kworker releasing swap entries can indeed accelerate
> the exiting process to release its pte_present memory (e.g. file and
> anonymous folio) faster.
>
> In addition, from the perspective of CPU resources, for scenarios where
> multiple exiting processes are running simultaneously, an asynchronous
> kworker instead of multiple exiting processes is used to release swap
> entries can release more CPU core resources for the current non-exiting
> and important processes to use, thereby improving the user experience
> of the current non-exiting and important processes. I think this is the
> main contribution of this modification.
>
> Example:
> When there are multiple processes and the system memory is low, if
> the camera processes are started at this time, it will trigger the
> instantaneous killing of many processes because the camera processes
> need to alloc a large amount of memory, resulting in multiple exiting
> processes running simultaneously. These exiting processes will compete
> with the current camera processes for CPU resources, and the release of
> physical memory occupied by multiple exiting processes due to scheduling
> is slow, ultimately affecting the slow execution of the camera process.
>
> By using this optimization modification, multiple exiting processes can
> quickly exit, freeing up their CPU resources and physical memory of
> pte_preset, improving the running speed of camera processes.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After restarting the machine, start 15 app processes first, and then
> start the camera app processes, we monitor the cold start and preview
> time datas of the camera app processes.
>
> Test datas of camera processes cold start time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>
> Test datas of camera processes preview time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>
> Base on the average of the six sets of test datas above, we can see that
> the benefit datas of the modified patch:
> 1. The cold start time of camera app processes has reduced by about 20%.
> 2. The preview time of camera app processes has reduced by about 42%.

This sounds quite promising. I understand that asynchronous releasing
of swap entries can help killed processes free memory more quickly,
allowing your camera app to access it faster. However, I’m unsure
about the impact of swap-related lock contention. My intuition is that
it might not be significant, given that the cluster uses a single lock
and its relatively small size helps distribute the swap locks.

Anyway, I’m very interested in your patchset and can certainly
appreciate its benefits from my own experience working on phones. I’m
quite busy with other issues at the moment, but I hope to provide you
with detailed comments in about two weeks.

> >
> >>>> Additionally, the relatively lengthy path for releasing swap entries
> >>>> further contributes to the longer time required for the background process
> >>>> to release its swap entries.
> >>>>
> >>>> In the multiple background applications scenario, when launching a large
> >>>> memory application such as a camera, system may enter a low memory state,
> >>>> which will triggers the killing of multiple background processes at the
> >>>> same time. Due to multiple exiting processes occupying multiple CPUs for
> >>>> concurrent execution, the current foreground application's CPU resources
> >>>> are tight and may cause issues such as lagging.
> >>>>
> >>>> To solve this problem, we have introduced the multiple exiting process
> >>>> asynchronous swap memory release mechanism, which isolates and caches
> >>>> swap entries occupied by multiple exit processes, and hands them over
> >>>> to an asynchronous kworker to complete the release. This allows the
> >>>> exiting processes to complete quickly and release CPU resources. We have
> >>>> validated this modification on the products and achieved the expected
> >>>> benefits.
> >>> Dumb question: why can't this be done in userspace?  The exiting
> >>> process does fork/exit and lets the child do all this asynchronous freeing?
> >> The logic optimization for kernel releasing swap entries cannot be
> >> implemented in userspace. The multiple exiting processes here own
> >> their independent mm, rather than parent and child processes share the
> >> same mm. Therefore, when the kernel executes multiple exiting process
> >> simultaneously, they will definitely occupy multiple CPU core resources
> >> to complete it.
> >>>> It offers several benefits:
> >>>> 1. Alleviate the high system cpu load caused by multiple exiting
> >>>>      processes running simultaneously.
> >>>> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>>>      kworker instead of multiple exiting processes parallel execution.
> >>> Why is lock contention reduced?  The same amount of work needs to be
> >>> done.
> >> When multiple CPU cores run to release the different swap entries belong
> >> to different exiting processes simultaneously, cluster lock or swapinfo
> >> lock may encounter lock contention issues, and while an asynchronous
> >> kworker that only occupies one CPU core is used to complete this work,
> >> it can reduce the probability of lock contention and free up the
> >> remaining CPU core resources for other non-exiting processes to use.
> >>>> 3. Release memory occupied by exiting processes more efficiently.
> >>> Probably it's slightly less efficient.
> >> We observed that using an asynchronous kworker can result in more free
> >> memory earlier. When multiple processes exit simultaneously, due to CPU
> >> core resources competition, these exiting processes remain in a
> >> runnable state for a long time and cannot release their occupied memory
> >> resources timely.
> >>> There are potential problems with this approach of passing work to a
> >>> kernel thread:
> >>>
> >>> - The process will exit while its resources are still allocated.  But
> >>>     its parent process assumes those resources are now all freed and the
> >>>     parent process then proceeds to allocate resources.  This results in
> >>>     a time period where peak resource consumption is higher than it was
> >>>     before such a change.
> >> - I don't think this modification will cause such a problem. Perhaps I
> >>     haven't fully understood your meaning yet. Can you give me a specific
> >>     example?
> > Normally, after completing zap_pte_range, your swap slots are returned to
> > the swap file, except for a few slot caches. However, with the asynchronous
> > approach, it means that even after your process has completely exited,
> >   some swap slots might still not be released to the system. This could
> > potentially starve other processes waiting for swap slots to perform
> > swap-outs. I assume this isn't a critical issue for you because, in the
> > case of killing processes, freeing up memory is more important than
> > releasing swap entries?
>   I did not encounter issues caused by the slow release of swap entries
> by asynchronous kworker during our testing. Normally, asynchronous
> kworker can also release cached swap entries in a short period of time.
> Of course, if the system allows, it is necessary to increase the running
> priority of the asynchronous kworker appropriately in order to release
> swap entries faster, which is also beneficial for the system.
>
> The swap-out datas for swap entries is also compressed and stored in the
> zram memory space, so it is relatively important to release the zram
> memory space corresponding to swap entries as soon as possible.
> >
> >>> - If all CPUs are running in userspace with realtime policy
> >>>     (SCHED_FIFO, for example) then the kworker thread will not run,
> >>>     indefinitely.
> >> - In my clumsy understanding, the execution priority of kernel threads
> >>     should not be lower than that of the exiting process, and the
> >>     asynchronous kworker execution should only be triggered when the
> >>     process exits. The exiting process should not be set to SCHED_LFO,
> >>     so when the exiting process is executed, the asynchronous kworker
> >>     should also have opportunity to get timely execution.
> >>> - Work which should have been accounted to the exiting process will
> >>>     instead go unaccounted.
> >> - You are right, the statistics of process exit time may no longer be
> >>     complete.
> >>> So please fully address all these potential issues.
> >> Thanks
> >> Zhiguo

Thanks
Barry
zhiguojiang Aug. 2, 2024, 2:42 p.m. UTC | #6
在 2024/8/2 18:42, Barry Song 写道:
> On Thu, Aug 1, 2024 at 10:33 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/8/1 15:36, Barry Song 写道:
>>> On Thu, Aug 1, 2024 at 2:31 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>>> 在 2024/8/1 0:17, Andrew Morton 写道:
>>>>> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Wed, 31 Jul 2024 21:33:14 +0800 Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>
>>>>>> The main reasons for the prolonged exit of a background process is the
>>>>> The kernel really doesn't have a concept of a "background process".
>>>>> It's a userspace concept - perhaps "the parent process isn't waiting on
>>>>> this process via wait()".
>>>>>
>>>>> I assume here you're referring to an Android userspace concept?  I
>>>>> expect that when Android "backgrounds" a process, it does lots of
>>>>> things to that process.  Perhaps scheduling priority, perhaps
>>>>> alteration of various MM tunables, etc.
>>>>>
>>>>> So rather than referring to "backgrounding" it would be better to
>>>>> identify what tuning alterations are made to such processes to bring
>>>>> about this behavior.
>>>> Hi Andrew Morton,
>>>>
>>>> Thank you for your review and comments.
>>>>
>>>> You are right. The "background process" here refers to the process
>>>> corresponding to an Android application switched to the background.
>>>> In fact, this patch is applicable to any exiting process.
>>>>
>>>> The further explaination the concept of "multiple exiting processes",
>>>> is that it refers to different processes owning independent mm rather
>>>> than sharing the same mm.
>>>>
>>>> I will use "mm" to describe process instead of "background" in next
>>>> version.
>>>>>> time-consuming release of its swap entries. The proportion of swap memory
>>>>>> occupied by the background process increases with its duration in the
>>>>>> background, and after a period of time, this value can reach 60% or more.
>>>>> Again, what is it about the tuning of such processes which causes this
>>>>> behavior?
>>>> When system is low memory, memory recycling will be trigged, where
>>>> anonymous folios in the process will be continuously reclaimed, resulting
>>>> in an increase of swap entries occupies by this process. So when the
>>>> process is killed, it takes more time to release it's swap entries over
>>>> time.
>>>>
>>>> Testing datas of process occuping different physical memory sizes at
>>>> different time points:
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at
>>>> different time points.
>>>>
>>>> Example:
>>>> The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  | BG 5min | BG 10min | BG 15min |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>>>> min - minute.
>>>>
>>>> Based on the above datas, we can know that the swap ratio occupied by
>>>> the process gradually increases over time.
>>> If I understand correctly, during zap_pte_range(), if 64.72% of the anonymous
>>> pages are actually swapped out, you end up zapping 100 PTEs but only freeing
>>> 36.28 pages of memory. By doing this asynchronously, you prevent the
>>> swap_release operation from blocking the process of zapping normal
>>> PTEs that are mapping to memory.
>>>
>>> Could you provide data showing the improvements after implementing
>>> asynchronous freeing of swap entries?
>> Hi Barry,
>>
>> Your understanding is correct. From the perspective of the benefits of
>> releasing the physical memory occupied by the exiting process, an
>> asynchronous kworker releasing swap entries can indeed accelerate
>> the exiting process to release its pte_present memory (e.g. file and
>> anonymous folio) faster.
>>
>> In addition, from the perspective of CPU resources, for scenarios where
>> multiple exiting processes are running simultaneously, an asynchronous
>> kworker instead of multiple exiting processes is used to release swap
>> entries can release more CPU core resources for the current non-exiting
>> and important processes to use, thereby improving the user experience
>> of the current non-exiting and important processes. I think this is the
>> main contribution of this modification.
>>
>> Example:
>> When there are multiple processes and the system memory is low, if
>> the camera processes are started at this time, it will trigger the
>> instantaneous killing of many processes because the camera processes
>> need to alloc a large amount of memory, resulting in multiple exiting
>> processes running simultaneously. These exiting processes will compete
>> with the current camera processes for CPU resources, and the release of
>> physical memory occupied by multiple exiting processes due to scheduling
>> is slow, ultimately affecting the slow execution of the camera process.
>>
>> By using this optimization modification, multiple exiting processes can
>> quickly exit, freeing up their CPU resources and physical memory of
>> pte_preset, improving the running speed of camera processes.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After restarting the machine, start 15 app processes first, and then
>> start the camera app processes, we monitor the cold start and preview
>> time datas of the camera app processes.
>>
>> Test datas of camera processes cold start time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>
>> Test datas of camera processes preview time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>
>> Base on the average of the six sets of test datas above, we can see that
>> the benefit datas of the modified patch:
>> 1. The cold start time of camera app processes has reduced by about 20%.
>> 2. The preview time of camera app processes has reduced by about 42%.
> This sounds quite promising. I understand that asynchronous releasing
> of swap entries can help killed processes free memory more quickly,
> allowing your camera app to access it faster. However, I’m unsure
> about the impact of swap-related lock contention. My intuition is that
> it might not be significant, given that the cluster uses a single lock
> and its relatively small size helps distribute the swap locks.
>
> Anyway, I’m very interested in your patchset and can certainly
> appreciate its benefits from my own experience working on phones. I’m
> quite busy with other issues at the moment, but I hope to provide you
> with detailed comments in about two weeks.
>
>>>>>> Additionally, the relatively lengthy path for releasing swap entries
>>>>>> further contributes to the longer time required for the background process
>>>>>> to release its swap entries.
>>>>>>
>>>>>> In the multiple background applications scenario, when launching a large
>>>>>> memory application such as a camera, system may enter a low memory state,
>>>>>> which will triggers the killing of multiple background processes at the
>>>>>> same time. Due to multiple exiting processes occupying multiple CPUs for
>>>>>> concurrent execution, the current foreground application's CPU resources
>>>>>> are tight and may cause issues such as lagging.
>>>>>>
>>>>>> To solve this problem, we have introduced the multiple exiting process
>>>>>> asynchronous swap memory release mechanism, which isolates and caches
>>>>>> swap entries occupied by multiple exit processes, and hands them over
>>>>>> to an asynchronous kworker to complete the release. This allows the
>>>>>> exiting processes to complete quickly and release CPU resources. We have
>>>>>> validated this modification on the products and achieved the expected
>>>>>> benefits.
>>>>> Dumb question: why can't this be done in userspace?  The exiting
>>>>> process does fork/exit and lets the child do all this asynchronous freeing?
>>>> The logic optimization for kernel releasing swap entries cannot be
>>>> implemented in userspace. The multiple exiting processes here own
>>>> their independent mm, rather than parent and child processes share the
>>>> same mm. Therefore, when the kernel executes multiple exiting process
>>>> simultaneously, they will definitely occupy multiple CPU core resources
>>>> to complete it.
>>>>>> It offers several benefits:
>>>>>> 1. Alleviate the high system cpu load caused by multiple exiting
>>>>>>       processes running simultaneously.
>>>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>>>       kworker instead of multiple exiting processes parallel execution.
>>>>> Why is lock contention reduced?  The same amount of work needs to be
>>>>> done.
>>>> When multiple CPU cores run to release the different swap entries belong
>>>> to different exiting processes simultaneously, cluster lock or swapinfo
>>>> lock may encounter lock contention issues, and while an asynchronous
>>>> kworker that only occupies one CPU core is used to complete this work,
>>>> it can reduce the probability of lock contention and free up the
>>>> remaining CPU core resources for other non-exiting processes to use.
>>>>>> 3. Release memory occupied by exiting processes more efficiently.
>>>>> Probably it's slightly less efficient.
>>>> We observed that using an asynchronous kworker can result in more free
>>>> memory earlier. When multiple processes exit simultaneously, due to CPU
>>>> core resources competition, these exiting processes remain in a
>>>> runnable state for a long time and cannot release their occupied memory
>>>> resources timely.
>>>>> There are potential problems with this approach of passing work to a
>>>>> kernel thread:
>>>>>
>>>>> - The process will exit while its resources are still allocated.  But
>>>>>      its parent process assumes those resources are now all freed and the
>>>>>      parent process then proceeds to allocate resources.  This results in
>>>>>      a time period where peak resource consumption is higher than it was
>>>>>      before such a change.
>>>> - I don't think this modification will cause such a problem. Perhaps I
>>>>      haven't fully understood your meaning yet. Can you give me a specific
>>>>      example?
>>> Normally, after completing zap_pte_range, your swap slots are returned to
>>> the swap file, except for a few slot caches. However, with the asynchronous
>>> approach, it means that even after your process has completely exited,
>>>    some swap slots might still not be released to the system. This could
>>> potentially starve other processes waiting for swap slots to perform
>>> swap-outs. I assume this isn't a critical issue for you because, in the
>>> case of killing processes, freeing up memory is more important than
>>> releasing swap entries?
>>    I did not encounter issues caused by the slow release of swap entries
>> by asynchronous kworker during our testing. Normally, asynchronous
>> kworker can also release cached swap entries in a short period of time.
>> Of course, if the system allows, it is necessary to increase the running
>> priority of the asynchronous kworker appropriately in order to release
>> swap entries faster, which is also beneficial for the system.
>>
>> The swap-out datas for swap entries is also compressed and stored in the
>> zram memory space, so it is relatively important to release the zram
>> memory space corresponding to swap entries as soon as possible.
Thank you for your attention and look forward to your reply.

You are correct, it might not be significant for cluster lock contention
due to its relatively small size of entries. However, asynchronous
kworker should have some benefits for swapinfo lock contention, because
when multiple exiting processes release their respective entries at the
same time, there will have swapinfo lock contention issue in the
swapcache_free_entries().
>>>>> - If all CPUs are running in userspace with realtime policy
>>>>>      (SCHED_FIFO, for example) then the kworker thread will not run,
>>>>>      indefinitely.
>>>> - In my clumsy understanding, the execution priority of kernel threads
>>>>      should not be lower than that of the exiting process, and the
>>>>      asynchronous kworker execution should only be triggered when the
>>>>      process exits. The exiting process should not be set to SCHED_LFO,
>>>>      so when the exiting process is executed, the asynchronous kworker
>>>>      should also have opportunity to get timely execution.
>>>>> - Work which should have been accounted to the exiting process will
>>>>>      instead go unaccounted.
>>>> - You are right, the statistics of process exit time may no longer be
>>>>      complete.
>>>>> So please fully address all these potential issues.
>>>> Thanks
>>>> Zhiguo
> Thanks
> Barry
Thanks
Zhiguo