diff mbox series

[v3,2/2] mm: tlb: add tlb swap entries batch async release

Message ID 20240805153639.1057-3-justinjiang@vivo.com (mailing list archive)
State New
Headers show
Series mm: tlb swap entries batch async release | expand

Commit Message

zhiguojiang Aug. 5, 2024, 3:36 p.m. UTC
One of the main reasons for the prolonged exit of the process with
independent mm is the time-consuming release of its swap entries.
The proportion of swap memory occupied by the process increases over
time due to high memory pressure triggering to reclaim anonymous folio
into swapspace, e.g., in Android devices, we found this proportion can
reach 60% or more after a period of time. Additionally, the relatively
lengthy path for releasing swap entries further contributes to the
longer time required to release swap entries.

Testing Platform: 8GB RAM
Testing procedure:
After booting up, start 15 processes first, and then observe the
physical memory size occupied by the last launched process at different
time points.
Example: The process launched last: com.qiyi.video
|  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
-------------------------------------------------------------------
|     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
|   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
|   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
|  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
|    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
| Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
Note: min - minute.

When there are multiple processes with independent mm and the high
memory pressure in system, if the large memory required process is
launched at this time, system will is likely to trigger the instantaneous
killing of many processes with independent mm. Due to multiple exiting
processes occupying multiple CPU core resources for concurrent execution,
leading to some issues such as the current non-exiting and important
processes lagging.

To solve this problem, we have introduced the multiple exiting process
asynchronous swap entries release mechanism, which isolates and caches
swap entries occupied by multiple exiting processes, and hands them over
to an asynchronous kworker to complete the release. This allows the
exiting processes to complete quickly and release CPU resources. We have
validated this modification on the Android products and achieved the
expected benefits.

Testing Platform: 8GB RAM
Testing procedure:
After restarting the machine, start 15 app processes first, and then
start the camera app processes, we monitor the cold start and preview
time datas of the camera app processes.

Test datas of camera processes cold start time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
| after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |

Test datas of camera processes preview time (unit: millisecond):
|  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
| before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
| after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |

Base on the average of the six sets of test datas above, we can see that
the benefit datas of the modified patch:
1. The cold start time of camera app processes has reduced by about 20%.
2. The preview time of camera app processes has reduced by about 42%.

It offers several benefits:
1. Alleviate the high system cpu loading caused by multiple exiting
   processes running simultaneously.
2. Reduce lock competition in swap entry free path by an asynchronous
   kworker instead of multiple exiting processes parallel execution.
3. Release pte_present memory occupied by exiting processes more
   efficiently.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---
 arch/s390/include/asm/tlb.h |   8 +
 include/asm-generic/tlb.h   |  44 ++++++
 include/linux/mm_types.h    |  58 +++++++
 mm/memory.c                 |   3 +-
 mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
 5 files changed, 408 insertions(+), 1 deletion(-)

Comments

Barry Song Sept. 4, 2024, 9:16 a.m. UTC | #1
On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>
> One of the main reasons for the prolonged exit of the process with
> independent mm is the time-consuming release of its swap entries.
> The proportion of swap memory occupied by the process increases over
> time due to high memory pressure triggering to reclaim anonymous folio
> into swapspace, e.g., in Android devices, we found this proportion can
> reach 60% or more after a period of time. Additionally, the relatively
> lengthy path for releasing swap entries further contributes to the
> longer time required to release swap entries.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After booting up, start 15 processes first, and then observe the
> physical memory size occupied by the last launched process at different
> time points.
> Example: The process launched last: com.qiyi.video
> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> -------------------------------------------------------------------
> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> Note: min - minute.
>
> When there are multiple processes with independent mm and the high
> memory pressure in system, if the large memory required process is
> launched at this time, system will is likely to trigger the instantaneous
> killing of many processes with independent mm. Due to multiple exiting
> processes occupying multiple CPU core resources for concurrent execution,
> leading to some issues such as the current non-exiting and important
> processes lagging.
>
> To solve this problem, we have introduced the multiple exiting process
> asynchronous swap entries release mechanism, which isolates and caches
> swap entries occupied by multiple exiting processes, and hands them over
> to an asynchronous kworker to complete the release. This allows the
> exiting processes to complete quickly and release CPU resources. We have
> validated this modification on the Android products and achieved the
> expected benefits.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After restarting the machine, start 15 app processes first, and then
> start the camera app processes, we monitor the cold start and preview
> time datas of the camera app processes.
>
> Test datas of camera processes cold start time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>
> Test datas of camera processes preview time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>
> Base on the average of the six sets of test datas above, we can see that
> the benefit datas of the modified patch:
> 1. The cold start time of camera app processes has reduced by about 20%.
> 2. The preview time of camera app processes has reduced by about 42%.
>
> It offers several benefits:
> 1. Alleviate the high system cpu loading caused by multiple exiting
>    processes running simultaneously.
> 2. Reduce lock competition in swap entry free path by an asynchronous
>    kworker instead of multiple exiting processes parallel execution.
> 3. Release pte_present memory occupied by exiting processes more
>    efficiently.
>
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
>  arch/s390/include/asm/tlb.h |   8 +
>  include/asm-generic/tlb.h   |  44 ++++++
>  include/linux/mm_types.h    |  58 +++++++
>  mm/memory.c                 |   3 +-
>  mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 408 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index e95b2c8081eb..3f681f63390f
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>                 struct page *page, bool delay_rmap, int page_size);
>  static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>                 struct page *page, unsigned int nr_pages, bool delay_rmap);
> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr);
>
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>         return false;
>  }
>
> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr)
> +{
> +       return false;
> +}
> +
>  static inline void tlb_flush(struct mmu_gather *tlb)
>  {
>         __tlb_flush_mm_lazy(tlb->mm);
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 709830274b75..8b4d516b35b8
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
>  static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
>  #endif
>
> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> +struct mmu_swap_batch {
> +       struct mmu_swap_batch *next;
> +       unsigned int nr;
> +       unsigned int max;
> +       encoded_swpentry_t encoded_entrys[];
> +};
> +
> +#define MAX_SWAP_GATHER_BATCH  \
> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
> +
> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
> +
> +struct mmu_swap_gather {
> +       /*
> +        * the asynchronous kworker to batch
> +        * release swap entries
> +        */
> +       struct work_struct free_work;
> +
> +       /* batch cache swap entries */
> +       unsigned int batch_count;
> +       struct mmu_swap_batch *active;
> +       struct mmu_swap_batch local;
> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> +};
> +
> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr);
> +#endif
> +
>  /*
>   * struct mmu_gather is an opaque type used by the mm code for passing around
>   * any data needed by arch specific code for tlb_remove_page.
> @@ -343,6 +374,18 @@ struct mmu_gather {
>         unsigned int            vma_exec : 1;
>         unsigned int            vma_huge : 1;
>         unsigned int            vma_pfn  : 1;
> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> +       /*
> +        * Two states of releasing swap entries
> +        * asynchronously:
> +        * swp_freeable - have opportunity to
> +        * release asynchronously future
> +        * swp_freeing - be releasing asynchronously.
> +        */
> +       unsigned int            swp_freeable : 1;
> +       unsigned int            swp_freeing : 1;
> +       unsigned int            swp_disable : 1;
> +#endif
>
>         unsigned int            batch_count;
>
> @@ -354,6 +397,7 @@ struct mmu_gather {
>  #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>         unsigned int page_size;
>  #endif
> +       struct mmu_swap_gather *swp;
>  #endif
>  };
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 165c58b12ccc..2f66303f1519
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -283,6 +283,64 @@ typedef struct {
>         unsigned long val;
>  } swp_entry_t;
>
> +/*
> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> + *
> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
> + * bit indicating extra context-dependent information. Only used in swp_entry
> + * asynchronous release path by mmu_swap_gather.
> + */
> +typedef struct {
> +       unsigned long val;
> +} encoded_swpentry_t;
> +
> +/*
> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
> + * total number of consecutive swap entries associated with the same folio. If this
> + * bit is not set, "nr" is implicitly 1.
> + *
> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
> + * Bit63 can be used here.
> + */
> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
> +
> +static __always_inline encoded_swpentry_t
> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> +{
> +       encoded_swpentry_t ret;
> +
> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> +       ret.val = flags | entry.val;
> +       return ret;
> +}
> +
> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
> +{
> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> +}
> +
> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
> +{
> +       swp_entry_t ret;
> +
> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> +       return ret;
> +}
> +
> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
> +{
> +       encoded_swpentry_t ret;
> +
> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> +       ret.val = nr;
> +       return ret;
> +}
> +
> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
> +{
> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
> +}
> +
>  /**
>   * struct folio - Represents a contiguous set of bytes.
>   * @flags: Identical to the page flags.
> diff --git a/mm/memory.c b/mm/memory.c
> index d6a9dcddaca4..023a8adcb67c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         if (!should_zap_cows(details))
>                                 continue;
>                         rss[MM_SWAPENTS] -= nr;
> -                       free_swap_and_cache_nr(entry, nr);
> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
> +                               free_swap_and_cache_nr(entry, nr);
>                 } else if (is_migration_entry(entry)) {
>                         folio = pfn_swap_entry_folio(entry);
>                         if (!should_zap_folio(details, folio))
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99b3e9408aa0..33dc9d1faff9
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -9,11 +9,303 @@
>  #include <linux/smp.h>
>  #include <linux/swap.h>
>  #include <linux/rmap.h>
> +#include <linux/oom.h>
>
>  #include <asm/pgalloc.h>
>  #include <asm/tlb.h>
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
> +/*
> + * The swp_entry asynchronous release mechanism for multiple processes with
> + * independent mm exiting simultaneously.
> + *
> + * During the multiple exiting processes releasing their own mm simultaneously,
> + * the swap entries in the exiting processes are handled by isolating, caching
> + * and handing over to an asynchronous kworker to complete the release.
> + *
> + * The conditions for the exiting process entering the swp_entry asynchronous
> + * release path:
> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
> + *    to alloc struct mmu_swap_gather frequently.
> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.

Hi Zhiguo,

I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
batched swap entry freeing, even with one process, could be a
bottleneck for a single
process based on the data from this patch:

mm: attempt to batch free swap entries for zap_pte_range()
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
"munmap bandwidth becomes 3X faster."

So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?

> + *
> + * Since the time for determining the number of exiting processes is dynamic,
> + * the exiting process may start to enter the swp_entry asynchronous release
> + * at the beginning or middle stage of the exiting process's swp_entry release
> + * path.
> + *
> + * Once an exiting process enters the swp_entry asynchronous release, all remaining
> + * swap entries in this exiting process need to be fully released by asynchronous
> + * kworker theoretically.

Freeing a slot can indeed release memory from `zRAM`, potentially returning
it to the system for allocation. Your patch frees swap slots asynchronously;
I assume this doesn’t slow down the memory freeing process for `zRAM`, or
could it even slow down the freeing of `zRAM` memory? Freeing compressed
memory might not be as crucial compared to freeing uncompressed memory with
present PTEs?

> + *
> + * The function of the swp_entry asynchronous release:
> + * 1. Alleviate the high system cpu load caused by multiple exiting processes
> + *    running simultaneously.
> + * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
> + *    instead of multiple exiting processes parallel execution.
> + * 3. Release pte_present memory occupied by exiting processes more efficiently.
> + */
> +
> +/*
> + * The min number of exiting processes required for swp_entry asynchronous release
> + */
> +#define NR_MIN_EXITING_PROCESSES 2
> +
> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
> +static struct kmem_cache *swap_gather_cachep;
> +static struct workqueue_struct *swapfree_wq;
> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
> +
> +static int __init tlb_swap_async_free_setup(void)
> +{
> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> +       if (!swapfree_wq)
> +               goto fail;
> +
> +       swap_gather_cachep = kmem_cache_create("swap_gather",
> +               sizeof(struct mmu_swap_gather),
> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
> +               NULL);
> +       if (!swap_gather_cachep)
> +               goto kcache_fail;
> +
> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
> +       return 0;
> +
> +kcache_fail:
> +       destroy_workqueue(swapfree_wq);
> +fail:
> +       return -ENOMEM;
> +}
> +postcore_initcall(tlb_swap_async_free_setup);
> +
> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
> +{
> +       struct mmu_swap_batch *swap_batch, *next;
> +
> +       for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
> +               next = swap_batch->next;
> +               free_page((unsigned long)swap_batch);
> +       }
> +       swap_gather->local.next = NULL;
> +       kmem_cache_free(swap_gather_cachep, swap_gather);
> +}
> +
> +static void tlb_swap_async_free_work(struct work_struct *w)
> +{
> +       int i, nr_multi, nr_free;
> +       swp_entry_t start_entry;
> +       struct mmu_swap_batch *swap_batch;
> +       struct mmu_swap_gather *swap_gather = container_of(w,
> +               struct mmu_swap_gather, free_work);
> +
> +       /* Release swap entries cached in mmu_swap_batch. */
> +       for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
> +           swap_batch = swap_batch->next) {
> +               nr_free = 0;
> +               for (i = 0; i < swap_batch->nr; i++) {
> +                       if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> +                               nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
> +                               free_swap_and_cache_nr(start_entry, nr_multi);
> +                               nr_free += 2;
> +                       } else {
> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> +                               free_swap_and_cache_nr(start_entry, 1);
> +                               nr_free++;
> +                       }
> +               }
> +               swap_batch->nr -= nr_free;
> +               VM_BUG_ON(swap_batch->nr);
> +       }
> +       __tlb_swap_gather_free(swap_gather);
> +}
> +
> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
> +{
> +       /*
> +        * Only the exiting processes with the MM_SWAPENTS counter >=
> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
> +        * entries by asynchronous kworker.
> +        */
> +       if (!task_is_dying() ||
> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
> +               return true;
> +
> +       atomic_inc(&nr_exiting_processes);
> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
> +               tlb->swp_freeable = 1;
> +       else
> +               tlb->swp_freeing = 1;
> +
> +       return false;
> +}
> +
> +/**
> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
> + * for swp_entry tear-down.
> + * @tlb: the mmu_swap_gather structure belongs to tlb
> + */
> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
> +{
> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
> +       if (unlikely(!tlb->swp))
> +               return false;
> +
> +       tlb->swp->local.next  = NULL;
> +       tlb->swp->local.nr    = 0;
> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
> +
> +       tlb->swp->active      = &tlb->swp->local;
> +       tlb->swp->batch_count = 0;
> +
> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
> +       return true;
> +}
> +
> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
> +{
> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
> +               return;
> +
> +       tlb->swp = NULL;
> +       tlb->swp_freeable = 0;
> +       tlb->swp_freeing = 0;
> +       tlb->swp_disable = 0;
> +
> +       if (__tlb_swap_gather_mmu_check(tlb))
> +               return;
> +
> +       /*
> +        * If the exiting process meets the conditions of
> +        * swp_entry asynchronous release, an mmu_swap_gather
> +        * structure will be initialized.
> +        */
> +       if (tlb->swp_freeing)
> +               __tlb_swap_gather_init(tlb);
> +}
> +
> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
> +{
> +       queue_work(swapfree_wq, &tlb->swp->free_work);
> +       tlb->swp = NULL;
> +       if (!finish)
> +               __tlb_swap_gather_init(tlb);
> +}
> +
> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
> +{
> +       struct mmu_swap_batch *swap_batch;
> +
> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
> +               goto free;
> +
> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
> +       if (unlikely(!swap_batch))
> +               goto free;
> +
> +       swap_batch->next = NULL;
> +       swap_batch->nr   = 0;
> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
> +
> +       tlb->swp->active->next = swap_batch;
> +       tlb->swp->active = swap_batch;
> +       tlb->swp->batch_count++;
> +       return true;
> +free:
> +       /* batch move to wq */
> +       __tlb_swap_gather_queuework(tlb, false);
> +       return false;
> +}
> +
> +/**
> + * __tlb_remove_swap_entries - the swap entries in exiting process are
> + * isolated, batch cached in struct mmu_swap_batch.
> + * @tlb: the current mmu_gather
> + * @entry: swp_entry to be isolated and cached
> + * @nr: the number of consecutive entries starting from entry parameter.
> + */
> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +                            swp_entry_t entry, int nr)
> +{
> +       struct mmu_swap_batch *swap_batch;
> +       unsigned long flags = 0;
> +       bool ret = false;
> +
> +       if (tlb->swp_disable)
> +               return ret;
> +
> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> +               return ret;
> +
> +       if (tlb->swp_freeable) {
> +               if (atomic_read(&nr_exiting_processes) <
> +                   NR_MIN_EXITING_PROCESSES)
> +                       return ret;
> +               /*
> +                * If the current number of exiting processes
> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
> +                * process with swp_freeable state will enter
> +                * swp_freeing state to start releasing its
> +                * remaining swap entries by the asynchronous
> +                * kworker.
> +                */
> +               tlb->swp_freeable = 0;
> +               tlb->swp_freeing = 1;
> +       }
> +
> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
> +               return ret;
> +
> +       swap_batch = tlb->swp->active;
> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
> +               __tlb_swap_gather_queuework(tlb, false);
> +               return ret;
> +       }
> +
> +       if (likely(nr == 1)) {
> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> +       } else {
> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
> +       }
> +       ret = true;
> +
> +       if (swap_batch->nr >= swap_batch->max - 1) {
> +               if (!__tlb_swap_next_batch(tlb))
> +                       goto exit;
> +               swap_batch = tlb->swp->active;
> +       }
> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
> +exit:
> +       return ret;
> +}
> +
> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
> +{
> +       if (tlb->swp_disable)
> +               return;
> +
> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> +               return;
> +
> +       if (tlb->swp_freeable) {
> +               tlb->swp_freeable = 0;
> +               VM_BUG_ON(tlb->swp_freeing);
> +               goto exit;
> +       }
> +       tlb->swp_freeing = 0;
> +       if (unlikely(!tlb->swp))
> +               goto exit;
> +
> +       __tlb_swap_gather_queuework(tlb, true);
> +exit:
> +       atomic_dec(&nr_exiting_processes);
> +}
>
>  static bool tlb_next_batch(struct mmu_gather *tlb)
>  {
> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>         tlb->local.max  = ARRAY_SIZE(tlb->__pages);
>         tlb->active     = &tlb->local;
>         tlb->batch_count = 0;
> +
> +       tlb->swp_disable = 1;
> +       __tlb_swap_gather_mmu(tlb);
>  #endif
>         tlb->delayed_rmap = 0;
>
> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
>         tlb_batch_list_free(tlb);
> +       __tlb_batch_swap_finish(tlb);
>  #endif
>         dec_tlb_flush_pending(tlb->mm);
>  }
> --
> 2.39.0
>

Thanks
Barry
zhiguojiang Sept. 4, 2024, 11:26 a.m. UTC | #2
在 2024/9/4 17:16, Barry Song 写道:
> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>> One of the main reasons for the prolonged exit of the process with
>> independent mm is the time-consuming release of its swap entries.
>> The proportion of swap memory occupied by the process increases over
>> time due to high memory pressure triggering to reclaim anonymous folio
>> into swapspace, e.g., in Android devices, we found this proportion can
>> reach 60% or more after a period of time. Additionally, the relatively
>> lengthy path for releasing swap entries further contributes to the
>> longer time required to release swap entries.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After booting up, start 15 processes first, and then observe the
>> physical memory size occupied by the last launched process at different
>> time points.
>> Example: The process launched last: com.qiyi.video
>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
>> -------------------------------------------------------------------
>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>> Note: min - minute.
>>
>> When there are multiple processes with independent mm and the high
>> memory pressure in system, if the large memory required process is
>> launched at this time, system will is likely to trigger the instantaneous
>> killing of many processes with independent mm. Due to multiple exiting
>> processes occupying multiple CPU core resources for concurrent execution,
>> leading to some issues such as the current non-exiting and important
>> processes lagging.
>>
>> To solve this problem, we have introduced the multiple exiting process
>> asynchronous swap entries release mechanism, which isolates and caches
>> swap entries occupied by multiple exiting processes, and hands them over
>> to an asynchronous kworker to complete the release. This allows the
>> exiting processes to complete quickly and release CPU resources. We have
>> validated this modification on the Android products and achieved the
>> expected benefits.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After restarting the machine, start 15 app processes first, and then
>> start the camera app processes, we monitor the cold start and preview
>> time datas of the camera app processes.
>>
>> Test datas of camera processes cold start time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>
>> Test datas of camera processes preview time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>
>> Base on the average of the six sets of test datas above, we can see that
>> the benefit datas of the modified patch:
>> 1. The cold start time of camera app processes has reduced by about 20%.
>> 2. The preview time of camera app processes has reduced by about 42%.
>>
>> It offers several benefits:
>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>     processes running simultaneously.
>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>     kworker instead of multiple exiting processes parallel execution.
>> 3. Release pte_present memory occupied by exiting processes more
>>     efficiently.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>> ---
>>   arch/s390/include/asm/tlb.h |   8 +
>>   include/asm-generic/tlb.h   |  44 ++++++
>>   include/linux/mm_types.h    |  58 +++++++
>>   mm/memory.c                 |   3 +-
>>   mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>   5 files changed, 408 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>> index e95b2c8081eb..3f681f63390f
>> --- a/arch/s390/include/asm/tlb.h
>> +++ b/arch/s390/include/asm/tlb.h
>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>                  struct page *page, bool delay_rmap, int page_size);
>>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>                  struct page *page, unsigned int nr_pages, bool delay_rmap);
>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>> +               swp_entry_t entry, int nr);
>>
>>   #define tlb_flush tlb_flush
>>   #define pte_free_tlb pte_free_tlb
>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>          return false;
>>   }
>>
>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>> +               swp_entry_t entry, int nr)
>> +{
>> +       return false;
>> +}
>> +
>>   static inline void tlb_flush(struct mmu_gather *tlb)
>>   {
>>          __tlb_flush_mm_lazy(tlb->mm);
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 709830274b75..8b4d516b35b8
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
>>   static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
>>   #endif
>>
>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>> +struct mmu_swap_batch {
>> +       struct mmu_swap_batch *next;
>> +       unsigned int nr;
>> +       unsigned int max;
>> +       encoded_swpentry_t encoded_entrys[];
>> +};
>> +
>> +#define MAX_SWAP_GATHER_BATCH  \
>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
>> +
>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
>> +
>> +struct mmu_swap_gather {
>> +       /*
>> +        * the asynchronous kworker to batch
>> +        * release swap entries
>> +        */
>> +       struct work_struct free_work;
>> +
>> +       /* batch cache swap entries */
>> +       unsigned int batch_count;
>> +       struct mmu_swap_batch *active;
>> +       struct mmu_swap_batch local;
>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
>> +};
>> +
>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>> +               swp_entry_t entry, int nr);
>> +#endif
>> +
>>   /*
>>    * struct mmu_gather is an opaque type used by the mm code for passing around
>>    * any data needed by arch specific code for tlb_remove_page.
>> @@ -343,6 +374,18 @@ struct mmu_gather {
>>          unsigned int            vma_exec : 1;
>>          unsigned int            vma_huge : 1;
>>          unsigned int            vma_pfn  : 1;
>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>> +       /*
>> +        * Two states of releasing swap entries
>> +        * asynchronously:
>> +        * swp_freeable - have opportunity to
>> +        * release asynchronously future
>> +        * swp_freeing - be releasing asynchronously.
>> +        */
>> +       unsigned int            swp_freeable : 1;
>> +       unsigned int            swp_freeing : 1;
>> +       unsigned int            swp_disable : 1;
>> +#endif
>>
>>          unsigned int            batch_count;
>>
>> @@ -354,6 +397,7 @@ struct mmu_gather {
>>   #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>>          unsigned int page_size;
>>   #endif
>> +       struct mmu_swap_gather *swp;
>>   #endif
>>   };
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 165c58b12ccc..2f66303f1519
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -283,6 +283,64 @@ typedef struct {
>>          unsigned long val;
>>   } swp_entry_t;
>>
>> +/*
>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
>> + *
>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
>> + * bit indicating extra context-dependent information. Only used in swp_entry
>> + * asynchronous release path by mmu_swap_gather.
>> + */
>> +typedef struct {
>> +       unsigned long val;
>> +} encoded_swpentry_t;
>> +
>> +/*
>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
>> + * total number of consecutive swap entries associated with the same folio. If this
>> + * bit is not set, "nr" is implicitly 1.
>> + *
>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
>> + * Bit63 can be used here.
>> + */
>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
>> +
>> +static __always_inline encoded_swpentry_t
>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
>> +{
>> +       encoded_swpentry_t ret;
>> +
>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>> +       ret.val = flags | entry.val;
>> +       return ret;
>> +}
>> +
>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
>> +{
>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>> +}
>> +
>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
>> +{
>> +       swp_entry_t ret;
>> +
>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>> +       return ret;
>> +}
>> +
>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
>> +{
>> +       encoded_swpentry_t ret;
>> +
>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>> +       ret.val = nr;
>> +       return ret;
>> +}
>> +
>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
>> +{
>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
>> +}
>> +
>>   /**
>>    * struct folio - Represents a contiguous set of bytes.
>>    * @flags: Identical to the page flags.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d6a9dcddaca4..023a8adcb67c
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                          if (!should_zap_cows(details))
>>                                  continue;
>>                          rss[MM_SWAPENTS] -= nr;
>> -                       free_swap_and_cache_nr(entry, nr);
>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
>> +                               free_swap_and_cache_nr(entry, nr);
>>                  } else if (is_migration_entry(entry)) {
>>                          folio = pfn_swap_entry_folio(entry);
>>                          if (!should_zap_folio(details, folio))
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 99b3e9408aa0..33dc9d1faff9
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -9,11 +9,303 @@
>>   #include <linux/smp.h>
>>   #include <linux/swap.h>
>>   #include <linux/rmap.h>
>> +#include <linux/oom.h>
>>
>>   #include <asm/pgalloc.h>
>>   #include <asm/tlb.h>
>>
>>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
>> +/*
>> + * The swp_entry asynchronous release mechanism for multiple processes with
>> + * independent mm exiting simultaneously.
>> + *
>> + * During the multiple exiting processes releasing their own mm simultaneously,
>> + * the swap entries in the exiting processes are handled by isolating, caching
>> + * and handing over to an asynchronous kworker to complete the release.
>> + *
>> + * The conditions for the exiting process entering the swp_entry asynchronous
>> + * release path:
>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
>> + *    to alloc struct mmu_swap_gather frequently.
>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
> Hi Zhiguo,
>
> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
> batched swap entry freeing, even with one process, could be a
> bottleneck for a single
> process based on the data from this patch:
>
> mm: attempt to batch free swap entries for zap_pte_range()
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
> "munmap bandwidth becomes 3X faster."
>
> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
Hi Barry,

Thanks for your comments.

The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
conducted the multiple android apps continuous startup performance
test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
showed that the startup time had deteriorated slightly. However,
the patch's logic in this test was different from the one I submitted
to the community, and it may be due to some other issues with the
previous old patch.

However, we have conducted relevant memory performance tests on this
patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
the results are better than before the modification. The patches have
been used on multiple projects.
For example:
Test the time consumption and subjective fluency experience of
launching 30 android apps continuously for two rounds.
Test machine: RAM 16GB
|        | time(s) | Frame-droped rate(%) |
| before | 230.76  |         0.54         |
| after  | 225.23  |         0.74         |
We can see that the patch has been optimized 5.53s for startup time and
0.2% frame-droped rate and better subjective smoothness experience.

Perhaps the patches submitted to the community has also improved the
multiple android apps continuous startup performance in the case
of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
tests to verify this situation in the future.
>
>> + *
>> + * Since the time for determining the number of exiting processes is dynamic,
>> + * the exiting process may start to enter the swp_entry asynchronous release
>> + * at the beginning or middle stage of the exiting process's swp_entry release
>> + * path.
>> + *
>> + * Once an exiting process enters the swp_entry asynchronous release, all remaining
>> + * swap entries in this exiting process need to be fully released by asynchronous
>> + * kworker theoretically.
> Freeing a slot can indeed release memory from `zRAM`, potentially returning
> it to the system for allocation. Your patch frees swap slots asynchronously;
> I assume this doesn’t slow down the memory freeing process for `zRAM`, or
> could it even slow down the freeing of `zRAM` memory? Freeing compressed
> memory might not be as crucial compared to freeing uncompressed memory with
> present PTEs?
Yes, freeing uncompressed memory with present PTEs is more important
compared to freeing compressed 'zRAM' memory.

I guess that the multiple exiting processes releasing swap entries
simultaneously may result in the swap_info->lock competition pressure
in swapcache_free_entries(), affecting the efficiency of releasing swap
entries. However, if the asynchronous kworker is used, this issue can
be avoided, and perhaps the improvement is minor.

The freeing of zRAM memory does not slow down. We have observed traces
in the camera startup scene and found that the asynchronous kworker
can release all swap entries before entering the camera preview.
Compared to not using the asynchronous kworker, the exiting processes
completed after entering the camera preview.
>
>> + *
>> + * The function of the swp_entry asynchronous release:
>> + * 1. Alleviate the high system cpu load caused by multiple exiting processes
>> + *    running simultaneously.
>> + * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
>> + *    instead of multiple exiting processes parallel execution.
>> + * 3. Release pte_present memory occupied by exiting processes more efficiently.
>> + */
>> +
>> +/*
>> + * The min number of exiting processes required for swp_entry asynchronous release
>> + */
>> +#define NR_MIN_EXITING_PROCESSES 2
>> +
>> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
>> +static struct kmem_cache *swap_gather_cachep;
>> +static struct workqueue_struct *swapfree_wq;
>> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
>> +
>> +static int __init tlb_swap_async_free_setup(void)
>> +{
>> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
>> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
>> +       if (!swapfree_wq)
>> +               goto fail;
>> +
>> +       swap_gather_cachep = kmem_cache_create("swap_gather",
>> +               sizeof(struct mmu_swap_gather),
>> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
>> +               NULL);
>> +       if (!swap_gather_cachep)
>> +               goto kcache_fail;
>> +
>> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
>> +       return 0;
>> +
>> +kcache_fail:
>> +       destroy_workqueue(swapfree_wq);
>> +fail:
>> +       return -ENOMEM;
>> +}
>> +postcore_initcall(tlb_swap_async_free_setup);
>> +
>> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
>> +{
>> +       struct mmu_swap_batch *swap_batch, *next;
>> +
>> +       for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
>> +               next = swap_batch->next;
>> +               free_page((unsigned long)swap_batch);
>> +       }
>> +       swap_gather->local.next = NULL;
>> +       kmem_cache_free(swap_gather_cachep, swap_gather);
>> +}
>> +
>> +static void tlb_swap_async_free_work(struct work_struct *w)
>> +{
>> +       int i, nr_multi, nr_free;
>> +       swp_entry_t start_entry;
>> +       struct mmu_swap_batch *swap_batch;
>> +       struct mmu_swap_gather *swap_gather = container_of(w,
>> +               struct mmu_swap_gather, free_work);
>> +
>> +       /* Release swap entries cached in mmu_swap_batch. */
>> +       for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
>> +           swap_batch = swap_batch->next) {
>> +               nr_free = 0;
>> +               for (i = 0; i < swap_batch->nr; i++) {
>> +                       if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
>> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>> +                               nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
>> +                               free_swap_and_cache_nr(start_entry, nr_multi);
>> +                               nr_free += 2;
>> +                       } else {
>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>> +                               free_swap_and_cache_nr(start_entry, 1);
>> +                               nr_free++;
>> +                       }
>> +               }
>> +               swap_batch->nr -= nr_free;
>> +               VM_BUG_ON(swap_batch->nr);
>> +       }
>> +       __tlb_swap_gather_free(swap_gather);
>> +}
>> +
>> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
>> +{
>> +       /*
>> +        * Only the exiting processes with the MM_SWAPENTS counter >=
>> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
>> +        * entries by asynchronous kworker.
>> +        */
>> +       if (!task_is_dying() ||
>> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
>> +               return true;
>> +
>> +       atomic_inc(&nr_exiting_processes);
>> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
>> +               tlb->swp_freeable = 1;
>> +       else
>> +               tlb->swp_freeing = 1;
>> +
>> +       return false;
>> +}
>> +
>> +/**
>> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
>> + * for swp_entry tear-down.
>> + * @tlb: the mmu_swap_gather structure belongs to tlb
>> + */
>> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
>> +{
>> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
>> +       if (unlikely(!tlb->swp))
>> +               return false;
>> +
>> +       tlb->swp->local.next  = NULL;
>> +       tlb->swp->local.nr    = 0;
>> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
>> +
>> +       tlb->swp->active      = &tlb->swp->local;
>> +       tlb->swp->batch_count = 0;
>> +
>> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
>> +       return true;
>> +}
>> +
>> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
>> +{
>> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
>> +               return;
>> +
>> +       tlb->swp = NULL;
>> +       tlb->swp_freeable = 0;
>> +       tlb->swp_freeing = 0;
>> +       tlb->swp_disable = 0;
>> +
>> +       if (__tlb_swap_gather_mmu_check(tlb))
>> +               return;
>> +
>> +       /*
>> +        * If the exiting process meets the conditions of
>> +        * swp_entry asynchronous release, an mmu_swap_gather
>> +        * structure will be initialized.
>> +        */
>> +       if (tlb->swp_freeing)
>> +               __tlb_swap_gather_init(tlb);
>> +}
>> +
>> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
>> +{
>> +       queue_work(swapfree_wq, &tlb->swp->free_work);
>> +       tlb->swp = NULL;
>> +       if (!finish)
>> +               __tlb_swap_gather_init(tlb);
>> +}
>> +
>> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
>> +{
>> +       struct mmu_swap_batch *swap_batch;
>> +
>> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
>> +               goto free;
>> +
>> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
>> +       if (unlikely(!swap_batch))
>> +               goto free;
>> +
>> +       swap_batch->next = NULL;
>> +       swap_batch->nr   = 0;
>> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
>> +
>> +       tlb->swp->active->next = swap_batch;
>> +       tlb->swp->active = swap_batch;
>> +       tlb->swp->batch_count++;
>> +       return true;
>> +free:
>> +       /* batch move to wq */
>> +       __tlb_swap_gather_queuework(tlb, false);
>> +       return false;
>> +}
>> +
>> +/**
>> + * __tlb_remove_swap_entries - the swap entries in exiting process are
>> + * isolated, batch cached in struct mmu_swap_batch.
>> + * @tlb: the current mmu_gather
>> + * @entry: swp_entry to be isolated and cached
>> + * @nr: the number of consecutive entries starting from entry parameter.
>> + */
>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>> +                            swp_entry_t entry, int nr)
>> +{
>> +       struct mmu_swap_batch *swap_batch;
>> +       unsigned long flags = 0;
>> +       bool ret = false;
>> +
>> +       if (tlb->swp_disable)
>> +               return ret;
>> +
>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>> +               return ret;
>> +
>> +       if (tlb->swp_freeable) {
>> +               if (atomic_read(&nr_exiting_processes) <
>> +                   NR_MIN_EXITING_PROCESSES)
>> +                       return ret;
>> +               /*
>> +                * If the current number of exiting processes
>> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
>> +                * process with swp_freeable state will enter
>> +                * swp_freeing state to start releasing its
>> +                * remaining swap entries by the asynchronous
>> +                * kworker.
>> +                */
>> +               tlb->swp_freeable = 0;
>> +               tlb->swp_freeing = 1;
>> +       }
>> +
>> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
>> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
>> +               return ret;
>> +
>> +       swap_batch = tlb->swp->active;
>> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
>> +               __tlb_swap_gather_queuework(tlb, false);
>> +               return ret;
>> +       }
>> +
>> +       if (likely(nr == 1)) {
>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
>> +       } else {
>> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
>> +       }
>> +       ret = true;
>> +
>> +       if (swap_batch->nr >= swap_batch->max - 1) {
>> +               if (!__tlb_swap_next_batch(tlb))
>> +                       goto exit;
>> +               swap_batch = tlb->swp->active;
>> +       }
>> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
>> +exit:
>> +       return ret;
>> +}
>> +
>> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
>> +{
>> +       if (tlb->swp_disable)
>> +               return;
>> +
>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>> +               return;
>> +
>> +       if (tlb->swp_freeable) {
>> +               tlb->swp_freeable = 0;
>> +               VM_BUG_ON(tlb->swp_freeing);
>> +               goto exit;
>> +       }
>> +       tlb->swp_freeing = 0;
>> +       if (unlikely(!tlb->swp))
>> +               goto exit;
>> +
>> +       __tlb_swap_gather_queuework(tlb, true);
>> +exit:
>> +       atomic_dec(&nr_exiting_processes);
>> +}
>>
>>   static bool tlb_next_batch(struct mmu_gather *tlb)
>>   {
>> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>>          tlb->local.max  = ARRAY_SIZE(tlb->__pages);
>>          tlb->active     = &tlb->local;
>>          tlb->batch_count = 0;
>> +
>> +       tlb->swp_disable = 1;
>> +       __tlb_swap_gather_mmu(tlb);
>>   #endif
>>          tlb->delayed_rmap = 0;
>>
>> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
>>
>>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>          tlb_batch_list_free(tlb);
>> +       __tlb_batch_swap_finish(tlb);
>>   #endif
>>          dec_tlb_flush_pending(tlb->mm);
>>   }
>> --
>> 2.39.0
>>
> Thanks
> Barry
Thanks
Zhiguo
zhiguojiang Sept. 4, 2024, 11:38 a.m. UTC | #3
在 2024/9/4 19:26, zhiguojiang 写道:
>
>
> 在 2024/9/4 17:16, Barry Song 写道:
>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> 
>> wrote:
>>> One of the main reasons for the prolonged exit of the process with
>>> independent mm is the time-consuming release of its swap entries.
>>> The proportion of swap memory occupied by the process increases over
>>> time due to high memory pressure triggering to reclaim anonymous folio
>>> into swapspace, e.g., in Android devices, we found this proportion can
>>> reach 60% or more after a period of time. Additionally, the relatively
>>> lengthy path for releasing swap entries further contributes to the
>>> longer time required to release swap entries.
>>>
>>> Testing Platform: 8GB RAM
>>> Testing procedure:
>>> After booting up, start 15 processes first, and then observe the
>>> physical memory size occupied by the last launched process at different
>>> time points.
>>> Example: The process launched last: com.qiyi.video
>>> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
>>> -------------------------------------------------------------------
>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
>>> Note: min - minute.
>>>
>>> When there are multiple processes with independent mm and the high
>>> memory pressure in system, if the large memory required process is
>>> launched at this time, system will is likely to trigger the 
>>> instantaneous
>>> killing of many processes with independent mm. Due to multiple exiting
>>> processes occupying multiple CPU core resources for concurrent 
>>> execution,
>>> leading to some issues such as the current non-exiting and important
>>> processes lagging.
>>>
>>> To solve this problem, we have introduced the multiple exiting process
>>> asynchronous swap entries release mechanism, which isolates and caches
>>> swap entries occupied by multiple exiting processes, and hands them 
>>> over
>>> to an asynchronous kworker to complete the release. This allows the
>>> exiting processes to complete quickly and release CPU resources. We 
>>> have
>>> validated this modification on the Android products and achieved the
>>> expected benefits.
>>>
>>> Testing Platform: 8GB RAM
>>> Testing procedure:
>>> After restarting the machine, start 15 app processes first, and then
>>> start the camera app processes, we monitor the cold start and preview
>>> time datas of the camera app processes.
>>>
>>> Test datas of camera processes cold start time (unit: millisecond):
>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>
>>> Test datas of camera processes preview time (unit: millisecond):
>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>
>>> Base on the average of the six sets of test datas above, we can see 
>>> that
>>> the benefit datas of the modified patch:
>>> 1. The cold start time of camera app processes has reduced by about 
>>> 20%.
>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>
>>> It offers several benefits:
>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>     processes running simultaneously.
>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>     kworker instead of multiple exiting processes parallel execution.
>>> 3. Release pte_present memory occupied by exiting processes more
>>>     efficiently.
>>>
>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>> ---
>>>   arch/s390/include/asm/tlb.h |   8 +
>>>   include/asm-generic/tlb.h   |  44 ++++++
>>>   include/linux/mm_types.h    |  58 +++++++
>>>   mm/memory.c                 |   3 +-
>>>   mm/mmu_gather.c             | 296 
>>> ++++++++++++++++++++++++++++++++++++
>>>   5 files changed, 408 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>> index e95b2c8081eb..3f681f63390f
>>> --- a/arch/s390/include/asm/tlb.h
>>> +++ b/arch/s390/include/asm/tlb.h
>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct 
>>> mmu_gather *tlb,
>>>                  struct page *page, bool delay_rmap, int page_size);
>>>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>                  struct page *page, unsigned int nr_pages, bool 
>>> delay_rmap);
>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>> +               swp_entry_t entry, int nr);
>>>
>>>   #define tlb_flush tlb_flush
>>>   #define pte_free_tlb pte_free_tlb
>>> @@ -69,6 +71,12 @@ static inline bool 
>>> __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>          return false;
>>>   }
>>>
>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>> +               swp_entry_t entry, int nr)
>>> +{
>>> +       return false;
>>> +}
>>> +
>>>   static inline void tlb_flush(struct mmu_gather *tlb)
>>>   {
>>>          __tlb_flush_mm_lazy(tlb->mm);
>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>> index 709830274b75..8b4d516b35b8
>>> --- a/include/asm-generic/tlb.h
>>> +++ b/include/asm-generic/tlb.h
>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather 
>>> *tlb, struct vm_area_struct *vma);
>>>   static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct 
>>> vm_area_struct *vma) { }
>>>   #endif
>>>
>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>> +struct mmu_swap_batch {
>>> +       struct mmu_swap_batch *next;
>>> +       unsigned int nr;
>>> +       unsigned int max;
>>> +       encoded_swpentry_t encoded_entrys[];
>>> +};
>>> +
>>> +#define MAX_SWAP_GATHER_BATCH  \
>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
>>> +
>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / 
>>> MAX_SWAP_GATHER_BATCH)
>>> +
>>> +struct mmu_swap_gather {
>>> +       /*
>>> +        * the asynchronous kworker to batch
>>> +        * release swap entries
>>> +        */
>>> +       struct work_struct free_work;
>>> +
>>> +       /* batch cache swap entries */
>>> +       unsigned int batch_count;
>>> +       struct mmu_swap_batch *active;
>>> +       struct mmu_swap_batch local;
>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
>>> +};
>>> +
>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>> +               swp_entry_t entry, int nr);
>>> +#endif
>>> +
>>>   /*
>>>    * struct mmu_gather is an opaque type used by the mm code for 
>>> passing around
>>>    * any data needed by arch specific code for tlb_remove_page.
>>> @@ -343,6 +374,18 @@ struct mmu_gather {
>>>          unsigned int            vma_exec : 1;
>>>          unsigned int            vma_huge : 1;
>>>          unsigned int            vma_pfn  : 1;
>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>> +       /*
>>> +        * Two states of releasing swap entries
>>> +        * asynchronously:
>>> +        * swp_freeable - have opportunity to
>>> +        * release asynchronously future
>>> +        * swp_freeing - be releasing asynchronously.
>>> +        */
>>> +       unsigned int            swp_freeable : 1;
>>> +       unsigned int            swp_freeing : 1;
>>> +       unsigned int            swp_disable : 1;
>>> +#endif
>>>
>>>          unsigned int            batch_count;
>>>
>>> @@ -354,6 +397,7 @@ struct mmu_gather {
>>>   #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>>>          unsigned int page_size;
>>>   #endif
>>> +       struct mmu_swap_gather *swp;
>>>   #endif
>>>   };
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 165c58b12ccc..2f66303f1519
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -283,6 +283,64 @@ typedef struct {
>>>          unsigned long val;
>>>   } swp_entry_t;
>>>
>>> +/*
>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
>>> + *
>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the 
>>> highest
>>> + * bit indicating extra context-dependent information. Only used in 
>>> swp_entry
>>> + * asynchronous release path by mmu_swap_gather.
>>> + */
>>> +typedef struct {
>>> +       unsigned long val;
>>> +} encoded_swpentry_t;
>>> +
>>> +/*
>>> + * The next item in an encoded_swpentry_t array is the "nr" 
>>> argument, specifying the
>>> + * total number of consecutive swap entries associated with the 
>>> same folio. If this
>>> + * bit is not set, "nr" is implicitly 1.
>>> + *
>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, 
>>> swp_type bits: 58 ~ 62.
>>> + * Bit63 can be used here.
>>> + */
>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG 
>>> - 1))
>>> +
>>> +static __always_inline encoded_swpentry_t
>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
>>> +{
>>> +       encoded_swpentry_t ret;
>>> +
>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>> +       ret.val = flags | entry.val;
>>> +       return ret;
>>> +}
>>> +
>>> +static inline unsigned long 
>>> encoded_swpentry_flags(encoded_swpentry_t entry)
>>> +{
>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>> +}
>>> +
>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t 
>>> entry)
>>> +{
>>> +       swp_entry_t ret;
>>> +
>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>> +       return ret;
>>> +}
>>> +
>>> +static __always_inline encoded_swpentry_t 
>>> encode_nr_swpentrys(unsigned long nr)
>>> +{
>>> +       encoded_swpentry_t ret;
>>> +
>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>> +       ret.val = nr;
>>> +       return ret;
>>> +}
>>> +
>>> +static __always_inline unsigned long 
>>> encoded_nr_swpentrys(encoded_swpentry_t entry)
>>> +{
>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
>>> +}
>>> +
>>>   /**
>>>    * struct folio - Represents a contiguous set of bytes.
>>>    * @flags: Identical to the page flags.
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index d6a9dcddaca4..023a8adcb67c
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct 
>>> mmu_gather *tlb,
>>>                          if (!should_zap_cows(details))
>>>                                  continue;
>>>                          rss[MM_SWAPENTS] -= nr;
>>> -                       free_swap_and_cache_nr(entry, nr);
>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
>>> +                               free_swap_and_cache_nr(entry, nr);
>>>                  } else if (is_migration_entry(entry)) {
>>>                          folio = pfn_swap_entry_folio(entry);
>>>                          if (!should_zap_folio(details, folio))
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 99b3e9408aa0..33dc9d1faff9
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -9,11 +9,303 @@
>>>   #include <linux/smp.h>
>>>   #include <linux/swap.h>
>>>   #include <linux/rmap.h>
>>> +#include <linux/oom.h>
>>>
>>>   #include <asm/pgalloc.h>
>>>   #include <asm/tlb.h>
>>>
>>>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>> +/*
>>> + * The swp_entry asynchronous release mechanism for multiple 
>>> processes with
>>> + * independent mm exiting simultaneously.
>>> + *
>>> + * During the multiple exiting processes releasing their own mm 
>>> simultaneously,
>>> + * the swap entries in the exiting processes are handled by 
>>> isolating, caching
>>> + * and handing over to an asynchronous kworker to complete the 
>>> release.
>>> + *
>>> + * The conditions for the exiting process entering the swp_entry 
>>> asynchronous
>>> + * release path:
>>> + * 1. The exiting process's MM_SWAPENTS count is >= 
>>> SWAP_CLUSTER_MAX, avoiding
>>> + *    to alloc struct mmu_swap_gather frequently.
>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
>> Hi Zhiguo,
>>
>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It 
>> seems that
>> batched swap entry freeing, even with one process, could be a
>> bottleneck for a single
>> process based on the data from this patch:
>>
>> mm: attempt to batch free swap entries for zap_pte_range()
>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee 
>>
>> "munmap bandwidth becomes 3X faster."
>>
>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
> Hi Barry,
>
> Thanks for your comments.
>
> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
> conducted the multiple android apps continuous startup performance
> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
> showed that the startup time had deteriorated slightly. However,
> the patch's logic in this test was different from the one I submitted
> to the community, and it may be due to some other issues with the
> previous old patch.
>
> However, we have conducted relevant memory performance tests on this
> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
> the results are better than before the modification. The patches have
> been used on multiple projects.
> For example:
> Test the time consumption and subjective fluency experience of
> launching 30 android apps continuously for two rounds.
> Test machine: RAM 16GB
> |        | time(s) | Frame-droped rate(%) |
> | before | 230.76  |         0.54         |
> | after  | 225.23  |         0.74         |
Sorry, update date.
Test machine: RAM 16GB
|        | time(s) | Frame-droped rate(%) |
| before | 230.76  |         0.74         |
| after  | 225.23  |         0.54         |
> We can see that the patch has been optimized 5.53s for startup time and
> 0.2% frame-droped rate and better subjective smoothness experience.
>
> Perhaps the patches submitted to the community has also improved the
> multiple android apps continuous startup performance in the case
> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
> tests to verify this situation in the future.
>>
>>> + *
>>> + * Since the time for determining the number of exiting processes 
>>> is dynamic,
>>> + * the exiting process may start to enter the swp_entry 
>>> asynchronous release
>>> + * at the beginning or middle stage of the exiting process's 
>>> swp_entry release
>>> + * path.
>>> + *
>>> + * Once an exiting process enters the swp_entry asynchronous 
>>> release, all remaining
>>> + * swap entries in this exiting process need to be fully released 
>>> by asynchronous
>>> + * kworker theoretically.
>> Freeing a slot can indeed release memory from `zRAM`, potentially 
>> returning
>> it to the system for allocation. Your patch frees swap slots 
>> asynchronously;
>> I assume this doesn’t slow down the memory freeing process for 
>> `zRAM`, or
>> could it even slow down the freeing of `zRAM` memory? Freeing compressed
>> memory might not be as crucial compared to freeing uncompressed 
>> memory with
>> present PTEs?
> Yes, freeing uncompressed memory with present PTEs is more important
> compared to freeing compressed 'zRAM' memory.
>
> I guess that the multiple exiting processes releasing swap entries
> simultaneously may result in the swap_info->lock competition pressure
> in swapcache_free_entries(), affecting the efficiency of releasing swap
> entries. However, if the asynchronous kworker is used, this issue can
> be avoided, and perhaps the improvement is minor.
>
> The freeing of zRAM memory does not slow down. We have observed traces
> in the camera startup scene and found that the asynchronous kworker
> can release all swap entries before entering the camera preview.
> Compared to not using the asynchronous kworker, the exiting processes
> completed after entering the camera preview.
>>
>>> + *
>>> + * The function of the swp_entry asynchronous release:
>>> + * 1. Alleviate the high system cpu load caused by multiple exiting 
>>> processes
>>> + *    running simultaneously.
>>> + * 2. Reduce lock competition in swap entry free path by an 
>>> asynchronous kworker
>>> + *    instead of multiple exiting processes parallel execution.
>>> + * 3. Release pte_present memory occupied by exiting processes more 
>>> efficiently.
>>> + */
>>> +
>>> +/*
>>> + * The min number of exiting processes required for swp_entry 
>>> asynchronous release
>>> + */
>>> +#define NR_MIN_EXITING_PROCESSES 2
>>> +
>>> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
>>> +static struct kmem_cache *swap_gather_cachep;
>>> +static struct workqueue_struct *swapfree_wq;
>>> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
>>> +
>>> +static int __init tlb_swap_async_free_setup(void)
>>> +{
>>> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
>>> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
>>> +       if (!swapfree_wq)
>>> +               goto fail;
>>> +
>>> +       swap_gather_cachep = kmem_cache_create("swap_gather",
>>> +               sizeof(struct mmu_swap_gather),
>>> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
>>> +               NULL);
>>> +       if (!swap_gather_cachep)
>>> +               goto kcache_fail;
>>> +
>>> + static_branch_disable(&tlb_swap_asyncfree_disabled);
>>> +       return 0;
>>> +
>>> +kcache_fail:
>>> +       destroy_workqueue(swapfree_wq);
>>> +fail:
>>> +       return -ENOMEM;
>>> +}
>>> +postcore_initcall(tlb_swap_async_free_setup);
>>> +
>>> +static void __tlb_swap_gather_free(struct mmu_swap_gather 
>>> *swap_gather)
>>> +{
>>> +       struct mmu_swap_batch *swap_batch, *next;
>>> +
>>> +       for (swap_batch = swap_gather->local.next; swap_batch; 
>>> swap_batch = next) {
>>> +               next = swap_batch->next;
>>> +               free_page((unsigned long)swap_batch);
>>> +       }
>>> +       swap_gather->local.next = NULL;
>>> +       kmem_cache_free(swap_gather_cachep, swap_gather);
>>> +}
>>> +
>>> +static void tlb_swap_async_free_work(struct work_struct *w)
>>> +{
>>> +       int i, nr_multi, nr_free;
>>> +       swp_entry_t start_entry;
>>> +       struct mmu_swap_batch *swap_batch;
>>> +       struct mmu_swap_gather *swap_gather = container_of(w,
>>> +               struct mmu_swap_gather, free_work);
>>> +
>>> +       /* Release swap entries cached in mmu_swap_batch. */
>>> +       for (swap_batch = &swap_gather->local; swap_batch && 
>>> swap_batch->nr;
>>> +           swap_batch = swap_batch->next) {
>>> +               nr_free = 0;
>>> +               for (i = 0; i < swap_batch->nr; i++) {
>>> +                       if 
>>> (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
>>> + ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
>>> +                               start_entry = 
>>> encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>>> +                               nr_multi = 
>>> encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
>>> + free_swap_and_cache_nr(start_entry, nr_multi);
>>> +                               nr_free += 2;
>>> +                       } else {
>>> +                               start_entry = 
>>> encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>>> + free_swap_and_cache_nr(start_entry, 1);
>>> +                               nr_free++;
>>> +                       }
>>> +               }
>>> +               swap_batch->nr -= nr_free;
>>> +               VM_BUG_ON(swap_batch->nr);
>>> +       }
>>> +       __tlb_swap_gather_free(swap_gather);
>>> +}
>>> +
>>> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
>>> +{
>>> +       /*
>>> +        * Only the exiting processes with the MM_SWAPENTS counter >=
>>> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
>>> +        * entries by asynchronous kworker.
>>> +        */
>>> +       if (!task_is_dying() ||
>>> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
>>> +               return true;
>>> +
>>> +       atomic_inc(&nr_exiting_processes);
>>> +       if (atomic_read(&nr_exiting_processes) < 
>>> NR_MIN_EXITING_PROCESSES)
>>> +               tlb->swp_freeable = 1;
>>> +       else
>>> +               tlb->swp_freeing = 1;
>>> +
>>> +       return false;
>>> +}
>>> +
>>> +/**
>>> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
>>> + * for swp_entry tear-down.
>>> + * @tlb: the mmu_swap_gather structure belongs to tlb
>>> + */
>>> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
>>> +{
>>> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | 
>>> GFP_NOWAIT);
>>> +       if (unlikely(!tlb->swp))
>>> +               return false;
>>> +
>>> +       tlb->swp->local.next  = NULL;
>>> +       tlb->swp->local.nr    = 0;
>>> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
>>> +
>>> +       tlb->swp->active      = &tlb->swp->local;
>>> +       tlb->swp->batch_count = 0;
>>> +
>>> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
>>> +       return true;
>>> +}
>>> +
>>> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
>>> +{
>>> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
>>> +               return;
>>> +
>>> +       tlb->swp = NULL;
>>> +       tlb->swp_freeable = 0;
>>> +       tlb->swp_freeing = 0;
>>> +       tlb->swp_disable = 0;
>>> +
>>> +       if (__tlb_swap_gather_mmu_check(tlb))
>>> +               return;
>>> +
>>> +       /*
>>> +        * If the exiting process meets the conditions of
>>> +        * swp_entry asynchronous release, an mmu_swap_gather
>>> +        * structure will be initialized.
>>> +        */
>>> +       if (tlb->swp_freeing)
>>> +               __tlb_swap_gather_init(tlb);
>>> +}
>>> +
>>> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, 
>>> bool finish)
>>> +{
>>> +       queue_work(swapfree_wq, &tlb->swp->free_work);
>>> +       tlb->swp = NULL;
>>> +       if (!finish)
>>> +               __tlb_swap_gather_init(tlb);
>>> +}
>>> +
>>> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
>>> +{
>>> +       struct mmu_swap_batch *swap_batch;
>>> +
>>> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
>>> +               goto free;
>>> +
>>> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
>>> +       if (unlikely(!swap_batch))
>>> +               goto free;
>>> +
>>> +       swap_batch->next = NULL;
>>> +       swap_batch->nr   = 0;
>>> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
>>> +
>>> +       tlb->swp->active->next = swap_batch;
>>> +       tlb->swp->active = swap_batch;
>>> +       tlb->swp->batch_count++;
>>> +       return true;
>>> +free:
>>> +       /* batch move to wq */
>>> +       __tlb_swap_gather_queuework(tlb, false);
>>> +       return false;
>>> +}
>>> +
>>> +/**
>>> + * __tlb_remove_swap_entries - the swap entries in exiting process are
>>> + * isolated, batch cached in struct mmu_swap_batch.
>>> + * @tlb: the current mmu_gather
>>> + * @entry: swp_entry to be isolated and cached
>>> + * @nr: the number of consecutive entries starting from entry 
>>> parameter.
>>> + */
>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>> +                            swp_entry_t entry, int nr)
>>> +{
>>> +       struct mmu_swap_batch *swap_batch;
>>> +       unsigned long flags = 0;
>>> +       bool ret = false;
>>> +
>>> +       if (tlb->swp_disable)
>>> +               return ret;
>>> +
>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>>> +               return ret;
>>> +
>>> +       if (tlb->swp_freeable) {
>>> +               if (atomic_read(&nr_exiting_processes) <
>>> +                   NR_MIN_EXITING_PROCESSES)
>>> +                       return ret;
>>> +               /*
>>> +                * If the current number of exiting processes
>>> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
>>> +                * process with swp_freeable state will enter
>>> +                * swp_freeing state to start releasing its
>>> +                * remaining swap entries by the asynchronous
>>> +                * kworker.
>>> +                */
>>> +               tlb->swp_freeable = 0;
>>> +               tlb->swp_freeing = 1;
>>> +       }
>>> +
>>> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
>>> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
>>> +               return ret;
>>> +
>>> +       swap_batch = tlb->swp->active;
>>> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
>>> +               __tlb_swap_gather_queuework(tlb, false);
>>> +               return ret;
>>> +       }
>>> +
>>> +       if (likely(nr == 1)) {
>>> + swap_batch->encoded_entrys[swap_batch->nr++] = 
>>> encode_swpentry(entry, flags);
>>> +       } else {
>>> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
>>> + swap_batch->encoded_entrys[swap_batch->nr++] = 
>>> encode_swpentry(entry, flags);
>>> + swap_batch->encoded_entrys[swap_batch->nr++] = 
>>> encode_nr_swpentrys(nr);
>>> +       }
>>> +       ret = true;
>>> +
>>> +       if (swap_batch->nr >= swap_batch->max - 1) {
>>> +               if (!__tlb_swap_next_batch(tlb))
>>> +                       goto exit;
>>> +               swap_batch = tlb->swp->active;
>>> +       }
>>> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
>>> +exit:
>>> +       return ret;
>>> +}
>>> +
>>> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
>>> +{
>>> +       if (tlb->swp_disable)
>>> +               return;
>>> +
>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>>> +               return;
>>> +
>>> +       if (tlb->swp_freeable) {
>>> +               tlb->swp_freeable = 0;
>>> +               VM_BUG_ON(tlb->swp_freeing);
>>> +               goto exit;
>>> +       }
>>> +       tlb->swp_freeing = 0;
>>> +       if (unlikely(!tlb->swp))
>>> +               goto exit;
>>> +
>>> +       __tlb_swap_gather_queuework(tlb, true);
>>> +exit:
>>> +       atomic_dec(&nr_exiting_processes);
>>> +}
>>>
>>>   static bool tlb_next_batch(struct mmu_gather *tlb)
>>>   {
>>> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather 
>>> *tlb, struct mm_struct *mm,
>>>          tlb->local.max  = ARRAY_SIZE(tlb->__pages);
>>>          tlb->active     = &tlb->local;
>>>          tlb->batch_count = 0;
>>> +
>>> +       tlb->swp_disable = 1;
>>> +       __tlb_swap_gather_mmu(tlb);
>>>   #endif
>>>          tlb->delayed_rmap = 0;
>>>
>>> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
>>>
>>>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>          tlb_batch_list_free(tlb);
>>> +       __tlb_batch_swap_finish(tlb);
>>>   #endif
>>>          dec_tlb_flush_pending(tlb->mm);
>>>   }
>>> -- 
>>> 2.39.0
>>>
>> Thanks
>> Barry
> Thanks
> Zhiguo
>
Barry Song Sept. 9, 2024, 1:59 a.m. UTC | #4
On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/9/4 17:16, Barry Song 写道:
> > On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >> One of the main reasons for the prolonged exit of the process with
> >> independent mm is the time-consuming release of its swap entries.
> >> The proportion of swap memory occupied by the process increases over
> >> time due to high memory pressure triggering to reclaim anonymous folio
> >> into swapspace, e.g., in Android devices, we found this proportion can
> >> reach 60% or more after a period of time. Additionally, the relatively
> >> lengthy path for releasing swap entries further contributes to the
> >> longer time required to release swap entries.
> >>
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After booting up, start 15 processes first, and then observe the
> >> physical memory size occupied by the last launched process at different
> >> time points.
> >> Example: The process launched last: com.qiyi.video
> >> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> >> -------------------------------------------------------------------
> >> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> >> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> >> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> >> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> >> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> >> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> >> Note: min - minute.
> >>
> >> When there are multiple processes with independent mm and the high
> >> memory pressure in system, if the large memory required process is
> >> launched at this time, system will is likely to trigger the instantaneous
> >> killing of many processes with independent mm. Due to multiple exiting
> >> processes occupying multiple CPU core resources for concurrent execution,
> >> leading to some issues such as the current non-exiting and important
> >> processes lagging.
> >>
> >> To solve this problem, we have introduced the multiple exiting process
> >> asynchronous swap entries release mechanism, which isolates and caches
> >> swap entries occupied by multiple exiting processes, and hands them over
> >> to an asynchronous kworker to complete the release. This allows the
> >> exiting processes to complete quickly and release CPU resources. We have
> >> validated this modification on the Android products and achieved the
> >> expected benefits.
> >>
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After restarting the machine, start 15 app processes first, and then
> >> start the camera app processes, we monitor the cold start and preview
> >> time datas of the camera app processes.
> >>
> >> Test datas of camera processes cold start time (unit: millisecond):
> >> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> >> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> >>
> >> Test datas of camera processes preview time (unit: millisecond):
> >> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> >> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> >>
> >> Base on the average of the six sets of test datas above, we can see that
> >> the benefit datas of the modified patch:
> >> 1. The cold start time of camera app processes has reduced by about 20%.
> >> 2. The preview time of camera app processes has reduced by about 42%.
> >>
> >> It offers several benefits:
> >> 1. Alleviate the high system cpu loading caused by multiple exiting
> >>     processes running simultaneously.
> >> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>     kworker instead of multiple exiting processes parallel execution.
> >> 3. Release pte_present memory occupied by exiting processes more
> >>     efficiently.
> >>
> >> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> >> ---
> >>   arch/s390/include/asm/tlb.h |   8 +
> >>   include/asm-generic/tlb.h   |  44 ++++++
> >>   include/linux/mm_types.h    |  58 +++++++
> >>   mm/memory.c                 |   3 +-
> >>   mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
> >>   5 files changed, 408 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> >> index e95b2c8081eb..3f681f63390f
> >> --- a/arch/s390/include/asm/tlb.h
> >> +++ b/arch/s390/include/asm/tlb.h
> >> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> >>                  struct page *page, bool delay_rmap, int page_size);
> >>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>                  struct page *page, unsigned int nr_pages, bool delay_rmap);
> >> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >> +               swp_entry_t entry, int nr);
> >>
> >>   #define tlb_flush tlb_flush
> >>   #define pte_free_tlb pte_free_tlb
> >> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>          return false;
> >>   }
> >>
> >> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >> +               swp_entry_t entry, int nr)
> >> +{
> >> +       return false;
> >> +}
> >> +
> >>   static inline void tlb_flush(struct mmu_gather *tlb)
> >>   {
> >>          __tlb_flush_mm_lazy(tlb->mm);
> >> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> >> index 709830274b75..8b4d516b35b8
> >> --- a/include/asm-generic/tlb.h
> >> +++ b/include/asm-generic/tlb.h
> >> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
> >>   static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
> >>   #endif
> >>
> >> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >> +struct mmu_swap_batch {
> >> +       struct mmu_swap_batch *next;
> >> +       unsigned int nr;
> >> +       unsigned int max;
> >> +       encoded_swpentry_t encoded_entrys[];
> >> +};
> >> +
> >> +#define MAX_SWAP_GATHER_BATCH  \
> >> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
> >> +
> >> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
> >> +
> >> +struct mmu_swap_gather {
> >> +       /*
> >> +        * the asynchronous kworker to batch
> >> +        * release swap entries
> >> +        */
> >> +       struct work_struct free_work;
> >> +
> >> +       /* batch cache swap entries */
> >> +       unsigned int batch_count;
> >> +       struct mmu_swap_batch *active;
> >> +       struct mmu_swap_batch local;
> >> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> >> +};
> >> +
> >> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >> +               swp_entry_t entry, int nr);
> >> +#endif
> >> +
> >>   /*
> >>    * struct mmu_gather is an opaque type used by the mm code for passing around
> >>    * any data needed by arch specific code for tlb_remove_page.
> >> @@ -343,6 +374,18 @@ struct mmu_gather {
> >>          unsigned int            vma_exec : 1;
> >>          unsigned int            vma_huge : 1;
> >>          unsigned int            vma_pfn  : 1;
> >> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >> +       /*
> >> +        * Two states of releasing swap entries
> >> +        * asynchronously:
> >> +        * swp_freeable - have opportunity to
> >> +        * release asynchronously future
> >> +        * swp_freeing - be releasing asynchronously.
> >> +        */
> >> +       unsigned int            swp_freeable : 1;
> >> +       unsigned int            swp_freeing : 1;
> >> +       unsigned int            swp_disable : 1;
> >> +#endif
> >>
> >>          unsigned int            batch_count;
> >>
> >> @@ -354,6 +397,7 @@ struct mmu_gather {
> >>   #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
> >>          unsigned int page_size;
> >>   #endif
> >> +       struct mmu_swap_gather *swp;
> >>   #endif
> >>   };
> >>
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 165c58b12ccc..2f66303f1519
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -283,6 +283,64 @@ typedef struct {
> >>          unsigned long val;
> >>   } swp_entry_t;
> >>
> >> +/*
> >> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> >> + *
> >> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
> >> + * bit indicating extra context-dependent information. Only used in swp_entry
> >> + * asynchronous release path by mmu_swap_gather.
> >> + */
> >> +typedef struct {
> >> +       unsigned long val;
> >> +} encoded_swpentry_t;
> >> +
> >> +/*
> >> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
> >> + * total number of consecutive swap entries associated with the same folio. If this
> >> + * bit is not set, "nr" is implicitly 1.
> >> + *
> >> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
> >> + * Bit63 can be used here.
> >> + */
> >> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
> >> +
> >> +static __always_inline encoded_swpentry_t
> >> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> >> +{
> >> +       encoded_swpentry_t ret;
> >> +
> >> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >> +       ret.val = flags | entry.val;
> >> +       return ret;
> >> +}
> >> +
> >> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
> >> +{
> >> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >> +}
> >> +
> >> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
> >> +{
> >> +       swp_entry_t ret;
> >> +
> >> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >> +       return ret;
> >> +}
> >> +
> >> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
> >> +{
> >> +       encoded_swpentry_t ret;
> >> +
> >> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >> +       ret.val = nr;
> >> +       return ret;
> >> +}
> >> +
> >> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
> >> +{
> >> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
> >> +}
> >> +
> >>   /**
> >>    * struct folio - Represents a contiguous set of bytes.
> >>    * @flags: Identical to the page flags.
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index d6a9dcddaca4..023a8adcb67c
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >>                          if (!should_zap_cows(details))
> >>                                  continue;
> >>                          rss[MM_SWAPENTS] -= nr;
> >> -                       free_swap_and_cache_nr(entry, nr);
> >> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
> >> +                               free_swap_and_cache_nr(entry, nr);
> >>                  } else if (is_migration_entry(entry)) {
> >>                          folio = pfn_swap_entry_folio(entry);
> >>                          if (!should_zap_folio(details, folio))
> >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >> index 99b3e9408aa0..33dc9d1faff9
> >> --- a/mm/mmu_gather.c
> >> +++ b/mm/mmu_gather.c
> >> @@ -9,11 +9,303 @@
> >>   #include <linux/smp.h>
> >>   #include <linux/swap.h>
> >>   #include <linux/rmap.h>
> >> +#include <linux/oom.h>
> >>
> >>   #include <asm/pgalloc.h>
> >>   #include <asm/tlb.h>
> >>
> >>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >> +/*
> >> + * The swp_entry asynchronous release mechanism for multiple processes with
> >> + * independent mm exiting simultaneously.
> >> + *
> >> + * During the multiple exiting processes releasing their own mm simultaneously,
> >> + * the swap entries in the exiting processes are handled by isolating, caching
> >> + * and handing over to an asynchronous kworker to complete the release.
> >> + *
> >> + * The conditions for the exiting process entering the swp_entry asynchronous
> >> + * release path:
> >> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
> >> + *    to alloc struct mmu_swap_gather frequently.
> >> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
> > Hi Zhiguo,
> >
> > I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
> > batched swap entry freeing, even with one process, could be a
> > bottleneck for a single
> > process based on the data from this patch:
> >
> > mm: attempt to batch free swap entries for zap_pte_range()
> > https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
> > "munmap bandwidth becomes 3X faster."
> >
> > So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
> Hi Barry,
>
> Thanks for your comments.
>
> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
> conducted the multiple android apps continuous startup performance
> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
> showed that the startup time had deteriorated slightly. However,
> the patch's logic in this test was different from the one I submitted
> to the community, and it may be due to some other issues with the
> previous old patch.
>
> However, we have conducted relevant memory performance tests on this
> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
> the results are better than before the modification. The patches have
> been used on multiple projects.
> For example:
> Test the time consumption and subjective fluency experience of
> launching 30 android apps continuously for two rounds.
> Test machine: RAM 16GB
> |        | time(s) | Frame-droped rate(%) |
> | before | 230.76  |         0.54         |
> | after  | 225.23  |         0.74         |
> We can see that the patch has been optimized 5.53s for startup time and
> 0.2% frame-droped rate and better subjective smoothness experience.
>
> Perhaps the patches submitted to the community has also improved the
> multiple android apps continuous startup performance in the case
> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
> tests to verify this situation in the future.

Using a fixed value like 2 feels more like a workaround than a solid solution.
It would be better if we could eliminate this hack.

Additionally, this type of asynchronous reclamation might struggle to scale
effectively, particularly on NUMA systems with many CPU cores.

Many kernel threads are per-node, like kswapd. For instance, if we have 100
threads running on 100 CPUs executing zap_pte_range(), your approach, which
relies on a single async thread to reclaim swap entries, might lead to
performance
regressions.

We might need to consider a more adaptable approach that can evaluate
the machine's
topology and dynamically determine the appropriate number of async
threads, rather
than hard-coding it to just one. Otherwise, there could be ongoing
concerns about
whether this solution is truly applicable to all systems.

Alternatively, we might be able to develop a method to speed up
batched freeing in a
synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
async, but it can still speed up tlb flush, right?

For phones with just 8 CPU cores, I definitely like your patch.
However, since we're
aiming for something which can affect all systems, the situation might be more
complex.

> >
> >> + *
> >> + * Since the time for determining the number of exiting processes is dynamic,
> >> + * the exiting process may start to enter the swp_entry asynchronous release
> >> + * at the beginning or middle stage of the exiting process's swp_entry release
> >> + * path.
> >> + *
> >> + * Once an exiting process enters the swp_entry asynchronous release, all remaining
> >> + * swap entries in this exiting process need to be fully released by asynchronous
> >> + * kworker theoretically.
> > Freeing a slot can indeed release memory from `zRAM`, potentially returning
> > it to the system for allocation. Your patch frees swap slots asynchronously;
> > I assume this doesn’t slow down the memory freeing process for `zRAM`, or
> > could it even slow down the freeing of `zRAM` memory? Freeing compressed
> > memory might not be as crucial compared to freeing uncompressed memory with
> > present PTEs?
> Yes, freeing uncompressed memory with present PTEs is more important
> compared to freeing compressed 'zRAM' memory.
>
> I guess that the multiple exiting processes releasing swap entries
> simultaneously may result in the swap_info->lock competition pressure
> in swapcache_free_entries(), affecting the efficiency of releasing swap
> entries. However, if the asynchronous kworker is used, this issue can
> be avoided, and perhaps the improvement is minor.
>
> The freeing of zRAM memory does not slow down. We have observed traces
> in the camera startup scene and found that the asynchronous kworker
> can release all swap entries before entering the camera preview.
> Compared to not using the asynchronous kworker, the exiting processes
> completed after entering the camera preview.
> >
> >> + *
> >> + * The function of the swp_entry asynchronous release:
> >> + * 1. Alleviate the high system cpu load caused by multiple exiting processes
> >> + *    running simultaneously.
> >> + * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
> >> + *    instead of multiple exiting processes parallel execution.
> >> + * 3. Release pte_present memory occupied by exiting processes more efficiently.
> >> + */
> >> +
> >> +/*
> >> + * The min number of exiting processes required for swp_entry asynchronous release
> >> + */
> >> +#define NR_MIN_EXITING_PROCESSES 2
> >> +
> >> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
> >> +static struct kmem_cache *swap_gather_cachep;
> >> +static struct workqueue_struct *swapfree_wq;
> >> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
> >> +
> >> +static int __init tlb_swap_async_free_setup(void)
> >> +{
> >> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
> >> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> >> +       if (!swapfree_wq)
> >> +               goto fail;
> >> +
> >> +       swap_gather_cachep = kmem_cache_create("swap_gather",
> >> +               sizeof(struct mmu_swap_gather),
> >> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
> >> +               NULL);
> >> +       if (!swap_gather_cachep)
> >> +               goto kcache_fail;
> >> +
> >> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
> >> +       return 0;
> >> +
> >> +kcache_fail:
> >> +       destroy_workqueue(swapfree_wq);
> >> +fail:
> >> +       return -ENOMEM;
> >> +}
> >> +postcore_initcall(tlb_swap_async_free_setup);
> >> +
> >> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
> >> +{
> >> +       struct mmu_swap_batch *swap_batch, *next;
> >> +
> >> +       for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
> >> +               next = swap_batch->next;
> >> +               free_page((unsigned long)swap_batch);
> >> +       }
> >> +       swap_gather->local.next = NULL;
> >> +       kmem_cache_free(swap_gather_cachep, swap_gather);
> >> +}
> >> +
> >> +static void tlb_swap_async_free_work(struct work_struct *w)
> >> +{
> >> +       int i, nr_multi, nr_free;
> >> +       swp_entry_t start_entry;
> >> +       struct mmu_swap_batch *swap_batch;
> >> +       struct mmu_swap_gather *swap_gather = container_of(w,
> >> +               struct mmu_swap_gather, free_work);
> >> +
> >> +       /* Release swap entries cached in mmu_swap_batch. */
> >> +       for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
> >> +           swap_batch = swap_batch->next) {
> >> +               nr_free = 0;
> >> +               for (i = 0; i < swap_batch->nr; i++) {
> >> +                       if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
> >> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
> >> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> >> +                               nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
> >> +                               free_swap_and_cache_nr(start_entry, nr_multi);
> >> +                               nr_free += 2;
> >> +                       } else {
> >> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> >> +                               free_swap_and_cache_nr(start_entry, 1);
> >> +                               nr_free++;
> >> +                       }
> >> +               }
> >> +               swap_batch->nr -= nr_free;
> >> +               VM_BUG_ON(swap_batch->nr);
> >> +       }
> >> +       __tlb_swap_gather_free(swap_gather);
> >> +}
> >> +
> >> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
> >> +{
> >> +       /*
> >> +        * Only the exiting processes with the MM_SWAPENTS counter >=
> >> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
> >> +        * entries by asynchronous kworker.
> >> +        */
> >> +       if (!task_is_dying() ||
> >> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
> >> +               return true;
> >> +
> >> +       atomic_inc(&nr_exiting_processes);
> >> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
> >> +               tlb->swp_freeable = 1;
> >> +       else
> >> +               tlb->swp_freeing = 1;
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +/**
> >> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
> >> + * for swp_entry tear-down.
> >> + * @tlb: the mmu_swap_gather structure belongs to tlb
> >> + */
> >> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
> >> +{
> >> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
> >> +       if (unlikely(!tlb->swp))
> >> +               return false;
> >> +
> >> +       tlb->swp->local.next  = NULL;
> >> +       tlb->swp->local.nr    = 0;
> >> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
> >> +
> >> +       tlb->swp->active      = &tlb->swp->local;
> >> +       tlb->swp->batch_count = 0;
> >> +
> >> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
> >> +       return true;
> >> +}
> >> +
> >> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
> >> +{
> >> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
> >> +               return;
> >> +
> >> +       tlb->swp = NULL;
> >> +       tlb->swp_freeable = 0;
> >> +       tlb->swp_freeing = 0;
> >> +       tlb->swp_disable = 0;
> >> +
> >> +       if (__tlb_swap_gather_mmu_check(tlb))
> >> +               return;
> >> +
> >> +       /*
> >> +        * If the exiting process meets the conditions of
> >> +        * swp_entry asynchronous release, an mmu_swap_gather
> >> +        * structure will be initialized.
> >> +        */
> >> +       if (tlb->swp_freeing)
> >> +               __tlb_swap_gather_init(tlb);
> >> +}
> >> +
> >> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
> >> +{
> >> +       queue_work(swapfree_wq, &tlb->swp->free_work);
> >> +       tlb->swp = NULL;
> >> +       if (!finish)
> >> +               __tlb_swap_gather_init(tlb);
> >> +}
> >> +
> >> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
> >> +{
> >> +       struct mmu_swap_batch *swap_batch;
> >> +
> >> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
> >> +               goto free;
> >> +
> >> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
> >> +       if (unlikely(!swap_batch))
> >> +               goto free;
> >> +
> >> +       swap_batch->next = NULL;
> >> +       swap_batch->nr   = 0;
> >> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
> >> +
> >> +       tlb->swp->active->next = swap_batch;
> >> +       tlb->swp->active = swap_batch;
> >> +       tlb->swp->batch_count++;
> >> +       return true;
> >> +free:
> >> +       /* batch move to wq */
> >> +       __tlb_swap_gather_queuework(tlb, false);
> >> +       return false;
> >> +}
> >> +
> >> +/**
> >> + * __tlb_remove_swap_entries - the swap entries in exiting process are
> >> + * isolated, batch cached in struct mmu_swap_batch.
> >> + * @tlb: the current mmu_gather
> >> + * @entry: swp_entry to be isolated and cached
> >> + * @nr: the number of consecutive entries starting from entry parameter.
> >> + */
> >> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >> +                            swp_entry_t entry, int nr)
> >> +{
> >> +       struct mmu_swap_batch *swap_batch;
> >> +       unsigned long flags = 0;
> >> +       bool ret = false;
> >> +
> >> +       if (tlb->swp_disable)
> >> +               return ret;
> >> +
> >> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >> +               return ret;
> >> +
> >> +       if (tlb->swp_freeable) {
> >> +               if (atomic_read(&nr_exiting_processes) <
> >> +                   NR_MIN_EXITING_PROCESSES)
> >> +                       return ret;
> >> +               /*
> >> +                * If the current number of exiting processes
> >> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
> >> +                * process with swp_freeable state will enter
> >> +                * swp_freeing state to start releasing its
> >> +                * remaining swap entries by the asynchronous
> >> +                * kworker.
> >> +                */
> >> +               tlb->swp_freeable = 0;
> >> +               tlb->swp_freeing = 1;
> >> +       }
> >> +
> >> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
> >> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
> >> +               return ret;
> >> +
> >> +       swap_batch = tlb->swp->active;
> >> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
> >> +               __tlb_swap_gather_queuework(tlb, false);
> >> +               return ret;
> >> +       }
> >> +
> >> +       if (likely(nr == 1)) {
> >> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> >> +       } else {
> >> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
> >> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> >> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
> >> +       }
> >> +       ret = true;
> >> +
> >> +       if (swap_batch->nr >= swap_batch->max - 1) {
> >> +               if (!__tlb_swap_next_batch(tlb))
> >> +                       goto exit;
> >> +               swap_batch = tlb->swp->active;
> >> +       }
> >> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
> >> +exit:
> >> +       return ret;
> >> +}
> >> +
> >> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
> >> +{
> >> +       if (tlb->swp_disable)
> >> +               return;
> >> +
> >> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >> +               return;
> >> +
> >> +       if (tlb->swp_freeable) {
> >> +               tlb->swp_freeable = 0;
> >> +               VM_BUG_ON(tlb->swp_freeing);
> >> +               goto exit;
> >> +       }
> >> +       tlb->swp_freeing = 0;
> >> +       if (unlikely(!tlb->swp))
> >> +               goto exit;
> >> +
> >> +       __tlb_swap_gather_queuework(tlb, true);
> >> +exit:
> >> +       atomic_dec(&nr_exiting_processes);
> >> +}
> >>
> >>   static bool tlb_next_batch(struct mmu_gather *tlb)
> >>   {
> >> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
> >>          tlb->local.max  = ARRAY_SIZE(tlb->__pages);
> >>          tlb->active     = &tlb->local;
> >>          tlb->batch_count = 0;
> >> +
> >> +       tlb->swp_disable = 1;
> >> +       __tlb_swap_gather_mmu(tlb);
> >>   #endif
> >>          tlb->delayed_rmap = 0;
> >>
> >> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
> >>
> >>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>          tlb_batch_list_free(tlb);
> >> +       __tlb_batch_swap_finish(tlb);
> >>   #endif
> >>          dec_tlb_flush_pending(tlb->mm);
> >>   }
> >> --
> >> 2.39.0
> >>

Thanks
Barry
David Hildenbrand Sept. 9, 2024, 6:49 a.m. UTC | #5
On 05.08.24 17:36, Zhiguo Jiang wrote:
> One of the main reasons for the prolonged exit of the process with
> independent mm is the time-consuming release of its swap entries.
> The proportion of swap memory occupied by the process increases over
> time due to high memory pressure triggering to reclaim anonymous folio
> into swapspace, e.g., in Android devices, we found this proportion can
> reach 60% or more after a period of time. Additionally, the relatively
> lengthy path for releasing swap entries further contributes to the
> longer time required to release swap entries.
> 
> Testing Platform: 8GB RAM
> Testing procedure:
> After booting up, start 15 processes first, and then observe the
> physical memory size occupied by the last launched process at different
> time points.
> Example: The process launched last: com.qiyi.video
> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> -------------------------------------------------------------------
> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> Note: min - minute.
> 
> When there are multiple processes with independent mm and the high
> memory pressure in system, if the large memory required process is
> launched at this time, system will is likely to trigger the instantaneous
> killing of many processes with independent mm. Due to multiple exiting
> processes occupying multiple CPU core resources for concurrent execution,
> leading to some issues such as the current non-exiting and important
> processes lagging.
> 
> To solve this problem, we have introduced the multiple exiting process
> asynchronous swap entries release mechanism, which isolates and caches
> swap entries occupied by multiple exiting processes, and hands them over
> to an asynchronous kworker to complete the release. This allows the
> exiting processes to complete quickly and release CPU resources. We have
> validated this modification on the Android products and achieved the
> expected benefits.
> 
> Testing Platform: 8GB RAM
> Testing procedure:
> After restarting the machine, start 15 app processes first, and then
> start the camera app processes, we monitor the cold start and preview
> time datas of the camera app processes.
> 
> Test datas of camera processes cold start time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> 
> Test datas of camera processes preview time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> 
> Base on the average of the six sets of test datas above, we can see that
> the benefit datas of the modified patch:
> 1. The cold start time of camera app processes has reduced by about 20%.
> 2. The preview time of camera app processes has reduced by about 42%.
> 
> It offers several benefits:
> 1. Alleviate the high system cpu loading caused by multiple exiting
>     processes running simultaneously.
> 2. Reduce lock competition in swap entry free path by an asynchronous
>     kworker instead of multiple exiting processes parallel execution.
> 3. Release pte_present memory occupied by exiting processes more
>     efficiently.
> 
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
>   arch/s390/include/asm/tlb.h |   8 +
>   include/asm-generic/tlb.h   |  44 ++++++
>   include/linux/mm_types.h    |  58 +++++++
>   mm/memory.c                 |   3 +-
>   mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>   5 files changed, 408 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index e95b2c8081eb..3f681f63390f
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>   		struct page *page, bool delay_rmap, int page_size);
>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>   		struct page *page, unsigned int nr_pages, bool delay_rmap);
> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +		swp_entry_t entry, int nr);


The problem I am having is that swap entries don't have any intersection 
with the TLB. It sounds like we're squeezing something into an existing 
concept (MMU gather) that just doesn't belong in there.
zhiguojiang Sept. 9, 2024, 2:39 p.m. UTC | #6
在 2024/9/9 9:59, Barry Song 写道:
> On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/9/4 17:16, Barry Song 写道:
>>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>> One of the main reasons for the prolonged exit of the process with
>>>> independent mm is the time-consuming release of its swap entries.
>>>> The proportion of swap memory occupied by the process increases over
>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>> lengthy path for releasing swap entries further contributes to the
>>>> longer time required to release swap entries.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at different
>>>> time points.
>>>> Example: The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>>>> Note: min - minute.
>>>>
>>>> When there are multiple processes with independent mm and the high
>>>> memory pressure in system, if the large memory required process is
>>>> launched at this time, system will is likely to trigger the instantaneous
>>>> killing of many processes with independent mm. Due to multiple exiting
>>>> processes occupying multiple CPU core resources for concurrent execution,
>>>> leading to some issues such as the current non-exiting and important
>>>> processes lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the Android products and achieved the
>>>> expected benefits.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After restarting the machine, start 15 app processes first, and then
>>>> start the camera app processes, we monitor the cold start and preview
>>>> time datas of the camera app processes.
>>>>
>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>
>>>> Test datas of camera processes preview time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>
>>>> Base on the average of the six sets of test datas above, we can see that
>>>> the benefit datas of the modified patch:
>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>      efficiently.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>> ---
>>>>    arch/s390/include/asm/tlb.h |   8 +
>>>>    include/asm-generic/tlb.h   |  44 ++++++
>>>>    include/linux/mm_types.h    |  58 +++++++
>>>>    mm/memory.c                 |   3 +-
>>>>    mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>    5 files changed, 408 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>> index e95b2c8081eb..3f681f63390f
>>>> --- a/arch/s390/include/asm/tlb.h
>>>> +++ b/arch/s390/include/asm/tlb.h
>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>>>                   struct page *page, bool delay_rmap, int page_size);
>>>>    static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>                   struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +               swp_entry_t entry, int nr);
>>>>
>>>>    #define tlb_flush tlb_flush
>>>>    #define pte_free_tlb pte_free_tlb
>>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>           return false;
>>>>    }
>>>>
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +               swp_entry_t entry, int nr)
>>>> +{
>>>> +       return false;
>>>> +}
>>>> +
>>>>    static inline void tlb_flush(struct mmu_gather *tlb)
>>>>    {
>>>>           __tlb_flush_mm_lazy(tlb->mm);
>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>>> index 709830274b75..8b4d516b35b8
>>>> --- a/include/asm-generic/tlb.h
>>>> +++ b/include/asm-generic/tlb.h
>>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
>>>>    static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
>>>>    #endif
>>>>
>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>> +struct mmu_swap_batch {
>>>> +       struct mmu_swap_batch *next;
>>>> +       unsigned int nr;
>>>> +       unsigned int max;
>>>> +       encoded_swpentry_t encoded_entrys[];
>>>> +};
>>>> +
>>>> +#define MAX_SWAP_GATHER_BATCH  \
>>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
>>>> +
>>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
>>>> +
>>>> +struct mmu_swap_gather {
>>>> +       /*
>>>> +        * the asynchronous kworker to batch
>>>> +        * release swap entries
>>>> +        */
>>>> +       struct work_struct free_work;
>>>> +
>>>> +       /* batch cache swap entries */
>>>> +       unsigned int batch_count;
>>>> +       struct mmu_swap_batch *active;
>>>> +       struct mmu_swap_batch local;
>>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
>>>> +};
>>>> +
>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +               swp_entry_t entry, int nr);
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * struct mmu_gather is an opaque type used by the mm code for passing around
>>>>     * any data needed by arch specific code for tlb_remove_page.
>>>> @@ -343,6 +374,18 @@ struct mmu_gather {
>>>>           unsigned int            vma_exec : 1;
>>>>           unsigned int            vma_huge : 1;
>>>>           unsigned int            vma_pfn  : 1;
>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>> +       /*
>>>> +        * Two states of releasing swap entries
>>>> +        * asynchronously:
>>>> +        * swp_freeable - have opportunity to
>>>> +        * release asynchronously future
>>>> +        * swp_freeing - be releasing asynchronously.
>>>> +        */
>>>> +       unsigned int            swp_freeable : 1;
>>>> +       unsigned int            swp_freeing : 1;
>>>> +       unsigned int            swp_disable : 1;
>>>> +#endif
>>>>
>>>>           unsigned int            batch_count;
>>>>
>>>> @@ -354,6 +397,7 @@ struct mmu_gather {
>>>>    #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>>>>           unsigned int page_size;
>>>>    #endif
>>>> +       struct mmu_swap_gather *swp;
>>>>    #endif
>>>>    };
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 165c58b12ccc..2f66303f1519
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -283,6 +283,64 @@ typedef struct {
>>>>           unsigned long val;
>>>>    } swp_entry_t;
>>>>
>>>> +/*
>>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
>>>> + *
>>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
>>>> + * bit indicating extra context-dependent information. Only used in swp_entry
>>>> + * asynchronous release path by mmu_swap_gather.
>>>> + */
>>>> +typedef struct {
>>>> +       unsigned long val;
>>>> +} encoded_swpentry_t;
>>>> +
>>>> +/*
>>>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
>>>> + * total number of consecutive swap entries associated with the same folio. If this
>>>> + * bit is not set, "nr" is implicitly 1.
>>>> + *
>>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
>>>> + * Bit63 can be used here.
>>>> + */
>>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
>>>> +
>>>> +static __always_inline encoded_swpentry_t
>>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
>>>> +{
>>>> +       encoded_swpentry_t ret;
>>>> +
>>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>> +       ret.val = flags | entry.val;
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
>>>> +{
>>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>> +}
>>>> +
>>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
>>>> +{
>>>> +       swp_entry_t ret;
>>>> +
>>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
>>>> +{
>>>> +       encoded_swpentry_t ret;
>>>> +
>>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>> +       ret.val = nr;
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
>>>> +{
>>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
>>>> +}
>>>> +
>>>>    /**
>>>>     * struct folio - Represents a contiguous set of bytes.
>>>>     * @flags: Identical to the page flags.
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index d6a9dcddaca4..023a8adcb67c
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>                           if (!should_zap_cows(details))
>>>>                                   continue;
>>>>                           rss[MM_SWAPENTS] -= nr;
>>>> -                       free_swap_and_cache_nr(entry, nr);
>>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
>>>> +                               free_swap_and_cache_nr(entry, nr);
>>>>                   } else if (is_migration_entry(entry)) {
>>>>                           folio = pfn_swap_entry_folio(entry);
>>>>                           if (!should_zap_folio(details, folio))
>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>> index 99b3e9408aa0..33dc9d1faff9
>>>> --- a/mm/mmu_gather.c
>>>> +++ b/mm/mmu_gather.c
>>>> @@ -9,11 +9,303 @@
>>>>    #include <linux/smp.h>
>>>>    #include <linux/swap.h>
>>>>    #include <linux/rmap.h>
>>>> +#include <linux/oom.h>
>>>>
>>>>    #include <asm/pgalloc.h>
>>>>    #include <asm/tlb.h>
>>>>
>>>>    #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>> +/*
>>>> + * The swp_entry asynchronous release mechanism for multiple processes with
>>>> + * independent mm exiting simultaneously.
>>>> + *
>>>> + * During the multiple exiting processes releasing their own mm simultaneously,
>>>> + * the swap entries in the exiting processes are handled by isolating, caching
>>>> + * and handing over to an asynchronous kworker to complete the release.
>>>> + *
>>>> + * The conditions for the exiting process entering the swp_entry asynchronous
>>>> + * release path:
>>>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
>>>> + *    to alloc struct mmu_swap_gather frequently.
>>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
>>> Hi Zhiguo,
>>>
>>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
>>> batched swap entry freeing, even with one process, could be a
>>> bottleneck for a single
>>> process based on the data from this patch:
>>>
>>> mm: attempt to batch free swap entries for zap_pte_range()
>>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
>>> "munmap bandwidth becomes 3X faster."
>>>
>>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
>> Hi Barry,
>>
>> Thanks for your comments.
>>
>> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
>> conducted the multiple android apps continuous startup performance
>> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
>> showed that the startup time had deteriorated slightly. However,
>> the patch's logic in this test was different from the one I submitted
>> to the community, and it may be due to some other issues with the
>> previous old patch.
>>
>> However, we have conducted relevant memory performance tests on this
>> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
>> the results are better than before the modification. The patches have
>> been used on multiple projects.
>> For example:
>> Test the time consumption and subjective fluency experience of
>> launching 30 android apps continuously for two rounds.
>> Test machine: RAM 16GB
>> |        | time(s) | Frame-droped rate(%) |
>> | before | 230.76  |         0.54         |
>> | after  | 225.23  |         0.74         |
>> We can see that the patch has been optimized 5.53s for startup time and
>> 0.2% frame-droped rate and better subjective smoothness experience.
>>
>> Perhaps the patches submitted to the community has also improved the
>> multiple android apps continuous startup performance in the case
>> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
>> tests to verify this situation in the future.
Thanks Barry for your valuable suggestions.
> Using a fixed value like 2 feels more like a workaround than a solid solution.
> It would be better if we could eliminate this hack.
Ok, I will conduct more tests for tring to solve this parameter issue.
>
> Additionally, this type of asynchronous reclamation might struggle to scale
> effectively, particularly on NUMA systems with many CPU cores.
>
> Many kernel threads are per-node, like kswapd. For instance, if we have 100
> threads running on 100 CPUs executing zap_pte_range(), your approach, which
> relies on a single async thread to reclaim swap entries, might lead to
> performance
> regressions.
>
> We might need to consider a more adaptable approach that can evaluate
> the machine's
> topology and dynamically determine the appropriate number of async
> threads, rather
> than hard-coding it to just one. Otherwise, there could be ongoing
> concerns about
> whether this solution is truly applicable to all systems.
Can we dynamically determine the number of asynchronous kworkers to be
created based on the number of exiting processes at a certain moment,
for example, every 8 exiting processes corresponds to one asynchronous
kworker?

| The number of exiting processes | The number of asynchronous kworkers |
|          [1, 8]                 | 1                    |
|          [9, 16]                | 2                    |
|          ...                    | ...                   |
|      [8*N+1, 8*(N+1)]           | N+1                   |
N >= 0
> Alternatively, we might be able to develop a method to speed up
> batched freeing in a
> synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
> async, but it can still speed up tlb flush, right?
The synchronous manner may have some optimization effect, but it seems
that the optimization effect on the CPUs load occupied by multiple
exiting processes is not significant. In addition, David stated in the
latest comment that swap entry seems unrelated to tlb.
> For phones with just 8 CPU cores, I definitely like your patch.
> However, since we're
> aiming for something which can affect all systems, the situation might be more
> complex.
Yes, this pacth still needs further improvement to adapt to all systems.
>>>> + *
>>>> + * Since the time for determining the number of exiting processes is dynamic,
>>>> + * the exiting process may start to enter the swp_entry asynchronous release
>>>> + * at the beginning or middle stage of the exiting process's swp_entry release
>>>> + * path.
>>>> + *
>>>> + * Once an exiting process enters the swp_entry asynchronous release, all remaining
>>>> + * swap entries in this exiting process need to be fully released by asynchronous
>>>> + * kworker theoretically.
>>> Freeing a slot can indeed release memory from `zRAM`, potentially returning
>>> it to the system for allocation. Your patch frees swap slots asynchronously;
>>> I assume this doesn’t slow down the memory freeing process for `zRAM`, or
>>> could it even slow down the freeing of `zRAM` memory? Freeing compressed
>>> memory might not be as crucial compared to freeing uncompressed memory with
>>> present PTEs?
>> Yes, freeing uncompressed memory with present PTEs is more important
>> compared to freeing compressed 'zRAM' memory.
>>
>> I guess that the multiple exiting processes releasing swap entries
>> simultaneously may result in the swap_info->lock competition pressure
>> in swapcache_free_entries(), affecting the efficiency of releasing swap
>> entries. However, if the asynchronous kworker is used, this issue can
>> be avoided, and perhaps the improvement is minor.
>>
>> The freeing of zRAM memory does not slow down. We have observed traces
>> in the camera startup scene and found that the asynchronous kworker
>> can release all swap entries before entering the camera preview.
>> Compared to not using the asynchronous kworker, the exiting processes
>> completed after entering the camera preview.
>>>> + *
>>>> + * The function of the swp_entry asynchronous release:
>>>> + * 1. Alleviate the high system cpu load caused by multiple exiting processes
>>>> + *    running simultaneously.
>>>> + * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
>>>> + *    instead of multiple exiting processes parallel execution.
>>>> + * 3. Release pte_present memory occupied by exiting processes more efficiently.
>>>> + */
>>>> +
>>>> +/*
>>>> + * The min number of exiting processes required for swp_entry asynchronous release
>>>> + */
>>>> +#define NR_MIN_EXITING_PROCESSES 2
>>>> +
>>>> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
>>>> +static struct kmem_cache *swap_gather_cachep;
>>>> +static struct workqueue_struct *swapfree_wq;
>>>> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
>>>> +
>>>> +static int __init tlb_swap_async_free_setup(void)
>>>> +{
>>>> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
>>>> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
>>>> +       if (!swapfree_wq)
>>>> +               goto fail;
>>>> +
>>>> +       swap_gather_cachep = kmem_cache_create("swap_gather",
>>>> +               sizeof(struct mmu_swap_gather),
>>>> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
>>>> +               NULL);
>>>> +       if (!swap_gather_cachep)
>>>> +               goto kcache_fail;
>>>> +
>>>> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
>>>> +       return 0;
>>>> +
>>>> +kcache_fail:
>>>> +       destroy_workqueue(swapfree_wq);
>>>> +fail:
>>>> +       return -ENOMEM;
>>>> +}
>>>> +postcore_initcall(tlb_swap_async_free_setup);
>>>> +
>>>> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
>>>> +{
>>>> +       struct mmu_swap_batch *swap_batch, *next;
>>>> +
>>>> +       for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
>>>> +               next = swap_batch->next;
>>>> +               free_page((unsigned long)swap_batch);
>>>> +       }
>>>> +       swap_gather->local.next = NULL;
>>>> +       kmem_cache_free(swap_gather_cachep, swap_gather);
>>>> +}
>>>> +
>>>> +static void tlb_swap_async_free_work(struct work_struct *w)
>>>> +{
>>>> +       int i, nr_multi, nr_free;
>>>> +       swp_entry_t start_entry;
>>>> +       struct mmu_swap_batch *swap_batch;
>>>> +       struct mmu_swap_gather *swap_gather = container_of(w,
>>>> +               struct mmu_swap_gather, free_work);
>>>> +
>>>> +       /* Release swap entries cached in mmu_swap_batch. */
>>>> +       for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
>>>> +           swap_batch = swap_batch->next) {
>>>> +               nr_free = 0;
>>>> +               for (i = 0; i < swap_batch->nr; i++) {
>>>> +                       if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
>>>> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
>>>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>>>> +                               nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
>>>> +                               free_swap_and_cache_nr(start_entry, nr_multi);
>>>> +                               nr_free += 2;
>>>> +                       } else {
>>>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
>>>> +                               free_swap_and_cache_nr(start_entry, 1);
>>>> +                               nr_free++;
>>>> +                       }
>>>> +               }
>>>> +               swap_batch->nr -= nr_free;
>>>> +               VM_BUG_ON(swap_batch->nr);
>>>> +       }
>>>> +       __tlb_swap_gather_free(swap_gather);
>>>> +}
>>>> +
>>>> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
>>>> +{
>>>> +       /*
>>>> +        * Only the exiting processes with the MM_SWAPENTS counter >=
>>>> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
>>>> +        * entries by asynchronous kworker.
>>>> +        */
>>>> +       if (!task_is_dying() ||
>>>> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
>>>> +               return true;
>>>> +
>>>> +       atomic_inc(&nr_exiting_processes);
>>>> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
>>>> +               tlb->swp_freeable = 1;
>>>> +       else
>>>> +               tlb->swp_freeing = 1;
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +/**
>>>> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
>>>> + * for swp_entry tear-down.
>>>> + * @tlb: the mmu_swap_gather structure belongs to tlb
>>>> + */
>>>> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
>>>> +{
>>>> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
>>>> +       if (unlikely(!tlb->swp))
>>>> +               return false;
>>>> +
>>>> +       tlb->swp->local.next  = NULL;
>>>> +       tlb->swp->local.nr    = 0;
>>>> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
>>>> +
>>>> +       tlb->swp->active      = &tlb->swp->local;
>>>> +       tlb->swp->batch_count = 0;
>>>> +
>>>> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
>>>> +       return true;
>>>> +}
>>>> +
>>>> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
>>>> +{
>>>> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
>>>> +               return;
>>>> +
>>>> +       tlb->swp = NULL;
>>>> +       tlb->swp_freeable = 0;
>>>> +       tlb->swp_freeing = 0;
>>>> +       tlb->swp_disable = 0;
>>>> +
>>>> +       if (__tlb_swap_gather_mmu_check(tlb))
>>>> +               return;
>>>> +
>>>> +       /*
>>>> +        * If the exiting process meets the conditions of
>>>> +        * swp_entry asynchronous release, an mmu_swap_gather
>>>> +        * structure will be initialized.
>>>> +        */
>>>> +       if (tlb->swp_freeing)
>>>> +               __tlb_swap_gather_init(tlb);
>>>> +}
>>>> +
>>>> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
>>>> +{
>>>> +       queue_work(swapfree_wq, &tlb->swp->free_work);
>>>> +       tlb->swp = NULL;
>>>> +       if (!finish)
>>>> +               __tlb_swap_gather_init(tlb);
>>>> +}
>>>> +
>>>> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
>>>> +{
>>>> +       struct mmu_swap_batch *swap_batch;
>>>> +
>>>> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
>>>> +               goto free;
>>>> +
>>>> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
>>>> +       if (unlikely(!swap_batch))
>>>> +               goto free;
>>>> +
>>>> +       swap_batch->next = NULL;
>>>> +       swap_batch->nr   = 0;
>>>> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
>>>> +
>>>> +       tlb->swp->active->next = swap_batch;
>>>> +       tlb->swp->active = swap_batch;
>>>> +       tlb->swp->batch_count++;
>>>> +       return true;
>>>> +free:
>>>> +       /* batch move to wq */
>>>> +       __tlb_swap_gather_queuework(tlb, false);
>>>> +       return false;
>>>> +}
>>>> +
>>>> +/**
>>>> + * __tlb_remove_swap_entries - the swap entries in exiting process are
>>>> + * isolated, batch cached in struct mmu_swap_batch.
>>>> + * @tlb: the current mmu_gather
>>>> + * @entry: swp_entry to be isolated and cached
>>>> + * @nr: the number of consecutive entries starting from entry parameter.
>>>> + */
>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +                            swp_entry_t entry, int nr)
>>>> +{
>>>> +       struct mmu_swap_batch *swap_batch;
>>>> +       unsigned long flags = 0;
>>>> +       bool ret = false;
>>>> +
>>>> +       if (tlb->swp_disable)
>>>> +               return ret;
>>>> +
>>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>>>> +               return ret;
>>>> +
>>>> +       if (tlb->swp_freeable) {
>>>> +               if (atomic_read(&nr_exiting_processes) <
>>>> +                   NR_MIN_EXITING_PROCESSES)
>>>> +                       return ret;
>>>> +               /*
>>>> +                * If the current number of exiting processes
>>>> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
>>>> +                * process with swp_freeable state will enter
>>>> +                * swp_freeing state to start releasing its
>>>> +                * remaining swap entries by the asynchronous
>>>> +                * kworker.
>>>> +                */
>>>> +               tlb->swp_freeable = 0;
>>>> +               tlb->swp_freeing = 1;
>>>> +       }
>>>> +
>>>> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
>>>> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
>>>> +               return ret;
>>>> +
>>>> +       swap_batch = tlb->swp->active;
>>>> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
>>>> +               __tlb_swap_gather_queuework(tlb, false);
>>>> +               return ret;
>>>> +       }
>>>> +
>>>> +       if (likely(nr == 1)) {
>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
>>>> +       } else {
>>>> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
>>>> +       }
>>>> +       ret = true;
>>>> +
>>>> +       if (swap_batch->nr >= swap_batch->max - 1) {
>>>> +               if (!__tlb_swap_next_batch(tlb))
>>>> +                       goto exit;
>>>> +               swap_batch = tlb->swp->active;
>>>> +       }
>>>> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
>>>> +exit:
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
>>>> +{
>>>> +       if (tlb->swp_disable)
>>>> +               return;
>>>> +
>>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
>>>> +               return;
>>>> +
>>>> +       if (tlb->swp_freeable) {
>>>> +               tlb->swp_freeable = 0;
>>>> +               VM_BUG_ON(tlb->swp_freeing);
>>>> +               goto exit;
>>>> +       }
>>>> +       tlb->swp_freeing = 0;
>>>> +       if (unlikely(!tlb->swp))
>>>> +               goto exit;
>>>> +
>>>> +       __tlb_swap_gather_queuework(tlb, true);
>>>> +exit:
>>>> +       atomic_dec(&nr_exiting_processes);
>>>> +}
>>>>
>>>>    static bool tlb_next_batch(struct mmu_gather *tlb)
>>>>    {
>>>> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>>>>           tlb->local.max  = ARRAY_SIZE(tlb->__pages);
>>>>           tlb->active     = &tlb->local;
>>>>           tlb->batch_count = 0;
>>>> +
>>>> +       tlb->swp_disable = 1;
>>>> +       __tlb_swap_gather_mmu(tlb);
>>>>    #endif
>>>>           tlb->delayed_rmap = 0;
>>>>
>>>> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
>>>>
>>>>    #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>           tlb_batch_list_free(tlb);
>>>> +       __tlb_batch_swap_finish(tlb);
>>>>    #endif
>>>>           dec_tlb_flush_pending(tlb->mm);
>>>>    }
>>>> --
>>>> 2.39.0
>>>>
> Thanks
> Barry
Thanks
Zhiguo
zhiguojiang Sept. 9, 2024, 2:43 p.m. UTC | #7
在 2024/9/9 14:49, David Hildenbrand 写道:
> On 05.08.24 17:36, Zhiguo Jiang wrote:
>> One of the main reasons for the prolonged exit of the process with
>> independent mm is the time-consuming release of its swap entries.
>> The proportion of swap memory occupied by the process increases over
>> time due to high memory pressure triggering to reclaim anonymous folio
>> into swapspace, e.g., in Android devices, we found this proportion can
>> reach 60% or more after a period of time. Additionally, the relatively
>> lengthy path for releasing swap entries further contributes to the
>> longer time required to release swap entries.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After booting up, start 15 processes first, and then observe the
>> physical memory size occupied by the last launched process at different
>> time points.
>> Example: The process launched last: com.qiyi.video
>> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
>> -------------------------------------------------------------------
>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
>> Note: min - minute.
>>
>> When there are multiple processes with independent mm and the high
>> memory pressure in system, if the large memory required process is
>> launched at this time, system will is likely to trigger the 
>> instantaneous
>> killing of many processes with independent mm. Due to multiple exiting
>> processes occupying multiple CPU core resources for concurrent 
>> execution,
>> leading to some issues such as the current non-exiting and important
>> processes lagging.
>>
>> To solve this problem, we have introduced the multiple exiting process
>> asynchronous swap entries release mechanism, which isolates and caches
>> swap entries occupied by multiple exiting processes, and hands them over
>> to an asynchronous kworker to complete the release. This allows the
>> exiting processes to complete quickly and release CPU resources. We have
>> validated this modification on the Android products and achieved the
>> expected benefits.
>>
>> Testing Platform: 8GB RAM
>> Testing procedure:
>> After restarting the machine, start 15 app processes first, and then
>> start the camera app processes, we monitor the cold start and preview
>> time datas of the camera app processes.
>>
>> Test datas of camera processes cold start time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>
>> Test datas of camera processes preview time (unit: millisecond):
>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>
>> Base on the average of the six sets of test datas above, we can see that
>> the benefit datas of the modified patch:
>> 1. The cold start time of camera app processes has reduced by about 20%.
>> 2. The preview time of camera app processes has reduced by about 42%.
>>
>> It offers several benefits:
>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>     processes running simultaneously.
>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>     kworker instead of multiple exiting processes parallel execution.
>> 3. Release pte_present memory occupied by exiting processes more
>>     efficiently.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>> ---
>>   arch/s390/include/asm/tlb.h |   8 +
>>   include/asm-generic/tlb.h   |  44 ++++++
>>   include/linux/mm_types.h    |  58 +++++++
>>   mm/memory.c                 |   3 +-
>>   mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>   5 files changed, 408 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>> index e95b2c8081eb..3f681f63390f
>> --- a/arch/s390/include/asm/tlb.h
>> +++ b/arch/s390/include/asm/tlb.h
>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct 
>> mmu_gather *tlb,
>>           struct page *page, bool delay_rmap, int page_size);
>>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>           struct page *page, unsigned int nr_pages, bool delay_rmap);
>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>> +        swp_entry_t entry, int nr);
>
>
> The problem I am having is that swap entries don't have any 
> intersection with the TLB. It sounds like we're squeezing something 
> into an existing concept (MMU gather) that just doesn't belong in there.
I referred to the mechanism of batch release in tlb, and perhaps a new
structure needs to be created to implement this feature.

Thanks
Zhiguo
Barry Song Sept. 10, 2024, 4:18 a.m. UTC | #8
On Tue, Sep 10, 2024 at 2:39 AM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/9/9 9:59, Barry Song 写道:
> > On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
> >>
> >>
> >> 在 2024/9/4 17:16, Barry Song 写道:
> >>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>> One of the main reasons for the prolonged exit of the process with
> >>>> independent mm is the time-consuming release of its swap entries.
> >>>> The proportion of swap memory occupied by the process increases over
> >>>> time due to high memory pressure triggering to reclaim anonymous folio
> >>>> into swapspace, e.g., in Android devices, we found this proportion can
> >>>> reach 60% or more after a period of time. Additionally, the relatively
> >>>> lengthy path for releasing swap entries further contributes to the
> >>>> longer time required to release swap entries.
> >>>>
> >>>> Testing Platform: 8GB RAM
> >>>> Testing procedure:
> >>>> After booting up, start 15 processes first, and then observe the
> >>>> physical memory size occupied by the last launched process at different
> >>>> time points.
> >>>> Example: The process launched last: com.qiyi.video
> >>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> >>>> -------------------------------------------------------------------
> >>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> >>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> >>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> >>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> >>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> >>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> >>>> Note: min - minute.
> >>>>
> >>>> When there are multiple processes with independent mm and the high
> >>>> memory pressure in system, if the large memory required process is
> >>>> launched at this time, system will is likely to trigger the instantaneous
> >>>> killing of many processes with independent mm. Due to multiple exiting
> >>>> processes occupying multiple CPU core resources for concurrent execution,
> >>>> leading to some issues such as the current non-exiting and important
> >>>> processes lagging.
> >>>>
> >>>> To solve this problem, we have introduced the multiple exiting process
> >>>> asynchronous swap entries release mechanism, which isolates and caches
> >>>> swap entries occupied by multiple exiting processes, and hands them over
> >>>> to an asynchronous kworker to complete the release. This allows the
> >>>> exiting processes to complete quickly and release CPU resources. We have
> >>>> validated this modification on the Android products and achieved the
> >>>> expected benefits.
> >>>>
> >>>> Testing Platform: 8GB RAM
> >>>> Testing procedure:
> >>>> After restarting the machine, start 15 app processes first, and then
> >>>> start the camera app processes, we monitor the cold start and preview
> >>>> time datas of the camera app processes.
> >>>>
> >>>> Test datas of camera processes cold start time (unit: millisecond):
> >>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> >>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> >>>>
> >>>> Test datas of camera processes preview time (unit: millisecond):
> >>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> >>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> >>>>
> >>>> Base on the average of the six sets of test datas above, we can see that
> >>>> the benefit datas of the modified patch:
> >>>> 1. The cold start time of camera app processes has reduced by about 20%.
> >>>> 2. The preview time of camera app processes has reduced by about 42%.
> >>>>
> >>>> It offers several benefits:
> >>>> 1. Alleviate the high system cpu loading caused by multiple exiting
> >>>>      processes running simultaneously.
> >>>> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>>>      kworker instead of multiple exiting processes parallel execution.
> >>>> 3. Release pte_present memory occupied by exiting processes more
> >>>>      efficiently.
> >>>>
> >>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> >>>> ---
> >>>>    arch/s390/include/asm/tlb.h |   8 +
> >>>>    include/asm-generic/tlb.h   |  44 ++++++
> >>>>    include/linux/mm_types.h    |  58 +++++++
> >>>>    mm/memory.c                 |   3 +-
> >>>>    mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
> >>>>    5 files changed, 408 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> >>>> index e95b2c8081eb..3f681f63390f
> >>>> --- a/arch/s390/include/asm/tlb.h
> >>>> +++ b/arch/s390/include/asm/tlb.h
> >>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> >>>>                   struct page *page, bool delay_rmap, int page_size);
> >>>>    static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>>>                   struct page *page, unsigned int nr_pages, bool delay_rmap);
> >>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>> +               swp_entry_t entry, int nr);
> >>>>
> >>>>    #define tlb_flush tlb_flush
> >>>>    #define pte_free_tlb pte_free_tlb
> >>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>>>           return false;
> >>>>    }
> >>>>
> >>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>> +               swp_entry_t entry, int nr)
> >>>> +{
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>>    static inline void tlb_flush(struct mmu_gather *tlb)
> >>>>    {
> >>>>           __tlb_flush_mm_lazy(tlb->mm);
> >>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> >>>> index 709830274b75..8b4d516b35b8
> >>>> --- a/include/asm-generic/tlb.h
> >>>> +++ b/include/asm-generic/tlb.h
> >>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
> >>>>    static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
> >>>>    #endif
> >>>>
> >>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>> +struct mmu_swap_batch {
> >>>> +       struct mmu_swap_batch *next;
> >>>> +       unsigned int nr;
> >>>> +       unsigned int max;
> >>>> +       encoded_swpentry_t encoded_entrys[];
> >>>> +};
> >>>> +
> >>>> +#define MAX_SWAP_GATHER_BATCH  \
> >>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
> >>>> +
> >>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
> >>>> +
> >>>> +struct mmu_swap_gather {
> >>>> +       /*
> >>>> +        * the asynchronous kworker to batch
> >>>> +        * release swap entries
> >>>> +        */
> >>>> +       struct work_struct free_work;
> >>>> +
> >>>> +       /* batch cache swap entries */
> >>>> +       unsigned int batch_count;
> >>>> +       struct mmu_swap_batch *active;
> >>>> +       struct mmu_swap_batch local;
> >>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> >>>> +};
> >>>> +
> >>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>> +               swp_entry_t entry, int nr);
> >>>> +#endif
> >>>> +
> >>>>    /*
> >>>>     * struct mmu_gather is an opaque type used by the mm code for passing around
> >>>>     * any data needed by arch specific code for tlb_remove_page.
> >>>> @@ -343,6 +374,18 @@ struct mmu_gather {
> >>>>           unsigned int            vma_exec : 1;
> >>>>           unsigned int            vma_huge : 1;
> >>>>           unsigned int            vma_pfn  : 1;
> >>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>> +       /*
> >>>> +        * Two states of releasing swap entries
> >>>> +        * asynchronously:
> >>>> +        * swp_freeable - have opportunity to
> >>>> +        * release asynchronously future
> >>>> +        * swp_freeing - be releasing asynchronously.
> >>>> +        */
> >>>> +       unsigned int            swp_freeable : 1;
> >>>> +       unsigned int            swp_freeing : 1;
> >>>> +       unsigned int            swp_disable : 1;
> >>>> +#endif
> >>>>
> >>>>           unsigned int            batch_count;
> >>>>
> >>>> @@ -354,6 +397,7 @@ struct mmu_gather {
> >>>>    #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
> >>>>           unsigned int page_size;
> >>>>    #endif
> >>>> +       struct mmu_swap_gather *swp;
> >>>>    #endif
> >>>>    };
> >>>>
> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>> index 165c58b12ccc..2f66303f1519
> >>>> --- a/include/linux/mm_types.h
> >>>> +++ b/include/linux/mm_types.h
> >>>> @@ -283,6 +283,64 @@ typedef struct {
> >>>>           unsigned long val;
> >>>>    } swp_entry_t;
> >>>>
> >>>> +/*
> >>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> >>>> + *
> >>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
> >>>> + * bit indicating extra context-dependent information. Only used in swp_entry
> >>>> + * asynchronous release path by mmu_swap_gather.
> >>>> + */
> >>>> +typedef struct {
> >>>> +       unsigned long val;
> >>>> +} encoded_swpentry_t;
> >>>> +
> >>>> +/*
> >>>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
> >>>> + * total number of consecutive swap entries associated with the same folio. If this
> >>>> + * bit is not set, "nr" is implicitly 1.
> >>>> + *
> >>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
> >>>> + * Bit63 can be used here.
> >>>> + */
> >>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
> >>>> +
> >>>> +static __always_inline encoded_swpentry_t
> >>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> >>>> +{
> >>>> +       encoded_swpentry_t ret;
> >>>> +
> >>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >>>> +       ret.val = flags | entry.val;
> >>>> +       return ret;
> >>>> +}
> >>>> +
> >>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
> >>>> +{
> >>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >>>> +}
> >>>> +
> >>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
> >>>> +{
> >>>> +       swp_entry_t ret;
> >>>> +
> >>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >>>> +       return ret;
> >>>> +}
> >>>> +
> >>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
> >>>> +{
> >>>> +       encoded_swpentry_t ret;
> >>>> +
> >>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >>>> +       ret.val = nr;
> >>>> +       return ret;
> >>>> +}
> >>>> +
> >>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
> >>>> +{
> >>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
> >>>> +}
> >>>> +
> >>>>    /**
> >>>>     * struct folio - Represents a contiguous set of bytes.
> >>>>     * @flags: Identical to the page flags.
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index d6a9dcddaca4..023a8adcb67c
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >>>>                           if (!should_zap_cows(details))
> >>>>                                   continue;
> >>>>                           rss[MM_SWAPENTS] -= nr;
> >>>> -                       free_swap_and_cache_nr(entry, nr);
> >>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
> >>>> +                               free_swap_and_cache_nr(entry, nr);
> >>>>                   } else if (is_migration_entry(entry)) {
> >>>>                           folio = pfn_swap_entry_folio(entry);
> >>>>                           if (!should_zap_folio(details, folio))
> >>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>>> index 99b3e9408aa0..33dc9d1faff9
> >>>> --- a/mm/mmu_gather.c
> >>>> +++ b/mm/mmu_gather.c
> >>>> @@ -9,11 +9,303 @@
> >>>>    #include <linux/smp.h>
> >>>>    #include <linux/swap.h>
> >>>>    #include <linux/rmap.h>
> >>>> +#include <linux/oom.h>
> >>>>
> >>>>    #include <asm/pgalloc.h>
> >>>>    #include <asm/tlb.h>
> >>>>
> >>>>    #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>> +/*
> >>>> + * The swp_entry asynchronous release mechanism for multiple processes with
> >>>> + * independent mm exiting simultaneously.
> >>>> + *
> >>>> + * During the multiple exiting processes releasing their own mm simultaneously,
> >>>> + * the swap entries in the exiting processes are handled by isolating, caching
> >>>> + * and handing over to an asynchronous kworker to complete the release.
> >>>> + *
> >>>> + * The conditions for the exiting process entering the swp_entry asynchronous
> >>>> + * release path:
> >>>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
> >>>> + *    to alloc struct mmu_swap_gather frequently.
> >>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
> >>> Hi Zhiguo,
> >>>
> >>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
> >>> batched swap entry freeing, even with one process, could be a
> >>> bottleneck for a single
> >>> process based on the data from this patch:
> >>>
> >>> mm: attempt to batch free swap entries for zap_pte_range()
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
> >>> "munmap bandwidth becomes 3X faster."
> >>>
> >>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
> >> Hi Barry,
> >>
> >> Thanks for your comments.
> >>
> >> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
> >> conducted the multiple android apps continuous startup performance
> >> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
> >> showed that the startup time had deteriorated slightly. However,
> >> the patch's logic in this test was different from the one I submitted
> >> to the community, and it may be due to some other issues with the
> >> previous old patch.
> >>
> >> However, we have conducted relevant memory performance tests on this
> >> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
> >> the results are better than before the modification. The patches have
> >> been used on multiple projects.
> >> For example:
> >> Test the time consumption and subjective fluency experience of
> >> launching 30 android apps continuously for two rounds.
> >> Test machine: RAM 16GB
> >> |        | time(s) | Frame-droped rate(%) |
> >> | before | 230.76  |         0.54         |
> >> | after  | 225.23  |         0.74         |
> >> We can see that the patch has been optimized 5.53s for startup time and
> >> 0.2% frame-droped rate and better subjective smoothness experience.
> >>
> >> Perhaps the patches submitted to the community has also improved the
> >> multiple android apps continuous startup performance in the case
> >> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
> >> tests to verify this situation in the future.
> Thanks Barry for your valuable suggestions.
> > Using a fixed value like 2 feels more like a workaround than a solid solution.
> > It would be better if we could eliminate this hack.
> Ok, I will conduct more tests for tring to solve this parameter issue.
> >
> > Additionally, this type of asynchronous reclamation might struggle to scale
> > effectively, particularly on NUMA systems with many CPU cores.
> >
> > Many kernel threads are per-node, like kswapd. For instance, if we have 100
> > threads running on 100 CPUs executing zap_pte_range(), your approach, which
> > relies on a single async thread to reclaim swap entries, might lead to
> > performance
> > regressions.
> >
> > We might need to consider a more adaptable approach that can evaluate
> > the machine's
> > topology and dynamically determine the appropriate number of async
> > threads, rather
> > than hard-coding it to just one. Otherwise, there could be ongoing
> > concerns about
> > whether this solution is truly applicable to all systems.
> Can we dynamically determine the number of asynchronous kworkers to be
> created based on the number of exiting processes at a certain moment,
> for example, every 8 exiting processes corresponds to one asynchronous
> kworker?
>
> | The number of exiting processes | The number of asynchronous kworkers |
> |          [1, 8]                 | 1                    |
> |          [9, 16]                | 2                    |
> |          ...                    | ...                   |
> |      [8*N+1, 8*(N+1)]           | N+1                   |
> N >= 0

I'm not entirely sure. Another potential approach could be to use a
dedicated thread
for each NUMA node, but we would need data to confirm if this is the
right solution.

> > Alternatively, we might be able to develop a method to speed up
> > batched freeing in a
> > synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
> > async, but it can still speed up tlb flush, right?
> The synchronous manner may have some optimization effect, but it seems
> that the optimization effect on the CPUs load occupied by multiple
> exiting processes is not significant. In addition, David stated in the
> latest comment that swap entry seems unrelated to tlb.
> > For phones with just 8 CPU cores, I definitely like your patch.
> > However, since we're
> > aiming for something which can affect all systems, the situation might be more
> > complex.
> Yes, this pacth still needs further improvement to adapt to all systems.

If we want to keep things simple, another approach might be to profile the
hotspots within swap_free_nr(), identifying the most time-consuming parts,
and then optimize them based on the data we collect from your
mmu_swap_batch. Then we don't have to make things async. Have you ever
profiled the hotspots?

> >>>> + *
> >>>> + * Since the time for determining the number of exiting processes is dynamic,
> >>>> + * the exiting process may start to enter the swp_entry asynchronous release
> >>>> + * at the beginning or middle stage of the exiting process's swp_entry release
> >>>> + * path.
> >>>> + *
> >>>> + * Once an exiting process enters the swp_entry asynchronous release, all remaining
> >>>> + * swap entries in this exiting process need to be fully released by asynchronous
> >>>> + * kworker theoretically.
> >>> Freeing a slot can indeed release memory from `zRAM`, potentially returning
> >>> it to the system for allocation. Your patch frees swap slots asynchronously;
> >>> I assume this doesn’t slow down the memory freeing process for `zRAM`, or
> >>> could it even slow down the freeing of `zRAM` memory? Freeing compressed
> >>> memory might not be as crucial compared to freeing uncompressed memory with
> >>> present PTEs?
> >> Yes, freeing uncompressed memory with present PTEs is more important
> >> compared to freeing compressed 'zRAM' memory.
> >>
> >> I guess that the multiple exiting processes releasing swap entries
> >> simultaneously may result in the swap_info->lock competition pressure
> >> in swapcache_free_entries(), affecting the efficiency of releasing swap
> >> entries. However, if the asynchronous kworker is used, this issue can
> >> be avoided, and perhaps the improvement is minor.
> >>
> >> The freeing of zRAM memory does not slow down. We have observed traces
> >> in the camera startup scene and found that the asynchronous kworker
> >> can release all swap entries before entering the camera preview.
> >> Compared to not using the asynchronous kworker, the exiting processes
> >> completed after entering the camera preview.
> >>>> + *
> >>>> + * The function of the swp_entry asynchronous release:
> >>>> + * 1. Alleviate the high system cpu load caused by multiple exiting processes
> >>>> + *    running simultaneously.
> >>>> + * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
> >>>> + *    instead of multiple exiting processes parallel execution.
> >>>> + * 3. Release pte_present memory occupied by exiting processes more efficiently.
> >>>> + */
> >>>> +
> >>>> +/*
> >>>> + * The min number of exiting processes required for swp_entry asynchronous release
> >>>> + */
> >>>> +#define NR_MIN_EXITING_PROCESSES 2
> >>>> +
> >>>> +static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
> >>>> +static struct kmem_cache *swap_gather_cachep;
> >>>> +static struct workqueue_struct *swapfree_wq;
> >>>> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
> >>>> +
> >>>> +static int __init tlb_swap_async_free_setup(void)
> >>>> +{
> >>>> +       swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
> >>>> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> >>>> +       if (!swapfree_wq)
> >>>> +               goto fail;
> >>>> +
> >>>> +       swap_gather_cachep = kmem_cache_create("swap_gather",
> >>>> +               sizeof(struct mmu_swap_gather),
> >>>> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
> >>>> +               NULL);
> >>>> +       if (!swap_gather_cachep)
> >>>> +               goto kcache_fail;
> >>>> +
> >>>> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
> >>>> +       return 0;
> >>>> +
> >>>> +kcache_fail:
> >>>> +       destroy_workqueue(swapfree_wq);
> >>>> +fail:
> >>>> +       return -ENOMEM;
> >>>> +}
> >>>> +postcore_initcall(tlb_swap_async_free_setup);
> >>>> +
> >>>> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
> >>>> +{
> >>>> +       struct mmu_swap_batch *swap_batch, *next;
> >>>> +
> >>>> +       for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
> >>>> +               next = swap_batch->next;
> >>>> +               free_page((unsigned long)swap_batch);
> >>>> +       }
> >>>> +       swap_gather->local.next = NULL;
> >>>> +       kmem_cache_free(swap_gather_cachep, swap_gather);
> >>>> +}
> >>>> +
> >>>> +static void tlb_swap_async_free_work(struct work_struct *w)
> >>>> +{
> >>>> +       int i, nr_multi, nr_free;
> >>>> +       swp_entry_t start_entry;
> >>>> +       struct mmu_swap_batch *swap_batch;
> >>>> +       struct mmu_swap_gather *swap_gather = container_of(w,
> >>>> +               struct mmu_swap_gather, free_work);
> >>>> +
> >>>> +       /* Release swap entries cached in mmu_swap_batch. */
> >>>> +       for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
> >>>> +           swap_batch = swap_batch->next) {
> >>>> +               nr_free = 0;
> >>>> +               for (i = 0; i < swap_batch->nr; i++) {
> >>>> +                       if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
> >>>> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
> >>>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> >>>> +                               nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
> >>>> +                               free_swap_and_cache_nr(start_entry, nr_multi);
> >>>> +                               nr_free += 2;
> >>>> +                       } else {
> >>>> +                               start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
> >>>> +                               free_swap_and_cache_nr(start_entry, 1);
> >>>> +                               nr_free++;
> >>>> +                       }
> >>>> +               }
> >>>> +               swap_batch->nr -= nr_free;
> >>>> +               VM_BUG_ON(swap_batch->nr);
> >>>> +       }
> >>>> +       __tlb_swap_gather_free(swap_gather);
> >>>> +}
> >>>> +
> >>>> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
> >>>> +{
> >>>> +       /*
> >>>> +        * Only the exiting processes with the MM_SWAPENTS counter >=
> >>>> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
> >>>> +        * entries by asynchronous kworker.
> >>>> +        */
> >>>> +       if (!task_is_dying() ||
> >>>> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
> >>>> +               return true;
> >>>> +
> >>>> +       atomic_inc(&nr_exiting_processes);
> >>>> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
> >>>> +               tlb->swp_freeable = 1;
> >>>> +       else
> >>>> +               tlb->swp_freeing = 1;
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
> >>>> + * for swp_entry tear-down.
> >>>> + * @tlb: the mmu_swap_gather structure belongs to tlb
> >>>> + */
> >>>> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
> >>>> +{
> >>>> +       tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
> >>>> +       if (unlikely(!tlb->swp))
> >>>> +               return false;
> >>>> +
> >>>> +       tlb->swp->local.next  = NULL;
> >>>> +       tlb->swp->local.nr    = 0;
> >>>> +       tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
> >>>> +
> >>>> +       tlb->swp->active      = &tlb->swp->local;
> >>>> +       tlb->swp->batch_count = 0;
> >>>> +
> >>>> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
> >>>> +       return true;
> >>>> +}
> >>>> +
> >>>> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
> >>>> +{
> >>>> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
> >>>> +               return;
> >>>> +
> >>>> +       tlb->swp = NULL;
> >>>> +       tlb->swp_freeable = 0;
> >>>> +       tlb->swp_freeing = 0;
> >>>> +       tlb->swp_disable = 0;
> >>>> +
> >>>> +       if (__tlb_swap_gather_mmu_check(tlb))
> >>>> +               return;
> >>>> +
> >>>> +       /*
> >>>> +        * If the exiting process meets the conditions of
> >>>> +        * swp_entry asynchronous release, an mmu_swap_gather
> >>>> +        * structure will be initialized.
> >>>> +        */
> >>>> +       if (tlb->swp_freeing)
> >>>> +               __tlb_swap_gather_init(tlb);
> >>>> +}
> >>>> +
> >>>> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
> >>>> +{
> >>>> +       queue_work(swapfree_wq, &tlb->swp->free_work);
> >>>> +       tlb->swp = NULL;
> >>>> +       if (!finish)
> >>>> +               __tlb_swap_gather_init(tlb);
> >>>> +}
> >>>> +
> >>>> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
> >>>> +{
> >>>> +       struct mmu_swap_batch *swap_batch;
> >>>> +
> >>>> +       if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
> >>>> +               goto free;
> >>>> +
> >>>> +       swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
> >>>> +       if (unlikely(!swap_batch))
> >>>> +               goto free;
> >>>> +
> >>>> +       swap_batch->next = NULL;
> >>>> +       swap_batch->nr   = 0;
> >>>> +       swap_batch->max  = MAX_SWAP_GATHER_BATCH;
> >>>> +
> >>>> +       tlb->swp->active->next = swap_batch;
> >>>> +       tlb->swp->active = swap_batch;
> >>>> +       tlb->swp->batch_count++;
> >>>> +       return true;
> >>>> +free:
> >>>> +       /* batch move to wq */
> >>>> +       __tlb_swap_gather_queuework(tlb, false);
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * __tlb_remove_swap_entries - the swap entries in exiting process are
> >>>> + * isolated, batch cached in struct mmu_swap_batch.
> >>>> + * @tlb: the current mmu_gather
> >>>> + * @entry: swp_entry to be isolated and cached
> >>>> + * @nr: the number of consecutive entries starting from entry parameter.
> >>>> + */
> >>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>> +                            swp_entry_t entry, int nr)
> >>>> +{
> >>>> +       struct mmu_swap_batch *swap_batch;
> >>>> +       unsigned long flags = 0;
> >>>> +       bool ret = false;
> >>>> +
> >>>> +       if (tlb->swp_disable)
> >>>> +               return ret;
> >>>> +
> >>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >>>> +               return ret;
> >>>> +
> >>>> +       if (tlb->swp_freeable) {
> >>>> +               if (atomic_read(&nr_exiting_processes) <
> >>>> +                   NR_MIN_EXITING_PROCESSES)
> >>>> +                       return ret;
> >>>> +               /*
> >>>> +                * If the current number of exiting processes
> >>>> +                * is >= NR_MIN_EXITING_PROCESSES, the exiting
> >>>> +                * process with swp_freeable state will enter
> >>>> +                * swp_freeing state to start releasing its
> >>>> +                * remaining swap entries by the asynchronous
> >>>> +                * kworker.
> >>>> +                */
> >>>> +               tlb->swp_freeable = 0;
> >>>> +               tlb->swp_freeing = 1;
> >>>> +       }
> >>>> +
> >>>> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
> >>>> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
> >>>> +               return ret;
> >>>> +
> >>>> +       swap_batch = tlb->swp->active;
> >>>> +       if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
> >>>> +               __tlb_swap_gather_queuework(tlb, false);
> >>>> +               return ret;
> >>>> +       }
> >>>> +
> >>>> +       if (likely(nr == 1)) {
> >>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> >>>> +       } else {
> >>>> +               flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
> >>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
> >>>> +               swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
> >>>> +       }
> >>>> +       ret = true;
> >>>> +
> >>>> +       if (swap_batch->nr >= swap_batch->max - 1) {
> >>>> +               if (!__tlb_swap_next_batch(tlb))
> >>>> +                       goto exit;
> >>>> +               swap_batch = tlb->swp->active;
> >>>> +       }
> >>>> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
> >>>> +exit:
> >>>> +       return ret;
> >>>> +}
> >>>> +
> >>>> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
> >>>> +{
> >>>> +       if (tlb->swp_disable)
> >>>> +               return;
> >>>> +
> >>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >>>> +               return;
> >>>> +
> >>>> +       if (tlb->swp_freeable) {
> >>>> +               tlb->swp_freeable = 0;
> >>>> +               VM_BUG_ON(tlb->swp_freeing);
> >>>> +               goto exit;
> >>>> +       }
> >>>> +       tlb->swp_freeing = 0;
> >>>> +       if (unlikely(!tlb->swp))
> >>>> +               goto exit;
> >>>> +
> >>>> +       __tlb_swap_gather_queuework(tlb, true);
> >>>> +exit:
> >>>> +       atomic_dec(&nr_exiting_processes);
> >>>> +}
> >>>>
> >>>>    static bool tlb_next_batch(struct mmu_gather *tlb)
> >>>>    {
> >>>> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
> >>>>           tlb->local.max  = ARRAY_SIZE(tlb->__pages);
> >>>>           tlb->active     = &tlb->local;
> >>>>           tlb->batch_count = 0;
> >>>> +
> >>>> +       tlb->swp_disable = 1;
> >>>> +       __tlb_swap_gather_mmu(tlb);
> >>>>    #endif
> >>>>           tlb->delayed_rmap = 0;
> >>>>
> >>>> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
> >>>>
> >>>>    #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>           tlb_batch_list_free(tlb);
> >>>> +       __tlb_batch_swap_finish(tlb);
> >>>>    #endif
> >>>>           dec_tlb_flush_pending(tlb->mm);
> >>>>    }
> >>>> --
> >>>> 2.39.0
> >>>>
> > Thanks
> > Barry
> Thanks
> Zhiguo
>

Thanks
Barry
Barry Song Sept. 10, 2024, 4:22 a.m. UTC | #9
On Tue, Sep 10, 2024 at 2:44 AM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/9/9 14:49, David Hildenbrand 写道:
> > On 05.08.24 17:36, Zhiguo Jiang wrote:
> >> One of the main reasons for the prolonged exit of the process with
> >> independent mm is the time-consuming release of its swap entries.
> >> The proportion of swap memory occupied by the process increases over
> >> time due to high memory pressure triggering to reclaim anonymous folio
> >> into swapspace, e.g., in Android devices, we found this proportion can
> >> reach 60% or more after a period of time. Additionally, the relatively
> >> lengthy path for releasing swap entries further contributes to the
> >> longer time required to release swap entries.
> >>
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After booting up, start 15 processes first, and then observe the
> >> physical memory size occupied by the last launched process at different
> >> time points.
> >> Example: The process launched last: com.qiyi.video
> >> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
> >> -------------------------------------------------------------------
> >> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
> >> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
> >> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
> >> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
> >> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
> >> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
> >> Note: min - minute.
> >>
> >> When there are multiple processes with independent mm and the high
> >> memory pressure in system, if the large memory required process is
> >> launched at this time, system will is likely to trigger the
> >> instantaneous
> >> killing of many processes with independent mm. Due to multiple exiting
> >> processes occupying multiple CPU core resources for concurrent
> >> execution,
> >> leading to some issues such as the current non-exiting and important
> >> processes lagging.
> >>
> >> To solve this problem, we have introduced the multiple exiting process
> >> asynchronous swap entries release mechanism, which isolates and caches
> >> swap entries occupied by multiple exiting processes, and hands them over
> >> to an asynchronous kworker to complete the release. This allows the
> >> exiting processes to complete quickly and release CPU resources. We have
> >> validated this modification on the Android products and achieved the
> >> expected benefits.
> >>
> >> Testing Platform: 8GB RAM
> >> Testing procedure:
> >> After restarting the machine, start 15 app processes first, and then
> >> start the camera app processes, we monitor the cold start and preview
> >> time datas of the camera app processes.
> >>
> >> Test datas of camera processes cold start time (unit: millisecond):
> >> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> >> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> >>
> >> Test datas of camera processes preview time (unit: millisecond):
> >> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> >> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> >>
> >> Base on the average of the six sets of test datas above, we can see that
> >> the benefit datas of the modified patch:
> >> 1. The cold start time of camera app processes has reduced by about 20%.
> >> 2. The preview time of camera app processes has reduced by about 42%.
> >>
> >> It offers several benefits:
> >> 1. Alleviate the high system cpu loading caused by multiple exiting
> >>     processes running simultaneously.
> >> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>     kworker instead of multiple exiting processes parallel execution.
> >> 3. Release pte_present memory occupied by exiting processes more
> >>     efficiently.
> >>
> >> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> >> ---
> >>   arch/s390/include/asm/tlb.h |   8 +
> >>   include/asm-generic/tlb.h   |  44 ++++++
> >>   include/linux/mm_types.h    |  58 +++++++
> >>   mm/memory.c                 |   3 +-
> >>   mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
> >>   5 files changed, 408 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> >> index e95b2c8081eb..3f681f63390f
> >> --- a/arch/s390/include/asm/tlb.h
> >> +++ b/arch/s390/include/asm/tlb.h
> >> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct
> >> mmu_gather *tlb,
> >>           struct page *page, bool delay_rmap, int page_size);
> >>   static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>           struct page *page, unsigned int nr_pages, bool delay_rmap);
> >> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >> +        swp_entry_t entry, int nr);
> >
> >
> > The problem I am having is that swap entries don't have any
> > intersection with the TLB. It sounds like we're squeezing something
> > into an existing concept (MMU gather) that just doesn't belong in there.
> I referred to the mechanism of batch release in tlb, and perhaps a new
> structure needs to be created to implement this feature.

We already use swap_slots_cache to collect multiple swap entries and
free them in
batches. Would it be better to incorporate our new logic there? might
be much less
change and don't need to touch zap_pte_range() ?  for example, while slot_caches
are almost full, wake up the async thread to free? Or, do you think
that cache->free_lock
is also a contended lock?

>
> Thanks
> Zhiguo
>

Thanks
Barry
David Hildenbrand Sept. 10, 2024, 8:52 a.m. UTC | #10
On 10.09.24 06:22, Barry Song wrote:
> On Tue, Sep 10, 2024 at 2:44 AM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>>
>> 在 2024/9/9 14:49, David Hildenbrand 写道:
>>> On 05.08.24 17:36, Zhiguo Jiang wrote:
>>>> One of the main reasons for the prolonged exit of the process with
>>>> independent mm is the time-consuming release of its swap entries.
>>>> The proportion of swap memory occupied by the process increases over
>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>> lengthy path for releasing swap entries further contributes to the
>>>> longer time required to release swap entries.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at different
>>>> time points.
>>>> Example: The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
>>>> Note: min - minute.
>>>>
>>>> When there are multiple processes with independent mm and the high
>>>> memory pressure in system, if the large memory required process is
>>>> launched at this time, system will is likely to trigger the
>>>> instantaneous
>>>> killing of many processes with independent mm. Due to multiple exiting
>>>> processes occupying multiple CPU core resources for concurrent
>>>> execution,
>>>> leading to some issues such as the current non-exiting and important
>>>> processes lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the Android products and achieved the
>>>> expected benefits.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After restarting the machine, start 15 app processes first, and then
>>>> start the camera app processes, we monitor the cold start and preview
>>>> time datas of the camera app processes.
>>>>
>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>
>>>> Test datas of camera processes preview time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>
>>>> Base on the average of the six sets of test datas above, we can see that
>>>> the benefit datas of the modified patch:
>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>      efficiently.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>> ---
>>>>    arch/s390/include/asm/tlb.h |   8 +
>>>>    include/asm-generic/tlb.h   |  44 ++++++
>>>>    include/linux/mm_types.h    |  58 +++++++
>>>>    mm/memory.c                 |   3 +-
>>>>    mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>    5 files changed, 408 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>> index e95b2c8081eb..3f681f63390f
>>>> --- a/arch/s390/include/asm/tlb.h
>>>> +++ b/arch/s390/include/asm/tlb.h
>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct
>>>> mmu_gather *tlb,
>>>>            struct page *page, bool delay_rmap, int page_size);
>>>>    static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>            struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +        swp_entry_t entry, int nr);
>>>
>>>
>>> The problem I am having is that swap entries don't have any
>>> intersection with the TLB. It sounds like we're squeezing something
>>> into an existing concept (MMU gather) that just doesn't belong in there.
>> I referred to the mechanism of batch release in tlb, and perhaps a new
>> structure needs to be created to implement this feature.
> 
> We already use swap_slots_cache to collect multiple swap entries and
> free them in
> batches. Would it be better to incorporate our new logic there?

Yes, very likely.

might
> be much less
> change and don't need to touch zap_pte_range() ?  for example, while slot_caches
> are almost full, wake up the async thread to free? 

FWIW, the async part over-complicates the whole thing. I agree that if 
possible, we should try to avoid it; maybe it's not easily possible.
zhiguojiang Sept. 10, 2024, 9:22 a.m. UTC | #11
在 2024/9/10 12:18, Barry Song 写道:
> On Tue, Sep 10, 2024 at 2:39 AM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/9/9 9:59, Barry Song 写道:
>>> On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>>>
>>>> 在 2024/9/4 17:16, Barry Song 写道:
>>>>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>> One of the main reasons for the prolonged exit of the process with
>>>>>> independent mm is the time-consuming release of its swap entries.
>>>>>> The proportion of swap memory occupied by the process increases over
>>>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>>>> lengthy path for releasing swap entries further contributes to the
>>>>>> longer time required to release swap entries.
>>>>>>
>>>>>> Testing Platform: 8GB RAM
>>>>>> Testing procedure:
>>>>>> After booting up, start 15 processes first, and then observe the
>>>>>> physical memory size occupied by the last launched process at different
>>>>>> time points.
>>>>>> Example: The process launched last: com.qiyi.video
>>>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
>>>>>> -------------------------------------------------------------------
>>>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>>>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>>>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>>>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>>>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>>>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>>>>>> Note: min - minute.
>>>>>>
>>>>>> When there are multiple processes with independent mm and the high
>>>>>> memory pressure in system, if the large memory required process is
>>>>>> launched at this time, system will is likely to trigger the instantaneous
>>>>>> killing of many processes with independent mm. Due to multiple exiting
>>>>>> processes occupying multiple CPU core resources for concurrent execution,
>>>>>> leading to some issues such as the current non-exiting and important
>>>>>> processes lagging.
>>>>>>
>>>>>> To solve this problem, we have introduced the multiple exiting process
>>>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>>>> to an asynchronous kworker to complete the release. This allows the
>>>>>> exiting processes to complete quickly and release CPU resources. We have
>>>>>> validated this modification on the Android products and achieved the
>>>>>> expected benefits.
>>>>>>
>>>>>> Testing Platform: 8GB RAM
>>>>>> Testing procedure:
>>>>>> After restarting the machine, start 15 app processes first, and then
>>>>>> start the camera app processes, we monitor the cold start and preview
>>>>>> time datas of the camera app processes.
>>>>>>
>>>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>>>
>>>>>> Test datas of camera processes preview time (unit: millisecond):
>>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>>>
>>>>>> Base on the average of the six sets of test datas above, we can see that
>>>>>> the benefit datas of the modified patch:
>>>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>>>
>>>>>> It offers several benefits:
>>>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>>>       processes running simultaneously.
>>>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>>>       kworker instead of multiple exiting processes parallel execution.
>>>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>>>       efficiently.
>>>>>>
>>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>>>> ---
>>>>>>     arch/s390/include/asm/tlb.h |   8 +
>>>>>>     include/asm-generic/tlb.h   |  44 ++++++
>>>>>>     include/linux/mm_types.h    |  58 +++++++
>>>>>>     mm/memory.c                 |   3 +-
>>>>>>     mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>>>     5 files changed, 408 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>>>> index e95b2c8081eb..3f681f63390f
>>>>>> --- a/arch/s390/include/asm/tlb.h
>>>>>> +++ b/arch/s390/include/asm/tlb.h
>>>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>>>>>                    struct page *page, bool delay_rmap, int page_size);
>>>>>>     static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>>>                    struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>> +               swp_entry_t entry, int nr);
>>>>>>
>>>>>>     #define tlb_flush tlb_flush
>>>>>>     #define pte_free_tlb pte_free_tlb
>>>>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>>>            return false;
>>>>>>     }
>>>>>>
>>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>> +               swp_entry_t entry, int nr)
>>>>>> +{
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>>     static inline void tlb_flush(struct mmu_gather *tlb)
>>>>>>     {
>>>>>>            __tlb_flush_mm_lazy(tlb->mm);
>>>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>>>>> index 709830274b75..8b4d516b35b8
>>>>>> --- a/include/asm-generic/tlb.h
>>>>>> +++ b/include/asm-generic/tlb.h
>>>>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
>>>>>>     static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
>>>>>>     #endif
>>>>>>
>>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>> +struct mmu_swap_batch {
>>>>>> +       struct mmu_swap_batch *next;
>>>>>> +       unsigned int nr;
>>>>>> +       unsigned int max;
>>>>>> +       encoded_swpentry_t encoded_entrys[];
>>>>>> +};
>>>>>> +
>>>>>> +#define MAX_SWAP_GATHER_BATCH  \
>>>>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
>>>>>> +
>>>>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
>>>>>> +
>>>>>> +struct mmu_swap_gather {
>>>>>> +       /*
>>>>>> +        * the asynchronous kworker to batch
>>>>>> +        * release swap entries
>>>>>> +        */
>>>>>> +       struct work_struct free_work;
>>>>>> +
>>>>>> +       /* batch cache swap entries */
>>>>>> +       unsigned int batch_count;
>>>>>> +       struct mmu_swap_batch *active;
>>>>>> +       struct mmu_swap_batch local;
>>>>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
>>>>>> +};
>>>>>> +
>>>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>> +               swp_entry_t entry, int nr);
>>>>>> +#endif
>>>>>> +
>>>>>>     /*
>>>>>>      * struct mmu_gather is an opaque type used by the mm code for passing around
>>>>>>      * any data needed by arch specific code for tlb_remove_page.
>>>>>> @@ -343,6 +374,18 @@ struct mmu_gather {
>>>>>>            unsigned int            vma_exec : 1;
>>>>>>            unsigned int            vma_huge : 1;
>>>>>>            unsigned int            vma_pfn  : 1;
>>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>> +       /*
>>>>>> +        * Two states of releasing swap entries
>>>>>> +        * asynchronously:
>>>>>> +        * swp_freeable - have opportunity to
>>>>>> +        * release asynchronously future
>>>>>> +        * swp_freeing - be releasing asynchronously.
>>>>>> +        */
>>>>>> +       unsigned int            swp_freeable : 1;
>>>>>> +       unsigned int            swp_freeing : 1;
>>>>>> +       unsigned int            swp_disable : 1;
>>>>>> +#endif
>>>>>>
>>>>>>            unsigned int            batch_count;
>>>>>>
>>>>>> @@ -354,6 +397,7 @@ struct mmu_gather {
>>>>>>     #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>>>>>>            unsigned int page_size;
>>>>>>     #endif
>>>>>> +       struct mmu_swap_gather *swp;
>>>>>>     #endif
>>>>>>     };
>>>>>>
>>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>>> index 165c58b12ccc..2f66303f1519
>>>>>> --- a/include/linux/mm_types.h
>>>>>> +++ b/include/linux/mm_types.h
>>>>>> @@ -283,6 +283,64 @@ typedef struct {
>>>>>>            unsigned long val;
>>>>>>     } swp_entry_t;
>>>>>>
>>>>>> +/*
>>>>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
>>>>>> + *
>>>>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
>>>>>> + * bit indicating extra context-dependent information. Only used in swp_entry
>>>>>> + * asynchronous release path by mmu_swap_gather.
>>>>>> + */
>>>>>> +typedef struct {
>>>>>> +       unsigned long val;
>>>>>> +} encoded_swpentry_t;
>>>>>> +
>>>>>> +/*
>>>>>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
>>>>>> + * total number of consecutive swap entries associated with the same folio. If this
>>>>>> + * bit is not set, "nr" is implicitly 1.
>>>>>> + *
>>>>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
>>>>>> + * Bit63 can be used here.
>>>>>> + */
>>>>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
>>>>>> +
>>>>>> +static __always_inline encoded_swpentry_t
>>>>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
>>>>>> +{
>>>>>> +       encoded_swpentry_t ret;
>>>>>> +
>>>>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>>>> +       ret.val = flags | entry.val;
>>>>>> +       return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
>>>>>> +{
>>>>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>>>> +}
>>>>>> +
>>>>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
>>>>>> +{
>>>>>> +       swp_entry_t ret;
>>>>>> +
>>>>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>>>> +       return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
>>>>>> +{
>>>>>> +       encoded_swpentry_t ret;
>>>>>> +
>>>>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>>>> +       ret.val = nr;
>>>>>> +       return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
>>>>>> +{
>>>>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * struct folio - Represents a contiguous set of bytes.
>>>>>>      * @flags: Identical to the page flags.
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index d6a9dcddaca4..023a8adcb67c
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>>>                            if (!should_zap_cows(details))
>>>>>>                                    continue;
>>>>>>                            rss[MM_SWAPENTS] -= nr;
>>>>>> -                       free_swap_and_cache_nr(entry, nr);
>>>>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
>>>>>> +                               free_swap_and_cache_nr(entry, nr);
>>>>>>                    } else if (is_migration_entry(entry)) {
>>>>>>                            folio = pfn_swap_entry_folio(entry);
>>>>>>                            if (!should_zap_folio(details, folio))
>>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>>>> index 99b3e9408aa0..33dc9d1faff9
>>>>>> --- a/mm/mmu_gather.c
>>>>>> +++ b/mm/mmu_gather.c
>>>>>> @@ -9,11 +9,303 @@
>>>>>>     #include <linux/smp.h>
>>>>>>     #include <linux/swap.h>
>>>>>>     #include <linux/rmap.h>
>>>>>> +#include <linux/oom.h>
>>>>>>
>>>>>>     #include <asm/pgalloc.h>
>>>>>>     #include <asm/tlb.h>
>>>>>>
>>>>>>     #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>> +/*
>>>>>> + * The swp_entry asynchronous release mechanism for multiple processes with
>>>>>> + * independent mm exiting simultaneously.
>>>>>> + *
>>>>>> + * During the multiple exiting processes releasing their own mm simultaneously,
>>>>>> + * the swap entries in the exiting processes are handled by isolating, caching
>>>>>> + * and handing over to an asynchronous kworker to complete the release.
>>>>>> + *
>>>>>> + * The conditions for the exiting process entering the swp_entry asynchronous
>>>>>> + * release path:
>>>>>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
>>>>>> + *    to alloc struct mmu_swap_gather frequently.
>>>>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
>>>>> Hi Zhiguo,
>>>>>
>>>>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
>>>>> batched swap entry freeing, even with one process, could be a
>>>>> bottleneck for a single
>>>>> process based on the data from this patch:
>>>>>
>>>>> mm: attempt to batch free swap entries for zap_pte_range()
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
>>>>> "munmap bandwidth becomes 3X faster."
>>>>>
>>>>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
>>>> Hi Barry,
>>>>
>>>> Thanks for your comments.
>>>>
>>>> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
>>>> conducted the multiple android apps continuous startup performance
>>>> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
>>>> showed that the startup time had deteriorated slightly. However,
>>>> the patch's logic in this test was different from the one I submitted
>>>> to the community, and it may be due to some other issues with the
>>>> previous old patch.
>>>>
>>>> However, we have conducted relevant memory performance tests on this
>>>> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
>>>> the results are better than before the modification. The patches have
>>>> been used on multiple projects.
>>>> For example:
>>>> Test the time consumption and subjective fluency experience of
>>>> launching 30 android apps continuously for two rounds.
>>>> Test machine: RAM 16GB
>>>> |        | time(s) | Frame-droped rate(%) |
>>>> | before | 230.76  |         0.54         |
>>>> | after  | 225.23  |         0.74         |
>>>> We can see that the patch has been optimized 5.53s for startup time and
>>>> 0.2% frame-droped rate and better subjective smoothness experience.
>>>>
>>>> Perhaps the patches submitted to the community has also improved the
>>>> multiple android apps continuous startup performance in the case
>>>> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
>>>> tests to verify this situation in the future.
>> Thanks Barry for your valuable suggestions.
>>> Using a fixed value like 2 feels more like a workaround than a solid solution.
>>> It would be better if we could eliminate this hack.
>> Ok, I will conduct more tests for tring to solve this parameter issue.
>>> Additionally, this type of asynchronous reclamation might struggle to scale
>>> effectively, particularly on NUMA systems with many CPU cores.
>>>
>>> Many kernel threads are per-node, like kswapd. For instance, if we have 100
>>> threads running on 100 CPUs executing zap_pte_range(), your approach, which
>>> relies on a single async thread to reclaim swap entries, might lead to
>>> performance
>>> regressions.
>>>
>>> We might need to consider a more adaptable approach that can evaluate
>>> the machine's
>>> topology and dynamically determine the appropriate number of async
>>> threads, rather
>>> than hard-coding it to just one. Otherwise, there could be ongoing
>>> concerns about
>>> whether this solution is truly applicable to all systems.
>> Can we dynamically determine the number of asynchronous kworkers to be
>> created based on the number of exiting processes at a certain moment,
>> for example, every 8 exiting processes corresponds to one asynchronous
>> kworker?
>>
>> | The number of exiting processes | The number of asynchronous kworkers |
>> |          [1, 8]                 | 1                    |
>> |          [9, 16]                | 2                    |
>> |          ...                    | ...                   |
>> |      [8*N+1, 8*(N+1)]           | N+1                   |
>> N >= 0
> I'm not entirely sure. Another potential approach could be to use a
> dedicated thread
> for each NUMA node, but we would need data to confirm if this is the
> right solution.
I feel that this may be a feasible async approach, where the async
kworker of each NUMA node is responsible for maintaining and releasing
swap entries mapped by the exiting processes executing on the
corresponding local CPUs. Not sure how many CPUs each NUMA node can
contain? However, I currently do not have a NUMA environment for this
testing verification.
>>> Alternatively, we might be able to develop a method to speed up
>>> batched freeing in a
>>> synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
>>> async, but it can still speed up tlb flush, right?
>> The synchronous manner may have some optimization effect, but it seems
>> that the optimization effect on the CPUs load occupied by multiple
>> exiting processes is not significant. In addition, David stated in the
>> latest comment that swap entry seems unrelated to tlb.
>>> For phones with just 8 CPU cores, I definitely like your patch.
>>> However, since we're
>>> aiming for something which can affect all systems, the situation might be more
>>> complex.
>> Yes, this pacth still needs further improvement to adapt to all systems.
> If we want to keep things simple, another approach might be to profile the
> hotspots within swap_free_nr(), identifying the most time-consuming parts,
> and then optimize them based on the data we collect from your
> mmu_swap_batch. Then we don't have to make things async. Have you ever
> profiled the hotspots?
The hotspots seems to have been implemented in the __swap_entries_free()
in the patches you submitted, releasing continuous swap entries at once.
Moreover, there may be duplication between the mmu_swap_batch and
swapsots_cache.
zhiguojiang Sept. 10, 2024, 9:27 a.m. UTC | #12
在 2024/9/10 12:22, Barry Song 写道:
> On Tue, Sep 10, 2024 at 2:44 AM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/9/9 14:49, David Hildenbrand 写道:
>>> On 05.08.24 17:36, Zhiguo Jiang wrote:
>>>> One of the main reasons for the prolonged exit of the process with
>>>> independent mm is the time-consuming release of its swap entries.
>>>> The proportion of swap memory occupied by the process increases over
>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>> lengthy path for releasing swap entries further contributes to the
>>>> longer time required to release swap entries.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at different
>>>> time points.
>>>> Example: The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
>>>> Note: min - minute.
>>>>
>>>> When there are multiple processes with independent mm and the high
>>>> memory pressure in system, if the large memory required process is
>>>> launched at this time, system will is likely to trigger the
>>>> instantaneous
>>>> killing of many processes with independent mm. Due to multiple exiting
>>>> processes occupying multiple CPU core resources for concurrent
>>>> execution,
>>>> leading to some issues such as the current non-exiting and important
>>>> processes lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the Android products and achieved the
>>>> expected benefits.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After restarting the machine, start 15 app processes first, and then
>>>> start the camera app processes, we monitor the cold start and preview
>>>> time datas of the camera app processes.
>>>>
>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>
>>>> Test datas of camera processes preview time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>
>>>> Base on the average of the six sets of test datas above, we can see that
>>>> the benefit datas of the modified patch:
>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>      efficiently.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>> ---
>>>>    arch/s390/include/asm/tlb.h |   8 +
>>>>    include/asm-generic/tlb.h   |  44 ++++++
>>>>    include/linux/mm_types.h    |  58 +++++++
>>>>    mm/memory.c                 |   3 +-
>>>>    mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>    5 files changed, 408 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>> index e95b2c8081eb..3f681f63390f
>>>> --- a/arch/s390/include/asm/tlb.h
>>>> +++ b/arch/s390/include/asm/tlb.h
>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct
>>>> mmu_gather *tlb,
>>>>            struct page *page, bool delay_rmap, int page_size);
>>>>    static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>            struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +        swp_entry_t entry, int nr);
>>>
>>> The problem I am having is that swap entries don't have any
>>> intersection with the TLB. It sounds like we're squeezing something
>>> into an existing concept (MMU gather) that just doesn't belong in there.
>> I referred to the mechanism of batch release in tlb, and perhaps a new
>> structure needs to be created to implement this feature.
> We already use swap_slots_cache to collect multiple swap entries and
> free them in
> batches. Would it be better to incorporate our new logic there? might
> be much less
> change and don't need to touch zap_pte_range() ?  for example, while slot_caches
> are almost full, wake up the async thread to free? Or, do you think
> that cache->free_lock
> is also a contended lock?
There should be no need to add mmu_swap_batch logic in the sync release
method, and this may lead to repetitive logic. The number of async
threads for releasing slot_caches seems also uncertain for a large
number of CPUs architectures. The cache->free_lock is a percpu spinlock,
which will disable preemption, so I guess there should be no lock
contention issue with it.

I understand that you want to incorporate this feature through sync
manner, and I think it may be a bit difficult.
>
>> Thanks
>> Zhiguo
>>
> Thanks
> Barry
Thanks
Zhiguo
Barry Song Sept. 10, 2024, 10:11 a.m. UTC | #13
On Tue, Sep 10, 2024 at 9:22 PM zhiguojiang <justinjiang@vivo.com> wrote:
>
>
>
> 在 2024/9/10 12:18, Barry Song 写道:
> > On Tue, Sep 10, 2024 at 2:39 AM zhiguojiang <justinjiang@vivo.com> wrote:
> >>
> >>
> >> 在 2024/9/9 9:59, Barry Song 写道:
> >>> On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
> >>>>
> >>>> 在 2024/9/4 17:16, Barry Song 写道:
> >>>>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
> >>>>>> One of the main reasons for the prolonged exit of the process with
> >>>>>> independent mm is the time-consuming release of its swap entries.
> >>>>>> The proportion of swap memory occupied by the process increases over
> >>>>>> time due to high memory pressure triggering to reclaim anonymous folio
> >>>>>> into swapspace, e.g., in Android devices, we found this proportion can
> >>>>>> reach 60% or more after a period of time. Additionally, the relatively
> >>>>>> lengthy path for releasing swap entries further contributes to the
> >>>>>> longer time required to release swap entries.
> >>>>>>
> >>>>>> Testing Platform: 8GB RAM
> >>>>>> Testing procedure:
> >>>>>> After booting up, start 15 processes first, and then observe the
> >>>>>> physical memory size occupied by the last launched process at different
> >>>>>> time points.
> >>>>>> Example: The process launched last: com.qiyi.video
> >>>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> >>>>>> -------------------------------------------------------------------
> >>>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> >>>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> >>>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> >>>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> >>>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> >>>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> >>>>>> Note: min - minute.
> >>>>>>
> >>>>>> When there are multiple processes with independent mm and the high
> >>>>>> memory pressure in system, if the large memory required process is
> >>>>>> launched at this time, system will is likely to trigger the instantaneous
> >>>>>> killing of many processes with independent mm. Due to multiple exiting
> >>>>>> processes occupying multiple CPU core resources for concurrent execution,
> >>>>>> leading to some issues such as the current non-exiting and important
> >>>>>> processes lagging.
> >>>>>>
> >>>>>> To solve this problem, we have introduced the multiple exiting process
> >>>>>> asynchronous swap entries release mechanism, which isolates and caches
> >>>>>> swap entries occupied by multiple exiting processes, and hands them over
> >>>>>> to an asynchronous kworker to complete the release. This allows the
> >>>>>> exiting processes to complete quickly and release CPU resources. We have
> >>>>>> validated this modification on the Android products and achieved the
> >>>>>> expected benefits.
> >>>>>>
> >>>>>> Testing Platform: 8GB RAM
> >>>>>> Testing procedure:
> >>>>>> After restarting the machine, start 15 app processes first, and then
> >>>>>> start the camera app processes, we monitor the cold start and preview
> >>>>>> time datas of the camera app processes.
> >>>>>>
> >>>>>> Test datas of camera processes cold start time (unit: millisecond):
> >>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> >>>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> >>>>>>
> >>>>>> Test datas of camera processes preview time (unit: millisecond):
> >>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> >>>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> >>>>>>
> >>>>>> Base on the average of the six sets of test datas above, we can see that
> >>>>>> the benefit datas of the modified patch:
> >>>>>> 1. The cold start time of camera app processes has reduced by about 20%.
> >>>>>> 2. The preview time of camera app processes has reduced by about 42%.
> >>>>>>
> >>>>>> It offers several benefits:
> >>>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
> >>>>>>       processes running simultaneously.
> >>>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
> >>>>>>       kworker instead of multiple exiting processes parallel execution.
> >>>>>> 3. Release pte_present memory occupied by exiting processes more
> >>>>>>       efficiently.
> >>>>>>
> >>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> >>>>>> ---
> >>>>>>     arch/s390/include/asm/tlb.h |   8 +
> >>>>>>     include/asm-generic/tlb.h   |  44 ++++++
> >>>>>>     include/linux/mm_types.h    |  58 +++++++
> >>>>>>     mm/memory.c                 |   3 +-
> >>>>>>     mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
> >>>>>>     5 files changed, 408 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> >>>>>> index e95b2c8081eb..3f681f63390f
> >>>>>> --- a/arch/s390/include/asm/tlb.h
> >>>>>> +++ b/arch/s390/include/asm/tlb.h
> >>>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> >>>>>>                    struct page *page, bool delay_rmap, int page_size);
> >>>>>>     static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>>>>>                    struct page *page, unsigned int nr_pages, bool delay_rmap);
> >>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>>>> +               swp_entry_t entry, int nr);
> >>>>>>
> >>>>>>     #define tlb_flush tlb_flush
> >>>>>>     #define pte_free_tlb pte_free_tlb
> >>>>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> >>>>>>            return false;
> >>>>>>     }
> >>>>>>
> >>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>>>> +               swp_entry_t entry, int nr)
> >>>>>> +{
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>     static inline void tlb_flush(struct mmu_gather *tlb)
> >>>>>>     {
> >>>>>>            __tlb_flush_mm_lazy(tlb->mm);
> >>>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> >>>>>> index 709830274b75..8b4d516b35b8
> >>>>>> --- a/include/asm-generic/tlb.h
> >>>>>> +++ b/include/asm-generic/tlb.h
> >>>>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
> >>>>>>     static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
> >>>>>>     #endif
> >>>>>>
> >>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +struct mmu_swap_batch {
> >>>>>> +       struct mmu_swap_batch *next;
> >>>>>> +       unsigned int nr;
> >>>>>> +       unsigned int max;
> >>>>>> +       encoded_swpentry_t encoded_entrys[];
> >>>>>> +};
> >>>>>> +
> >>>>>> +#define MAX_SWAP_GATHER_BATCH  \
> >>>>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
> >>>>>> +
> >>>>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
> >>>>>> +
> >>>>>> +struct mmu_swap_gather {
> >>>>>> +       /*
> >>>>>> +        * the asynchronous kworker to batch
> >>>>>> +        * release swap entries
> >>>>>> +        */
> >>>>>> +       struct work_struct free_work;
> >>>>>> +
> >>>>>> +       /* batch cache swap entries */
> >>>>>> +       unsigned int batch_count;
> >>>>>> +       struct mmu_swap_batch *active;
> >>>>>> +       struct mmu_swap_batch local;
> >>>>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> >>>>>> +};
> >>>>>> +
> >>>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>>>> +               swp_entry_t entry, int nr);
> >>>>>> +#endif
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * struct mmu_gather is an opaque type used by the mm code for passing around
> >>>>>>      * any data needed by arch specific code for tlb_remove_page.
> >>>>>> @@ -343,6 +374,18 @@ struct mmu_gather {
> >>>>>>            unsigned int            vma_exec : 1;
> >>>>>>            unsigned int            vma_huge : 1;
> >>>>>>            unsigned int            vma_pfn  : 1;
> >>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +       /*
> >>>>>> +        * Two states of releasing swap entries
> >>>>>> +        * asynchronously:
> >>>>>> +        * swp_freeable - have opportunity to
> >>>>>> +        * release asynchronously future
> >>>>>> +        * swp_freeing - be releasing asynchronously.
> >>>>>> +        */
> >>>>>> +       unsigned int            swp_freeable : 1;
> >>>>>> +       unsigned int            swp_freeing : 1;
> >>>>>> +       unsigned int            swp_disable : 1;
> >>>>>> +#endif
> >>>>>>
> >>>>>>            unsigned int            batch_count;
> >>>>>>
> >>>>>> @@ -354,6 +397,7 @@ struct mmu_gather {
> >>>>>>     #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
> >>>>>>            unsigned int page_size;
> >>>>>>     #endif
> >>>>>> +       struct mmu_swap_gather *swp;
> >>>>>>     #endif
> >>>>>>     };
> >>>>>>
> >>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>>>> index 165c58b12ccc..2f66303f1519
> >>>>>> --- a/include/linux/mm_types.h
> >>>>>> +++ b/include/linux/mm_types.h
> >>>>>> @@ -283,6 +283,64 @@ typedef struct {
> >>>>>>            unsigned long val;
> >>>>>>     } swp_entry_t;
> >>>>>>
> >>>>>> +/*
> >>>>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> >>>>>> + *
> >>>>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
> >>>>>> + * bit indicating extra context-dependent information. Only used in swp_entry
> >>>>>> + * asynchronous release path by mmu_swap_gather.
> >>>>>> + */
> >>>>>> +typedef struct {
> >>>>>> +       unsigned long val;
> >>>>>> +} encoded_swpentry_t;
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
> >>>>>> + * total number of consecutive swap entries associated with the same folio. If this
> >>>>>> + * bit is not set, "nr" is implicitly 1.
> >>>>>> + *
> >>>>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
> >>>>>> + * Bit63 can be used here.
> >>>>>> + */
> >>>>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
> >>>>>> +
> >>>>>> +static __always_inline encoded_swpentry_t
> >>>>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> >>>>>> +{
> >>>>>> +       encoded_swpentry_t ret;
> >>>>>> +
> >>>>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >>>>>> +       ret.val = flags | entry.val;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
> >>>>>> +{
> >>>>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
> >>>>>> +{
> >>>>>> +       swp_entry_t ret;
> >>>>>> +
> >>>>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
> >>>>>> +{
> >>>>>> +       encoded_swpentry_t ret;
> >>>>>> +
> >>>>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >>>>>> +       ret.val = nr;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
> >>>>>> +{
> >>>>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
> >>>>>> +}
> >>>>>> +
> >>>>>>     /**
> >>>>>>      * struct folio - Represents a contiguous set of bytes.
> >>>>>>      * @flags: Identical to the page flags.
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index d6a9dcddaca4..023a8adcb67c
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >>>>>>                            if (!should_zap_cows(details))
> >>>>>>                                    continue;
> >>>>>>                            rss[MM_SWAPENTS] -= nr;
> >>>>>> -                       free_swap_and_cache_nr(entry, nr);
> >>>>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
> >>>>>> +                               free_swap_and_cache_nr(entry, nr);
> >>>>>>                    } else if (is_migration_entry(entry)) {
> >>>>>>                            folio = pfn_swap_entry_folio(entry);
> >>>>>>                            if (!should_zap_folio(details, folio))
> >>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>>>>> index 99b3e9408aa0..33dc9d1faff9
> >>>>>> --- a/mm/mmu_gather.c
> >>>>>> +++ b/mm/mmu_gather.c
> >>>>>> @@ -9,11 +9,303 @@
> >>>>>>     #include <linux/smp.h>
> >>>>>>     #include <linux/swap.h>
> >>>>>>     #include <linux/rmap.h>
> >>>>>> +#include <linux/oom.h>
> >>>>>>
> >>>>>>     #include <asm/pgalloc.h>
> >>>>>>     #include <asm/tlb.h>
> >>>>>>
> >>>>>>     #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +/*
> >>>>>> + * The swp_entry asynchronous release mechanism for multiple processes with
> >>>>>> + * independent mm exiting simultaneously.
> >>>>>> + *
> >>>>>> + * During the multiple exiting processes releasing their own mm simultaneously,
> >>>>>> + * the swap entries in the exiting processes are handled by isolating, caching
> >>>>>> + * and handing over to an asynchronous kworker to complete the release.
> >>>>>> + *
> >>>>>> + * The conditions for the exiting process entering the swp_entry asynchronous
> >>>>>> + * release path:
> >>>>>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
> >>>>>> + *    to alloc struct mmu_swap_gather frequently.
> >>>>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
> >>>>> Hi Zhiguo,
> >>>>>
> >>>>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
> >>>>> batched swap entry freeing, even with one process, could be a
> >>>>> bottleneck for a single
> >>>>> process based on the data from this patch:
> >>>>>
> >>>>> mm: attempt to batch free swap entries for zap_pte_range()
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
> >>>>> "munmap bandwidth becomes 3X faster."
> >>>>>
> >>>>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
> >>>> Hi Barry,
> >>>>
> >>>> Thanks for your comments.
> >>>>
> >>>> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
> >>>> conducted the multiple android apps continuous startup performance
> >>>> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
> >>>> showed that the startup time had deteriorated slightly. However,
> >>>> the patch's logic in this test was different from the one I submitted
> >>>> to the community, and it may be due to some other issues with the
> >>>> previous old patch.
> >>>>
> >>>> However, we have conducted relevant memory performance tests on this
> >>>> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
> >>>> the results are better than before the modification. The patches have
> >>>> been used on multiple projects.
> >>>> For example:
> >>>> Test the time consumption and subjective fluency experience of
> >>>> launching 30 android apps continuously for two rounds.
> >>>> Test machine: RAM 16GB
> >>>> |        | time(s) | Frame-droped rate(%) |
> >>>> | before | 230.76  |         0.54         |
> >>>> | after  | 225.23  |         0.74         |
> >>>> We can see that the patch has been optimized 5.53s for startup time and
> >>>> 0.2% frame-droped rate and better subjective smoothness experience.
> >>>>
> >>>> Perhaps the patches submitted to the community has also improved the
> >>>> multiple android apps continuous startup performance in the case
> >>>> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
> >>>> tests to verify this situation in the future.
> >> Thanks Barry for your valuable suggestions.
> >>> Using a fixed value like 2 feels more like a workaround than a solid solution.
> >>> It would be better if we could eliminate this hack.
> >> Ok, I will conduct more tests for tring to solve this parameter issue.
> >>> Additionally, this type of asynchronous reclamation might struggle to scale
> >>> effectively, particularly on NUMA systems with many CPU cores.
> >>>
> >>> Many kernel threads are per-node, like kswapd. For instance, if we have 100
> >>> threads running on 100 CPUs executing zap_pte_range(), your approach, which
> >>> relies on a single async thread to reclaim swap entries, might lead to
> >>> performance
> >>> regressions.
> >>>
> >>> We might need to consider a more adaptable approach that can evaluate
> >>> the machine's
> >>> topology and dynamically determine the appropriate number of async
> >>> threads, rather
> >>> than hard-coding it to just one. Otherwise, there could be ongoing
> >>> concerns about
> >>> whether this solution is truly applicable to all systems.
> >> Can we dynamically determine the number of asynchronous kworkers to be
> >> created based on the number of exiting processes at a certain moment,
> >> for example, every 8 exiting processes corresponds to one asynchronous
> >> kworker?
> >>
> >> | The number of exiting processes | The number of asynchronous kworkers |
> >> |          [1, 8]                 | 1                    |
> >> |          [9, 16]                | 2                    |
> >> |          ...                    | ...                   |
> >> |      [8*N+1, 8*(N+1)]           | N+1                   |
> >> N >= 0
> > I'm not entirely sure. Another potential approach could be to use a
> > dedicated thread
> > for each NUMA node, but we would need data to confirm if this is the
> > right solution.
> I feel that this may be a feasible async approach, where the async
> kworker of each NUMA node is responsible for maintaining and releasing
> swap entries mapped by the exiting processes executing on the
> corresponding local CPUs. Not sure how many CPUs each NUMA node can
> contain? However, I currently do not have a NUMA environment for this
> testing verification.

Yes, I get it. As someone working on phones, we don’t have NUMA machines,
but our code still impacts them. That’s what makes things so tricky:-)

> >>> Alternatively, we might be able to develop a method to speed up
> >>> batched freeing in a
> >>> synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
> >>> async, but it can still speed up tlb flush, right?
> >> The synchronous manner may have some optimization effect, but it seems
> >> that the optimization effect on the CPUs load occupied by multiple
> >> exiting processes is not significant. In addition, David stated in the
> >> latest comment that swap entry seems unrelated to tlb.
> >>> For phones with just 8 CPU cores, I definitely like your patch.
> >>> However, since we're
> >>> aiming for something which can affect all systems, the situation might be more
> >>> complex.
> >> Yes, this pacth still needs further improvement to adapt to all systems.
> > If we want to keep things simple, another approach might be to profile the
> > hotspots within swap_free_nr(), identifying the most time-consuming parts,
> > and then optimize them based on the data we collect from your
> > mmu_swap_batch. Then we don't have to make things async. Have you ever
> > profiled the hotspots?
> The hotspots seems to have been implemented in the __swap_entries_free()
> in the patches you submitted, releasing continuous swap entries at once.
> Moreover, there may be duplication between the mmu_swap_batch and
> swapsots_cache.
zhiguojiang Sept. 12, 2024, 7:30 a.m. UTC | #14
在 2024/9/10 18:11, Barry Song 写道:
> On Tue, Sep 10, 2024 at 9:22 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>
>>
>> 在 2024/9/10 12:18, Barry Song 写道:
>>> On Tue, Sep 10, 2024 at 2:39 AM zhiguojiang <justinjiang@vivo.com> wrote:
>>>>
>>>> 在 2024/9/9 9:59, Barry Song 写道:
>>>>> On Wed, Sep 4, 2024 at 11:26 PM zhiguojiang <justinjiang@vivo.com> wrote:
>>>>>> 在 2024/9/4 17:16, Barry Song 写道:
>>>>>>> On Tue, Aug 6, 2024 at 3:36 AM Zhiguo Jiang <justinjiang@vivo.com> wrote:
>>>>>>>> One of the main reasons for the prolonged exit of the process with
>>>>>>>> independent mm is the time-consuming release of its swap entries.
>>>>>>>> The proportion of swap memory occupied by the process increases over
>>>>>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>>>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>>>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>>>>>> lengthy path for releasing swap entries further contributes to the
>>>>>>>> longer time required to release swap entries.
>>>>>>>>
>>>>>>>> Testing Platform: 8GB RAM
>>>>>>>> Testing procedure:
>>>>>>>> After booting up, start 15 processes first, and then observe the
>>>>>>>> physical memory size occupied by the last launched process at different
>>>>>>>> time points.
>>>>>>>> Example: The process launched last: com.qiyi.video
>>>>>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
>>>>>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
>>>>>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
>>>>>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
>>>>>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
>>>>>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
>>>>>>>> Note: min - minute.
>>>>>>>>
>>>>>>>> When there are multiple processes with independent mm and the high
>>>>>>>> memory pressure in system, if the large memory required process is
>>>>>>>> launched at this time, system will is likely to trigger the instantaneous
>>>>>>>> killing of many processes with independent mm. Due to multiple exiting
>>>>>>>> processes occupying multiple CPU core resources for concurrent execution,
>>>>>>>> leading to some issues such as the current non-exiting and important
>>>>>>>> processes lagging.
>>>>>>>>
>>>>>>>> To solve this problem, we have introduced the multiple exiting process
>>>>>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>>>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>>>>>> to an asynchronous kworker to complete the release. This allows the
>>>>>>>> exiting processes to complete quickly and release CPU resources. We have
>>>>>>>> validated this modification on the Android products and achieved the
>>>>>>>> expected benefits.
>>>>>>>>
>>>>>>>> Testing Platform: 8GB RAM
>>>>>>>> Testing procedure:
>>>>>>>> After restarting the machine, start 15 app processes first, and then
>>>>>>>> start the camera app processes, we monitor the cold start and preview
>>>>>>>> time datas of the camera app processes.
>>>>>>>>
>>>>>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>>>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>>>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>>>>>
>>>>>>>> Test datas of camera processes preview time (unit: millisecond):
>>>>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>>>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>>>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>>>>>
>>>>>>>> Base on the average of the six sets of test datas above, we can see that
>>>>>>>> the benefit datas of the modified patch:
>>>>>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>>>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>>>>>
>>>>>>>> It offers several benefits:
>>>>>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>>>>>        processes running simultaneously.
>>>>>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>>>>>        kworker instead of multiple exiting processes parallel execution.
>>>>>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>>>>>        efficiently.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>>>>>> ---
>>>>>>>>      arch/s390/include/asm/tlb.h |   8 +
>>>>>>>>      include/asm-generic/tlb.h   |  44 ++++++
>>>>>>>>      include/linux/mm_types.h    |  58 +++++++
>>>>>>>>      mm/memory.c                 |   3 +-
>>>>>>>>      mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>>>>>      5 files changed, 408 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>>>>>> index e95b2c8081eb..3f681f63390f
>>>>>>>> --- a/arch/s390/include/asm/tlb.h
>>>>>>>> +++ b/arch/s390/include/asm/tlb.h
>>>>>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>>>>>>>                     struct page *page, bool delay_rmap, int page_size);
>>>>>>>>      static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>>>>>                     struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>>>> +               swp_entry_t entry, int nr);
>>>>>>>>
>>>>>>>>      #define tlb_flush tlb_flush
>>>>>>>>      #define pte_free_tlb pte_free_tlb
>>>>>>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>>>>>             return false;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>>>> +               swp_entry_t entry, int nr)
>>>>>>>> +{
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static inline void tlb_flush(struct mmu_gather *tlb)
>>>>>>>>      {
>>>>>>>>             __tlb_flush_mm_lazy(tlb->mm);
>>>>>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>>>>>>> index 709830274b75..8b4d516b35b8
>>>>>>>> --- a/include/asm-generic/tlb.h
>>>>>>>> +++ b/include/asm-generic/tlb.h
>>>>>>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
>>>>>>>>      static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
>>>>>>>>      #endif
>>>>>>>>
>>>>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>>>> +struct mmu_swap_batch {
>>>>>>>> +       struct mmu_swap_batch *next;
>>>>>>>> +       unsigned int nr;
>>>>>>>> +       unsigned int max;
>>>>>>>> +       encoded_swpentry_t encoded_entrys[];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +#define MAX_SWAP_GATHER_BATCH  \
>>>>>>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
>>>>>>>> +
>>>>>>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
>>>>>>>> +
>>>>>>>> +struct mmu_swap_gather {
>>>>>>>> +       /*
>>>>>>>> +        * the asynchronous kworker to batch
>>>>>>>> +        * release swap entries
>>>>>>>> +        */
>>>>>>>> +       struct work_struct free_work;
>>>>>>>> +
>>>>>>>> +       /* batch cache swap entries */
>>>>>>>> +       unsigned int batch_count;
>>>>>>>> +       struct mmu_swap_batch *active;
>>>>>>>> +       struct mmu_swap_batch local;
>>>>>>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>>>>>> +               swp_entry_t entry, int nr);
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * struct mmu_gather is an opaque type used by the mm code for passing around
>>>>>>>>       * any data needed by arch specific code for tlb_remove_page.
>>>>>>>> @@ -343,6 +374,18 @@ struct mmu_gather {
>>>>>>>>             unsigned int            vma_exec : 1;
>>>>>>>>             unsigned int            vma_huge : 1;
>>>>>>>>             unsigned int            vma_pfn  : 1;
>>>>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>>>> +       /*
>>>>>>>> +        * Two states of releasing swap entries
>>>>>>>> +        * asynchronously:
>>>>>>>> +        * swp_freeable - have opportunity to
>>>>>>>> +        * release asynchronously future
>>>>>>>> +        * swp_freeing - be releasing asynchronously.
>>>>>>>> +        */
>>>>>>>> +       unsigned int            swp_freeable : 1;
>>>>>>>> +       unsigned int            swp_freeing : 1;
>>>>>>>> +       unsigned int            swp_disable : 1;
>>>>>>>> +#endif
>>>>>>>>
>>>>>>>>             unsigned int            batch_count;
>>>>>>>>
>>>>>>>> @@ -354,6 +397,7 @@ struct mmu_gather {
>>>>>>>>      #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>>>>>>>>             unsigned int page_size;
>>>>>>>>      #endif
>>>>>>>> +       struct mmu_swap_gather *swp;
>>>>>>>>      #endif
>>>>>>>>      };
>>>>>>>>
>>>>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>>>>> index 165c58b12ccc..2f66303f1519
>>>>>>>> --- a/include/linux/mm_types.h
>>>>>>>> +++ b/include/linux/mm_types.h
>>>>>>>> @@ -283,6 +283,64 @@ typedef struct {
>>>>>>>>             unsigned long val;
>>>>>>>>      } swp_entry_t;
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
>>>>>>>> + *
>>>>>>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
>>>>>>>> + * bit indicating extra context-dependent information. Only used in swp_entry
>>>>>>>> + * asynchronous release path by mmu_swap_gather.
>>>>>>>> + */
>>>>>>>> +typedef struct {
>>>>>>>> +       unsigned long val;
>>>>>>>> +} encoded_swpentry_t;
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
>>>>>>>> + * total number of consecutive swap entries associated with the same folio. If this
>>>>>>>> + * bit is not set, "nr" is implicitly 1.
>>>>>>>> + *
>>>>>>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
>>>>>>>> + * Bit63 can be used here.
>>>>>>>> + */
>>>>>>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
>>>>>>>> +
>>>>>>>> +static __always_inline encoded_swpentry_t
>>>>>>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
>>>>>>>> +{
>>>>>>>> +       encoded_swpentry_t ret;
>>>>>>>> +
>>>>>>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>>>>>> +       ret.val = flags | entry.val;
>>>>>>>> +       return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
>>>>>>>> +{
>>>>>>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
>>>>>>>> +{
>>>>>>>> +       swp_entry_t ret;
>>>>>>>> +
>>>>>>>> +       ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
>>>>>>>> +       return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
>>>>>>>> +{
>>>>>>>> +       encoded_swpentry_t ret;
>>>>>>>> +
>>>>>>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
>>>>>>>> +       ret.val = nr;
>>>>>>>> +       return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
>>>>>>>> +{
>>>>>>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * struct folio - Represents a contiguous set of bytes.
>>>>>>>>       * @flags: Identical to the page flags.
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index d6a9dcddaca4..023a8adcb67c
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>>>>>                             if (!should_zap_cows(details))
>>>>>>>>                                     continue;
>>>>>>>>                             rss[MM_SWAPENTS] -= nr;
>>>>>>>> -                       free_swap_and_cache_nr(entry, nr);
>>>>>>>> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
>>>>>>>> +                               free_swap_and_cache_nr(entry, nr);
>>>>>>>>                     } else if (is_migration_entry(entry)) {
>>>>>>>>                             folio = pfn_swap_entry_folio(entry);
>>>>>>>>                             if (!should_zap_folio(details, folio))
>>>>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>>>>>> index 99b3e9408aa0..33dc9d1faff9
>>>>>>>> --- a/mm/mmu_gather.c
>>>>>>>> +++ b/mm/mmu_gather.c
>>>>>>>> @@ -9,11 +9,303 @@
>>>>>>>>      #include <linux/smp.h>
>>>>>>>>      #include <linux/swap.h>
>>>>>>>>      #include <linux/rmap.h>
>>>>>>>> +#include <linux/oom.h>
>>>>>>>>
>>>>>>>>      #include <asm/pgalloc.h>
>>>>>>>>      #include <asm/tlb.h>
>>>>>>>>
>>>>>>>>      #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>>>>>> +/*
>>>>>>>> + * The swp_entry asynchronous release mechanism for multiple processes with
>>>>>>>> + * independent mm exiting simultaneously.
>>>>>>>> + *
>>>>>>>> + * During the multiple exiting processes releasing their own mm simultaneously,
>>>>>>>> + * the swap entries in the exiting processes are handled by isolating, caching
>>>>>>>> + * and handing over to an asynchronous kworker to complete the release.
>>>>>>>> + *
>>>>>>>> + * The conditions for the exiting process entering the swp_entry asynchronous
>>>>>>>> + * release path:
>>>>>>>> + * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
>>>>>>>> + *    to alloc struct mmu_swap_gather frequently.
>>>>>>>> + * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
>>>>>>> Hi Zhiguo,
>>>>>>>
>>>>>>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems that
>>>>>>> batched swap entry freeing, even with one process, could be a
>>>>>>> bottleneck for a single
>>>>>>> process based on the data from this patch:
>>>>>>>
>>>>>>> mm: attempt to batch free swap entries for zap_pte_range()
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable&id=bea67dcc5ee
>>>>>>> "munmap bandwidth becomes 3X faster."
>>>>>>>
>>>>>>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?
>>>>>> Hi Barry,
>>>>>>
>>>>>> Thanks for your comments.
>>>>>>
>>>>>> The reason for NR_MIN_EXITING_PROCESSES = 2 is that previously we
>>>>>> conducted the multiple android apps continuous startup performance
>>>>>> test on the case of NR_MIN_EXITING_PROCESSES = 1, and the results
>>>>>> showed that the startup time had deteriorated slightly. However,
>>>>>> the patch's logic in this test was different from the one I submitted
>>>>>> to the community, and it may be due to some other issues with the
>>>>>> previous old patch.
>>>>>>
>>>>>> However, we have conducted relevant memory performance tests on this
>>>>>> patches submitted to the community (NR_MIN_EXITING_PROCESSES=2), and
>>>>>> the results are better than before the modification. The patches have
>>>>>> been used on multiple projects.
>>>>>> For example:
>>>>>> Test the time consumption and subjective fluency experience of
>>>>>> launching 30 android apps continuously for two rounds.
>>>>>> Test machine: RAM 16GB
>>>>>> |        | time(s) | Frame-droped rate(%) |
>>>>>> | before | 230.76  |         0.54         |
>>>>>> | after  | 225.23  |         0.74         |
>>>>>> We can see that the patch has been optimized 5.53s for startup time and
>>>>>> 0.2% frame-droped rate and better subjective smoothness experience.
>>>>>>
>>>>>> Perhaps the patches submitted to the community has also improved the
>>>>>> multiple android apps continuous startup performance in the case
>>>>>> of NR_MIN_EXITING_PROCESSES=1. If necessary, I will conduct relevant
>>>>>> tests to verify this situation in the future.
>>>> Thanks Barry for your valuable suggestions.
>>>>> Using a fixed value like 2 feels more like a workaround than a solid solution.
>>>>> It would be better if we could eliminate this hack.
>>>> Ok, I will conduct more tests for tring to solve this parameter issue.
>>>>> Additionally, this type of asynchronous reclamation might struggle to scale
>>>>> effectively, particularly on NUMA systems with many CPU cores.
>>>>>
>>>>> Many kernel threads are per-node, like kswapd. For instance, if we have 100
>>>>> threads running on 100 CPUs executing zap_pte_range(), your approach, which
>>>>> relies on a single async thread to reclaim swap entries, might lead to
>>>>> performance
>>>>> regressions.
>>>>>
>>>>> We might need to consider a more adaptable approach that can evaluate
>>>>> the machine's
>>>>> topology and dynamically determine the appropriate number of async
>>>>> threads, rather
>>>>> than hard-coding it to just one. Otherwise, there could be ongoing
>>>>> concerns about
>>>>> whether this solution is truly applicable to all systems.
>>>> Can we dynamically determine the number of asynchronous kworkers to be
>>>> created based on the number of exiting processes at a certain moment,
>>>> for example, every 8 exiting processes corresponds to one asynchronous
>>>> kworker?
>>>>
>>>> | The number of exiting processes | The number of asynchronous kworkers |
>>>> |          [1, 8]                 | 1                    |
>>>> |          [9, 16]                | 2                    |
>>>> |          ...                    | ...                   |
>>>> |      [8*N+1, 8*(N+1)]           | N+1                   |
>>>> N >= 0
>>> I'm not entirely sure. Another potential approach could be to use a
>>> dedicated thread
>>> for each NUMA node, but we would need data to confirm if this is the
>>> right solution.
>> I feel that this may be a feasible async approach, where the async
>> kworker of each NUMA node is responsible for maintaining and releasing
>> swap entries mapped by the exiting processes executing on the
>> corresponding local CPUs. Not sure how many CPUs each NUMA node can
>> contain? However, I currently do not have a NUMA environment for this
>> testing verification.
> Yes, I get it. As someone working on phones, we don’t have NUMA machines,
> but our code still impacts them. That’s what makes things so tricky:-)
>
>>>>> Alternatively, we might be able to develop a method to speed up
>>>>> batched freeing in a
>>>>> synchronous manner after collecting the mmu_swap_batch. mmu_gather isn't
>>>>> async, but it can still speed up tlb flush, right?
>>>> The synchronous manner may have some optimization effect, but it seems
>>>> that the optimization effect on the CPUs load occupied by multiple
>>>> exiting processes is not significant. In addition, David stated in the
>>>> latest comment that swap entry seems unrelated to tlb.
>>>>> For phones with just 8 CPU cores, I definitely like your patch.
>>>>> However, since we're
>>>>> aiming for something which can affect all systems, the situation might be more
>>>>> complex.
>>>> Yes, this pacth still needs further improvement to adapt to all systems.
>>> If we want to keep things simple, another approach might be to profile the
>>> hotspots within swap_free_nr(), identifying the most time-consuming parts,
>>> and then optimize them based on the data we collect from your
>>> mmu_swap_batch. Then we don't have to make things async. Have you ever
>>> profiled the hotspots?
>> The hotspots seems to have been implemented in the __swap_entries_free()
>> in the patches you submitted, releasing continuous swap entries at once.
>> Moreover, there may be duplication between the mmu_swap_batch and
>> swapsots_cache.
diff mbox series

Patch

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index e95b2c8081eb..3f681f63390f
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -28,6 +28,8 @@  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 		struct page *page, bool delay_rmap, int page_size);
 static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
 		struct page *page, unsigned int nr_pages, bool delay_rmap);
+static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -69,6 +71,12 @@  static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
 	return false;
 }
 
+static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr)
+{
+	return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 709830274b75..8b4d516b35b8
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -294,6 +294,37 @@  extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
 static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
 #endif
 
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
+struct mmu_swap_batch {
+	struct mmu_swap_batch *next;
+	unsigned int nr;
+	unsigned int max;
+	encoded_swpentry_t encoded_entrys[];
+};
+
+#define MAX_SWAP_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
+
+#define MAX_SWAP_GATHER_BATCH_COUNT	(10000UL / MAX_SWAP_GATHER_BATCH)
+
+struct mmu_swap_gather {
+	/*
+	 * the asynchronous kworker to batch
+	 * release swap entries
+	 */
+	struct work_struct free_work;
+
+	/* batch cache swap entries */
+	unsigned int batch_count;
+	struct mmu_swap_batch *active;
+	struct mmu_swap_batch local;
+	encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
+};
+
+bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+		swp_entry_t entry, int nr);
+#endif
+
 /*
  * struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
@@ -343,6 +374,18 @@  struct mmu_gather {
 	unsigned int		vma_exec : 1;
 	unsigned int		vma_huge : 1;
 	unsigned int		vma_pfn  : 1;
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
+	/*
+	 * Two states of releasing swap entries
+	 * asynchronously:
+	 * swp_freeable - have opportunity to
+	 * release asynchronously future
+	 * swp_freeing - be releasing asynchronously.
+	 */
+	unsigned int		swp_freeable : 1;
+	unsigned int		swp_freeing : 1;
+	unsigned int		swp_disable : 1;
+#endif
 
 	unsigned int		batch_count;
 
@@ -354,6 +397,7 @@  struct mmu_gather {
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
+	struct mmu_swap_gather *swp;
 #endif
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 165c58b12ccc..2f66303f1519
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,64 @@  typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * encoded_swpentry_t - a type marking the encoded swp_entry_t.
+ *
+ * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highest
+ * bit indicating extra context-dependent information. Only used in swp_entry
+ * asynchronous release path by mmu_swap_gather.
+ */
+typedef struct {
+	unsigned long val;
+} encoded_swpentry_t;
+
+/*
+ * The next item in an encoded_swpentry_t array is the "nr" argument, specifying the
+ * total number of consecutive swap entries associated with the same folio. If this
+ * bit is not set, "nr" is implicitly 1.
+ *
+ * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bits: 58 ~ 62.
+ * Bit63 can be used here.
+ */
+#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
+
+static __always_inline encoded_swpentry_t
+encode_swpentry(swp_entry_t entry, unsigned long flags)
+{
+	encoded_swpentry_t ret;
+
+	VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
+	ret.val = flags | entry.val;
+	return ret;
+}
+
+static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t entry)
+{
+	return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
+}
+
+static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry)
+{
+	swp_entry_t ret;
+
+	ret.val = ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
+	return ret;
+}
+
+static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned long nr)
+{
+	encoded_swpentry_t ret;
+
+	VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
+	ret.val = nr;
+	return ret;
+}
+
+static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpentry_t entry)
+{
+	return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/mm/memory.c b/mm/memory.c
index d6a9dcddaca4..023a8adcb67c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1650,7 +1650,8 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!should_zap_cows(details))
 				continue;
 			rss[MM_SWAPENTS] -= nr;
-			free_swap_and_cache_nr(entry, nr);
+			if (!__tlb_remove_swap_entries(tlb, entry, nr))
+				free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
 			if (!should_zap_folio(details, folio))
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0..33dc9d1faff9
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,11 +9,303 @@ 
 #include <linux/smp.h>
 #include <linux/swap.h>
 #include <linux/rmap.h>
+#include <linux/oom.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
+/*
+ * The swp_entry asynchronous release mechanism for multiple processes with
+ * independent mm exiting simultaneously.
+ *
+ * During the multiple exiting processes releasing their own mm simultaneously,
+ * the swap entries in the exiting processes are handled by isolating, caching
+ * and handing over to an asynchronous kworker to complete the release.
+ *
+ * The conditions for the exiting process entering the swp_entry asynchronous
+ * release path:
+ * 1. The exiting process's MM_SWAPENTS count is >= SWAP_CLUSTER_MAX, avoiding
+ *    to alloc struct mmu_swap_gather frequently.
+ * 2. The number of exiting processes is >= NR_MIN_EXITING_PROCESSES.
+ *
+ * Since the time for determining the number of exiting processes is dynamic,
+ * the exiting process may start to enter the swp_entry asynchronous release
+ * at the beginning or middle stage of the exiting process's swp_entry release
+ * path.
+ *
+ * Once an exiting process enters the swp_entry asynchronous release, all remaining
+ * swap entries in this exiting process need to be fully released by asynchronous
+ * kworker theoretically.
+ *
+ * The function of the swp_entry asynchronous release:
+ * 1. Alleviate the high system cpu load caused by multiple exiting processes
+ *    running simultaneously.
+ * 2. Reduce lock competition in swap entry free path by an asynchronous kworker
+ *    instead of multiple exiting processes parallel execution.
+ * 3. Release pte_present memory occupied by exiting processes more efficiently.
+ */
+
+/*
+ * The min number of exiting processes required for swp_entry asynchronous release
+ */
+#define NR_MIN_EXITING_PROCESSES 2
+
+static atomic_t nr_exiting_processes = ATOMIC_INIT(0);
+static struct kmem_cache *swap_gather_cachep;
+static struct workqueue_struct *swapfree_wq;
+static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
+
+static int __init tlb_swap_async_free_setup(void)
+{
+	swapfree_wq = alloc_workqueue("smfree_wq", WQ_UNBOUND |
+		WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
+	if (!swapfree_wq)
+		goto fail;
+
+	swap_gather_cachep = kmem_cache_create("swap_gather",
+		sizeof(struct mmu_swap_gather),
+		0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
+		NULL);
+	if (!swap_gather_cachep)
+		goto kcache_fail;
+
+	static_branch_disable(&tlb_swap_asyncfree_disabled);
+	return 0;
+
+kcache_fail:
+	destroy_workqueue(swapfree_wq);
+fail:
+	return -ENOMEM;
+}
+postcore_initcall(tlb_swap_async_free_setup);
+
+static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
+{
+	struct mmu_swap_batch *swap_batch, *next;
+
+	for (swap_batch = swap_gather->local.next; swap_batch; swap_batch = next) {
+		next = swap_batch->next;
+		free_page((unsigned long)swap_batch);
+	}
+	swap_gather->local.next = NULL;
+	kmem_cache_free(swap_gather_cachep, swap_gather);
+}
+
+static void tlb_swap_async_free_work(struct work_struct *w)
+{
+	int i, nr_multi, nr_free;
+	swp_entry_t start_entry;
+	struct mmu_swap_batch *swap_batch;
+	struct mmu_swap_gather *swap_gather = container_of(w,
+		struct mmu_swap_gather, free_work);
+
+	/* Release swap entries cached in mmu_swap_batch. */
+	for (swap_batch = &swap_gather->local; swap_batch && swap_batch->nr;
+	    swap_batch = swap_batch->next) {
+		nr_free = 0;
+		for (i = 0; i < swap_batch->nr; i++) {
+			if (unlikely(encoded_swpentry_flags(swap_batch->encoded_entrys[i]) &
+			    ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
+				start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
+				nr_multi = encoded_nr_swpentrys(swap_batch->encoded_entrys[++i]);
+				free_swap_and_cache_nr(start_entry, nr_multi);
+				nr_free += 2;
+			} else {
+				start_entry = encoded_swpentry_data(swap_batch->encoded_entrys[i]);
+				free_swap_and_cache_nr(start_entry, 1);
+				nr_free++;
+			}
+		}
+		swap_batch->nr -= nr_free;
+		VM_BUG_ON(swap_batch->nr);
+	}
+	__tlb_swap_gather_free(swap_gather);
+}
+
+static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
+{
+	/*
+	 * Only the exiting processes with the MM_SWAPENTS counter >=
+	 * SWAP_CLUSTER_MAX have the opportunity to release their swap
+	 * entries by asynchronous kworker.
+	 */
+	if (!task_is_dying() ||
+	    get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
+		return true;
+
+	atomic_inc(&nr_exiting_processes);
+	if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES)
+		tlb->swp_freeable = 1;
+	else
+		tlb->swp_freeing = 1;
+
+	return false;
+}
+
+/**
+ * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
+ * for swp_entry tear-down.
+ * @tlb: the mmu_swap_gather structure belongs to tlb
+ */
+static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
+{
+	tlb->swp = kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GFP_NOWAIT);
+	if (unlikely(!tlb->swp))
+		return false;
+
+	tlb->swp->local.next  = NULL;
+	tlb->swp->local.nr    = 0;
+	tlb->swp->local.max   = ARRAY_SIZE(tlb->swp->__encoded_entrys);
+
+	tlb->swp->active      = &tlb->swp->local;
+	tlb->swp->batch_count = 0;
+
+	INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
+	return true;
+}
+
+static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
+{
+	if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
+		return;
+
+	tlb->swp = NULL;
+	tlb->swp_freeable = 0;
+	tlb->swp_freeing = 0;
+	tlb->swp_disable = 0;
+
+	if (__tlb_swap_gather_mmu_check(tlb))
+		return;
+
+	/*
+	 * If the exiting process meets the conditions of
+	 * swp_entry asynchronous release, an mmu_swap_gather
+	 * structure will be initialized.
+	 */
+	if (tlb->swp_freeing)
+		__tlb_swap_gather_init(tlb);
+}
+
+static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool finish)
+{
+	queue_work(swapfree_wq, &tlb->swp->free_work);
+	tlb->swp = NULL;
+	if (!finish)
+		__tlb_swap_gather_init(tlb);
+}
+
+static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
+{
+	struct mmu_swap_batch *swap_batch;
+
+	if (tlb->swp->batch_count == MAX_SWAP_GATHER_BATCH_COUNT)
+		goto free;
+
+	swap_batch = (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
+	if (unlikely(!swap_batch))
+		goto free;
+
+	swap_batch->next = NULL;
+	swap_batch->nr   = 0;
+	swap_batch->max  = MAX_SWAP_GATHER_BATCH;
+
+	tlb->swp->active->next = swap_batch;
+	tlb->swp->active = swap_batch;
+	tlb->swp->batch_count++;
+	return true;
+free:
+	/* batch move to wq */
+	__tlb_swap_gather_queuework(tlb, false);
+	return false;
+}
+
+/**
+ * __tlb_remove_swap_entries - the swap entries in exiting process are
+ * isolated, batch cached in struct mmu_swap_batch.
+ * @tlb: the current mmu_gather
+ * @entry: swp_entry to be isolated and cached
+ * @nr: the number of consecutive entries starting from entry parameter.
+ */
+bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
+			     swp_entry_t entry, int nr)
+{
+	struct mmu_swap_batch *swap_batch;
+	unsigned long flags = 0;
+	bool ret = false;
+
+	if (tlb->swp_disable)
+		return ret;
+
+	if (!tlb->swp_freeable && !tlb->swp_freeing)
+		return ret;
+
+	if (tlb->swp_freeable) {
+		if (atomic_read(&nr_exiting_processes) <
+		    NR_MIN_EXITING_PROCESSES)
+			return ret;
+		/*
+		 * If the current number of exiting processes
+		 * is >= NR_MIN_EXITING_PROCESSES, the exiting
+		 * process with swp_freeable state will enter
+		 * swp_freeing state to start releasing its
+		 * remaining swap entries by the asynchronous
+		 * kworker.
+		 */
+		tlb->swp_freeable = 0;
+		tlb->swp_freeing = 1;
+	}
+
+	VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
+	if (!tlb->swp && !__tlb_swap_gather_init(tlb))
+		return ret;
+
+	swap_batch = tlb->swp->active;
+	if (unlikely(swap_batch->nr >= swap_batch->max - 1)) {
+		__tlb_swap_gather_queuework(tlb, false);
+		return ret;
+	}
+
+	if (likely(nr == 1)) {
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
+	} else {
+		flags |= ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_swpentry(entry, flags);
+		swap_batch->encoded_entrys[swap_batch->nr++] = encode_nr_swpentrys(nr);
+	}
+	ret = true;
+
+	if (swap_batch->nr >= swap_batch->max - 1) {
+		if (!__tlb_swap_next_batch(tlb))
+			goto exit;
+		swap_batch = tlb->swp->active;
+	}
+	VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
+exit:
+	return ret;
+}
+
+static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
+{
+	if (tlb->swp_disable)
+		return;
+
+	if (!tlb->swp_freeable && !tlb->swp_freeing)
+		return;
+
+	if (tlb->swp_freeable) {
+		tlb->swp_freeable = 0;
+		VM_BUG_ON(tlb->swp_freeing);
+		goto exit;
+	}
+	tlb->swp_freeing = 0;
+	if (unlikely(!tlb->swp))
+		goto exit;
+
+	__tlb_swap_gather_queuework(tlb, true);
+exit:
+	atomic_dec(&nr_exiting_processes);
+}
 
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
@@ -386,6 +678,9 @@  static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
+
+	tlb->swp_disable = 1;
+	__tlb_swap_gather_mmu(tlb);
 #endif
 	tlb->delayed_rmap = 0;
 
@@ -466,6 +761,7 @@  void tlb_finish_mmu(struct mmu_gather *tlb)
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb_batch_list_free(tlb);
+	__tlb_batch_swap_finish(tlb);
 #endif
 	dec_tlb_flush_pending(tlb->mm);
 }