diff mbox series

mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist

Message ID 20240701142046.6050-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist | expand

Commit Message

Yafang Shao July 1, 2024, 2:20 p.m. UTC
Currently, we're encountering latency spikes in our container environment
when a specific container with multiple Python-based tasks exits. These
tasks may hold the zone->lock for an extended period, significantly
impacting latency for other containers attempting to allocate memory.

As a workaround, we've found that minimizing the pagelist size, such as
setting it to 4 times the batch size, can help mitigate these spikes.
However, managing vm.percpu_pagelist_high_fraction across a large fleet of
servers poses challenges due to variations in CPU counts, NUMA nodes, and
physical memory capacities.

To enhance practicality, we propose allowing the setting of -1 for
vm.percpu_pagelist_high_fraction to designate a minimum pagelist size.

Furthermore, considering the challenges associated with utilizing
vm.percpu_pagelist_high_fraction, it would be beneficial to introduce a
more intuitive parameter, vm.percpu_pagelist_high_size, that would permit
direct specification of the pagelist size as a multiple of the batch size.
This methodology would mirror the functionality of vm.dirty_ratio and
vm.dirty_bytes, providing users with greater flexibility and control.

We have discussed the possibility of introducing multiple small zones to
mitigate the contention on the zone->lock[0], but this approach is likely
to require a longer-term implementation effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/admin-guide/sysctl/vm.rst | 4 ++++
 mm/page_alloc.c                         | 8 ++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Andrew Morton July 2, 2024, 2:51 a.m. UTC | #1
On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:

> Currently, we're encountering latency spikes in our container environment
> when a specific container with multiple Python-based tasks exits. These
> tasks may hold the zone->lock for an extended period, significantly
> impacting latency for other containers attempting to allocate memory.

Is this locking issue well understood?  Is anyone working on it?  A
reasonably detailed description of the issue and a description of any
ongoing work would be helpful here.

> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the value of hot per-cpu
>  page lists. A user can specify a number like 100 to allocate 1/100th of
>  each zone between per-cpu lists.
>  
> +The minimum number of pages that can be stored in per-CPU page lists is
> +four times the batch value. By writing '-1' to this sysctl, you can set
> +this minimum value.

I suggest we also describe why an operator would want to set this, and
the expected effects of that action.

>  The batch value of each per-cpu page list remains the same regardless of
>  the value of the high fraction so allocation latencies are unaffected.
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..e7313f9d704b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online,
>  	int nr_split_cpus;
>  	unsigned long total_pages;
>  
> +	/* Setting -1 to set the minimum pagelist size, four times the batch size */

Some old-timers still use 80-column xterms ;)

> +	if (high_fraction == -1)
> +		return batch << 2;
> +
>  	if (!high_fraction) {
>  		/*
>  		 * By default, the high value of the pcp is based on the zone
Yafang Shao July 2, 2024, 6:37 a.m. UTC | #2
On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Currently, we're encountering latency spikes in our container environment
> > when a specific container with multiple Python-based tasks exits. These
> > tasks may hold the zone->lock for an extended period, significantly
> > impacting latency for other containers attempting to allocate memory.
>
> Is this locking issue well understood?  Is anyone working on it?  A
> reasonably detailed description of the issue and a description of any
> ongoing work would be helpful here.

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Our investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.

+   77.02%     0.00%  uwsgi    [kernel.kallsyms]
           [k] mmput                                   ▒
-   76.98%     0.01%  uwsgi    [kernel.kallsyms]
           [k] exit_mmap                               ▒
   - 76.97% exit_mmap
                                                       ▒
      - 58.58% unmap_vmas
                                                       ▒
         - 58.55% unmap_single_vma
                                                       ▒
            - unmap_page_range
                                                       ▒
               - 58.32% zap_pte_range
                                                       ▒
                  - 42.88% tlb_flush_mmu
                                                       ▒
                     - 42.76% free_pages_and_swap_cache
                                                       ▒
                        - 41.22% release_pages
                                                       ▒
                           - 33.29% free_unref_page_list
                                                       ▒
                              - 32.37% free_unref_page_commit
                                                       ▒
                                 - 31.64% free_pcppages_bulk
                                                       ▒
                                    + 28.65% _raw_spin_lock
                                                       ▒
                                      1.28% __list_del_entry_valid
                                                       ▒
                           + 3.25% folio_lruvec_lock_irqsave
                                                       ▒
                           + 0.75% __mem_cgroup_uncharge_list
                                                       ▒
                             0.60% __mod_lruvec_state
                                                       ▒
                          1.07% free_swap_cache
                                                       ▒
                  + 11.69% page_remove_rmap
                                                       ▒
                    0.64% __mod_lruvec_page_state
      - 17.34% remove_vma
                                                       ▒
         - 17.25% vm_area_free
                                                       ▒
            - 17.23% kmem_cache_free
                                                       ▒
               - 17.15% __slab_free
                                                       ▒
                  - 14.56% discard_slab
                                                       ▒
                       free_slab
                                                       ▒
                       __free_slab
                                                       ▒
                       __free_pages
                                                       ▒
                     - free_unref_page
                                                       ▒
                        - 13.50% free_unref_page_commit
                                                       ▒
                           - free_pcppages_bulk
                                                       ▒
                              + 13.44% _raw_spin_lock

By enabling the mm_page_pcpu_drain() we can find the detailed stack:

          <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
e=1
           <...>-1540432 [224] d..3. 618048.023887: <stack trace>
 => free_pcppages_bulk
 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_flush_mmu
 => zap_pte_range
 => unmap_page_range
 => unmap_single_vma
 => unmap_vmas
 => exit_mmap
 => mmput
 => do_exit
 => do_group_exit
 => get_signal
 => arch_do_signal_or_restart
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,

Node 0, zone   Normal
  pages free     144465775
        boost    0
        min      1309270
        low      1636587
        high     1963904
        spanned  564133888
        present  296747008
        managed  291974346
        cma      0
        protection: (0, 0, 0, 0)
...
...
  pagesets
    cpu: 0
              count: 2217
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 1
              count: 4510
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 2
              count: 3059
              high:  6392
              batch: 63

...

The high is around 100 times the batch size.

We also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:

19:48:54
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 148      |*****************                       |
       512 -> 1023       : 334      |****************************************|
      1024 -> 2047       : 33       |***                                     |
      2048 -> 4095       : 5        |                                        |
      4096 -> 8191       : 7        |                                        |
      8192 -> 16383      : 12       |*                                       |
     16384 -> 32767      : 30       |***                                     |
     32768 -> 65535      : 21       |**                                      |
     65536 -> 131071     : 15       |*                                       |
    131072 -> 262143     : 27       |***                                     |
    262144 -> 524287     : 84       |**********                              |
    524288 -> 1048575    : 203      |************************                |
   1048576 -> 2097151    : 284      |**********************************      |
   2097152 -> 4194303    : 327      |*************************************** |
   4194304 -> 8388607    : 215      |*************************               |
   8388608 -> 16777215   : 116      |*************                           |
  16777216 -> 33554431   : 47       |*****                                   |
  33554432 -> 67108863   : 8        |                                        |
  67108864 -> 134217727  : 3        |                                        |

avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920

The latency can reach tens of milliseconds.

By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
minimum pagelist high at 4 times the batch size, we were able to
significantly reduce the latency associated with the
free_pcppages_bulk() function during container exits.:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 120      |                                        |
       256 -> 511        : 365      |*                                       |
       512 -> 1023       : 201      |                                        |
      1024 -> 2047       : 103      |                                        |
      2048 -> 4095       : 84       |                                        |
      4096 -> 8191       : 87       |                                        |
      8192 -> 16383      : 4777     |**************                          |
     16384 -> 32767      : 10572    |*******************************         |
     32768 -> 65535      : 13544    |****************************************|
     65536 -> 131071     : 12723    |*************************************   |
    131072 -> 262143     : 8604     |*************************               |
    262144 -> 524287     : 3659     |**********                              |
    524288 -> 1048575    : 921      |**                                      |
   1048576 -> 2097151    : 122      |                                        |
   2097152 -> 4194303    : 5        |                                        |

avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925

After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
knob to set the minimum pagelist high at a level that effectively
mitigated latency issues, we observed that other containers were no
longer experiencing similar complaints. As a result, we decided to
implement this tuning as a permanent workaround and have deployed it
across all clusters of servers where these containers may be deployed.

>
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the value of hot per-cpu
> >  page lists. A user can specify a number like 100 to allocate 1/100th of
> >  each zone between per-cpu lists.
> >
> > +The minimum number of pages that can be stored in per-CPU page lists is
> > +four times the batch value. By writing '-1' to this sysctl, you can set
> > +this minimum value.
>
> I suggest we also describe why an operator would want to set this, and
> the expected effects of that action.

will improve it.

>
> >  The batch value of each per-cpu page list remains the same regardless of
> >  the value of the high fraction so allocation latencies are unaffected.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..e7313f9d704b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online,
> >       int nr_split_cpus;
> >       unsigned long total_pages;
> >
> > +     /* Setting -1 to set the minimum pagelist size, four times the batch size */
>
> Some old-timers still use 80-column xterms ;)

will change it.


Regards
Yafang
Huang, Ying July 2, 2024, 7:23 a.m. UTC | #3
Hi, Yafang,

Yafang Shao <laoar.shao@gmail.com> writes:

> Currently, we're encountering latency spikes in our container environment
> when a specific container with multiple Python-based tasks exits.

Can you show some data?  On which kind of machine, how long is the
latency?

> These
> tasks may hold the zone->lock for an extended period, significantly
> impacting latency for other containers attempting to allocate memory.

So, the allocation latency is influenced, not application exit latency?
Could you measure the run time of free_pcppages_bulk(), this can be done
via ftrace function_graph tracer.  We want to check whether this is a
common issue.

In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
avoid too long latency"), we have measured the allocation/free latency
for different CONFIG_PCP_BATCH_SCALE_MAX.  The target in the commit is
to control the latency <= 100us.

> As a workaround, we've found that minimizing the pagelist size, such as
> setting it to 4 times the batch size, can help mitigate these spikes.
> However, managing vm.percpu_pagelist_high_fraction across a large fleet of
> servers poses challenges due to variations in CPU counts, NUMA nodes, and
> physical memory capacities.
>
> To enhance practicality, we propose allowing the setting of -1 for
> vm.percpu_pagelist_high_fraction to designate a minimum pagelist size.

If it is really necessary, can we just use a large enough number for
vm.percpu_pagelist_high_fraction?  For example, (1 << 30)?

> Furthermore, considering the challenges associated with utilizing
> vm.percpu_pagelist_high_fraction, it would be beneficial to introduce a
> more intuitive parameter, vm.percpu_pagelist_high_size, that would permit
> direct specification of the pagelist size as a multiple of the batch size.
> This methodology would mirror the functionality of vm.dirty_ratio and
> vm.dirty_bytes, providing users with greater flexibility and control.
>
> We have discussed the possibility of introducing multiple small zones to
> mitigate the contention on the zone->lock[0], but this approach is likely
> to require a longer-term implementation effort.
>
> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>

[snip]

--
Best Regards,
Huang, Ying
Huang, Ying July 2, 2024, 9:08 a.m. UTC | #4
Yafang Shao <laoar.shao@gmail.com> writes:

> On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> > Currently, we're encountering latency spikes in our container environment
>> > when a specific container with multiple Python-based tasks exits. These
>> > tasks may hold the zone->lock for an extended period, significantly
>> > impacting latency for other containers attempting to allocate memory.
>>
>> Is this locking issue well understood?  Is anyone working on it?  A
>> reasonably detailed description of the issue and a description of any
>> ongoing work would be helpful here.
>
> In our containerized environment, we have a specific type of container
> that runs 18 processes, each consuming approximately 6GB of RSS. These
> processes are organized as separate processes rather than threads due
> to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> multi-threaded setup. Upon the exit of these containers, other
> containers hosted on the same machine experience significant latency
> spikes.
>
> Our investigation using perf tracing revealed that the root cause of
> these spikes is the simultaneous execution of exit_mmap() by each of
> the exiting processes. This concurrent access to the zone->lock
> results in contention, which becomes a hotspot and negatively impacts
> performance. The perf results clearly indicate this contention as a
> primary contributor to the observed latency issues.
>
> +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>            [k] mmput                                   ▒
> -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>            [k] exit_mmap                               ▒
>    - 76.97% exit_mmap
>                                                        ▒
>       - 58.58% unmap_vmas
>                                                        ▒
>          - 58.55% unmap_single_vma
>                                                        ▒
>             - unmap_page_range
>                                                        ▒
>                - 58.32% zap_pte_range
>                                                        ▒
>                   - 42.88% tlb_flush_mmu
>                                                        ▒
>                      - 42.76% free_pages_and_swap_cache
>                                                        ▒
>                         - 41.22% release_pages
>                                                        ▒
>                            - 33.29% free_unref_page_list
>                                                        ▒
>                               - 32.37% free_unref_page_commit
>                                                        ▒
>                                  - 31.64% free_pcppages_bulk
>                                                        ▒
>                                     + 28.65% _raw_spin_lock
>                                                        ▒
>                                       1.28% __list_del_entry_valid
>                                                        ▒
>                            + 3.25% folio_lruvec_lock_irqsave
>                                                        ▒
>                            + 0.75% __mem_cgroup_uncharge_list
>                                                        ▒
>                              0.60% __mod_lruvec_state
>                                                        ▒
>                           1.07% free_swap_cache
>                                                        ▒
>                   + 11.69% page_remove_rmap
>                                                        ▒
>                     0.64% __mod_lruvec_page_state
>       - 17.34% remove_vma
>                                                        ▒
>          - 17.25% vm_area_free
>                                                        ▒
>             - 17.23% kmem_cache_free
>                                                        ▒
>                - 17.15% __slab_free
>                                                        ▒
>                   - 14.56% discard_slab
>                                                        ▒
>                        free_slab
>                                                        ▒
>                        __free_slab
>                                                        ▒
>                        __free_pages
>                                                        ▒
>                      - free_unref_page
>                                                        ▒
>                         - 13.50% free_unref_page_commit
>                                                        ▒
>                            - free_pcppages_bulk
>                                                        ▒
>                               + 13.44% _raw_spin_lock
>
> By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>
>           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> e=1
>            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>  => free_pcppages_bulk
>  => free_unref_page_commit
>  => free_unref_page_list
>  => release_pages
>  => free_pages_and_swap_cache
>  => tlb_flush_mmu
>  => zap_pte_range
>  => unmap_page_range
>  => unmap_single_vma
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => get_signal
>  => arch_do_signal_or_restart
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
>
> The servers experiencing these issues are equipped with impressive
> hardware specifications, including 256 CPUs and 1TB of memory, all
> within a single NUMA node. The zoneinfo is as follows,
>
> Node 0, zone   Normal
>   pages free     144465775
>         boost    0
>         min      1309270
>         low      1636587
>         high     1963904
>         spanned  564133888
>         present  296747008
>         managed  291974346
>         cma      0
>         protection: (0, 0, 0, 0)
> ...
> ...
>   pagesets
>     cpu: 0
>               count: 2217
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 1
>               count: 4510
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 2
>               count: 3059
>               high:  6392
>               batch: 63
>
> ...
>
> The high is around 100 times the batch size.
>
> We also traced the latency associated with the free_pcppages_bulk()
> function during the container exit process:
>
> 19:48:54
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 148      |*****************                       |
>        512 -> 1023       : 334      |****************************************|
>       1024 -> 2047       : 33       |***                                     |
>       2048 -> 4095       : 5        |                                        |
>       4096 -> 8191       : 7        |                                        |
>       8192 -> 16383      : 12       |*                                       |
>      16384 -> 32767      : 30       |***                                     |
>      32768 -> 65535      : 21       |**                                      |
>      65536 -> 131071     : 15       |*                                       |
>     131072 -> 262143     : 27       |***                                     |
>     262144 -> 524287     : 84       |**********                              |
>     524288 -> 1048575    : 203      |************************                |
>    1048576 -> 2097151    : 284      |**********************************      |
>    2097152 -> 4194303    : 327      |*************************************** |
>    4194304 -> 8388607    : 215      |*************************               |
>    8388608 -> 16777215   : 116      |*************                           |
>   16777216 -> 33554431   : 47       |*****                                   |
>   33554432 -> 67108863   : 8        |                                        |
>   67108864 -> 134217727  : 3        |                                        |
>
> avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>
> The latency can reach tens of milliseconds.
>
> By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> minimum pagelist high at 4 times the batch size, we were able to
> significantly reduce the latency associated with the
> free_pcppages_bulk() function during container exits.:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 120      |                                        |
>        256 -> 511        : 365      |*                                       |
>        512 -> 1023       : 201      |                                        |
>       1024 -> 2047       : 103      |                                        |
>       2048 -> 4095       : 84       |                                        |
>       4096 -> 8191       : 87       |                                        |
>       8192 -> 16383      : 4777     |**************                          |
>      16384 -> 32767      : 10572    |*******************************         |
>      32768 -> 65535      : 13544    |****************************************|
>      65536 -> 131071     : 12723    |*************************************   |
>     131072 -> 262143     : 8604     |*************************               |
>     262144 -> 524287     : 3659     |**********                              |
>     524288 -> 1048575    : 921      |**                                      |
>    1048576 -> 2097151    : 122      |                                        |
>    2097152 -> 4194303    : 5        |                                        |
>
> avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>
> After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> knob to set the minimum pagelist high at a level that effectively
> mitigated latency issues, we observed that other containers were no
> longer experiencing similar complaints. As a result, we decided to
> implement this tuning as a permanent workaround and have deployed it
> across all clusters of servers where these containers may be deployed.

Thanks for your detailed data.

IIUC, the latency of free_pcppages_bulk() during process exiting
shouldn't be a problem?  Because users care more about the total time of
process exiting, that is, throughput.  And I suspect that the zone->lock
contention and page allocating/freeing throughput will be worse with
your configuration?

But the latency of free_pcppages_bulk() and page allocation in other
processes is a problem.  And your configuration can help it.

Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
you have a normal PCP size (high) but smaller PCP batch.  I guess that
may help both latency and throughput in your system.  Could you give it
a try?

[snip]

--
Best Regards,
Huang, Ying
Yafang Shao July 2, 2024, 12:07 p.m. UTC | #5
On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> > Currently, we're encountering latency spikes in our container environment
> >> > when a specific container with multiple Python-based tasks exits. These
> >> > tasks may hold the zone->lock for an extended period, significantly
> >> > impacting latency for other containers attempting to allocate memory.
> >>
> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> reasonably detailed description of the issue and a description of any
> >> ongoing work would be helpful here.
> >
> > In our containerized environment, we have a specific type of container
> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> > processes are organized as separate processes rather than threads due
> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> > multi-threaded setup. Upon the exit of these containers, other
> > containers hosted on the same machine experience significant latency
> > spikes.
> >
> > Our investigation using perf tracing revealed that the root cause of
> > these spikes is the simultaneous execution of exit_mmap() by each of
> > the exiting processes. This concurrent access to the zone->lock
> > results in contention, which becomes a hotspot and negatively impacts
> > performance. The perf results clearly indicate this contention as a
> > primary contributor to the observed latency issues.
> >
> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >            [k] mmput                                   ▒
> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >            [k] exit_mmap                               ▒
> >    - 76.97% exit_mmap
> >                                                        ▒
> >       - 58.58% unmap_vmas
> >                                                        ▒
> >          - 58.55% unmap_single_vma
> >                                                        ▒
> >             - unmap_page_range
> >                                                        ▒
> >                - 58.32% zap_pte_range
> >                                                        ▒
> >                   - 42.88% tlb_flush_mmu
> >                                                        ▒
> >                      - 42.76% free_pages_and_swap_cache
> >                                                        ▒
> >                         - 41.22% release_pages
> >                                                        ▒
> >                            - 33.29% free_unref_page_list
> >                                                        ▒
> >                               - 32.37% free_unref_page_commit
> >                                                        ▒
> >                                  - 31.64% free_pcppages_bulk
> >                                                        ▒
> >                                     + 28.65% _raw_spin_lock
> >                                                        ▒
> >                                       1.28% __list_del_entry_valid
> >                                                        ▒
> >                            + 3.25% folio_lruvec_lock_irqsave
> >                                                        ▒
> >                            + 0.75% __mem_cgroup_uncharge_list
> >                                                        ▒
> >                              0.60% __mod_lruvec_state
> >                                                        ▒
> >                           1.07% free_swap_cache
> >                                                        ▒
> >                   + 11.69% page_remove_rmap
> >                                                        ▒
> >                     0.64% __mod_lruvec_page_state
> >       - 17.34% remove_vma
> >                                                        ▒
> >          - 17.25% vm_area_free
> >                                                        ▒
> >             - 17.23% kmem_cache_free
> >                                                        ▒
> >                - 17.15% __slab_free
> >                                                        ▒
> >                   - 14.56% discard_slab
> >                                                        ▒
> >                        free_slab
> >                                                        ▒
> >                        __free_slab
> >                                                        ▒
> >                        __free_pages
> >                                                        ▒
> >                      - free_unref_page
> >                                                        ▒
> >                         - 13.50% free_unref_page_commit
> >                                                        ▒
> >                            - free_pcppages_bulk
> >                                                        ▒
> >                               + 13.44% _raw_spin_lock
> >
> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >
> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> > e=1
> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >  => free_pcppages_bulk
> >  => free_unref_page_commit
> >  => free_unref_page_list
> >  => release_pages
> >  => free_pages_and_swap_cache
> >  => tlb_flush_mmu
> >  => zap_pte_range
> >  => unmap_page_range
> >  => unmap_single_vma
> >  => unmap_vmas
> >  => exit_mmap
> >  => mmput
> >  => do_exit
> >  => do_group_exit
> >  => get_signal
> >  => arch_do_signal_or_restart
> >  => exit_to_user_mode_prepare
> >  => syscall_exit_to_user_mode
> >  => do_syscall_64
> >  => entry_SYSCALL_64_after_hwframe
> >
> > The servers experiencing these issues are equipped with impressive
> > hardware specifications, including 256 CPUs and 1TB of memory, all
> > within a single NUMA node. The zoneinfo is as follows,
> >
> > Node 0, zone   Normal
> >   pages free     144465775
> >         boost    0
> >         min      1309270
> >         low      1636587
> >         high     1963904
> >         spanned  564133888
> >         present  296747008
> >         managed  291974346
> >         cma      0
> >         protection: (0, 0, 0, 0)
> > ...
> > ...
> >   pagesets
> >     cpu: 0
> >               count: 2217
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 1
> >               count: 4510
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 2
> >               count: 3059
> >               high:  6392
> >               batch: 63
> >
> > ...
> >
> > The high is around 100 times the batch size.
> >
> > We also traced the latency associated with the free_pcppages_bulk()
> > function during the container exit process:
> >
> > 19:48:54
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 148      |*****************                       |
> >        512 -> 1023       : 334      |****************************************|
> >       1024 -> 2047       : 33       |***                                     |
> >       2048 -> 4095       : 5        |                                        |
> >       4096 -> 8191       : 7        |                                        |
> >       8192 -> 16383      : 12       |*                                       |
> >      16384 -> 32767      : 30       |***                                     |
> >      32768 -> 65535      : 21       |**                                      |
> >      65536 -> 131071     : 15       |*                                       |
> >     131072 -> 262143     : 27       |***                                     |
> >     262144 -> 524287     : 84       |**********                              |
> >     524288 -> 1048575    : 203      |************************                |
> >    1048576 -> 2097151    : 284      |**********************************      |
> >    2097152 -> 4194303    : 327      |*************************************** |
> >    4194304 -> 8388607    : 215      |*************************               |
> >    8388608 -> 16777215   : 116      |*************                           |
> >   16777216 -> 33554431   : 47       |*****                                   |
> >   33554432 -> 67108863   : 8        |                                        |
> >   67108864 -> 134217727  : 3        |                                        |
> >
> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >
> > The latency can reach tens of milliseconds.
> >
> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> > minimum pagelist high at 4 times the batch size, we were able to
> > significantly reduce the latency associated with the
> > free_pcppages_bulk() function during container exits.:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 120      |                                        |
> >        256 -> 511        : 365      |*                                       |
> >        512 -> 1023       : 201      |                                        |
> >       1024 -> 2047       : 103      |                                        |
> >       2048 -> 4095       : 84       |                                        |
> >       4096 -> 8191       : 87       |                                        |
> >       8192 -> 16383      : 4777     |**************                          |
> >      16384 -> 32767      : 10572    |*******************************         |
> >      32768 -> 65535      : 13544    |****************************************|
> >      65536 -> 131071     : 12723    |*************************************   |
> >     131072 -> 262143     : 8604     |*************************               |
> >     262144 -> 524287     : 3659     |**********                              |
> >     524288 -> 1048575    : 921      |**                                      |
> >    1048576 -> 2097151    : 122      |                                        |
> >    2097152 -> 4194303    : 5        |                                        |
> >
> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >
> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> > knob to set the minimum pagelist high at a level that effectively
> > mitigated latency issues, we observed that other containers were no
> > longer experiencing similar complaints. As a result, we decided to
> > implement this tuning as a permanent workaround and have deployed it
> > across all clusters of servers where these containers may be deployed.
>
> Thanks for your detailed data.
>
> IIUC, the latency of free_pcppages_bulk() during process exiting
> shouldn't be a problem?

Right. The problem arises when the process holds the lock for too
long, causing other processes that are attempting to allocate memory
to experience delays or wait times.

> Because users care more about the total time of
> process exiting, that is, throughput.  And I suspect that the zone->lock
> contention and page allocating/freeing throughput will be worse with
> your configuration?

While reducing throughput may not be a significant concern due to the
minimal difference, the potential for latency spikes, a crucial metric
for assessing system stability, is of greater concern to users. Higher
latency can lead to request errors, impacting the user experience.
Therefore, maintaining stability, even at the cost of slightly lower
throughput, is preferable to experiencing higher throughput with
unstable performance.

>
> But the latency of free_pcppages_bulk() and page allocation in other
> processes is a problem.  And your configuration can help it.
>
> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> may help both latency and throughput in your system.  Could you give it
> a try?

Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
configuration option. However, I've observed your recent improvements
to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
restrict the pcp batch scale factor to avoid too long latency"), which
has prompted me to experiment with manually setting the
pcp->free_factor to zero. While this adjustment provided some
improvement, the results were not as significant as I had hoped.

BTW, perhaps we should consider the implementation of a sysctl knob as
an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
to more easily adjust it.

Below is the replyment to your question in another thread:

> Could you measure the run time of free_pcppages_bulk(), this can be done
> via ftrace function_graph tracer.  We want to check whether this is a
> common issue.

I believe this issue is not a common issue, as we have only observed
latency spikes under this specific workload.

> If it is really necessary, can we just use a large enough number for
> vm.percpu_pagelist_high_fraction?  For example, (1 << 30)?

Currently, we are setting the value to 0x7fffffff, which can be
confusing for others due to its arbitrary nature. Given that the
minimum high size is a special value, specifically 4 times the batch
size, I believe it would be more beneficial to introduce a dedicated
sysctl value that clearly represents this setting. This will not only
make the configuration more intuitive for users, but also provide a
clear and documented reference for future reference.
Huang, Ying July 3, 2024, 1:55 a.m. UTC | #6
Yafang Shao <laoar.shao@gmail.com> writes:

> On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>
>> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >>
>> >> > Currently, we're encountering latency spikes in our container environment
>> >> > when a specific container with multiple Python-based tasks exits. These
>> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> > impacting latency for other containers attempting to allocate memory.
>> >>
>> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> reasonably detailed description of the issue and a description of any
>> >> ongoing work would be helpful here.
>> >
>> > In our containerized environment, we have a specific type of container
>> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> > processes are organized as separate processes rather than threads due
>> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> > multi-threaded setup. Upon the exit of these containers, other
>> > containers hosted on the same machine experience significant latency
>> > spikes.
>> >
>> > Our investigation using perf tracing revealed that the root cause of
>> > these spikes is the simultaneous execution of exit_mmap() by each of
>> > the exiting processes. This concurrent access to the zone->lock
>> > results in contention, which becomes a hotspot and negatively impacts
>> > performance. The perf results clearly indicate this contention as a
>> > primary contributor to the observed latency issues.
>> >
>> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >            [k] mmput                                   ▒
>> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >            [k] exit_mmap                               ▒
>> >    - 76.97% exit_mmap
>> >                                                        ▒
>> >       - 58.58% unmap_vmas
>> >                                                        ▒
>> >          - 58.55% unmap_single_vma
>> >                                                        ▒
>> >             - unmap_page_range
>> >                                                        ▒
>> >                - 58.32% zap_pte_range
>> >                                                        ▒
>> >                   - 42.88% tlb_flush_mmu
>> >                                                        ▒
>> >                      - 42.76% free_pages_and_swap_cache
>> >                                                        ▒
>> >                         - 41.22% release_pages
>> >                                                        ▒
>> >                            - 33.29% free_unref_page_list
>> >                                                        ▒
>> >                               - 32.37% free_unref_page_commit
>> >                                                        ▒
>> >                                  - 31.64% free_pcppages_bulk
>> >                                                        ▒
>> >                                     + 28.65% _raw_spin_lock
>> >                                                        ▒
>> >                                       1.28% __list_del_entry_valid
>> >                                                        ▒
>> >                            + 3.25% folio_lruvec_lock_irqsave
>> >                                                        ▒
>> >                            + 0.75% __mem_cgroup_uncharge_list
>> >                                                        ▒
>> >                              0.60% __mod_lruvec_state
>> >                                                        ▒
>> >                           1.07% free_swap_cache
>> >                                                        ▒
>> >                   + 11.69% page_remove_rmap
>> >                                                        ▒
>> >                     0.64% __mod_lruvec_page_state
>> >       - 17.34% remove_vma
>> >                                                        ▒
>> >          - 17.25% vm_area_free
>> >                                                        ▒
>> >             - 17.23% kmem_cache_free
>> >                                                        ▒
>> >                - 17.15% __slab_free
>> >                                                        ▒
>> >                   - 14.56% discard_slab
>> >                                                        ▒
>> >                        free_slab
>> >                                                        ▒
>> >                        __free_slab
>> >                                                        ▒
>> >                        __free_pages
>> >                                                        ▒
>> >                      - free_unref_page
>> >                                                        ▒
>> >                         - 13.50% free_unref_page_commit
>> >                                                        ▒
>> >                            - free_pcppages_bulk
>> >                                                        ▒
>> >                               + 13.44% _raw_spin_lock
>> >
>> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >
>> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> > e=1
>> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >  => free_pcppages_bulk
>> >  => free_unref_page_commit
>> >  => free_unref_page_list
>> >  => release_pages
>> >  => free_pages_and_swap_cache
>> >  => tlb_flush_mmu
>> >  => zap_pte_range
>> >  => unmap_page_range
>> >  => unmap_single_vma
>> >  => unmap_vmas
>> >  => exit_mmap
>> >  => mmput
>> >  => do_exit
>> >  => do_group_exit
>> >  => get_signal
>> >  => arch_do_signal_or_restart
>> >  => exit_to_user_mode_prepare
>> >  => syscall_exit_to_user_mode
>> >  => do_syscall_64
>> >  => entry_SYSCALL_64_after_hwframe
>> >
>> > The servers experiencing these issues are equipped with impressive
>> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> > within a single NUMA node. The zoneinfo is as follows,
>> >
>> > Node 0, zone   Normal
>> >   pages free     144465775
>> >         boost    0
>> >         min      1309270
>> >         low      1636587
>> >         high     1963904
>> >         spanned  564133888
>> >         present  296747008
>> >         managed  291974346
>> >         cma      0
>> >         protection: (0, 0, 0, 0)
>> > ...
>> > ...
>> >   pagesets
>> >     cpu: 0
>> >               count: 2217
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 1
>> >               count: 4510
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 2
>> >               count: 3059
>> >               high:  6392
>> >               batch: 63
>> >
>> > ...
>> >
>> > The high is around 100 times the batch size.
>> >
>> > We also traced the latency associated with the free_pcppages_bulk()
>> > function during the container exit process:
>> >
>> > 19:48:54
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 148      |*****************                       |
>> >        512 -> 1023       : 334      |****************************************|
>> >       1024 -> 2047       : 33       |***                                     |
>> >       2048 -> 4095       : 5        |                                        |
>> >       4096 -> 8191       : 7        |                                        |
>> >       8192 -> 16383      : 12       |*                                       |
>> >      16384 -> 32767      : 30       |***                                     |
>> >      32768 -> 65535      : 21       |**                                      |
>> >      65536 -> 131071     : 15       |*                                       |
>> >     131072 -> 262143     : 27       |***                                     |
>> >     262144 -> 524287     : 84       |**********                              |
>> >     524288 -> 1048575    : 203      |************************                |
>> >    1048576 -> 2097151    : 284      |**********************************      |
>> >    2097152 -> 4194303    : 327      |*************************************** |
>> >    4194304 -> 8388607    : 215      |*************************               |
>> >    8388608 -> 16777215   : 116      |*************                           |
>> >   16777216 -> 33554431   : 47       |*****                                   |
>> >   33554432 -> 67108863   : 8        |                                        |
>> >   67108864 -> 134217727  : 3        |                                        |
>> >
>> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >
>> > The latency can reach tens of milliseconds.
>> >
>> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> > minimum pagelist high at 4 times the batch size, we were able to
>> > significantly reduce the latency associated with the
>> > free_pcppages_bulk() function during container exits.:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 120      |                                        |
>> >        256 -> 511        : 365      |*                                       |
>> >        512 -> 1023       : 201      |                                        |
>> >       1024 -> 2047       : 103      |                                        |
>> >       2048 -> 4095       : 84       |                                        |
>> >       4096 -> 8191       : 87       |                                        |
>> >       8192 -> 16383      : 4777     |**************                          |
>> >      16384 -> 32767      : 10572    |*******************************         |
>> >      32768 -> 65535      : 13544    |****************************************|
>> >      65536 -> 131071     : 12723    |*************************************   |
>> >     131072 -> 262143     : 8604     |*************************               |
>> >     262144 -> 524287     : 3659     |**********                              |
>> >     524288 -> 1048575    : 921      |**                                      |
>> >    1048576 -> 2097151    : 122      |                                        |
>> >    2097152 -> 4194303    : 5        |                                        |
>> >
>> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >
>> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> > knob to set the minimum pagelist high at a level that effectively
>> > mitigated latency issues, we observed that other containers were no
>> > longer experiencing similar complaints. As a result, we decided to
>> > implement this tuning as a permanent workaround and have deployed it
>> > across all clusters of servers where these containers may be deployed.
>>
>> Thanks for your detailed data.
>>
>> IIUC, the latency of free_pcppages_bulk() during process exiting
>> shouldn't be a problem?
>
> Right. The problem arises when the process holds the lock for too
> long, causing other processes that are attempting to allocate memory
> to experience delays or wait times.
>
>> Because users care more about the total time of
>> process exiting, that is, throughput.  And I suspect that the zone->lock
>> contention and page allocating/freeing throughput will be worse with
>> your configuration?
>
> While reducing throughput may not be a significant concern due to the
> minimal difference, the potential for latency spikes, a crucial metric
> for assessing system stability, is of greater concern to users. Higher
> latency can lead to request errors, impacting the user experience.
> Therefore, maintaining stability, even at the cost of slightly lower
> throughput, is preferable to experiencing higher throughput with
> unstable performance.
>
>>
>> But the latency of free_pcppages_bulk() and page allocation in other
>> processes is a problem.  And your configuration can help it.
>>
>> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> may help both latency and throughput in your system.  Could you give it
>> a try?
>
> Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> configuration option. However, I've observed your recent improvements
> to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> restrict the pcp batch scale factor to avoid too long latency"), which
> has prompted me to experiment with manually setting the
> pcp->free_factor to zero. While this adjustment provided some
> improvement, the results were not as significant as I had hoped.
>
> BTW, perhaps we should consider the implementation of a sysctl knob as
> an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> to more easily adjust it.

If you cannot test upstream behavior, it's hard to make changes to
upstream.  Could you find a way to do that?

IIUC, PCP high will not influence allocate/free latency, PCP batch will.
Your configuration will influence PCP batch via configuration PCP high.
So, it may be reasonable to find a way to adjust PCP batch directly.
But, we need practical requirements and test methods first.

[snip]

--
Best Regards,
Huang, Ying
Yafang Shao July 3, 2024, 2:13 a.m. UTC | #7
On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> >>
> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >> >>
> >> >> > Currently, we're encountering latency spikes in our container environment
> >> >> > when a specific container with multiple Python-based tasks exits. These
> >> >> > tasks may hold the zone->lock for an extended period, significantly
> >> >> > impacting latency for other containers attempting to allocate memory.
> >> >>
> >> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> >> reasonably detailed description of the issue and a description of any
> >> >> ongoing work would be helpful here.
> >> >
> >> > In our containerized environment, we have a specific type of container
> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> > processes are organized as separate processes rather than threads due
> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> > multi-threaded setup. Upon the exit of these containers, other
> >> > containers hosted on the same machine experience significant latency
> >> > spikes.
> >> >
> >> > Our investigation using perf tracing revealed that the root cause of
> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> > the exiting processes. This concurrent access to the zone->lock
> >> > results in contention, which becomes a hotspot and negatively impacts
> >> > performance. The perf results clearly indicate this contention as a
> >> > primary contributor to the observed latency issues.
> >> >
> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >            [k] mmput                                   ▒
> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >            [k] exit_mmap                               ▒
> >> >    - 76.97% exit_mmap
> >> >                                                        ▒
> >> >       - 58.58% unmap_vmas
> >> >                                                        ▒
> >> >          - 58.55% unmap_single_vma
> >> >                                                        ▒
> >> >             - unmap_page_range
> >> >                                                        ▒
> >> >                - 58.32% zap_pte_range
> >> >                                                        ▒
> >> >                   - 42.88% tlb_flush_mmu
> >> >                                                        ▒
> >> >                      - 42.76% free_pages_and_swap_cache
> >> >                                                        ▒
> >> >                         - 41.22% release_pages
> >> >                                                        ▒
> >> >                            - 33.29% free_unref_page_list
> >> >                                                        ▒
> >> >                               - 32.37% free_unref_page_commit
> >> >                                                        ▒
> >> >                                  - 31.64% free_pcppages_bulk
> >> >                                                        ▒
> >> >                                     + 28.65% _raw_spin_lock
> >> >                                                        ▒
> >> >                                       1.28% __list_del_entry_valid
> >> >                                                        ▒
> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >                                                        ▒
> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >                                                        ▒
> >> >                              0.60% __mod_lruvec_state
> >> >                                                        ▒
> >> >                           1.07% free_swap_cache
> >> >                                                        ▒
> >> >                   + 11.69% page_remove_rmap
> >> >                                                        ▒
> >> >                     0.64% __mod_lruvec_page_state
> >> >       - 17.34% remove_vma
> >> >                                                        ▒
> >> >          - 17.25% vm_area_free
> >> >                                                        ▒
> >> >             - 17.23% kmem_cache_free
> >> >                                                        ▒
> >> >                - 17.15% __slab_free
> >> >                                                        ▒
> >> >                   - 14.56% discard_slab
> >> >                                                        ▒
> >> >                        free_slab
> >> >                                                        ▒
> >> >                        __free_slab
> >> >                                                        ▒
> >> >                        __free_pages
> >> >                                                        ▒
> >> >                      - free_unref_page
> >> >                                                        ▒
> >> >                         - 13.50% free_unref_page_commit
> >> >                                                        ▒
> >> >                            - free_pcppages_bulk
> >> >                                                        ▒
> >> >                               + 13.44% _raw_spin_lock
> >> >
> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >> >
> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> > e=1
> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >  => free_pcppages_bulk
> >> >  => free_unref_page_commit
> >> >  => free_unref_page_list
> >> >  => release_pages
> >> >  => free_pages_and_swap_cache
> >> >  => tlb_flush_mmu
> >> >  => zap_pte_range
> >> >  => unmap_page_range
> >> >  => unmap_single_vma
> >> >  => unmap_vmas
> >> >  => exit_mmap
> >> >  => mmput
> >> >  => do_exit
> >> >  => do_group_exit
> >> >  => get_signal
> >> >  => arch_do_signal_or_restart
> >> >  => exit_to_user_mode_prepare
> >> >  => syscall_exit_to_user_mode
> >> >  => do_syscall_64
> >> >  => entry_SYSCALL_64_after_hwframe
> >> >
> >> > The servers experiencing these issues are equipped with impressive
> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >
> >> > Node 0, zone   Normal
> >> >   pages free     144465775
> >> >         boost    0
> >> >         min      1309270
> >> >         low      1636587
> >> >         high     1963904
> >> >         spanned  564133888
> >> >         present  296747008
> >> >         managed  291974346
> >> >         cma      0
> >> >         protection: (0, 0, 0, 0)
> >> > ...
> >> > ...
> >> >   pagesets
> >> >     cpu: 0
> >> >               count: 2217
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 1
> >> >               count: 4510
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 2
> >> >               count: 3059
> >> >               high:  6392
> >> >               batch: 63
> >> >
> >> > ...
> >> >
> >> > The high is around 100 times the batch size.
> >> >
> >> > We also traced the latency associated with the free_pcppages_bulk()
> >> > function during the container exit process:
> >> >
> >> > 19:48:54
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 148      |*****************                       |
> >> >        512 -> 1023       : 334      |****************************************|
> >> >       1024 -> 2047       : 33       |***                                     |
> >> >       2048 -> 4095       : 5        |                                        |
> >> >       4096 -> 8191       : 7        |                                        |
> >> >       8192 -> 16383      : 12       |*                                       |
> >> >      16384 -> 32767      : 30       |***                                     |
> >> >      32768 -> 65535      : 21       |**                                      |
> >> >      65536 -> 131071     : 15       |*                                       |
> >> >     131072 -> 262143     : 27       |***                                     |
> >> >     262144 -> 524287     : 84       |**********                              |
> >> >     524288 -> 1048575    : 203      |************************                |
> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >
> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >
> >> > The latency can reach tens of milliseconds.
> >> >
> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> > significantly reduce the latency associated with the
> >> > free_pcppages_bulk() function during container exits.:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 120      |                                        |
> >> >        256 -> 511        : 365      |*                                       |
> >> >        512 -> 1023       : 201      |                                        |
> >> >       1024 -> 2047       : 103      |                                        |
> >> >       2048 -> 4095       : 84       |                                        |
> >> >       4096 -> 8191       : 87       |                                        |
> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >
> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >
> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> >> > knob to set the minimum pagelist high at a level that effectively
> >> > mitigated latency issues, we observed that other containers were no
> >> > longer experiencing similar complaints. As a result, we decided to
> >> > implement this tuning as a permanent workaround and have deployed it
> >> > across all clusters of servers where these containers may be deployed.
> >>
> >> Thanks for your detailed data.
> >>
> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> shouldn't be a problem?
> >
> > Right. The problem arises when the process holds the lock for too
> > long, causing other processes that are attempting to allocate memory
> > to experience delays or wait times.
> >
> >> Because users care more about the total time of
> >> process exiting, that is, throughput.  And I suspect that the zone->lock
> >> contention and page allocating/freeing throughput will be worse with
> >> your configuration?
> >
> > While reducing throughput may not be a significant concern due to the
> > minimal difference, the potential for latency spikes, a crucial metric
> > for assessing system stability, is of greater concern to users. Higher
> > latency can lead to request errors, impacting the user experience.
> > Therefore, maintaining stability, even at the cost of slightly lower
> > throughput, is preferable to experiencing higher throughput with
> > unstable performance.
> >
> >>
> >> But the latency of free_pcppages_bulk() and page allocation in other
> >> processes is a problem.  And your configuration can help it.
> >>
> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> >> may help both latency and throughput in your system.  Could you give it
> >> a try?
> >
> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> > configuration option. However, I've observed your recent improvements
> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> > restrict the pcp batch scale factor to avoid too long latency"), which
> > has prompted me to experiment with manually setting the
> > pcp->free_factor to zero. While this adjustment provided some
> > improvement, the results were not as significant as I had hoped.
> >
> > BTW, perhaps we should consider the implementation of a sysctl knob as
> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> > to more easily adjust it.
>
> If you cannot test upstream behavior, it's hard to make changes to
> upstream.  Could you find a way to do that?

I'm afraid I can't run an upstream kernel in our production environment :(
Lots of code changes have to be made.

>
> IIUC, PCP high will not influence allocate/free latency, PCP batch will.

It seems incorrect.
Looks at the code in free_unref_page_commit() :

    if (pcp->count >= high) {
        free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
                                          pcp, pindex);
    }

And nr_pcp_free() :
    min_nr_free = batch;
    max_nr_free = high - batch;

    batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
    return batch;

The 'batch' is not a fixed value but changed dynamically, isn't it ?

> Your configuration will influence PCP batch via configuration PCP high.
> So, it may be reasonable to find a way to adjust PCP batch directly.
> But, we need practical requirements and test methods first.
>
Huang, Ying July 3, 2024, 3:21 a.m. UTC | #8
Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >> >>
>> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >> >>
>> >> >> > Currently, we're encountering latency spikes in our container environment
>> >> >> > when a specific container with multiple Python-based tasks exits. These
>> >> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> >> > impacting latency for other containers attempting to allocate memory.
>> >> >>
>> >> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> >> reasonably detailed description of the issue and a description of any
>> >> >> ongoing work would be helpful here.
>> >> >
>> >> > In our containerized environment, we have a specific type of container
>> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> > processes are organized as separate processes rather than threads due
>> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> > containers hosted on the same machine experience significant latency
>> >> > spikes.
>> >> >
>> >> > Our investigation using perf tracing revealed that the root cause of
>> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> > the exiting processes. This concurrent access to the zone->lock
>> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> > performance. The perf results clearly indicate this contention as a
>> >> > primary contributor to the observed latency issues.
>> >> >
>> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >            [k] mmput                                   ▒
>> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >            [k] exit_mmap                               ▒
>> >> >    - 76.97% exit_mmap
>> >> >                                                        ▒
>> >> >       - 58.58% unmap_vmas
>> >> >                                                        ▒
>> >> >          - 58.55% unmap_single_vma
>> >> >                                                        ▒
>> >> >             - unmap_page_range
>> >> >                                                        ▒
>> >> >                - 58.32% zap_pte_range
>> >> >                                                        ▒
>> >> >                   - 42.88% tlb_flush_mmu
>> >> >                                                        ▒
>> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >                                                        ▒
>> >> >                         - 41.22% release_pages
>> >> >                                                        ▒
>> >> >                            - 33.29% free_unref_page_list
>> >> >                                                        ▒
>> >> >                               - 32.37% free_unref_page_commit
>> >> >                                                        ▒
>> >> >                                  - 31.64% free_pcppages_bulk
>> >> >                                                        ▒
>> >> >                                     + 28.65% _raw_spin_lock
>> >> >                                                        ▒
>> >> >                                       1.28% __list_del_entry_valid
>> >> >                                                        ▒
>> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >                                                        ▒
>> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >                                                        ▒
>> >> >                              0.60% __mod_lruvec_state
>> >> >                                                        ▒
>> >> >                           1.07% free_swap_cache
>> >> >                                                        ▒
>> >> >                   + 11.69% page_remove_rmap
>> >> >                                                        ▒
>> >> >                     0.64% __mod_lruvec_page_state
>> >> >       - 17.34% remove_vma
>> >> >                                                        ▒
>> >> >          - 17.25% vm_area_free
>> >> >                                                        ▒
>> >> >             - 17.23% kmem_cache_free
>> >> >                                                        ▒
>> >> >                - 17.15% __slab_free
>> >> >                                                        ▒
>> >> >                   - 14.56% discard_slab
>> >> >                                                        ▒
>> >> >                        free_slab
>> >> >                                                        ▒
>> >> >                        __free_slab
>> >> >                                                        ▒
>> >> >                        __free_pages
>> >> >                                                        ▒
>> >> >                      - free_unref_page
>> >> >                                                        ▒
>> >> >                         - 13.50% free_unref_page_commit
>> >> >                                                        ▒
>> >> >                            - free_pcppages_bulk
>> >> >                                                        ▒
>> >> >                               + 13.44% _raw_spin_lock
>> >> >
>> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >> >
>> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> > e=1
>> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >  => free_pcppages_bulk
>> >> >  => free_unref_page_commit
>> >> >  => free_unref_page_list
>> >> >  => release_pages
>> >> >  => free_pages_and_swap_cache
>> >> >  => tlb_flush_mmu
>> >> >  => zap_pte_range
>> >> >  => unmap_page_range
>> >> >  => unmap_single_vma
>> >> >  => unmap_vmas
>> >> >  => exit_mmap
>> >> >  => mmput
>> >> >  => do_exit
>> >> >  => do_group_exit
>> >> >  => get_signal
>> >> >  => arch_do_signal_or_restart
>> >> >  => exit_to_user_mode_prepare
>> >> >  => syscall_exit_to_user_mode
>> >> >  => do_syscall_64
>> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >
>> >> > The servers experiencing these issues are equipped with impressive
>> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >
>> >> > Node 0, zone   Normal
>> >> >   pages free     144465775
>> >> >         boost    0
>> >> >         min      1309270
>> >> >         low      1636587
>> >> >         high     1963904
>> >> >         spanned  564133888
>> >> >         present  296747008
>> >> >         managed  291974346
>> >> >         cma      0
>> >> >         protection: (0, 0, 0, 0)
>> >> > ...
>> >> > ...
>> >> >   pagesets
>> >> >     cpu: 0
>> >> >               count: 2217
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 1
>> >> >               count: 4510
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 2
>> >> >               count: 3059
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >
>> >> > ...
>> >> >
>> >> > The high is around 100 times the batch size.
>> >> >
>> >> > We also traced the latency associated with the free_pcppages_bulk()
>> >> > function during the container exit process:
>> >> >
>> >> > 19:48:54
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 0        |                                        |
>> >> >        256 -> 511        : 148      |*****************                       |
>> >> >        512 -> 1023       : 334      |****************************************|
>> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >       2048 -> 4095       : 5        |                                        |
>> >> >       4096 -> 8191       : 7        |                                        |
>> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >
>> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >
>> >> > The latency can reach tens of milliseconds.
>> >> >
>> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> >> > minimum pagelist high at 4 times the batch size, we were able to
>> >> > significantly reduce the latency associated with the
>> >> > free_pcppages_bulk() function during container exits.:
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 120      |                                        |
>> >> >        256 -> 511        : 365      |*                                       |
>> >> >        512 -> 1023       : 201      |                                        |
>> >> >       1024 -> 2047       : 103      |                                        |
>> >> >       2048 -> 4095       : 84       |                                        |
>> >> >       4096 -> 8191       : 87       |                                        |
>> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >
>> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >
>> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> >> > knob to set the minimum pagelist high at a level that effectively
>> >> > mitigated latency issues, we observed that other containers were no
>> >> > longer experiencing similar complaints. As a result, we decided to
>> >> > implement this tuning as a permanent workaround and have deployed it
>> >> > across all clusters of servers where these containers may be deployed.
>> >>
>> >> Thanks for your detailed data.
>> >>
>> >> IIUC, the latency of free_pcppages_bulk() during process exiting
>> >> shouldn't be a problem?
>> >
>> > Right. The problem arises when the process holds the lock for too
>> > long, causing other processes that are attempting to allocate memory
>> > to experience delays or wait times.
>> >
>> >> Because users care more about the total time of
>> >> process exiting, that is, throughput.  And I suspect that the zone->lock
>> >> contention and page allocating/freeing throughput will be worse with
>> >> your configuration?
>> >
>> > While reducing throughput may not be a significant concern due to the
>> > minimal difference, the potential for latency spikes, a crucial metric
>> > for assessing system stability, is of greater concern to users. Higher
>> > latency can lead to request errors, impacting the user experience.
>> > Therefore, maintaining stability, even at the cost of slightly lower
>> > throughput, is preferable to experiencing higher throughput with
>> > unstable performance.
>> >
>> >>
>> >> But the latency of free_pcppages_bulk() and page allocation in other
>> >> processes is a problem.  And your configuration can help it.
>> >>
>> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> >> may help both latency and throughput in your system.  Could you give it
>> >> a try?
>> >
>> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
>> > configuration option. However, I've observed your recent improvements
>> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
>> > restrict the pcp batch scale factor to avoid too long latency"), which
>> > has prompted me to experiment with manually setting the
>> > pcp->free_factor to zero. While this adjustment provided some
>> > improvement, the results were not as significant as I had hoped.
>> >
>> > BTW, perhaps we should consider the implementation of a sysctl knob as
>> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
>> > to more easily adjust it.
>>
>> If you cannot test upstream behavior, it's hard to make changes to
>> upstream.  Could you find a way to do that?
>
> I'm afraid I can't run an upstream kernel in our production environment :(
> Lots of code changes have to be made.

Understand.  Can you find a way to test upstream behavior, not upstream
kernel exactly?  Or test the upstream kernel but in a similar but not
exactly production environment.

>> IIUC, PCP high will not influence allocate/free latency, PCP batch will.
>
> It seems incorrect.
> Looks at the code in free_unref_page_commit() :
>
>     if (pcp->count >= high) {
>         free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
>                                           pcp, pindex);
>     }
>
> And nr_pcp_free() :
>     min_nr_free = batch;
>     max_nr_free = high - batch;
>
>     batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
>     return batch;
>
> The 'batch' is not a fixed value but changed dynamically, isn't it ?

Sorry, my words were confusing.  For 'batch', I mean the value of the
"count" parameter of free_pcppages_bulk() actually.  For example, if we
change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that.

>> Your configuration will influence PCP batch via configuration PCP high.
>> So, it may be reasonable to find a way to adjust PCP batch directly.
>> But, we need practical requirements and test methods first.
>>

--
Best Regards,
Huang, Ying
Yafang Shao July 3, 2024, 3:44 a.m. UTC | #9
On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> >> >>
> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >> >> >>
> >> >> >> > Currently, we're encountering latency spikes in our container environment
> >> >> >> > when a specific container with multiple Python-based tasks exits. These
> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
> >> >> >> > impacting latency for other containers attempting to allocate memory.
> >> >> >>
> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> >> >> reasonably detailed description of the issue and a description of any
> >> >> >> ongoing work would be helpful here.
> >> >> >
> >> >> > In our containerized environment, we have a specific type of container
> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> > processes are organized as separate processes rather than threads due
> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> > containers hosted on the same machine experience significant latency
> >> >> > spikes.
> >> >> >
> >> >> > Our investigation using perf tracing revealed that the root cause of
> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> > primary contributor to the observed latency issues.
> >> >> >
> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >> >            [k] mmput                                   ▒
> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >> >            [k] exit_mmap                               ▒
> >> >> >    - 76.97% exit_mmap
> >> >> >                                                        ▒
> >> >> >       - 58.58% unmap_vmas
> >> >> >                                                        ▒
> >> >> >          - 58.55% unmap_single_vma
> >> >> >                                                        ▒
> >> >> >             - unmap_page_range
> >> >> >                                                        ▒
> >> >> >                - 58.32% zap_pte_range
> >> >> >                                                        ▒
> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >                                                        ▒
> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >                                                        ▒
> >> >> >                         - 41.22% release_pages
> >> >> >                                                        ▒
> >> >> >                            - 33.29% free_unref_page_list
> >> >> >                                                        ▒
> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >                                                        ▒
> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >                                                        ▒
> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >                                                        ▒
> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >                                                        ▒
> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >                                                        ▒
> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >                                                        ▒
> >> >> >                              0.60% __mod_lruvec_state
> >> >> >                                                        ▒
> >> >> >                           1.07% free_swap_cache
> >> >> >                                                        ▒
> >> >> >                   + 11.69% page_remove_rmap
> >> >> >                                                        ▒
> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >       - 17.34% remove_vma
> >> >> >                                                        ▒
> >> >> >          - 17.25% vm_area_free
> >> >> >                                                        ▒
> >> >> >             - 17.23% kmem_cache_free
> >> >> >                                                        ▒
> >> >> >                - 17.15% __slab_free
> >> >> >                                                        ▒
> >> >> >                   - 14.56% discard_slab
> >> >> >                                                        ▒
> >> >> >                        free_slab
> >> >> >                                                        ▒
> >> >> >                        __free_slab
> >> >> >                                                        ▒
> >> >> >                        __free_pages
> >> >> >                                                        ▒
> >> >> >                      - free_unref_page
> >> >> >                                                        ▒
> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >                                                        ▒
> >> >> >                            - free_pcppages_bulk
> >> >> >                                                        ▒
> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >
> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >> >> >
> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> > e=1
> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >  => free_pcppages_bulk
> >> >> >  => free_unref_page_commit
> >> >> >  => free_unref_page_list
> >> >> >  => release_pages
> >> >> >  => free_pages_and_swap_cache
> >> >> >  => tlb_flush_mmu
> >> >> >  => zap_pte_range
> >> >> >  => unmap_page_range
> >> >> >  => unmap_single_vma
> >> >> >  => unmap_vmas
> >> >> >  => exit_mmap
> >> >> >  => mmput
> >> >> >  => do_exit
> >> >> >  => do_group_exit
> >> >> >  => get_signal
> >> >> >  => arch_do_signal_or_restart
> >> >> >  => exit_to_user_mode_prepare
> >> >> >  => syscall_exit_to_user_mode
> >> >> >  => do_syscall_64
> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >
> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >
> >> >> > Node 0, zone   Normal
> >> >> >   pages free     144465775
> >> >> >         boost    0
> >> >> >         min      1309270
> >> >> >         low      1636587
> >> >> >         high     1963904
> >> >> >         spanned  564133888
> >> >> >         present  296747008
> >> >> >         managed  291974346
> >> >> >         cma      0
> >> >> >         protection: (0, 0, 0, 0)
> >> >> > ...
> >> >> > ...
> >> >> >   pagesets
> >> >> >     cpu: 0
> >> >> >               count: 2217
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 1
> >> >> >               count: 4510
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 2
> >> >> >               count: 3059
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > The high is around 100 times the batch size.
> >> >> >
> >> >> > We also traced the latency associated with the free_pcppages_bulk()
> >> >> > function during the container exit process:
> >> >> >
> >> >> > 19:48:54
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >
> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >
> >> >> > The latency can reach tens of milliseconds.
> >> >> >
> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> >> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> >> > significantly reduce the latency associated with the
> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >
> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >
> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> >> >> > knob to set the minimum pagelist high at a level that effectively
> >> >> > mitigated latency issues, we observed that other containers were no
> >> >> > longer experiencing similar complaints. As a result, we decided to
> >> >> > implement this tuning as a permanent workaround and have deployed it
> >> >> > across all clusters of servers where these containers may be deployed.
> >> >>
> >> >> Thanks for your detailed data.
> >> >>
> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> >> shouldn't be a problem?
> >> >
> >> > Right. The problem arises when the process holds the lock for too
> >> > long, causing other processes that are attempting to allocate memory
> >> > to experience delays or wait times.
> >> >
> >> >> Because users care more about the total time of
> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
> >> >> contention and page allocating/freeing throughput will be worse with
> >> >> your configuration?
> >> >
> >> > While reducing throughput may not be a significant concern due to the
> >> > minimal difference, the potential for latency spikes, a crucial metric
> >> > for assessing system stability, is of greater concern to users. Higher
> >> > latency can lead to request errors, impacting the user experience.
> >> > Therefore, maintaining stability, even at the cost of slightly lower
> >> > throughput, is preferable to experiencing higher throughput with
> >> > unstable performance.
> >> >
> >> >>
> >> >> But the latency of free_pcppages_bulk() and page allocation in other
> >> >> processes is a problem.  And your configuration can help it.
> >> >>
> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> >> >> may help both latency and throughput in your system.  Could you give it
> >> >> a try?
> >> >
> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> >> > configuration option. However, I've observed your recent improvements
> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> >> > restrict the pcp batch scale factor to avoid too long latency"), which
> >> > has prompted me to experiment with manually setting the
> >> > pcp->free_factor to zero. While this adjustment provided some
> >> > improvement, the results were not as significant as I had hoped.
> >> >
> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> >> > to more easily adjust it.
> >>
> >> If you cannot test upstream behavior, it's hard to make changes to
> >> upstream.  Could you find a way to do that?
> >
> > I'm afraid I can't run an upstream kernel in our production environment :(
> > Lots of code changes have to be made.
>
> Understand.  Can you find a way to test upstream behavior, not upstream
> kernel exactly?  Or test the upstream kernel but in a similar but not
> exactly production environment.

I'm willing to give it a try, but it may take some time to achieve the
desired results..

>
> >> IIUC, PCP high will not influence allocate/free latency, PCP batch will.
> >
> > It seems incorrect.
> > Looks at the code in free_unref_page_commit() :
> >
> >     if (pcp->count >= high) {
> >         free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> >                                           pcp, pindex);
> >     }
> >
> > And nr_pcp_free() :
> >     min_nr_free = batch;
> >     max_nr_free = high - batch;
> >
> >     batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
> >     return batch;
> >
> > The 'batch' is not a fixed value but changed dynamically, isn't it ?
>
> Sorry, my words were confusing.  For 'batch', I mean the value of the
> "count" parameter of free_pcppages_bulk() actually.  For example, if we
> change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that.

If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is
that the pcp->free_count should not exceed (63 << 0), right ? (suppose
63 is the default batch size)
However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) ,
is that expected ?

Perhaps we should make the change below?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e7313f9d704b..8c52a30201d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone
*zone, struct per_cpu_pages *pcp,
        } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
                pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
        }
-       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) {
                pcp->free_count += (1 << order);
+               if (unlikely(pcp->free_count > (batch <<
CONFIG_PCP_BATCH_SCALE_MAX)))
+                       pcp->free_count = batch << CONFIG_PCP_BATCH_SCALE_MAX;
+       }
        high = nr_pcp_high(pcp, zone, batch, free_high);
        if (pcp->count >= high) {
                free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high,
free_high),
Huang, Ying July 3, 2024, 5:34 a.m. UTC | #10
Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >> >> >>
>> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >> >> >>
>> >> >> >> > Currently, we're encountering latency spikes in our container environment
>> >> >> >> > when a specific container with multiple Python-based tasks exits. These
>> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> >> >> > impacting latency for other containers attempting to allocate memory.
>> >> >> >>
>> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> >> >> reasonably detailed description of the issue and a description of any
>> >> >> >> ongoing work would be helpful here.
>> >> >> >
>> >> >> > In our containerized environment, we have a specific type of container
>> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> > processes are organized as separate processes rather than threads due
>> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> > containers hosted on the same machine experience significant latency
>> >> >> > spikes.
>> >> >> >
>> >> >> > Our investigation using perf tracing revealed that the root cause of
>> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> > primary contributor to the observed latency issues.
>> >> >> >
>> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >> >            [k] mmput                                   ▒
>> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >> >            [k] exit_mmap                               ▒
>> >> >> >    - 76.97% exit_mmap
>> >> >> >                                                        ▒
>> >> >> >       - 58.58% unmap_vmas
>> >> >> >                                                        ▒
>> >> >> >          - 58.55% unmap_single_vma
>> >> >> >                                                        ▒
>> >> >> >             - unmap_page_range
>> >> >> >                                                        ▒
>> >> >> >                - 58.32% zap_pte_range
>> >> >> >                                                        ▒
>> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >                                                        ▒
>> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >                                                        ▒
>> >> >> >                         - 41.22% release_pages
>> >> >> >                                                        ▒
>> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >                                                        ▒
>> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >                                                        ▒
>> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >                                                        ▒
>> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >                                                        ▒
>> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >                                                        ▒
>> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >                                                        ▒
>> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >                                                        ▒
>> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >                                                        ▒
>> >> >> >                           1.07% free_swap_cache
>> >> >> >                                                        ▒
>> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >                                                        ▒
>> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >       - 17.34% remove_vma
>> >> >> >                                                        ▒
>> >> >> >          - 17.25% vm_area_free
>> >> >> >                                                        ▒
>> >> >> >             - 17.23% kmem_cache_free
>> >> >> >                                                        ▒
>> >> >> >                - 17.15% __slab_free
>> >> >> >                                                        ▒
>> >> >> >                   - 14.56% discard_slab
>> >> >> >                                                        ▒
>> >> >> >                        free_slab
>> >> >> >                                                        ▒
>> >> >> >                        __free_slab
>> >> >> >                                                        ▒
>> >> >> >                        __free_pages
>> >> >> >                                                        ▒
>> >> >> >                      - free_unref_page
>> >> >> >                                                        ▒
>> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >                                                        ▒
>> >> >> >                            - free_pcppages_bulk
>> >> >> >                                                        ▒
>> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >
>> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >> >> >
>> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> > e=1
>> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >  => free_pcppages_bulk
>> >> >> >  => free_unref_page_commit
>> >> >> >  => free_unref_page_list
>> >> >> >  => release_pages
>> >> >> >  => free_pages_and_swap_cache
>> >> >> >  => tlb_flush_mmu
>> >> >> >  => zap_pte_range
>> >> >> >  => unmap_page_range
>> >> >> >  => unmap_single_vma
>> >> >> >  => unmap_vmas
>> >> >> >  => exit_mmap
>> >> >> >  => mmput
>> >> >> >  => do_exit
>> >> >> >  => do_group_exit
>> >> >> >  => get_signal
>> >> >> >  => arch_do_signal_or_restart
>> >> >> >  => exit_to_user_mode_prepare
>> >> >> >  => syscall_exit_to_user_mode
>> >> >> >  => do_syscall_64
>> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >
>> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >
>> >> >> > Node 0, zone   Normal
>> >> >> >   pages free     144465775
>> >> >> >         boost    0
>> >> >> >         min      1309270
>> >> >> >         low      1636587
>> >> >> >         high     1963904
>> >> >> >         spanned  564133888
>> >> >> >         present  296747008
>> >> >> >         managed  291974346
>> >> >> >         cma      0
>> >> >> >         protection: (0, 0, 0, 0)
>> >> >> > ...
>> >> >> > ...
>> >> >> >   pagesets
>> >> >> >     cpu: 0
>> >> >> >               count: 2217
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 1
>> >> >> >               count: 4510
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 2
>> >> >> >               count: 3059
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >
>> >> >> > ...
>> >> >> >
>> >> >> > The high is around 100 times the batch size.
>> >> >> >
>> >> >> > We also traced the latency associated with the free_pcppages_bulk()
>> >> >> > function during the container exit process:
>> >> >> >
>> >> >> > 19:48:54
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >
>> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >> >
>> >> >> > The latency can reach tens of milliseconds.
>> >> >> >
>> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> >> >> > minimum pagelist high at 4 times the batch size, we were able to
>> >> >> > significantly reduce the latency associated with the
>> >> >> > free_pcppages_bulk() function during container exits.:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >
>> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >> >
>> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> >> >> > knob to set the minimum pagelist high at a level that effectively
>> >> >> > mitigated latency issues, we observed that other containers were no
>> >> >> > longer experiencing similar complaints. As a result, we decided to
>> >> >> > implement this tuning as a permanent workaround and have deployed it
>> >> >> > across all clusters of servers where these containers may be deployed.
>> >> >>
>> >> >> Thanks for your detailed data.
>> >> >>
>> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
>> >> >> shouldn't be a problem?
>> >> >
>> >> > Right. The problem arises when the process holds the lock for too
>> >> > long, causing other processes that are attempting to allocate memory
>> >> > to experience delays or wait times.
>> >> >
>> >> >> Because users care more about the total time of
>> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
>> >> >> contention and page allocating/freeing throughput will be worse with
>> >> >> your configuration?
>> >> >
>> >> > While reducing throughput may not be a significant concern due to the
>> >> > minimal difference, the potential for latency spikes, a crucial metric
>> >> > for assessing system stability, is of greater concern to users. Higher
>> >> > latency can lead to request errors, impacting the user experience.
>> >> > Therefore, maintaining stability, even at the cost of slightly lower
>> >> > throughput, is preferable to experiencing higher throughput with
>> >> > unstable performance.
>> >> >
>> >> >>
>> >> >> But the latency of free_pcppages_bulk() and page allocation in other
>> >> >> processes is a problem.  And your configuration can help it.
>> >> >>
>> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> >> >> may help both latency and throughput in your system.  Could you give it
>> >> >> a try?
>> >> >
>> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
>> >> > configuration option. However, I've observed your recent improvements
>> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
>> >> > restrict the pcp batch scale factor to avoid too long latency"), which
>> >> > has prompted me to experiment with manually setting the
>> >> > pcp->free_factor to zero. While this adjustment provided some
>> >> > improvement, the results were not as significant as I had hoped.
>> >> >
>> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
>> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
>> >> > to more easily adjust it.
>> >>
>> >> If you cannot test upstream behavior, it's hard to make changes to
>> >> upstream.  Could you find a way to do that?
>> >
>> > I'm afraid I can't run an upstream kernel in our production environment :(
>> > Lots of code changes have to be made.
>>
>> Understand.  Can you find a way to test upstream behavior, not upstream
>> kernel exactly?  Or test the upstream kernel but in a similar but not
>> exactly production environment.
>
> I'm willing to give it a try, but it may take some time to achieve the
> desired results..

Thanks!

>>
>> >> IIUC, PCP high will not influence allocate/free latency, PCP batch will.
>> >
>> > It seems incorrect.
>> > Looks at the code in free_unref_page_commit() :
>> >
>> >     if (pcp->count >= high) {
>> >         free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
>> >                                           pcp, pindex);
>> >     }
>> >
>> > And nr_pcp_free() :
>> >     min_nr_free = batch;
>> >     max_nr_free = high - batch;
>> >
>> >     batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
>> >     return batch;
>> >
>> > The 'batch' is not a fixed value but changed dynamically, isn't it ?
>>
>> Sorry, my words were confusing.  For 'batch', I mean the value of the
>> "count" parameter of free_pcppages_bulk() actually.  For example, if we
>> change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that.
>
> If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is
> that the pcp->free_count should not exceed (63 << 0), right ? (suppose
> 63 is the default batch size)
> However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) ,
> is that expected ?
>
> Perhaps we should make the change below?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e7313f9d704b..8c52a30201d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone
> *zone, struct per_cpu_pages *pcp,
>         } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
>                 pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
>         }
> -       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
> +       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) {
>                 pcp->free_count += (1 << order);
> +               if (unlikely(pcp->free_count > (batch <<
> CONFIG_PCP_BATCH_SCALE_MAX)))
> +                       pcp->free_count = batch << CONFIG_PCP_BATCH_SCALE_MAX;
> +       }
>         high = nr_pcp_high(pcp, zone, batch, free_high);
>         if (pcp->count >= high) {
>                 free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high,
> free_high),

Or

        pcp->free_count = max(pcp->free_count + (1<< order), batch <<
                CONFIG_PCP_BATCH_SCALE_MAX);

--
Best Regards,
Huang, Ying
Yafang Shao July 4, 2024, 1:27 p.m. UTC | #11
On Wed, Jul 3, 2024 at 1:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> >> >> >>
> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> > Currently, we're encountering latency spikes in our container environment
> >> >> >> >> > when a specific container with multiple Python-based tasks exits. These
> >> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
> >> >> >> >> > impacting latency for other containers attempting to allocate memory.
> >> >> >> >>
> >> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> >> >> >> reasonably detailed description of the issue and a description of any
> >> >> >> >> ongoing work would be helpful here.
> >> >> >> >
> >> >> >> > In our containerized environment, we have a specific type of container
> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> >> > processes are organized as separate processes rather than threads due
> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> > containers hosted on the same machine experience significant latency
> >> >> >> > spikes.
> >> >> >> >
> >> >> >> > Our investigation using perf tracing revealed that the root cause of
> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >
> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >> >> >            [k] mmput                                   ▒
> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >> >> >            [k] exit_mmap                               ▒
> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >                                                        ▒
> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >                                                        ▒
> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >                                                        ▒
> >> >> >> >             - unmap_page_range
> >> >> >> >                                                        ▒
> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >                                                        ▒
> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >                                                        ▒
> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >                                                        ▒
> >> >> >> >                         - 41.22% release_pages
> >> >> >> >                                                        ▒
> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >                                                        ▒
> >> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >> >                                                        ▒
> >> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >> >                                                        ▒
> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >                                                        ▒
> >> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >> >                                                        ▒
> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >> >                                                        ▒
> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >> >                                                        ▒
> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >                                                        ▒
> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >                                                        ▒
> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >                                                        ▒
> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >       - 17.34% remove_vma
> >> >> >> >                                                        ▒
> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >                                                        ▒
> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >                                                        ▒
> >> >> >> >                - 17.15% __slab_free
> >> >> >> >                                                        ▒
> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >                                                        ▒
> >> >> >> >                        free_slab
> >> >> >> >                                                        ▒
> >> >> >> >                        __free_slab
> >> >> >> >                                                        ▒
> >> >> >> >                        __free_pages
> >> >> >> >                                                        ▒
> >> >> >> >                      - free_unref_page
> >> >> >> >                                                        ▒
> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >                                                        ▒
> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >                                                        ▒
> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >> >
> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >> >> >> >
> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> >> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> >> > e=1
> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >> >  => free_pcppages_bulk
> >> >> >> >  => free_unref_page_commit
> >> >> >> >  => free_unref_page_list
> >> >> >> >  => release_pages
> >> >> >> >  => free_pages_and_swap_cache
> >> >> >> >  => tlb_flush_mmu
> >> >> >> >  => zap_pte_range
> >> >> >> >  => unmap_page_range
> >> >> >> >  => unmap_single_vma
> >> >> >> >  => unmap_vmas
> >> >> >> >  => exit_mmap
> >> >> >> >  => mmput
> >> >> >> >  => do_exit
> >> >> >> >  => do_group_exit
> >> >> >> >  => get_signal
> >> >> >> >  => arch_do_signal_or_restart
> >> >> >> >  => exit_to_user_mode_prepare
> >> >> >> >  => syscall_exit_to_user_mode
> >> >> >> >  => do_syscall_64
> >> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >> >
> >> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >
> >> >> >> > Node 0, zone   Normal
> >> >> >> >   pages free     144465775
> >> >> >> >         boost    0
> >> >> >> >         min      1309270
> >> >> >> >         low      1636587
> >> >> >> >         high     1963904
> >> >> >> >         spanned  564133888
> >> >> >> >         present  296747008
> >> >> >> >         managed  291974346
> >> >> >> >         cma      0
> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> > ...
> >> >> >> > ...
> >> >> >> >   pagesets
> >> >> >> >     cpu: 0
> >> >> >> >               count: 2217
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 1
> >> >> >> >               count: 4510
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 2
> >> >> >> >               count: 3059
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >
> >> >> >> > ...
> >> >> >> >
> >> >> >> > The high is around 100 times the batch size.
> >> >> >> >
> >> >> >> > We also traced the latency associated with the free_pcppages_bulk()
> >> >> >> > function during the container exit process:
> >> >> >> >
> >> >> >> > 19:48:54
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >> >
> >> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >> >
> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >
> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> >> >> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> >> >> > significantly reduce the latency associated with the
> >> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >> >
> >> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >> >
> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> >> >> >> > knob to set the minimum pagelist high at a level that effectively
> >> >> >> > mitigated latency issues, we observed that other containers were no
> >> >> >> > longer experiencing similar complaints. As a result, we decided to
> >> >> >> > implement this tuning as a permanent workaround and have deployed it
> >> >> >> > across all clusters of servers where these containers may be deployed.
> >> >> >>
> >> >> >> Thanks for your detailed data.
> >> >> >>
> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> >> >> shouldn't be a problem?
> >> >> >
> >> >> > Right. The problem arises when the process holds the lock for too
> >> >> > long, causing other processes that are attempting to allocate memory
> >> >> > to experience delays or wait times.
> >> >> >
> >> >> >> Because users care more about the total time of
> >> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
> >> >> >> contention and page allocating/freeing throughput will be worse with
> >> >> >> your configuration?
> >> >> >
> >> >> > While reducing throughput may not be a significant concern due to the
> >> >> > minimal difference, the potential for latency spikes, a crucial metric
> >> >> > for assessing system stability, is of greater concern to users. Higher
> >> >> > latency can lead to request errors, impacting the user experience.
> >> >> > Therefore, maintaining stability, even at the cost of slightly lower
> >> >> > throughput, is preferable to experiencing higher throughput with
> >> >> > unstable performance.
> >> >> >
> >> >> >>
> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other
> >> >> >> processes is a problem.  And your configuration can help it.
> >> >> >>
> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> >> >> >> may help both latency and throughput in your system.  Could you give it
> >> >> >> a try?
> >> >> >
> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> >> >> > configuration option. However, I've observed your recent improvements
> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> >> >> > restrict the pcp batch scale factor to avoid too long latency"), which
> >> >> > has prompted me to experiment with manually setting the
> >> >> > pcp->free_factor to zero. While this adjustment provided some
> >> >> > improvement, the results were not as significant as I had hoped.
> >> >> >
> >> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> >> >> > to more easily adjust it.
> >> >>
> >> >> If you cannot test upstream behavior, it's hard to make changes to
> >> >> upstream.  Could you find a way to do that?
> >> >
> >> > I'm afraid I can't run an upstream kernel in our production environment :(
> >> > Lots of code changes have to be made.
> >>
> >> Understand.  Can you find a way to test upstream behavior, not upstream
> >> kernel exactly?  Or test the upstream kernel but in a similar but not
> >> exactly production environment.
> >
> > I'm willing to give it a try, but it may take some time to achieve the
> > desired results..
>
> Thanks!

After I backported the series "mm: PCP high auto-tuning," which
consists of a total of 9 patches, to our 6.1.y stable kernel and
deployed it to our production envrionment, I observed a significant
reduction in latency. The results are as follows:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 11       |                                        |
      4096 -> 8191       : 3        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 2        |                                        |
     32768 -> 65535      : 7        |                                        |
     65536 -> 131071     : 198      |*********                               |
    131072 -> 262143     : 530      |************************                |
    262144 -> 524287     : 824      |**************************************  |
    524288 -> 1048575    : 852      |****************************************|
   1048576 -> 2097151    : 714      |*********************************       |
   2097152 -> 4194303    : 389      |******************                      |
   4194304 -> 8388607    : 143      |******                                  |
   8388608 -> 16777215   : 29       |*                                       |
  16777216 -> 33554431   : 1        |                                        |

avg = 1181478 nsecs, total: 4380921824 nsecs, count: 3708

Compared to the previous data, the maximum latency has been reduced to
less than 30ms.

Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
latency was further reduced to less than 2ms.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 36       |                                        |
      2048 -> 4095       : 5063     |*****                                   |
      4096 -> 8191       : 31226    |********************************        |
      8192 -> 16383      : 37606    |*************************************** |
     16384 -> 32767      : 38359    |****************************************|
     32768 -> 65535      : 30652    |*******************************         |
     65536 -> 131071     : 18714    |*******************                     |
    131072 -> 262143     : 7968     |********                                |
    262144 -> 524287     : 1996     |**                                      |
    524288 -> 1048575    : 302      |                                        |
   1048576 -> 2097151    : 19       |                                        |

avg = 40702 nsecs, total: 7002105331 nsecs, count: 172031

After multiple trials, I observed no significant differences between
each attempt.

Therefore, we decided to backport your improvements to our local
kernel. Additionally, I propose introducing a new sysctl knob,
vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
to easily tune the setting based on their specific workloads.
Huang, Ying July 5, 2024, 1:28 a.m. UTC | #12
Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 3, 2024 at 1:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >> >> >> >>
>> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >> > Currently, we're encountering latency spikes in our container environment
>> >> >> >> >> > when a specific container with multiple Python-based tasks exits. These
>> >> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> >> >> >> > impacting latency for other containers attempting to allocate memory.
>> >> >> >> >>
>> >> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> >> >> >> reasonably detailed description of the issue and a description of any
>> >> >> >> >> ongoing work would be helpful here.
>> >> >> >> >
>> >> >> >> > In our containerized environment, we have a specific type of container
>> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> >> > processes are organized as separate processes rather than threads due
>> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> >> > containers hosted on the same machine experience significant latency
>> >> >> >> > spikes.
>> >> >> >> >
>> >> >> >> > Our investigation using perf tracing revealed that the root cause of
>> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> >> > primary contributor to the observed latency issues.
>> >> >> >> >
>> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >> >> >            [k] mmput                                   ▒
>> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >> >> >            [k] exit_mmap                               ▒
>> >> >> >> >    - 76.97% exit_mmap
>> >> >> >> >                                                        ▒
>> >> >> >> >       - 58.58% unmap_vmas
>> >> >> >> >                                                        ▒
>> >> >> >> >          - 58.55% unmap_single_vma
>> >> >> >> >                                                        ▒
>> >> >> >> >             - unmap_page_range
>> >> >> >> >                                                        ▒
>> >> >> >> >                - 58.32% zap_pte_range
>> >> >> >> >                                                        ▒
>> >> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >> >                                                        ▒
>> >> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >> >                                                        ▒
>> >> >> >> >                         - 41.22% release_pages
>> >> >> >> >                                                        ▒
>> >> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >> >                                                        ▒
>> >> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >> >                                                        ▒
>> >> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >> >                                                        ▒
>> >> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >> >                                                        ▒
>> >> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >> >                                                        ▒
>> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >> >                                                        ▒
>> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >> >                                                        ▒
>> >> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >> >                                                        ▒
>> >> >> >> >                           1.07% free_swap_cache
>> >> >> >> >                                                        ▒
>> >> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >> >                                                        ▒
>> >> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >> >       - 17.34% remove_vma
>> >> >> >> >                                                        ▒
>> >> >> >> >          - 17.25% vm_area_free
>> >> >> >> >                                                        ▒
>> >> >> >> >             - 17.23% kmem_cache_free
>> >> >> >> >                                                        ▒
>> >> >> >> >                - 17.15% __slab_free
>> >> >> >> >                                                        ▒
>> >> >> >> >                   - 14.56% discard_slab
>> >> >> >> >                                                        ▒
>> >> >> >> >                        free_slab
>> >> >> >> >                                                        ▒
>> >> >> >> >                        __free_slab
>> >> >> >> >                                                        ▒
>> >> >> >> >                        __free_pages
>> >> >> >> >                                                        ▒
>> >> >> >> >                      - free_unref_page
>> >> >> >> >                                                        ▒
>> >> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >> >                                                        ▒
>> >> >> >> >                            - free_pcppages_bulk
>> >> >> >> >                                                        ▒
>> >> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >> >
>> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >> >> >> >
>> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> >> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> >> > e=1
>> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >> >  => free_pcppages_bulk
>> >> >> >> >  => free_unref_page_commit
>> >> >> >> >  => free_unref_page_list
>> >> >> >> >  => release_pages
>> >> >> >> >  => free_pages_and_swap_cache
>> >> >> >> >  => tlb_flush_mmu
>> >> >> >> >  => zap_pte_range
>> >> >> >> >  => unmap_page_range
>> >> >> >> >  => unmap_single_vma
>> >> >> >> >  => unmap_vmas
>> >> >> >> >  => exit_mmap
>> >> >> >> >  => mmput
>> >> >> >> >  => do_exit
>> >> >> >> >  => do_group_exit
>> >> >> >> >  => get_signal
>> >> >> >> >  => arch_do_signal_or_restart
>> >> >> >> >  => exit_to_user_mode_prepare
>> >> >> >> >  => syscall_exit_to_user_mode
>> >> >> >> >  => do_syscall_64
>> >> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >> >
>> >> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >> >
>> >> >> >> > Node 0, zone   Normal
>> >> >> >> >   pages free     144465775
>> >> >> >> >         boost    0
>> >> >> >> >         min      1309270
>> >> >> >> >         low      1636587
>> >> >> >> >         high     1963904
>> >> >> >> >         spanned  564133888
>> >> >> >> >         present  296747008
>> >> >> >> >         managed  291974346
>> >> >> >> >         cma      0
>> >> >> >> >         protection: (0, 0, 0, 0)
>> >> >> >> > ...
>> >> >> >> > ...
>> >> >> >> >   pagesets
>> >> >> >> >     cpu: 0
>> >> >> >> >               count: 2217
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 1
>> >> >> >> >               count: 4510
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 2
>> >> >> >> >               count: 3059
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >
>> >> >> >> > ...
>> >> >> >> >
>> >> >> >> > The high is around 100 times the batch size.
>> >> >> >> >
>> >> >> >> > We also traced the latency associated with the free_pcppages_bulk()
>> >> >> >> > function during the container exit process:
>> >> >> >> >
>> >> >> >> > 19:48:54
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >> >
>> >> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >> >> >
>> >> >> >> > The latency can reach tens of milliseconds.
>> >> >> >> >
>> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> >> >> >> > minimum pagelist high at 4 times the batch size, we were able to
>> >> >> >> > significantly reduce the latency associated with the
>> >> >> >> > free_pcppages_bulk() function during container exits.:
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >> >
>> >> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >> >> >
>> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> >> >> >> > knob to set the minimum pagelist high at a level that effectively
>> >> >> >> > mitigated latency issues, we observed that other containers were no
>> >> >> >> > longer experiencing similar complaints. As a result, we decided to
>> >> >> >> > implement this tuning as a permanent workaround and have deployed it
>> >> >> >> > across all clusters of servers where these containers may be deployed.
>> >> >> >>
>> >> >> >> Thanks for your detailed data.
>> >> >> >>
>> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
>> >> >> >> shouldn't be a problem?
>> >> >> >
>> >> >> > Right. The problem arises when the process holds the lock for too
>> >> >> > long, causing other processes that are attempting to allocate memory
>> >> >> > to experience delays or wait times.
>> >> >> >
>> >> >> >> Because users care more about the total time of
>> >> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
>> >> >> >> contention and page allocating/freeing throughput will be worse with
>> >> >> >> your configuration?
>> >> >> >
>> >> >> > While reducing throughput may not be a significant concern due to the
>> >> >> > minimal difference, the potential for latency spikes, a crucial metric
>> >> >> > for assessing system stability, is of greater concern to users. Higher
>> >> >> > latency can lead to request errors, impacting the user experience.
>> >> >> > Therefore, maintaining stability, even at the cost of slightly lower
>> >> >> > throughput, is preferable to experiencing higher throughput with
>> >> >> > unstable performance.
>> >> >> >
>> >> >> >>
>> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other
>> >> >> >> processes is a problem.  And your configuration can help it.
>> >> >> >>
>> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> >> >> >> may help both latency and throughput in your system.  Could you give it
>> >> >> >> a try?
>> >> >> >
>> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
>> >> >> > configuration option. However, I've observed your recent improvements
>> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
>> >> >> > restrict the pcp batch scale factor to avoid too long latency"), which
>> >> >> > has prompted me to experiment with manually setting the
>> >> >> > pcp->free_factor to zero. While this adjustment provided some
>> >> >> > improvement, the results were not as significant as I had hoped.
>> >> >> >
>> >> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
>> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
>> >> >> > to more easily adjust it.
>> >> >>
>> >> >> If you cannot test upstream behavior, it's hard to make changes to
>> >> >> upstream.  Could you find a way to do that?
>> >> >
>> >> > I'm afraid I can't run an upstream kernel in our production environment :(
>> >> > Lots of code changes have to be made.
>> >>
>> >> Understand.  Can you find a way to test upstream behavior, not upstream
>> >> kernel exactly?  Or test the upstream kernel but in a similar but not
>> >> exactly production environment.
>> >
>> > I'm willing to give it a try, but it may take some time to achieve the
>> > desired results..
>>
>> Thanks!
>
> After I backported the series "mm: PCP high auto-tuning," which
> consists of a total of 9 patches, to our 6.1.y stable kernel and
> deployed it to our production envrionment, I observed a significant
> reduction in latency. The results are as follows:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 2        |                                        |
>       2048 -> 4095       : 11       |                                        |
>       4096 -> 8191       : 3        |                                        |
>       8192 -> 16383      : 1        |                                        |
>      16384 -> 32767      : 2        |                                        |
>      32768 -> 65535      : 7        |                                        |
>      65536 -> 131071     : 198      |*********                               |
>     131072 -> 262143     : 530      |************************                |
>     262144 -> 524287     : 824      |**************************************  |
>     524288 -> 1048575    : 852      |****************************************|
>    1048576 -> 2097151    : 714      |*********************************       |
>    2097152 -> 4194303    : 389      |******************                      |
>    4194304 -> 8388607    : 143      |******                                  |
>    8388608 -> 16777215   : 29       |*                                       |
>   16777216 -> 33554431   : 1        |                                        |
>
> avg = 1181478 nsecs, total: 4380921824 nsecs, count: 3708
>
> Compared to the previous data, the maximum latency has been reduced to
> less than 30ms.

That series can reduce the allocation/freeing from/to the buddy system,
thus reduce the lock contention.

> Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
> to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
> vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
> latency was further reduced to less than 2ms.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 36       |                                        |
>       2048 -> 4095       : 5063     |*****                                   |
>       4096 -> 8191       : 31226    |********************************        |
>       8192 -> 16383      : 37606    |*************************************** |
>      16384 -> 32767      : 38359    |****************************************|
>      32768 -> 65535      : 30652    |*******************************         |
>      65536 -> 131071     : 18714    |*******************                     |
>     131072 -> 262143     : 7968     |********                                |
>     262144 -> 524287     : 1996     |**                                      |
>     524288 -> 1048575    : 302      |                                        |
>    1048576 -> 2097151    : 19       |                                        |
>
> avg = 40702 nsecs, total: 7002105331 nsecs, count: 172031
>
> After multiple trials, I observed no significant differences between
> each attempt.

The test results looks good.

> Therefore, we decided to backport your improvements to our local
> kernel. Additionally, I propose introducing a new sysctl knob,
> vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
> to easily tune the setting based on their specific workloads.

The downside is that the pcp->high decaying (in decay_pcp_high()) will
be slower.  That is, it will take longer for idle pages to be freed from
PCP to buddy.  One possible solution is to keep the decaying page
number, but use a loop as follows to control latency.

while (count < decay_number) {
        spin_lock();
        free_pcppages_bulk(, batch, );
        spin_unlock();
        count -= batch;
        if (count)
                cond_resched();
}

--
Best Regards,
Huang, Ying
Yafang Shao July 5, 2024, 3:03 a.m. UTC | #13
On Fri, Jul 5, 2024 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 1:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> > Currently, we're encountering latency spikes in our container environment
> >> >> >> >> >> > when a specific container with multiple Python-based tasks exits. These
> >> >> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
> >> >> >> >> >> > impacting latency for other containers attempting to allocate memory.
> >> >> >> >> >>
> >> >> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> >> >> >> >> reasonably detailed description of the issue and a description of any
> >> >> >> >> >> ongoing work would be helpful here.
> >> >> >> >> >
> >> >> >> >> > In our containerized environment, we have a specific type of container
> >> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> >> >> > processes are organized as separate processes rather than threads due
> >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> >> > containers hosted on the same machine experience significant latency
> >> >> >> >> > spikes.
> >> >> >> >> >
> >> >> >> >> > Our investigation using perf tracing revealed that the root cause of
> >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >> >
> >> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >> >> >> >            [k] mmput                                   ▒
> >> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >> >> >> >            [k] exit_mmap                               ▒
> >> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >> >                                                        ▒
> >> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >> >                                                        ▒
> >> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >> >                                                        ▒
> >> >> >> >> >             - unmap_page_range
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                         - 41.22% release_pages
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >> >       - 17.34% remove_vma
> >> >> >> >> >                                                        ▒
> >> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >> >                                                        ▒
> >> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                - 17.15% __slab_free
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                        free_slab
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                        __free_slab
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                        __free_pages
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                      - free_unref_page
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >> >                                                        ▒
> >> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >> >> >
> >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >> >> >> >> >
> >> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> >> >> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> >> >> > e=1
> >> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >> >> >  => free_pcppages_bulk
> >> >> >> >> >  => free_unref_page_commit
> >> >> >> >> >  => free_unref_page_list
> >> >> >> >> >  => release_pages
> >> >> >> >> >  => free_pages_and_swap_cache
> >> >> >> >> >  => tlb_flush_mmu
> >> >> >> >> >  => zap_pte_range
> >> >> >> >> >  => unmap_page_range
> >> >> >> >> >  => unmap_single_vma
> >> >> >> >> >  => unmap_vmas
> >> >> >> >> >  => exit_mmap
> >> >> >> >> >  => mmput
> >> >> >> >> >  => do_exit
> >> >> >> >> >  => do_group_exit
> >> >> >> >> >  => get_signal
> >> >> >> >> >  => arch_do_signal_or_restart
> >> >> >> >> >  => exit_to_user_mode_prepare
> >> >> >> >> >  => syscall_exit_to_user_mode
> >> >> >> >> >  => do_syscall_64
> >> >> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >> >> >
> >> >> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >> >
> >> >> >> >> > Node 0, zone   Normal
> >> >> >> >> >   pages free     144465775
> >> >> >> >> >         boost    0
> >> >> >> >> >         min      1309270
> >> >> >> >> >         low      1636587
> >> >> >> >> >         high     1963904
> >> >> >> >> >         spanned  564133888
> >> >> >> >> >         present  296747008
> >> >> >> >> >         managed  291974346
> >> >> >> >> >         cma      0
> >> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> >> > ...
> >> >> >> >> > ...
> >> >> >> >> >   pagesets
> >> >> >> >> >     cpu: 0
> >> >> >> >> >               count: 2217
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >   vm stats threshold: 125
> >> >> >> >> >     cpu: 1
> >> >> >> >> >               count: 4510
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >   vm stats threshold: 125
> >> >> >> >> >     cpu: 2
> >> >> >> >> >               count: 3059
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >
> >> >> >> >> > ...
> >> >> >> >> >
> >> >> >> >> > The high is around 100 times the batch size.
> >> >> >> >> >
> >> >> >> >> > We also traced the latency associated with the free_pcppages_bulk()
> >> >> >> >> > function during the container exit process:
> >> >> >> >> >
> >> >> >> >> > 19:48:54
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >> >> >
> >> >> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >> >> >
> >> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >> >
> >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> >> >> >> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> >> >> >> > significantly reduce the latency associated with the
> >> >> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >> >> >
> >> >> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >> >> >
> >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> >> >> >> >> > knob to set the minimum pagelist high at a level that effectively
> >> >> >> >> > mitigated latency issues, we observed that other containers were no
> >> >> >> >> > longer experiencing similar complaints. As a result, we decided to
> >> >> >> >> > implement this tuning as a permanent workaround and have deployed it
> >> >> >> >> > across all clusters of servers where these containers may be deployed.
> >> >> >> >>
> >> >> >> >> Thanks for your detailed data.
> >> >> >> >>
> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> >> >> >> shouldn't be a problem?
> >> >> >> >
> >> >> >> > Right. The problem arises when the process holds the lock for too
> >> >> >> > long, causing other processes that are attempting to allocate memory
> >> >> >> > to experience delays or wait times.
> >> >> >> >
> >> >> >> >> Because users care more about the total time of
> >> >> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
> >> >> >> >> contention and page allocating/freeing throughput will be worse with
> >> >> >> >> your configuration?
> >> >> >> >
> >> >> >> > While reducing throughput may not be a significant concern due to the
> >> >> >> > minimal difference, the potential for latency spikes, a crucial metric
> >> >> >> > for assessing system stability, is of greater concern to users. Higher
> >> >> >> > latency can lead to request errors, impacting the user experience.
> >> >> >> > Therefore, maintaining stability, even at the cost of slightly lower
> >> >> >> > throughput, is preferable to experiencing higher throughput with
> >> >> >> > unstable performance.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other
> >> >> >> >> processes is a problem.  And your configuration can help it.
> >> >> >> >>
> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> >> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> >> >> >> >> may help both latency and throughput in your system.  Could you give it
> >> >> >> >> a try?
> >> >> >> >
> >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> >> >> >> > configuration option. However, I've observed your recent improvements
> >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> >> >> >> > restrict the pcp batch scale factor to avoid too long latency"), which
> >> >> >> > has prompted me to experiment with manually setting the
> >> >> >> > pcp->free_factor to zero. While this adjustment provided some
> >> >> >> > improvement, the results were not as significant as I had hoped.
> >> >> >> >
> >> >> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
> >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> >> >> >> > to more easily adjust it.
> >> >> >>
> >> >> >> If you cannot test upstream behavior, it's hard to make changes to
> >> >> >> upstream.  Could you find a way to do that?
> >> >> >
> >> >> > I'm afraid I can't run an upstream kernel in our production environment :(
> >> >> > Lots of code changes have to be made.
> >> >>
> >> >> Understand.  Can you find a way to test upstream behavior, not upstream
> >> >> kernel exactly?  Or test the upstream kernel but in a similar but not
> >> >> exactly production environment.
> >> >
> >> > I'm willing to give it a try, but it may take some time to achieve the
> >> > desired results..
> >>
> >> Thanks!
> >
> > After I backported the series "mm: PCP high auto-tuning," which
> > consists of a total of 9 patches, to our 6.1.y stable kernel and
> > deployed it to our production envrionment, I observed a significant
> > reduction in latency. The results are as follows:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 2        |                                        |
> >       2048 -> 4095       : 11       |                                        |
> >       4096 -> 8191       : 3        |                                        |
> >       8192 -> 16383      : 1        |                                        |
> >      16384 -> 32767      : 2        |                                        |
> >      32768 -> 65535      : 7        |                                        |
> >      65536 -> 131071     : 198      |*********                               |
> >     131072 -> 262143     : 530      |************************                |
> >     262144 -> 524287     : 824      |**************************************  |
> >     524288 -> 1048575    : 852      |****************************************|
> >    1048576 -> 2097151    : 714      |*********************************       |
> >    2097152 -> 4194303    : 389      |******************                      |
> >    4194304 -> 8388607    : 143      |******                                  |
> >    8388608 -> 16777215   : 29       |*                                       |
> >   16777216 -> 33554431   : 1        |                                        |
> >
> > avg = 1181478 nsecs, total: 4380921824 nsecs, count: 3708
> >
> > Compared to the previous data, the maximum latency has been reduced to
> > less than 30ms.
>
> That series can reduce the allocation/freeing from/to the buddy system,
> thus reduce the lock contention.
>
> > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
> > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
> > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
> > latency was further reduced to less than 2ms.
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 36       |                                        |
> >       2048 -> 4095       : 5063     |*****                                   |
> >       4096 -> 8191       : 31226    |********************************        |
> >       8192 -> 16383      : 37606    |*************************************** |
> >      16384 -> 32767      : 38359    |****************************************|
> >      32768 -> 65535      : 30652    |*******************************         |
> >      65536 -> 131071     : 18714    |*******************                     |
> >     131072 -> 262143     : 7968     |********                                |
> >     262144 -> 524287     : 1996     |**                                      |
> >     524288 -> 1048575    : 302      |                                        |
> >    1048576 -> 2097151    : 19       |                                        |
> >
> > avg = 40702 nsecs, total: 7002105331 nsecs, count: 172031
> >
> > After multiple trials, I observed no significant differences between
> > each attempt.
>
> The test results looks good.
>
> > Therefore, we decided to backport your improvements to our local
> > kernel. Additionally, I propose introducing a new sysctl knob,
> > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
> > to easily tune the setting based on their specific workloads.
>
> The downside is that the pcp->high decaying (in decay_pcp_high()) will
> be slower.  That is, it will take longer for idle pages to be freed from
> PCP to buddy.  One possible solution is to keep the decaying page
> number, but use a loop as follows to control latency.
>
> while (count < decay_number) {
>         spin_lock();
>         free_pcppages_bulk(, batch, );
>         spin_unlock();
>         count -= batch;
>         if (count)
>                 cond_resched();
> }

I will try it with this additional change.
Thanks for your suggestion.

IIUC, the additional change should be as follows?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone,
unsigned int order,
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
        int high_min, to_drain, batch;
-       int todo = 0;
+       int todo = 0, count = 0;

        high_min = READ_ONCE(pcp->high_min);
        batch = READ_ONCE(pcp->batch);
@@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct
per_cpu_pages *pcp)
         * control latency.  This caps pcp->high decrement too.
         */
        if (pcp->high > high_min) {
-               pcp->high = max3(pcp->count - (batch << pcp_batch_scale_max),
+               pcp->high = max3(pcp->count - (batch << 5),
                                 pcp->high - (pcp->high >> 3), high_min);
                if (pcp->high > high_min)
                        todo++;
        }

        to_drain = pcp->count - pcp->high;
-       if (to_drain > 0) {
+       while (count < to_drain) {
                spin_lock(&pcp->lock);
-               free_pcppages_bulk(zone, to_drain, pcp, 0);
+               free_pcppages_bulk(zone, batch, pcp, 0);
                spin_unlock(&pcp->lock);
+               count += batch;
                todo++;
+               cond_resched();
        }
Huang, Ying July 5, 2024, 5:31 a.m. UTC | #14
Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 5, 2024 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 3, 2024 at 1:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> > Currently, we're encountering latency spikes in our container environment
>> >> >> >> >> >> > when a specific container with multiple Python-based tasks exits. These
>> >> >> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> >> >> >> >> > impacting latency for other containers attempting to allocate memory.
>> >> >> >> >> >>
>> >> >> >> >> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> >> >> >> >> reasonably detailed description of the issue and a description of any
>> >> >> >> >> >> ongoing work would be helpful here.
>> >> >> >> >> >
>> >> >> >> >> > In our containerized environment, we have a specific type of container
>> >> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> >> >> > processes are organized as separate processes rather than threads due
>> >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> >> >> > containers hosted on the same machine experience significant latency
>> >> >> >> >> > spikes.
>> >> >> >> >> >
>> >> >> >> >> > Our investigation using perf tracing revealed that the root cause of
>> >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> >> >> > primary contributor to the observed latency issues.
>> >> >> >> >> >
>> >> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >> >> >> >            [k] mmput                                   ▒
>> >> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >> >> >> >            [k] exit_mmap                               ▒
>> >> >> >> >> >    - 76.97% exit_mmap
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >       - 58.58% unmap_vmas
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >          - 58.55% unmap_single_vma
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >             - unmap_page_range
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                - 58.32% zap_pte_range
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                         - 41.22% release_pages
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                           1.07% free_swap_cache
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >> >> >       - 17.34% remove_vma
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >          - 17.25% vm_area_free
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >             - 17.23% kmem_cache_free
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                - 17.15% __slab_free
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                   - 14.56% discard_slab
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                        free_slab
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                        __free_slab
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                        __free_pages
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                      - free_unref_page
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                            - free_pcppages_bulk
>> >> >> >> >> >                                                        ▒
>> >> >> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >> >> >
>> >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >> >> >> >> >
>> >> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> >> >> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> >> >> > e=1
>> >> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >> >> >  => free_pcppages_bulk
>> >> >> >> >> >  => free_unref_page_commit
>> >> >> >> >> >  => free_unref_page_list
>> >> >> >> >> >  => release_pages
>> >> >> >> >> >  => free_pages_and_swap_cache
>> >> >> >> >> >  => tlb_flush_mmu
>> >> >> >> >> >  => zap_pte_range
>> >> >> >> >> >  => unmap_page_range
>> >> >> >> >> >  => unmap_single_vma
>> >> >> >> >> >  => unmap_vmas
>> >> >> >> >> >  => exit_mmap
>> >> >> >> >> >  => mmput
>> >> >> >> >> >  => do_exit
>> >> >> >> >> >  => do_group_exit
>> >> >> >> >> >  => get_signal
>> >> >> >> >> >  => arch_do_signal_or_restart
>> >> >> >> >> >  => exit_to_user_mode_prepare
>> >> >> >> >> >  => syscall_exit_to_user_mode
>> >> >> >> >> >  => do_syscall_64
>> >> >> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >> >> >
>> >> >> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >> >> >
>> >> >> >> >> > Node 0, zone   Normal
>> >> >> >> >> >   pages free     144465775
>> >> >> >> >> >         boost    0
>> >> >> >> >> >         min      1309270
>> >> >> >> >> >         low      1636587
>> >> >> >> >> >         high     1963904
>> >> >> >> >> >         spanned  564133888
>> >> >> >> >> >         present  296747008
>> >> >> >> >> >         managed  291974346
>> >> >> >> >> >         cma      0
>> >> >> >> >> >         protection: (0, 0, 0, 0)
>> >> >> >> >> > ...
>> >> >> >> >> > ...
>> >> >> >> >> >   pagesets
>> >> >> >> >> >     cpu: 0
>> >> >> >> >> >               count: 2217
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >   vm stats threshold: 125
>> >> >> >> >> >     cpu: 1
>> >> >> >> >> >               count: 4510
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >   vm stats threshold: 125
>> >> >> >> >> >     cpu: 2
>> >> >> >> >> >               count: 3059
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >
>> >> >> >> >> > ...
>> >> >> >> >> >
>> >> >> >> >> > The high is around 100 times the batch size.
>> >> >> >> >> >
>> >> >> >> >> > We also traced the latency associated with the free_pcppages_bulk()
>> >> >> >> >> > function during the container exit process:
>> >> >> >> >> >
>> >> >> >> >> > 19:48:54
>> >> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >> >> >
>> >> >> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >> >> >> >
>> >> >> >> >> > The latency can reach tens of milliseconds.
>> >> >> >> >> >
>> >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> >> >> >> >> > minimum pagelist high at 4 times the batch size, we were able to
>> >> >> >> >> > significantly reduce the latency associated with the
>> >> >> >> >> > free_pcppages_bulk() function during container exits.:
>> >> >> >> >> >
>> >> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >> >> >
>> >> >> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >> >> >> >
>> >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> >> >> >> >> > knob to set the minimum pagelist high at a level that effectively
>> >> >> >> >> > mitigated latency issues, we observed that other containers were no
>> >> >> >> >> > longer experiencing similar complaints. As a result, we decided to
>> >> >> >> >> > implement this tuning as a permanent workaround and have deployed it
>> >> >> >> >> > across all clusters of servers where these containers may be deployed.
>> >> >> >> >>
>> >> >> >> >> Thanks for your detailed data.
>> >> >> >> >>
>> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
>> >> >> >> >> shouldn't be a problem?
>> >> >> >> >
>> >> >> >> > Right. The problem arises when the process holds the lock for too
>> >> >> >> > long, causing other processes that are attempting to allocate memory
>> >> >> >> > to experience delays or wait times.
>> >> >> >> >
>> >> >> >> >> Because users care more about the total time of
>> >> >> >> >> process exiting, that is, throughput.  And I suspect that the zone->lock
>> >> >> >> >> contention and page allocating/freeing throughput will be worse with
>> >> >> >> >> your configuration?
>> >> >> >> >
>> >> >> >> > While reducing throughput may not be a significant concern due to the
>> >> >> >> > minimal difference, the potential for latency spikes, a crucial metric
>> >> >> >> > for assessing system stability, is of greater concern to users. Higher
>> >> >> >> > latency can lead to request errors, impacting the user experience.
>> >> >> >> > Therefore, maintaining stability, even at the cost of slightly lower
>> >> >> >> > throughput, is preferable to experiencing higher throughput with
>> >> >> >> > unstable performance.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other
>> >> >> >> >> processes is a problem.  And your configuration can help it.
>> >> >> >> >>
>> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> >> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> >> >> >> >> may help both latency and throughput in your system.  Could you give it
>> >> >> >> >> a try?
>> >> >> >> >
>> >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
>> >> >> >> > configuration option. However, I've observed your recent improvements
>> >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
>> >> >> >> > restrict the pcp batch scale factor to avoid too long latency"), which
>> >> >> >> > has prompted me to experiment with manually setting the
>> >> >> >> > pcp->free_factor to zero. While this adjustment provided some
>> >> >> >> > improvement, the results were not as significant as I had hoped.
>> >> >> >> >
>> >> >> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
>> >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
>> >> >> >> > to more easily adjust it.
>> >> >> >>
>> >> >> >> If you cannot test upstream behavior, it's hard to make changes to
>> >> >> >> upstream.  Could you find a way to do that?
>> >> >> >
>> >> >> > I'm afraid I can't run an upstream kernel in our production environment :(
>> >> >> > Lots of code changes have to be made.
>> >> >>
>> >> >> Understand.  Can you find a way to test upstream behavior, not upstream
>> >> >> kernel exactly?  Or test the upstream kernel but in a similar but not
>> >> >> exactly production environment.
>> >> >
>> >> > I'm willing to give it a try, but it may take some time to achieve the
>> >> > desired results..
>> >>
>> >> Thanks!
>> >
>> > After I backported the series "mm: PCP high auto-tuning," which
>> > consists of a total of 9 patches, to our 6.1.y stable kernel and
>> > deployed it to our production envrionment, I observed a significant
>> > reduction in latency. The results are as follows:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 0        |                                        |
>> >        512 -> 1023       : 0        |                                        |
>> >       1024 -> 2047       : 2        |                                        |
>> >       2048 -> 4095       : 11       |                                        |
>> >       4096 -> 8191       : 3        |                                        |
>> >       8192 -> 16383      : 1        |                                        |
>> >      16384 -> 32767      : 2        |                                        |
>> >      32768 -> 65535      : 7        |                                        |
>> >      65536 -> 131071     : 198      |*********                               |
>> >     131072 -> 262143     : 530      |************************                |
>> >     262144 -> 524287     : 824      |**************************************  |
>> >     524288 -> 1048575    : 852      |****************************************|
>> >    1048576 -> 2097151    : 714      |*********************************       |
>> >    2097152 -> 4194303    : 389      |******************                      |
>> >    4194304 -> 8388607    : 143      |******                                  |
>> >    8388608 -> 16777215   : 29       |*                                       |
>> >   16777216 -> 33554431   : 1        |                                        |
>> >
>> > avg = 1181478 nsecs, total: 4380921824 nsecs, count: 3708
>> >
>> > Compared to the previous data, the maximum latency has been reduced to
>> > less than 30ms.
>>
>> That series can reduce the allocation/freeing from/to the buddy system,
>> thus reduce the lock contention.
>>
>> > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
>> > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
>> > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
>> > latency was further reduced to less than 2ms.
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 0        |                                        |
>> >        512 -> 1023       : 0        |                                        |
>> >       1024 -> 2047       : 36       |                                        |
>> >       2048 -> 4095       : 5063     |*****                                   |
>> >       4096 -> 8191       : 31226    |********************************        |
>> >       8192 -> 16383      : 37606    |*************************************** |
>> >      16384 -> 32767      : 38359    |****************************************|
>> >      32768 -> 65535      : 30652    |*******************************         |
>> >      65536 -> 131071     : 18714    |*******************                     |
>> >     131072 -> 262143     : 7968     |********                                |
>> >     262144 -> 524287     : 1996     |**                                      |
>> >     524288 -> 1048575    : 302      |                                        |
>> >    1048576 -> 2097151    : 19       |                                        |
>> >
>> > avg = 40702 nsecs, total: 7002105331 nsecs, count: 172031
>> >
>> > After multiple trials, I observed no significant differences between
>> > each attempt.
>>
>> The test results looks good.
>>
>> > Therefore, we decided to backport your improvements to our local
>> > kernel. Additionally, I propose introducing a new sysctl knob,
>> > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
>> > to easily tune the setting based on their specific workloads.
>>
>> The downside is that the pcp->high decaying (in decay_pcp_high()) will
>> be slower.  That is, it will take longer for idle pages to be freed from
>> PCP to buddy.  One possible solution is to keep the decaying page
>> number, but use a loop as follows to control latency.
>>
>> while (count < decay_number) {
>>         spin_lock();
>>         free_pcppages_bulk(, batch, );
>>         spin_unlock();
>>         count -= batch;
>>         if (count)
>>                 cond_resched();
>> }
>
> I will try it with this additional change.
> Thanks for your suggestion.
>
> IIUC, the additional change should be as follows?
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone,
> unsigned int order,
>  int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
>         int high_min, to_drain, batch;
> -       int todo = 0;
> +       int todo = 0, count = 0;
>
>         high_min = READ_ONCE(pcp->high_min);
>         batch = READ_ONCE(pcp->batch);
> @@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct
> per_cpu_pages *pcp)
>          * control latency.  This caps pcp->high decrement too.
>          */
>         if (pcp->high > high_min) {
> -               pcp->high = max3(pcp->count - (batch << pcp_batch_scale_max),
> +               pcp->high = max3(pcp->count - (batch << 5),

Please avoid to use magic number if possible.  Otherwise looks good to
me.  Thanks!

>                                  pcp->high - (pcp->high >> 3), high_min);
>                 if (pcp->high > high_min)
>                         todo++;
>         }
>
>         to_drain = pcp->count - pcp->high;
> -       if (to_drain > 0) {
> +       while (count < to_drain) {
>                 spin_lock(&pcp->lock);
> -               free_pcppages_bulk(zone, to_drain, pcp, 0);
> +               free_pcppages_bulk(zone, batch, pcp, 0);
>                 spin_unlock(&pcp->lock);
> +               count += batch;
>                 todo++;
> +               cond_resched();
>         }

--
Best Regards,
Huang, Ying
Mel Gorman July 5, 2024, 1:09 p.m. UTC | #15
On Mon, Jul 01, 2024 at 07:51:43PM -0700, Andrew Morton wrote:
> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> 
> > Currently, we're encountering latency spikes in our container environment
> > when a specific container with multiple Python-based tasks exits. These
> > tasks may hold the zone->lock for an extended period, significantly
> > impacting latency for other containers attempting to allocate memory.
> 
> Is this locking issue well understood? 

I cannot comment about others but I believe this problem to be
well-understood. The zone->lock is an incredibly large lock at this point
protecting an unbounded amount of data. As time goes by, it's just getting
worse and it was terrible even a few years ago, let alone now.

> Is anyone working on it? 

Not that I'm aware of but I've paid so little attention to linux-mm in
the last few years, that's not saying much.

The main problem is that it's hard to solve quickly as splitting that
lock is possible, but not trivial.  I am mildly concerned that more and
more people are looking for ways of getting around zone->lock contention
using the PCP allocator. I believe that to be a losing battle even though
I added THP to the PCP caching myself. Now we have dynamic resizing which
works ok but piling on top of it are file-backed THPs and THPs smaller than
MAX_ORDER, folios in general etc. Dealing with that within PCP has limits and
adding more sysctls to deal with corner cases is a band-aid that most users
probably will miss. Working around all the zone->lock issues in PCP just
delays the inevitable as PCP doesn't play well with overall availability
(e.g. high order pages free but on a remote CPU), fragmentation control
(frag fallback because desired page type are on a remote CPU) or scaling
(because ultimately it can still contend on zone->lock). IIUC, pcp lists
were originally about preserving cache hotness with zone->lock contention
reduction as a bonus but now it's a band aid trying to deal with for
zone->lock covering massive amounts of memory.

Eventually the work will have to be put into splitting zone lock using
something akin to memory arenas and moving away from zone_id to identify
what range of free lists a particular page belongs to.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..1f535d022cda 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -856,6 +856,10 @@  on per-cpu page lists. This entry only changes the value of hot per-cpu
 page lists. A user can specify a number like 100 to allocate 1/100th of
 each zone between per-cpu lists.
 
+The minimum number of pages that can be stored in per-CPU page lists is
+four times the batch value. By writing '-1' to this sysctl, you can set
+this minimum value.
+
 The batch value of each per-cpu page list remains the same regardless of
 the value of the high fraction so allocation latencies are unaffected.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..e7313f9d704b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5486,6 +5486,10 @@  static int zone_highsize(struct zone *zone, int batch, int cpu_online,
 	int nr_split_cpus;
 	unsigned long total_pages;
 
+	/* Setting -1 to set the minimum pagelist size, four times the batch size */
+	if (high_fraction == -1)
+		return batch << 2;
+
 	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
@@ -6192,7 +6196,8 @@  static int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
 
 	/* Sanity checking to avoid pcp imbalance */
 	if (percpu_pagelist_high_fraction &&
-	    percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) {
+	    percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION &&
+	    percpu_pagelist_high_fraction != -1) {
 		percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction;
 		ret = -EINVAL;
 		goto out;
@@ -6241,7 +6246,6 @@  static struct ctl_table page_alloc_sysctl_table[] = {
 		.maxlen		= sizeof(percpu_pagelist_high_fraction),
 		.mode		= 0644,
 		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
-		.extra1		= SYSCTL_ZERO,
 	},
 	{
 		.procname	= "lowmem_reserve_ratio",