mbox series

[0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Message ID 20240707094956.94654-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max | expand

Message

Yafang Shao July 7, 2024, 9:49 a.m. UTC
Background
==========

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Investigation
=============

My investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.

+   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
-   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
   - 76.97% exit_mmap
      - 58.58% unmap_vmas
         - 58.55% unmap_single_vma
            - unmap_page_range
               - 58.32% zap_pte_range
                  - 42.88% tlb_flush_mmu
                     - 42.76% free_pages_and_swap_cache
                        - 41.22% release_pages
                           - 33.29% free_unref_page_list
                              - 32.37% free_unref_page_commit
                                 - 31.64% free_pcppages_bulk
                                    + 28.65% _raw_spin_lock
                                      1.28% __list_del_entry_valid
                           + 3.25% folio_lruvec_lock_irqsave
                           + 0.75% __mem_cgroup_uncharge_list
                             0.60% __mod_lruvec_state
                          1.07% free_swap_cache
                  + 11.69% page_remove_rmap
                    0.64% __mod_lruvec_page_state
      - 17.34% remove_vma
         - 17.25% vm_area_free
            - 17.23% kmem_cache_free
               - 17.15% __slab_free
                  - 14.56% discard_slab
                       free_slab
                       __free_slab
                       __free_pages
                     - free_unref_page
                        - 13.50% free_unref_page_commit
                           - free_pcppages_bulk
                              + 13.44% _raw_spin_lock

By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
with the majority of them being regular order-0 user pages.

          <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
e=1
           <...>-1540432 [224] d..3. 618048.023887: <stack trace>
 => free_pcppages_bulk
 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_flush_mmu
 => zap_pte_range
 => unmap_page_range
 => unmap_single_vma
 => unmap_vmas
 => exit_mmap
 => mmput
 => do_exit
 => do_group_exit
 => get_signal
 => arch_do_signal_or_restart
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,

Node 0, zone   Normal
  pages free     144465775
        boost    0
        min      1309270
        low      1636587
        high     1963904
        spanned  564133888
        present  296747008
        managed  291974346
        cma      0
        protection: (0, 0, 0, 0)
...
  pagesets
    cpu: 0
              count: 2217
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 1
              count: 4510
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 2
              count: 3059
              high:  6392
              batch: 63

...

The pcp high is around 100 times the batch size.

I also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 148      |*****************                       |
       512 -> 1023       : 334      |****************************************|
      1024 -> 2047       : 33       |***                                     |
      2048 -> 4095       : 5        |                                        |
      4096 -> 8191       : 7        |                                        |
      8192 -> 16383      : 12       |*                                       |
     16384 -> 32767      : 30       |***                                     |
     32768 -> 65535      : 21       |**                                      |
     65536 -> 131071     : 15       |*                                       |
    131072 -> 262143     : 27       |***                                     |
    262144 -> 524287     : 84       |**********                              |
    524288 -> 1048575    : 203      |************************                |
   1048576 -> 2097151    : 284      |**********************************      |
   2097152 -> 4194303    : 327      |*************************************** |
   4194304 -> 8388607    : 215      |*************************               |
   8388608 -> 16777215   : 116      |*************                           |
  16777216 -> 33554431   : 47       |*****                                   |
  33554432 -> 67108863   : 8        |                                        |
  67108864 -> 134217727  : 3        |                                        |

The latency can reach tens of milliseconds.

Experimenting
=============

vm.percpu_pagelist_high_fraction
--------------------------------

The kernel version currently deployed in our production environment is the
stable 6.1.y, and my initial strategy involves optimizing the
vm.percpu_pagelist_high_fraction parameter. By increasing the value of
vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
page draining, which subsequently leads to a substantial reduction in
latency. After setting the sysctl value to 0x7fffffff, I observed a notable
improvement in latency.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 120      |                                        |
       256 -> 511        : 365      |*                                       |
       512 -> 1023       : 201      |                                        |
      1024 -> 2047       : 103      |                                        |
      2048 -> 4095       : 84       |                                        |
      4096 -> 8191       : 87       |                                        |
      8192 -> 16383      : 4777     |**************                          |
     16384 -> 32767      : 10572    |*******************************         |
     32768 -> 65535      : 13544    |****************************************|
     65536 -> 131071     : 12723    |*************************************   |
    131072 -> 262143     : 8604     |*************************               |
    262144 -> 524287     : 3659     |**********                              |
    524288 -> 1048575    : 921      |**                                      |
   1048576 -> 2097151    : 122      |                                        |
   2097152 -> 4194303    : 5        |                                        |

However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
pcp high watermark size to a minimum of four times the batch size. While
this could theoretically affect throughput, as highlighted by Ying[0], we
have yet to observe any significant difference in throughput within our
production environment after implementing this change.

Backporting the series "mm: PCP high auto-tuning"
-------------------------------------------------

My second endeavor was to backport the series titled
"mm: PCP high auto-tuning"[1], which comprises nine individual patches,
into our 6.1.y stable kernel version. Subsequent to its deployment in our
production environment, I noted a pronounced reduction in latency. The
observed outcomes are as enumerated below:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 11       |                                        |
      4096 -> 8191       : 3        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 2        |                                        |
     32768 -> 65535      : 7        |                                        |
     65536 -> 131071     : 198      |*********                               |
    131072 -> 262143     : 530      |************************                |
    262144 -> 524287     : 824      |**************************************  |
    524288 -> 1048575    : 852      |****************************************|
   1048576 -> 2097151    : 714      |*********************************       |
   2097152 -> 4194303    : 389      |******************                      |
   4194304 -> 8388607    : 143      |******                                  |
   8388608 -> 16777215   : 29       |*                                       |
  16777216 -> 33554431   : 1        |                                        |

Compared to the previous data, the maximum latency has been reduced to
less than 30ms.

Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
----------------------------------------

Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
potentially reduce the PCP batch size without compromising the PCP high
watermark size. This approach could mitigate latency spikes without
adversely affecting throughput. Consequently, my third attempt focused on
modifying this configuration.

To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
further reduction in the maximum latency, which was lowered to less than
2ms:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 36       |                                        |
      2048 -> 4095       : 5063     |*****                                   |
      4096 -> 8191       : 31226    |********************************        |
      8192 -> 16383      : 37606    |*************************************** |
     16384 -> 32767      : 38359    |****************************************|
     32768 -> 65535      : 30652    |*******************************         |
     65536 -> 131071     : 18714    |*******************                     |
    131072 -> 262143     : 7968     |********                                |
    262144 -> 524287     : 1996     |**                                      |
    524288 -> 1048575    : 302      |                                        |
   1048576 -> 2097151    : 19       |                                        |

After multiple trials, I observed no significant differences between
each attempt.

The Proposal
============

This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.

Future improvement to zone->lock
================================

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[2], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to[3]. However, implementing these solutions is
likely to necessitate a more extended development effort.

Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0]
Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]

Changes:
- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
  https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/

Yafang Shao (3):
  mm/page_alloc: A minor fix to the calculation of pcp->free_count
  mm/page_alloc: Avoid changing pcp->high decaying when adjusting
    CONFIG_PCP_BATCH_SCALE_MAX
  mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

 Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
 include/linux/sysctl.h                  |  1 +
 kernel/sysctl.c                         |  2 +-
 mm/Kconfig                              | 11 -------
 mm/page_alloc.c                         | 38 ++++++++++++++++++-------
 5 files changed, 45 insertions(+), 22 deletions(-)

Comments

Huang, Ying July 10, 2024, 3 a.m. UTC | #1
Yafang Shao <laoar.shao@gmail.com> writes:

> Background
> ==========
>
> In our containerized environment, we have a specific type of container
> that runs 18 processes, each consuming approximately 6GB of RSS. These
> processes are organized as separate processes rather than threads due
> to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> multi-threaded setup. Upon the exit of these containers, other
> containers hosted on the same machine experience significant latency
> spikes.
>
> Investigation
> =============
>
> My investigation using perf tracing revealed that the root cause of
> these spikes is the simultaneous execution of exit_mmap() by each of
> the exiting processes. This concurrent access to the zone->lock
> results in contention, which becomes a hotspot and negatively impacts
> performance. The perf results clearly indicate this contention as a
> primary contributor to the observed latency issues.
>
> +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>    - 76.97% exit_mmap
>       - 58.58% unmap_vmas
>          - 58.55% unmap_single_vma
>             - unmap_page_range
>                - 58.32% zap_pte_range
>                   - 42.88% tlb_flush_mmu
>                      - 42.76% free_pages_and_swap_cache
>                         - 41.22% release_pages
>                            - 33.29% free_unref_page_list
>                               - 32.37% free_unref_page_commit
>                                  - 31.64% free_pcppages_bulk
>                                     + 28.65% _raw_spin_lock
>                                       1.28% __list_del_entry_valid
>                            + 3.25% folio_lruvec_lock_irqsave
>                            + 0.75% __mem_cgroup_uncharge_list
>                              0.60% __mod_lruvec_state
>                           1.07% free_swap_cache
>                   + 11.69% page_remove_rmap
>                     0.64% __mod_lruvec_page_state
>       - 17.34% remove_vma
>          - 17.25% vm_area_free
>             - 17.23% kmem_cache_free
>                - 17.15% __slab_free
>                   - 14.56% discard_slab
>                        free_slab
>                        __free_slab
>                        __free_pages
>                      - free_unref_page
>                         - 13.50% free_unref_page_commit
>                            - free_pcppages_bulk
>                               + 13.44% _raw_spin_lock

I don't think your change will reduce zone->lock contention cycles.  So,
I don't find the value of the above data.

> By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> with the majority of them being regular order-0 user pages.
>
>           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> e=1
>            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>  => free_pcppages_bulk
>  => free_unref_page_commit
>  => free_unref_page_list
>  => release_pages
>  => free_pages_and_swap_cache
>  => tlb_flush_mmu
>  => zap_pte_range
>  => unmap_page_range
>  => unmap_single_vma
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => get_signal
>  => arch_do_signal_or_restart
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
>
> The servers experiencing these issues are equipped with impressive
> hardware specifications, including 256 CPUs and 1TB of memory, all
> within a single NUMA node. The zoneinfo is as follows,
>
> Node 0, zone   Normal
>   pages free     144465775
>         boost    0
>         min      1309270
>         low      1636587
>         high     1963904
>         spanned  564133888
>         present  296747008
>         managed  291974346
>         cma      0
>         protection: (0, 0, 0, 0)
> ...
>   pagesets
>     cpu: 0
>               count: 2217
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 1
>               count: 4510
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 2
>               count: 3059
>               high:  6392
>               batch: 63
>
> ...
>
> The pcp high is around 100 times the batch size.
>
> I also traced the latency associated with the free_pcppages_bulk()
> function during the container exit process:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 148      |*****************                       |
>        512 -> 1023       : 334      |****************************************|
>       1024 -> 2047       : 33       |***                                     |
>       2048 -> 4095       : 5        |                                        |
>       4096 -> 8191       : 7        |                                        |
>       8192 -> 16383      : 12       |*                                       |
>      16384 -> 32767      : 30       |***                                     |
>      32768 -> 65535      : 21       |**                                      |
>      65536 -> 131071     : 15       |*                                       |
>     131072 -> 262143     : 27       |***                                     |
>     262144 -> 524287     : 84       |**********                              |
>     524288 -> 1048575    : 203      |************************                |
>    1048576 -> 2097151    : 284      |**********************************      |
>    2097152 -> 4194303    : 327      |*************************************** |
>    4194304 -> 8388607    : 215      |*************************               |
>    8388608 -> 16777215   : 116      |*************                           |
>   16777216 -> 33554431   : 47       |*****                                   |
>   33554432 -> 67108863   : 8        |                                        |
>   67108864 -> 134217727  : 3        |                                        |
>
> The latency can reach tens of milliseconds.
>
> Experimenting
> =============
>
> vm.percpu_pagelist_high_fraction
> --------------------------------
>
> The kernel version currently deployed in our production environment is the
> stable 6.1.y, and my initial strategy involves optimizing the

IMHO, we should focus on upstream activity in the cover letter and patch
description.  And I don't think that it's necessary to describe the
alternative solution with too much details.

> vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> page draining, which subsequently leads to a substantial reduction in
> latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> improvement in latency.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 120      |                                        |
>        256 -> 511        : 365      |*                                       |
>        512 -> 1023       : 201      |                                        |
>       1024 -> 2047       : 103      |                                        |
>       2048 -> 4095       : 84       |                                        |
>       4096 -> 8191       : 87       |                                        |
>       8192 -> 16383      : 4777     |**************                          |
>      16384 -> 32767      : 10572    |*******************************         |
>      32768 -> 65535      : 13544    |****************************************|
>      65536 -> 131071     : 12723    |*************************************   |
>     131072 -> 262143     : 8604     |*************************               |
>     262144 -> 524287     : 3659     |**********                              |
>     524288 -> 1048575    : 921      |**                                      |
>    1048576 -> 2097151    : 122      |                                        |
>    2097152 -> 4194303    : 5        |                                        |
>
> However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> pcp high watermark size to a minimum of four times the batch size. While
> this could theoretically affect throughput, as highlighted by Ying[0], we
> have yet to observe any significant difference in throughput within our
> production environment after implementing this change.
>
> Backporting the series "mm: PCP high auto-tuning"
> -------------------------------------------------

Again, not upstream activity.  We can describe the upstream behavior
directly.

> My second endeavor was to backport the series titled
> "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> into our 6.1.y stable kernel version. Subsequent to its deployment in our
> production environment, I noted a pronounced reduction in latency. The
> observed outcomes are as enumerated below:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 2        |                                        |
>       2048 -> 4095       : 11       |                                        |
>       4096 -> 8191       : 3        |                                        |
>       8192 -> 16383      : 1        |                                        |
>      16384 -> 32767      : 2        |                                        |
>      32768 -> 65535      : 7        |                                        |
>      65536 -> 131071     : 198      |*********                               |
>     131072 -> 262143     : 530      |************************                |
>     262144 -> 524287     : 824      |**************************************  |
>     524288 -> 1048575    : 852      |****************************************|
>    1048576 -> 2097151    : 714      |*********************************       |
>    2097152 -> 4194303    : 389      |******************                      |
>    4194304 -> 8388607    : 143      |******                                  |
>    8388608 -> 16777215   : 29       |*                                       |
>   16777216 -> 33554431   : 1        |                                        |
>
> Compared to the previous data, the maximum latency has been reduced to
> less than 30ms.

People don't care too much about page freeing latency during processes
exiting.  Instead, they care more about the process exiting time, that
is, throughput.  So, it's better to show the page allocation latency
which is affected by the simultaneous processes exiting.

> Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
> ----------------------------------------
>
> Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
> potentially reduce the PCP batch size without compromising the PCP high
> watermark size. This approach could mitigate latency spikes without
> adversely affecting throughput. Consequently, my third attempt focused on
> modifying this configuration.
>
> To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
> with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
> vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
> further reduction in the maximum latency, which was lowered to less than
> 2ms:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 36       |                                        |
>       2048 -> 4095       : 5063     |*****                                   |
>       4096 -> 8191       : 31226    |********************************        |
>       8192 -> 16383      : 37606    |*************************************** |
>      16384 -> 32767      : 38359    |****************************************|
>      32768 -> 65535      : 30652    |*******************************         |
>      65536 -> 131071     : 18714    |*******************                     |
>     131072 -> 262143     : 7968     |********                                |
>     262144 -> 524287     : 1996     |**                                      |
>     524288 -> 1048575    : 302      |                                        |
>    1048576 -> 2097151    : 19       |                                        |
>
> After multiple trials, I observed no significant differences between
> each attempt.
>
> The Proposal
> ============
>
> This series encompasses two minor refinements to the PCP high watermark
> auto-tuning mechanism, along with the introduction of a new sysctl knob
> that serves as a more practical alternative to the previous configuration
> method.
>
> Future improvement to zone->lock
> ================================
>
> To ultimately mitigate the zone->lock contention issue, several suggestions
> have been proposed. One approach involves dividing large zones into multi
> smaller zones, as suggested by Matthew[2], while another entails splitting
> the zone->lock using a mechanism similar to memory arenas and shifting away
> from relying solely on zone_id to identify the range of free lists a
> particular page belongs to[3]. However, implementing these solutions is
> likely to necessitate a more extended development effort.
>
> Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0]
> Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1]
> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
> Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]
>
> Changes:
> - mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
>   https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/
>
> Yafang Shao (3):
>   mm/page_alloc: A minor fix to the calculation of pcp->free_count
>   mm/page_alloc: Avoid changing pcp->high decaying when adjusting
>     CONFIG_PCP_BATCH_SCALE_MAX
>   mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
>
>  Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
>  include/linux/sysctl.h                  |  1 +
>  kernel/sysctl.c                         |  2 +-
>  mm/Kconfig                              | 11 -------
>  mm/page_alloc.c                         | 38 ++++++++++++++++++-------
>  5 files changed, 45 insertions(+), 22 deletions(-)

--
Best Regards,
Huang, Ying
Yafang Shao July 11, 2024, 2:25 a.m. UTC | #2
On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > Background
> > ==========
> >
> > In our containerized environment, we have a specific type of container
> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> > processes are organized as separate processes rather than threads due
> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> > multi-threaded setup. Upon the exit of these containers, other
> > containers hosted on the same machine experience significant latency
> > spikes.
> >
> > Investigation
> > =============
> >
> > My investigation using perf tracing revealed that the root cause of
> > these spikes is the simultaneous execution of exit_mmap() by each of
> > the exiting processes. This concurrent access to the zone->lock
> > results in contention, which becomes a hotspot and negatively impacts
> > performance. The perf results clearly indicate this contention as a
> > primary contributor to the observed latency issues.
> >
> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >    - 76.97% exit_mmap
> >       - 58.58% unmap_vmas
> >          - 58.55% unmap_single_vma
> >             - unmap_page_range
> >                - 58.32% zap_pte_range
> >                   - 42.88% tlb_flush_mmu
> >                      - 42.76% free_pages_and_swap_cache
> >                         - 41.22% release_pages
> >                            - 33.29% free_unref_page_list
> >                               - 32.37% free_unref_page_commit
> >                                  - 31.64% free_pcppages_bulk
> >                                     + 28.65% _raw_spin_lock
> >                                       1.28% __list_del_entry_valid
> >                            + 3.25% folio_lruvec_lock_irqsave
> >                            + 0.75% __mem_cgroup_uncharge_list
> >                              0.60% __mod_lruvec_state
> >                           1.07% free_swap_cache
> >                   + 11.69% page_remove_rmap
> >                     0.64% __mod_lruvec_page_state
> >       - 17.34% remove_vma
> >          - 17.25% vm_area_free
> >             - 17.23% kmem_cache_free
> >                - 17.15% __slab_free
> >                   - 14.56% discard_slab
> >                        free_slab
> >                        __free_slab
> >                        __free_pages
> >                      - free_unref_page
> >                         - 13.50% free_unref_page_commit
> >                            - free_pcppages_bulk
> >                               + 13.44% _raw_spin_lock
>
> I don't think your change will reduce zone->lock contention cycles.  So,
> I don't find the value of the above data.
>
> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> > with the majority of them being regular order-0 user pages.
> >
> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> > e=1
> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >  => free_pcppages_bulk
> >  => free_unref_page_commit
> >  => free_unref_page_list
> >  => release_pages
> >  => free_pages_and_swap_cache
> >  => tlb_flush_mmu
> >  => zap_pte_range
> >  => unmap_page_range
> >  => unmap_single_vma
> >  => unmap_vmas
> >  => exit_mmap
> >  => mmput
> >  => do_exit
> >  => do_group_exit
> >  => get_signal
> >  => arch_do_signal_or_restart
> >  => exit_to_user_mode_prepare
> >  => syscall_exit_to_user_mode
> >  => do_syscall_64
> >  => entry_SYSCALL_64_after_hwframe
> >
> > The servers experiencing these issues are equipped with impressive
> > hardware specifications, including 256 CPUs and 1TB of memory, all
> > within a single NUMA node. The zoneinfo is as follows,
> >
> > Node 0, zone   Normal
> >   pages free     144465775
> >         boost    0
> >         min      1309270
> >         low      1636587
> >         high     1963904
> >         spanned  564133888
> >         present  296747008
> >         managed  291974346
> >         cma      0
> >         protection: (0, 0, 0, 0)
> > ...
> >   pagesets
> >     cpu: 0
> >               count: 2217
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 1
> >               count: 4510
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 2
> >               count: 3059
> >               high:  6392
> >               batch: 63
> >
> > ...
> >
> > The pcp high is around 100 times the batch size.
> >
> > I also traced the latency associated with the free_pcppages_bulk()
> > function during the container exit process:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 148      |*****************                       |
> >        512 -> 1023       : 334      |****************************************|
> >       1024 -> 2047       : 33       |***                                     |
> >       2048 -> 4095       : 5        |                                        |
> >       4096 -> 8191       : 7        |                                        |
> >       8192 -> 16383      : 12       |*                                       |
> >      16384 -> 32767      : 30       |***                                     |
> >      32768 -> 65535      : 21       |**                                      |
> >      65536 -> 131071     : 15       |*                                       |
> >     131072 -> 262143     : 27       |***                                     |
> >     262144 -> 524287     : 84       |**********                              |
> >     524288 -> 1048575    : 203      |************************                |
> >    1048576 -> 2097151    : 284      |**********************************      |
> >    2097152 -> 4194303    : 327      |*************************************** |
> >    4194304 -> 8388607    : 215      |*************************               |
> >    8388608 -> 16777215   : 116      |*************                           |
> >   16777216 -> 33554431   : 47       |*****                                   |
> >   33554432 -> 67108863   : 8        |                                        |
> >   67108864 -> 134217727  : 3        |                                        |
> >
> > The latency can reach tens of milliseconds.
> >
> > Experimenting
> > =============
> >
> > vm.percpu_pagelist_high_fraction
> > --------------------------------
> >
> > The kernel version currently deployed in our production environment is the
> > stable 6.1.y, and my initial strategy involves optimizing the
>
> IMHO, we should focus on upstream activity in the cover letter and patch
> description.  And I don't think that it's necessary to describe the
> alternative solution with too much details.
>
> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> > page draining, which subsequently leads to a substantial reduction in
> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> > improvement in latency.
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 120      |                                        |
> >        256 -> 511        : 365      |*                                       |
> >        512 -> 1023       : 201      |                                        |
> >       1024 -> 2047       : 103      |                                        |
> >       2048 -> 4095       : 84       |                                        |
> >       4096 -> 8191       : 87       |                                        |
> >       8192 -> 16383      : 4777     |**************                          |
> >      16384 -> 32767      : 10572    |*******************************         |
> >      32768 -> 65535      : 13544    |****************************************|
> >      65536 -> 131071     : 12723    |*************************************   |
> >     131072 -> 262143     : 8604     |*************************               |
> >     262144 -> 524287     : 3659     |**********                              |
> >     524288 -> 1048575    : 921      |**                                      |
> >    1048576 -> 2097151    : 122      |                                        |
> >    2097152 -> 4194303    : 5        |                                        |
> >
> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> > pcp high watermark size to a minimum of four times the batch size. While
> > this could theoretically affect throughput, as highlighted by Ying[0], we
> > have yet to observe any significant difference in throughput within our
> > production environment after implementing this change.
> >
> > Backporting the series "mm: PCP high auto-tuning"
> > -------------------------------------------------
>
> Again, not upstream activity.  We can describe the upstream behavior
> directly.

Andrew has requested that I provide a more comprehensive analysis of
this issue, and in response, I have endeavored to outline all the
pertinent details in a thorough and detailed manner.

>
> > My second endeavor was to backport the series titled
> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> > production environment, I noted a pronounced reduction in latency. The
> > observed outcomes are as enumerated below:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 2        |                                        |
> >       2048 -> 4095       : 11       |                                        |
> >       4096 -> 8191       : 3        |                                        |
> >       8192 -> 16383      : 1        |                                        |
> >      16384 -> 32767      : 2        |                                        |
> >      32768 -> 65535      : 7        |                                        |
> >      65536 -> 131071     : 198      |*********                               |
> >     131072 -> 262143     : 530      |************************                |
> >     262144 -> 524287     : 824      |**************************************  |
> >     524288 -> 1048575    : 852      |****************************************|
> >    1048576 -> 2097151    : 714      |*********************************       |
> >    2097152 -> 4194303    : 389      |******************                      |
> >    4194304 -> 8388607    : 143      |******                                  |
> >    8388608 -> 16777215   : 29       |*                                       |
> >   16777216 -> 33554431   : 1        |                                        |
> >
> > Compared to the previous data, the maximum latency has been reduced to
> > less than 30ms.
>
> People don't care too much about page freeing latency during processes
> exiting.  Instead, they care more about the process exiting time, that
> is, throughput.  So, it's better to show the page allocation latency
> which is affected by the simultaneous processes exiting.

I'm confused also. Is this issue really hard to understand ?
Huang, Ying July 11, 2024, 6:38 a.m. UTC | #3
Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > Background
>> > ==========
>> >
>> > In our containerized environment, we have a specific type of container
>> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> > processes are organized as separate processes rather than threads due
>> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> > multi-threaded setup. Upon the exit of these containers, other
>> > containers hosted on the same machine experience significant latency
>> > spikes.
>> >
>> > Investigation
>> > =============
>> >
>> > My investigation using perf tracing revealed that the root cause of
>> > these spikes is the simultaneous execution of exit_mmap() by each of
>> > the exiting processes. This concurrent access to the zone->lock
>> > results in contention, which becomes a hotspot and negatively impacts
>> > performance. The perf results clearly indicate this contention as a
>> > primary contributor to the observed latency issues.
>> >
>> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >    - 76.97% exit_mmap
>> >       - 58.58% unmap_vmas
>> >          - 58.55% unmap_single_vma
>> >             - unmap_page_range
>> >                - 58.32% zap_pte_range
>> >                   - 42.88% tlb_flush_mmu
>> >                      - 42.76% free_pages_and_swap_cache
>> >                         - 41.22% release_pages
>> >                            - 33.29% free_unref_page_list
>> >                               - 32.37% free_unref_page_commit
>> >                                  - 31.64% free_pcppages_bulk
>> >                                     + 28.65% _raw_spin_lock
>> >                                       1.28% __list_del_entry_valid
>> >                            + 3.25% folio_lruvec_lock_irqsave
>> >                            + 0.75% __mem_cgroup_uncharge_list
>> >                              0.60% __mod_lruvec_state
>> >                           1.07% free_swap_cache
>> >                   + 11.69% page_remove_rmap
>> >                     0.64% __mod_lruvec_page_state
>> >       - 17.34% remove_vma
>> >          - 17.25% vm_area_free
>> >             - 17.23% kmem_cache_free
>> >                - 17.15% __slab_free
>> >                   - 14.56% discard_slab
>> >                        free_slab
>> >                        __free_slab
>> >                        __free_pages
>> >                      - free_unref_page
>> >                         - 13.50% free_unref_page_commit
>> >                            - free_pcppages_bulk
>> >                               + 13.44% _raw_spin_lock
>>
>> I don't think your change will reduce zone->lock contention cycles.  So,
>> I don't find the value of the above data.
>>
>> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> > with the majority of them being regular order-0 user pages.
>> >
>> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> > e=1
>> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >  => free_pcppages_bulk
>> >  => free_unref_page_commit
>> >  => free_unref_page_list
>> >  => release_pages
>> >  => free_pages_and_swap_cache
>> >  => tlb_flush_mmu
>> >  => zap_pte_range
>> >  => unmap_page_range
>> >  => unmap_single_vma
>> >  => unmap_vmas
>> >  => exit_mmap
>> >  => mmput
>> >  => do_exit
>> >  => do_group_exit
>> >  => get_signal
>> >  => arch_do_signal_or_restart
>> >  => exit_to_user_mode_prepare
>> >  => syscall_exit_to_user_mode
>> >  => do_syscall_64
>> >  => entry_SYSCALL_64_after_hwframe
>> >
>> > The servers experiencing these issues are equipped with impressive
>> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> > within a single NUMA node. The zoneinfo is as follows,
>> >
>> > Node 0, zone   Normal
>> >   pages free     144465775
>> >         boost    0
>> >         min      1309270
>> >         low      1636587
>> >         high     1963904
>> >         spanned  564133888
>> >         present  296747008
>> >         managed  291974346
>> >         cma      0
>> >         protection: (0, 0, 0, 0)
>> > ...
>> >   pagesets
>> >     cpu: 0
>> >               count: 2217
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 1
>> >               count: 4510
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 2
>> >               count: 3059
>> >               high:  6392
>> >               batch: 63
>> >
>> > ...
>> >
>> > The pcp high is around 100 times the batch size.
>> >
>> > I also traced the latency associated with the free_pcppages_bulk()
>> > function during the container exit process:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 148      |*****************                       |
>> >        512 -> 1023       : 334      |****************************************|
>> >       1024 -> 2047       : 33       |***                                     |
>> >       2048 -> 4095       : 5        |                                        |
>> >       4096 -> 8191       : 7        |                                        |
>> >       8192 -> 16383      : 12       |*                                       |
>> >      16384 -> 32767      : 30       |***                                     |
>> >      32768 -> 65535      : 21       |**                                      |
>> >      65536 -> 131071     : 15       |*                                       |
>> >     131072 -> 262143     : 27       |***                                     |
>> >     262144 -> 524287     : 84       |**********                              |
>> >     524288 -> 1048575    : 203      |************************                |
>> >    1048576 -> 2097151    : 284      |**********************************      |
>> >    2097152 -> 4194303    : 327      |*************************************** |
>> >    4194304 -> 8388607    : 215      |*************************               |
>> >    8388608 -> 16777215   : 116      |*************                           |
>> >   16777216 -> 33554431   : 47       |*****                                   |
>> >   33554432 -> 67108863   : 8        |                                        |
>> >   67108864 -> 134217727  : 3        |                                        |
>> >
>> > The latency can reach tens of milliseconds.
>> >
>> > Experimenting
>> > =============
>> >
>> > vm.percpu_pagelist_high_fraction
>> > --------------------------------
>> >
>> > The kernel version currently deployed in our production environment is the
>> > stable 6.1.y, and my initial strategy involves optimizing the
>>
>> IMHO, we should focus on upstream activity in the cover letter and patch
>> description.  And I don't think that it's necessary to describe the
>> alternative solution with too much details.
>>
>> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> > page draining, which subsequently leads to a substantial reduction in
>> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> > improvement in latency.
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 120      |                                        |
>> >        256 -> 511        : 365      |*                                       |
>> >        512 -> 1023       : 201      |                                        |
>> >       1024 -> 2047       : 103      |                                        |
>> >       2048 -> 4095       : 84       |                                        |
>> >       4096 -> 8191       : 87       |                                        |
>> >       8192 -> 16383      : 4777     |**************                          |
>> >      16384 -> 32767      : 10572    |*******************************         |
>> >      32768 -> 65535      : 13544    |****************************************|
>> >      65536 -> 131071     : 12723    |*************************************   |
>> >     131072 -> 262143     : 8604     |*************************               |
>> >     262144 -> 524287     : 3659     |**********                              |
>> >     524288 -> 1048575    : 921      |**                                      |
>> >    1048576 -> 2097151    : 122      |                                        |
>> >    2097152 -> 4194303    : 5        |                                        |
>> >
>> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> > pcp high watermark size to a minimum of four times the batch size. While
>> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> > have yet to observe any significant difference in throughput within our
>> > production environment after implementing this change.
>> >
>> > Backporting the series "mm: PCP high auto-tuning"
>> > -------------------------------------------------
>>
>> Again, not upstream activity.  We can describe the upstream behavior
>> directly.
>
> Andrew has requested that I provide a more comprehensive analysis of
> this issue, and in response, I have endeavored to outline all the
> pertinent details in a thorough and detailed manner.

IMHO, upstream activity can provide comprehensive analysis of the issue
too.  And, your patch has changed much from the first version.  It's
better to describe your current version.

>>
>> > My second endeavor was to backport the series titled
>> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> > production environment, I noted a pronounced reduction in latency. The
>> > observed outcomes are as enumerated below:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 0        |                                        |
>> >        512 -> 1023       : 0        |                                        |
>> >       1024 -> 2047       : 2        |                                        |
>> >       2048 -> 4095       : 11       |                                        |
>> >       4096 -> 8191       : 3        |                                        |
>> >       8192 -> 16383      : 1        |                                        |
>> >      16384 -> 32767      : 2        |                                        |
>> >      32768 -> 65535      : 7        |                                        |
>> >      65536 -> 131071     : 198      |*********                               |
>> >     131072 -> 262143     : 530      |************************                |
>> >     262144 -> 524287     : 824      |**************************************  |
>> >     524288 -> 1048575    : 852      |****************************************|
>> >    1048576 -> 2097151    : 714      |*********************************       |
>> >    2097152 -> 4194303    : 389      |******************                      |
>> >    4194304 -> 8388607    : 143      |******                                  |
>> >    8388608 -> 16777215   : 29       |*                                       |
>> >   16777216 -> 33554431   : 1        |                                        |
>> >
>> > Compared to the previous data, the maximum latency has been reduced to
>> > less than 30ms.
>>
>> People don't care too much about page freeing latency during processes
>> exiting.  Instead, they care more about the process exiting time, that
>> is, throughput.  So, it's better to show the page allocation latency
>> which is affected by the simultaneous processes exiting.
>
> I'm confused also. Is this issue really hard to understand ?

IMHO, it's better to prove the issue directly.  If you cannot prove it
directly, you can try alternative one and describe why.

--
Best Regards,
Huang, Ying
Yafang Shao July 11, 2024, 7:21 a.m. UTC | #4
On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > Background
> >> > ==========
> >> >
> >> > In our containerized environment, we have a specific type of container
> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> > processes are organized as separate processes rather than threads due
> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> > multi-threaded setup. Upon the exit of these containers, other
> >> > containers hosted on the same machine experience significant latency
> >> > spikes.
> >> >
> >> > Investigation
> >> > =============
> >> >
> >> > My investigation using perf tracing revealed that the root cause of
> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> > the exiting processes. This concurrent access to the zone->lock
> >> > results in contention, which becomes a hotspot and negatively impacts
> >> > performance. The perf results clearly indicate this contention as a
> >> > primary contributor to the observed latency issues.
> >> >
> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >    - 76.97% exit_mmap
> >> >       - 58.58% unmap_vmas
> >> >          - 58.55% unmap_single_vma
> >> >             - unmap_page_range
> >> >                - 58.32% zap_pte_range
> >> >                   - 42.88% tlb_flush_mmu
> >> >                      - 42.76% free_pages_and_swap_cache
> >> >                         - 41.22% release_pages
> >> >                            - 33.29% free_unref_page_list
> >> >                               - 32.37% free_unref_page_commit
> >> >                                  - 31.64% free_pcppages_bulk
> >> >                                     + 28.65% _raw_spin_lock
> >> >                                       1.28% __list_del_entry_valid
> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >                              0.60% __mod_lruvec_state
> >> >                           1.07% free_swap_cache
> >> >                   + 11.69% page_remove_rmap
> >> >                     0.64% __mod_lruvec_page_state
> >> >       - 17.34% remove_vma
> >> >          - 17.25% vm_area_free
> >> >             - 17.23% kmem_cache_free
> >> >                - 17.15% __slab_free
> >> >                   - 14.56% discard_slab
> >> >                        free_slab
> >> >                        __free_slab
> >> >                        __free_pages
> >> >                      - free_unref_page
> >> >                         - 13.50% free_unref_page_commit
> >> >                            - free_pcppages_bulk
> >> >                               + 13.44% _raw_spin_lock
> >>
> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> I don't find the value of the above data.
> >>
> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> > with the majority of them being regular order-0 user pages.
> >> >
> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> > e=1
> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >  => free_pcppages_bulk
> >> >  => free_unref_page_commit
> >> >  => free_unref_page_list
> >> >  => release_pages
> >> >  => free_pages_and_swap_cache
> >> >  => tlb_flush_mmu
> >> >  => zap_pte_range
> >> >  => unmap_page_range
> >> >  => unmap_single_vma
> >> >  => unmap_vmas
> >> >  => exit_mmap
> >> >  => mmput
> >> >  => do_exit
> >> >  => do_group_exit
> >> >  => get_signal
> >> >  => arch_do_signal_or_restart
> >> >  => exit_to_user_mode_prepare
> >> >  => syscall_exit_to_user_mode
> >> >  => do_syscall_64
> >> >  => entry_SYSCALL_64_after_hwframe
> >> >
> >> > The servers experiencing these issues are equipped with impressive
> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >
> >> > Node 0, zone   Normal
> >> >   pages free     144465775
> >> >         boost    0
> >> >         min      1309270
> >> >         low      1636587
> >> >         high     1963904
> >> >         spanned  564133888
> >> >         present  296747008
> >> >         managed  291974346
> >> >         cma      0
> >> >         protection: (0, 0, 0, 0)
> >> > ...
> >> >   pagesets
> >> >     cpu: 0
> >> >               count: 2217
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 1
> >> >               count: 4510
> >> >               high:  6392
> >> >               batch: 63
> >> >   vm stats threshold: 125
> >> >     cpu: 2
> >> >               count: 3059
> >> >               high:  6392
> >> >               batch: 63
> >> >
> >> > ...
> >> >
> >> > The pcp high is around 100 times the batch size.
> >> >
> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> > function during the container exit process:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 148      |*****************                       |
> >> >        512 -> 1023       : 334      |****************************************|
> >> >       1024 -> 2047       : 33       |***                                     |
> >> >       2048 -> 4095       : 5        |                                        |
> >> >       4096 -> 8191       : 7        |                                        |
> >> >       8192 -> 16383      : 12       |*                                       |
> >> >      16384 -> 32767      : 30       |***                                     |
> >> >      32768 -> 65535      : 21       |**                                      |
> >> >      65536 -> 131071     : 15       |*                                       |
> >> >     131072 -> 262143     : 27       |***                                     |
> >> >     262144 -> 524287     : 84       |**********                              |
> >> >     524288 -> 1048575    : 203      |************************                |
> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >
> >> > The latency can reach tens of milliseconds.
> >> >
> >> > Experimenting
> >> > =============
> >> >
> >> > vm.percpu_pagelist_high_fraction
> >> > --------------------------------
> >> >
> >> > The kernel version currently deployed in our production environment is the
> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >>
> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> description.  And I don't think that it's necessary to describe the
> >> alternative solution with too much details.
> >>
> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> > page draining, which subsequently leads to a substantial reduction in
> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> > improvement in latency.
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 120      |                                        |
> >> >        256 -> 511        : 365      |*                                       |
> >> >        512 -> 1023       : 201      |                                        |
> >> >       1024 -> 2047       : 103      |                                        |
> >> >       2048 -> 4095       : 84       |                                        |
> >> >       4096 -> 8191       : 87       |                                        |
> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >
> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> > have yet to observe any significant difference in throughput within our
> >> > production environment after implementing this change.
> >> >
> >> > Backporting the series "mm: PCP high auto-tuning"
> >> > -------------------------------------------------
> >>
> >> Again, not upstream activity.  We can describe the upstream behavior
> >> directly.
> >
> > Andrew has requested that I provide a more comprehensive analysis of
> > this issue, and in response, I have endeavored to outline all the
> > pertinent details in a thorough and detailed manner.
>
> IMHO, upstream activity can provide comprehensive analysis of the issue
> too.  And, your patch has changed much from the first version.  It's
> better to describe your current version.

After backporting the pcp auto-tuning feature to the 6.1.y branch, the
code is almost the same with the upstream kernel wrt the pcp. I have
thoroughly documented the detailed data showcasing the changes in the
backported version, providing a clear picture of the results. However,
it's crucial to note that I am unable to directly run the upstream
kernel on our production environment due to practical constraints.

>
> >>
> >> > My second endeavor was to backport the series titled
> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> > production environment, I noted a pronounced reduction in latency. The
> >> > observed outcomes are as enumerated below:
> >> >
> >> >      nsecs               : count     distribution
> >> >          0 -> 1          : 0        |                                        |
> >> >          2 -> 3          : 0        |                                        |
> >> >          4 -> 7          : 0        |                                        |
> >> >          8 -> 15         : 0        |                                        |
> >> >         16 -> 31         : 0        |                                        |
> >> >         32 -> 63         : 0        |                                        |
> >> >         64 -> 127        : 0        |                                        |
> >> >        128 -> 255        : 0        |                                        |
> >> >        256 -> 511        : 0        |                                        |
> >> >        512 -> 1023       : 0        |                                        |
> >> >       1024 -> 2047       : 2        |                                        |
> >> >       2048 -> 4095       : 11       |                                        |
> >> >       4096 -> 8191       : 3        |                                        |
> >> >       8192 -> 16383      : 1        |                                        |
> >> >      16384 -> 32767      : 2        |                                        |
> >> >      32768 -> 65535      : 7        |                                        |
> >> >      65536 -> 131071     : 198      |*********                               |
> >> >     131072 -> 262143     : 530      |************************                |
> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >
> >> > Compared to the previous data, the maximum latency has been reduced to
> >> > less than 30ms.
> >>
> >> People don't care too much about page freeing latency during processes
> >> exiting.  Instead, they care more about the process exiting time, that
> >> is, throughput.  So, it's better to show the page allocation latency
> >> which is affected by the simultaneous processes exiting.
> >
> > I'm confused also. Is this issue really hard to understand ?
>
> IMHO, it's better to prove the issue directly.  If you cannot prove it
> directly, you can try alternative one and describe why.

Not all data can be verified straightforwardly or effortlessly. The
primary focus lies in the zone->lock contention, which necessitates
measuring the latency it incurs. To accomplish this, the
free_pcppages_bulk() function serves as an effective tool for
evaluation. Therefore, I have opted to specifically measure the
latency associated with free_pcppages_bulk().

The rationale behind not measuring allocation latency is due to the
necessity of finding a willing participant to endure potential delays,
a task that proved unsuccessful as no one expressed interest. In
contrast, assessing free_pcppages_bulk()'s latency solely requires
identifying and experimenting with the source causing the delays,
making it a more feasible approach.
Huang, Ying July 11, 2024, 8:36 a.m. UTC | #5
Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > Background
>> >> > ==========
>> >> >
>> >> > In our containerized environment, we have a specific type of container
>> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> > processes are organized as separate processes rather than threads due
>> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> > containers hosted on the same machine experience significant latency
>> >> > spikes.
>> >> >
>> >> > Investigation
>> >> > =============
>> >> >
>> >> > My investigation using perf tracing revealed that the root cause of
>> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> > the exiting processes. This concurrent access to the zone->lock
>> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> > performance. The perf results clearly indicate this contention as a
>> >> > primary contributor to the observed latency issues.
>> >> >
>> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >    - 76.97% exit_mmap
>> >> >       - 58.58% unmap_vmas
>> >> >          - 58.55% unmap_single_vma
>> >> >             - unmap_page_range
>> >> >                - 58.32% zap_pte_range
>> >> >                   - 42.88% tlb_flush_mmu
>> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >                         - 41.22% release_pages
>> >> >                            - 33.29% free_unref_page_list
>> >> >                               - 32.37% free_unref_page_commit
>> >> >                                  - 31.64% free_pcppages_bulk
>> >> >                                     + 28.65% _raw_spin_lock
>> >> >                                       1.28% __list_del_entry_valid
>> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >                              0.60% __mod_lruvec_state
>> >> >                           1.07% free_swap_cache
>> >> >                   + 11.69% page_remove_rmap
>> >> >                     0.64% __mod_lruvec_page_state
>> >> >       - 17.34% remove_vma
>> >> >          - 17.25% vm_area_free
>> >> >             - 17.23% kmem_cache_free
>> >> >                - 17.15% __slab_free
>> >> >                   - 14.56% discard_slab
>> >> >                        free_slab
>> >> >                        __free_slab
>> >> >                        __free_pages
>> >> >                      - free_unref_page
>> >> >                         - 13.50% free_unref_page_commit
>> >> >                            - free_pcppages_bulk
>> >> >                               + 13.44% _raw_spin_lock
>> >>
>> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> I don't find the value of the above data.
>> >>
>> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> > with the majority of them being regular order-0 user pages.
>> >> >
>> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> > e=1
>> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >  => free_pcppages_bulk
>> >> >  => free_unref_page_commit
>> >> >  => free_unref_page_list
>> >> >  => release_pages
>> >> >  => free_pages_and_swap_cache
>> >> >  => tlb_flush_mmu
>> >> >  => zap_pte_range
>> >> >  => unmap_page_range
>> >> >  => unmap_single_vma
>> >> >  => unmap_vmas
>> >> >  => exit_mmap
>> >> >  => mmput
>> >> >  => do_exit
>> >> >  => do_group_exit
>> >> >  => get_signal
>> >> >  => arch_do_signal_or_restart
>> >> >  => exit_to_user_mode_prepare
>> >> >  => syscall_exit_to_user_mode
>> >> >  => do_syscall_64
>> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >
>> >> > The servers experiencing these issues are equipped with impressive
>> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >
>> >> > Node 0, zone   Normal
>> >> >   pages free     144465775
>> >> >         boost    0
>> >> >         min      1309270
>> >> >         low      1636587
>> >> >         high     1963904
>> >> >         spanned  564133888
>> >> >         present  296747008
>> >> >         managed  291974346
>> >> >         cma      0
>> >> >         protection: (0, 0, 0, 0)
>> >> > ...
>> >> >   pagesets
>> >> >     cpu: 0
>> >> >               count: 2217
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 1
>> >> >               count: 4510
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >   vm stats threshold: 125
>> >> >     cpu: 2
>> >> >               count: 3059
>> >> >               high:  6392
>> >> >               batch: 63
>> >> >
>> >> > ...
>> >> >
>> >> > The pcp high is around 100 times the batch size.
>> >> >
>> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> > function during the container exit process:
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 0        |                                        |
>> >> >        256 -> 511        : 148      |*****************                       |
>> >> >        512 -> 1023       : 334      |****************************************|
>> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >       2048 -> 4095       : 5        |                                        |
>> >> >       4096 -> 8191       : 7        |                                        |
>> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >
>> >> > The latency can reach tens of milliseconds.
>> >> >
>> >> > Experimenting
>> >> > =============
>> >> >
>> >> > vm.percpu_pagelist_high_fraction
>> >> > --------------------------------
>> >> >
>> >> > The kernel version currently deployed in our production environment is the
>> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >>
>> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> description.  And I don't think that it's necessary to describe the
>> >> alternative solution with too much details.
>> >>
>> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> > page draining, which subsequently leads to a substantial reduction in
>> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> > improvement in latency.
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 120      |                                        |
>> >> >        256 -> 511        : 365      |*                                       |
>> >> >        512 -> 1023       : 201      |                                        |
>> >> >       1024 -> 2047       : 103      |                                        |
>> >> >       2048 -> 4095       : 84       |                                        |
>> >> >       4096 -> 8191       : 87       |                                        |
>> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >
>> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> > have yet to observe any significant difference in throughput within our
>> >> > production environment after implementing this change.
>> >> >
>> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> > -------------------------------------------------
>> >>
>> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> directly.
>> >
>> > Andrew has requested that I provide a more comprehensive analysis of
>> > this issue, and in response, I have endeavored to outline all the
>> > pertinent details in a thorough and detailed manner.
>>
>> IMHO, upstream activity can provide comprehensive analysis of the issue
>> too.  And, your patch has changed much from the first version.  It's
>> better to describe your current version.
>
> After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> code is almost the same with the upstream kernel wrt the pcp. I have
> thoroughly documented the detailed data showcasing the changes in the
> backported version, providing a clear picture of the results. However,
> it's crucial to note that I am unable to directly run the upstream
> kernel on our production environment due to practical constraints.

IMHO, the patch is for upstream kernel, not some downstream kernel, so
focus should be the upstream activity.  The issue of the upstream
kernel, and how to resolve it.  The production environment test results
can be used to support the upstream change.

>> >> > My second endeavor was to backport the series titled
>> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> > production environment, I noted a pronounced reduction in latency. The
>> >> > observed outcomes are as enumerated below:
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                                        |
>> >> >          2 -> 3          : 0        |                                        |
>> >> >          4 -> 7          : 0        |                                        |
>> >> >          8 -> 15         : 0        |                                        |
>> >> >         16 -> 31         : 0        |                                        |
>> >> >         32 -> 63         : 0        |                                        |
>> >> >         64 -> 127        : 0        |                                        |
>> >> >        128 -> 255        : 0        |                                        |
>> >> >        256 -> 511        : 0        |                                        |
>> >> >        512 -> 1023       : 0        |                                        |
>> >> >       1024 -> 2047       : 2        |                                        |
>> >> >       2048 -> 4095       : 11       |                                        |
>> >> >       4096 -> 8191       : 3        |                                        |
>> >> >       8192 -> 16383      : 1        |                                        |
>> >> >      16384 -> 32767      : 2        |                                        |
>> >> >      32768 -> 65535      : 7        |                                        |
>> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >     131072 -> 262143     : 530      |************************                |
>> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >
>> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> > less than 30ms.
>> >>
>> >> People don't care too much about page freeing latency during processes
>> >> exiting.  Instead, they care more about the process exiting time, that
>> >> is, throughput.  So, it's better to show the page allocation latency
>> >> which is affected by the simultaneous processes exiting.
>> >
>> > I'm confused also. Is this issue really hard to understand ?
>>
>> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> directly, you can try alternative one and describe why.
>
> Not all data can be verified straightforwardly or effortlessly. The
> primary focus lies in the zone->lock contention, which necessitates
> measuring the latency it incurs. To accomplish this, the
> free_pcppages_bulk() function serves as an effective tool for
> evaluation. Therefore, I have opted to specifically measure the
> latency associated with free_pcppages_bulk().
>
> The rationale behind not measuring allocation latency is due to the
> necessity of finding a willing participant to endure potential delays,
> a task that proved unsuccessful as no one expressed interest. In
> contrast, assessing free_pcppages_bulk()'s latency solely requires
> identifying and experimenting with the source causing the delays,
> making it a more feasible approach.

Can you run a benchmark program that do quite some memory allocation by
yourself to test it?

--
Best Regards,
Huang, Ying
Yafang Shao July 11, 2024, 9:40 a.m. UTC | #6
On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > Background
> >> >> > ==========
> >> >> >
> >> >> > In our containerized environment, we have a specific type of container
> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> > processes are organized as separate processes rather than threads due
> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> > containers hosted on the same machine experience significant latency
> >> >> > spikes.
> >> >> >
> >> >> > Investigation
> >> >> > =============
> >> >> >
> >> >> > My investigation using perf tracing revealed that the root cause of
> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> > primary contributor to the observed latency issues.
> >> >> >
> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >> >    - 76.97% exit_mmap
> >> >> >       - 58.58% unmap_vmas
> >> >> >          - 58.55% unmap_single_vma
> >> >> >             - unmap_page_range
> >> >> >                - 58.32% zap_pte_range
> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >                         - 41.22% release_pages
> >> >> >                            - 33.29% free_unref_page_list
> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >                              0.60% __mod_lruvec_state
> >> >> >                           1.07% free_swap_cache
> >> >> >                   + 11.69% page_remove_rmap
> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >       - 17.34% remove_vma
> >> >> >          - 17.25% vm_area_free
> >> >> >             - 17.23% kmem_cache_free
> >> >> >                - 17.15% __slab_free
> >> >> >                   - 14.56% discard_slab
> >> >> >                        free_slab
> >> >> >                        __free_slab
> >> >> >                        __free_pages
> >> >> >                      - free_unref_page
> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >                            - free_pcppages_bulk
> >> >> >                               + 13.44% _raw_spin_lock
> >> >>
> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> >> I don't find the value of the above data.
> >> >>
> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >
> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> > e=1
> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >  => free_pcppages_bulk
> >> >> >  => free_unref_page_commit
> >> >> >  => free_unref_page_list
> >> >> >  => release_pages
> >> >> >  => free_pages_and_swap_cache
> >> >> >  => tlb_flush_mmu
> >> >> >  => zap_pte_range
> >> >> >  => unmap_page_range
> >> >> >  => unmap_single_vma
> >> >> >  => unmap_vmas
> >> >> >  => exit_mmap
> >> >> >  => mmput
> >> >> >  => do_exit
> >> >> >  => do_group_exit
> >> >> >  => get_signal
> >> >> >  => arch_do_signal_or_restart
> >> >> >  => exit_to_user_mode_prepare
> >> >> >  => syscall_exit_to_user_mode
> >> >> >  => do_syscall_64
> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >
> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >
> >> >> > Node 0, zone   Normal
> >> >> >   pages free     144465775
> >> >> >         boost    0
> >> >> >         min      1309270
> >> >> >         low      1636587
> >> >> >         high     1963904
> >> >> >         spanned  564133888
> >> >> >         present  296747008
> >> >> >         managed  291974346
> >> >> >         cma      0
> >> >> >         protection: (0, 0, 0, 0)
> >> >> > ...
> >> >> >   pagesets
> >> >> >     cpu: 0
> >> >> >               count: 2217
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 1
> >> >> >               count: 4510
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 2
> >> >> >               count: 3059
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > The pcp high is around 100 times the batch size.
> >> >> >
> >> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> >> > function during the container exit process:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >
> >> >> > The latency can reach tens of milliseconds.
> >> >> >
> >> >> > Experimenting
> >> >> > =============
> >> >> >
> >> >> > vm.percpu_pagelist_high_fraction
> >> >> > --------------------------------
> >> >> >
> >> >> > The kernel version currently deployed in our production environment is the
> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >>
> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> >> description.  And I don't think that it's necessary to describe the
> >> >> alternative solution with too much details.
> >> >>
> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> >> > page draining, which subsequently leads to a substantial reduction in
> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> >> > improvement in latency.
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >
> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> >> > have yet to observe any significant difference in throughput within our
> >> >> > production environment after implementing this change.
> >> >> >
> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> > -------------------------------------------------
> >> >>
> >> >> Again, not upstream activity.  We can describe the upstream behavior
> >> >> directly.
> >> >
> >> > Andrew has requested that I provide a more comprehensive analysis of
> >> > this issue, and in response, I have endeavored to outline all the
> >> > pertinent details in a thorough and detailed manner.
> >>
> >> IMHO, upstream activity can provide comprehensive analysis of the issue
> >> too.  And, your patch has changed much from the first version.  It's
> >> better to describe your current version.
> >
> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> > code is almost the same with the upstream kernel wrt the pcp. I have
> > thoroughly documented the detailed data showcasing the changes in the
> > backported version, providing a clear picture of the results. However,
> > it's crucial to note that I am unable to directly run the upstream
> > kernel on our production environment due to practical constraints.
>
> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> focus should be the upstream activity.  The issue of the upstream
> kernel, and how to resolve it.  The production environment test results
> can be used to support the upstream change.

 The sole distinction in the pcp between version 6.1.y and the
upstream kernel lies solely in the modifications made to the code by
you. Furthermore, given that your code changes have now been
successfully backported, what else do you expect me to do ?

>
> >> >> > My second endeavor was to backport the series titled
> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> >> > production environment, I noted a pronounced reduction in latency. The
> >> >> > observed outcomes are as enumerated below:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >        512 -> 1023       : 0        |                                        |
> >> >> >       1024 -> 2047       : 2        |                                        |
> >> >> >       2048 -> 4095       : 11       |                                        |
> >> >> >       4096 -> 8191       : 3        |                                        |
> >> >> >       8192 -> 16383      : 1        |                                        |
> >> >> >      16384 -> 32767      : 2        |                                        |
> >> >> >      32768 -> 65535      : 7        |                                        |
> >> >> >      65536 -> 131071     : 198      |*********                               |
> >> >> >     131072 -> 262143     : 530      |************************                |
> >> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >> >
> >> >> > Compared to the previous data, the maximum latency has been reduced to
> >> >> > less than 30ms.
> >> >>
> >> >> People don't care too much about page freeing latency during processes
> >> >> exiting.  Instead, they care more about the process exiting time, that
> >> >> is, throughput.  So, it's better to show the page allocation latency
> >> >> which is affected by the simultaneous processes exiting.
> >> >
> >> > I'm confused also. Is this issue really hard to understand ?
> >>
> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
> >> directly, you can try alternative one and describe why.
> >
> > Not all data can be verified straightforwardly or effortlessly. The
> > primary focus lies in the zone->lock contention, which necessitates
> > measuring the latency it incurs. To accomplish this, the
> > free_pcppages_bulk() function serves as an effective tool for
> > evaluation. Therefore, I have opted to specifically measure the
> > latency associated with free_pcppages_bulk().
> >
> > The rationale behind not measuring allocation latency is due to the
> > necessity of finding a willing participant to endure potential delays,
> > a task that proved unsuccessful as no one expressed interest. In
> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> > identifying and experimenting with the source causing the delays,
> > making it a more feasible approach.
>
> Can you run a benchmark program that do quite some memory allocation by
> yourself to test it?

I can have a try.
However, is it the key point here?  Why can't the lock contention be
measured by the freeing?
Huang, Ying July 11, 2024, 11:03 a.m. UTC | #7
Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > Background
>> >> >> > ==========
>> >> >> >
>> >> >> > In our containerized environment, we have a specific type of container
>> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> > processes are organized as separate processes rather than threads due
>> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> > containers hosted on the same machine experience significant latency
>> >> >> > spikes.
>> >> >> >
>> >> >> > Investigation
>> >> >> > =============
>> >> >> >
>> >> >> > My investigation using perf tracing revealed that the root cause of
>> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> > primary contributor to the observed latency issues.
>> >> >> >
>> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >> >    - 76.97% exit_mmap
>> >> >> >       - 58.58% unmap_vmas
>> >> >> >          - 58.55% unmap_single_vma
>> >> >> >             - unmap_page_range
>> >> >> >                - 58.32% zap_pte_range
>> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >                         - 41.22% release_pages
>> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >                           1.07% free_swap_cache
>> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >       - 17.34% remove_vma
>> >> >> >          - 17.25% vm_area_free
>> >> >> >             - 17.23% kmem_cache_free
>> >> >> >                - 17.15% __slab_free
>> >> >> >                   - 14.56% discard_slab
>> >> >> >                        free_slab
>> >> >> >                        __free_slab
>> >> >> >                        __free_pages
>> >> >> >                      - free_unref_page
>> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >                            - free_pcppages_bulk
>> >> >> >                               + 13.44% _raw_spin_lock
>> >> >>
>> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> >> I don't find the value of the above data.
>> >> >>
>> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> >> > with the majority of them being regular order-0 user pages.
>> >> >> >
>> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> > e=1
>> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >  => free_pcppages_bulk
>> >> >> >  => free_unref_page_commit
>> >> >> >  => free_unref_page_list
>> >> >> >  => release_pages
>> >> >> >  => free_pages_and_swap_cache
>> >> >> >  => tlb_flush_mmu
>> >> >> >  => zap_pte_range
>> >> >> >  => unmap_page_range
>> >> >> >  => unmap_single_vma
>> >> >> >  => unmap_vmas
>> >> >> >  => exit_mmap
>> >> >> >  => mmput
>> >> >> >  => do_exit
>> >> >> >  => do_group_exit
>> >> >> >  => get_signal
>> >> >> >  => arch_do_signal_or_restart
>> >> >> >  => exit_to_user_mode_prepare
>> >> >> >  => syscall_exit_to_user_mode
>> >> >> >  => do_syscall_64
>> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >
>> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >
>> >> >> > Node 0, zone   Normal
>> >> >> >   pages free     144465775
>> >> >> >         boost    0
>> >> >> >         min      1309270
>> >> >> >         low      1636587
>> >> >> >         high     1963904
>> >> >> >         spanned  564133888
>> >> >> >         present  296747008
>> >> >> >         managed  291974346
>> >> >> >         cma      0
>> >> >> >         protection: (0, 0, 0, 0)
>> >> >> > ...
>> >> >> >   pagesets
>> >> >> >     cpu: 0
>> >> >> >               count: 2217
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 1
>> >> >> >               count: 4510
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 2
>> >> >> >               count: 3059
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >
>> >> >> > ...
>> >> >> >
>> >> >> > The pcp high is around 100 times the batch size.
>> >> >> >
>> >> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> >> > function during the container exit process:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >
>> >> >> > The latency can reach tens of milliseconds.
>> >> >> >
>> >> >> > Experimenting
>> >> >> > =============
>> >> >> >
>> >> >> > vm.percpu_pagelist_high_fraction
>> >> >> > --------------------------------
>> >> >> >
>> >> >> > The kernel version currently deployed in our production environment is the
>> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >> >>
>> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> >> description.  And I don't think that it's necessary to describe the
>> >> >> alternative solution with too much details.
>> >> >>
>> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> >> > page draining, which subsequently leads to a substantial reduction in
>> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> >> > improvement in latency.
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >
>> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> >> > have yet to observe any significant difference in throughput within our
>> >> >> > production environment after implementing this change.
>> >> >> >
>> >> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> >> > -------------------------------------------------
>> >> >>
>> >> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> >> directly.
>> >> >
>> >> > Andrew has requested that I provide a more comprehensive analysis of
>> >> > this issue, and in response, I have endeavored to outline all the
>> >> > pertinent details in a thorough and detailed manner.
>> >>
>> >> IMHO, upstream activity can provide comprehensive analysis of the issue
>> >> too.  And, your patch has changed much from the first version.  It's
>> >> better to describe your current version.
>> >
>> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
>> > code is almost the same with the upstream kernel wrt the pcp. I have
>> > thoroughly documented the detailed data showcasing the changes in the
>> > backported version, providing a clear picture of the results. However,
>> > it's crucial to note that I am unable to directly run the upstream
>> > kernel on our production environment due to practical constraints.
>>
>> IMHO, the patch is for upstream kernel, not some downstream kernel, so
>> focus should be the upstream activity.  The issue of the upstream
>> kernel, and how to resolve it.  The production environment test results
>> can be used to support the upstream change.
>
>  The sole distinction in the pcp between version 6.1.y and the
> upstream kernel lies solely in the modifications made to the code by
> you. Furthermore, given that your code changes have now been
> successfully backported, what else do you expect me to do ?

If you can run the upstream kernel directly with some proxy workloads,
it will be better.  But, I understand that this may be not easy for you.

So, what I really expect you to do is to organize the patch description
in an upstream centric way.  Describe the issue of the upstream kernel,
and how do you resolve it.  Although your test data comes from a
downstream kernel with the same page allocator behavior.

>>
>> >> >> > My second endeavor was to backport the series titled
>> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> >> > production environment, I noted a pronounced reduction in latency. The
>> >> >> > observed outcomes are as enumerated below:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >        512 -> 1023       : 0        |                                        |
>> >> >> >       1024 -> 2047       : 2        |                                        |
>> >> >> >       2048 -> 4095       : 11       |                                        |
>> >> >> >       4096 -> 8191       : 3        |                                        |
>> >> >> >       8192 -> 16383      : 1        |                                        |
>> >> >> >      16384 -> 32767      : 2        |                                        |
>> >> >> >      32768 -> 65535      : 7        |                                        |
>> >> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >> >     131072 -> 262143     : 530      |************************                |
>> >> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >> >
>> >> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> >> > less than 30ms.
>> >> >>
>> >> >> People don't care too much about page freeing latency during processes
>> >> >> exiting.  Instead, they care more about the process exiting time, that
>> >> >> is, throughput.  So, it's better to show the page allocation latency
>> >> >> which is affected by the simultaneous processes exiting.
>> >> >
>> >> > I'm confused also. Is this issue really hard to understand ?
>> >>
>> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> >> directly, you can try alternative one and describe why.
>> >
>> > Not all data can be verified straightforwardly or effortlessly. The
>> > primary focus lies in the zone->lock contention, which necessitates
>> > measuring the latency it incurs. To accomplish this, the
>> > free_pcppages_bulk() function serves as an effective tool for
>> > evaluation. Therefore, I have opted to specifically measure the
>> > latency associated with free_pcppages_bulk().
>> >
>> > The rationale behind not measuring allocation latency is due to the
>> > necessity of finding a willing participant to endure potential delays,
>> > a task that proved unsuccessful as no one expressed interest. In
>> > contrast, assessing free_pcppages_bulk()'s latency solely requires
>> > identifying and experimenting with the source causing the delays,
>> > making it a more feasible approach.
>>
>> Can you run a benchmark program that do quite some memory allocation by
>> yourself to test it?
>
> I can have a try.

Thanks!

> However, is it the key point here?

It's better to prove the issue directly instead of indirectly.

> Why can't the lock contention be measured by the freeing?

Have you measured the lock contention after adjusting
CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
hurt lock contention.  I have said it several times, but it seems that
you don't agree with me.  Can you prove I'm wrong with data?

--
Best Regards,
Huang, Ying
Yafang Shao July 11, 2024, 12:40 p.m. UTC | #8
On Thu, Jul 11, 2024 at 7:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > Background
> >> >> >> > ==========
> >> >> >> >
> >> >> >> > In our containerized environment, we have a specific type of container
> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> >> > processes are organized as separate processes rather than threads due
> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> > containers hosted on the same machine experience significant latency
> >> >> >> > spikes.
> >> >> >> >
> >> >> >> > Investigation
> >> >> >> > =============
> >> >> >> >
> >> >> >> > My investigation using perf tracing revealed that the root cause of
> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >
> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >             - unmap_page_range
> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >                         - 41.22% release_pages
> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >                                       1.28% __list_del_entry_valid
> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >       - 17.34% remove_vma
> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >                - 17.15% __slab_free
> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >                        free_slab
> >> >> >> >                        __free_slab
> >> >> >> >                        __free_pages
> >> >> >> >                      - free_unref_page
> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >>
> >> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
> >> >> >> I don't find the value of the above data.
> >> >> >>
> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >> >
> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> >> > e=1
> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >> >  => free_pcppages_bulk
> >> >> >> >  => free_unref_page_commit
> >> >> >> >  => free_unref_page_list
> >> >> >> >  => release_pages
> >> >> >> >  => free_pages_and_swap_cache
> >> >> >> >  => tlb_flush_mmu
> >> >> >> >  => zap_pte_range
> >> >> >> >  => unmap_page_range
> >> >> >> >  => unmap_single_vma
> >> >> >> >  => unmap_vmas
> >> >> >> >  => exit_mmap
> >> >> >> >  => mmput
> >> >> >> >  => do_exit
> >> >> >> >  => do_group_exit
> >> >> >> >  => get_signal
> >> >> >> >  => arch_do_signal_or_restart
> >> >> >> >  => exit_to_user_mode_prepare
> >> >> >> >  => syscall_exit_to_user_mode
> >> >> >> >  => do_syscall_64
> >> >> >> >  => entry_SYSCALL_64_after_hwframe
> >> >> >> >
> >> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >
> >> >> >> > Node 0, zone   Normal
> >> >> >> >   pages free     144465775
> >> >> >> >         boost    0
> >> >> >> >         min      1309270
> >> >> >> >         low      1636587
> >> >> >> >         high     1963904
> >> >> >> >         spanned  564133888
> >> >> >> >         present  296747008
> >> >> >> >         managed  291974346
> >> >> >> >         cma      0
> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> > ...
> >> >> >> >   pagesets
> >> >> >> >     cpu: 0
> >> >> >> >               count: 2217
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 1
> >> >> >> >               count: 4510
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 2
> >> >> >> >               count: 3059
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >
> >> >> >> > ...
> >> >> >> >
> >> >> >> > The pcp high is around 100 times the batch size.
> >> >> >> >
> >> >> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> >> >> > function during the container exit process:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 148      |*****************                       |
> >> >> >> >        512 -> 1023       : 334      |****************************************|
> >> >> >> >       1024 -> 2047       : 33       |***                                     |
> >> >> >> >       2048 -> 4095       : 5        |                                        |
> >> >> >> >       4096 -> 8191       : 7        |                                        |
> >> >> >> >       8192 -> 16383      : 12       |*                                       |
> >> >> >> >      16384 -> 32767      : 30       |***                                     |
> >> >> >> >      32768 -> 65535      : 21       |**                                      |
> >> >> >> >      65536 -> 131071     : 15       |*                                       |
> >> >> >> >     131072 -> 262143     : 27       |***                                     |
> >> >> >> >     262144 -> 524287     : 84       |**********                              |
> >> >> >> >     524288 -> 1048575    : 203      |************************                |
> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
> >> >> >> >
> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >
> >> >> >> > Experimenting
> >> >> >> > =============
> >> >> >> >
> >> >> >> > vm.percpu_pagelist_high_fraction
> >> >> >> > --------------------------------
> >> >> >> >
> >> >> >> > The kernel version currently deployed in our production environment is the
> >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >> >>
> >> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> >> >> description.  And I don't think that it's necessary to describe the
> >> >> >> alternative solution with too much details.
> >> >> >>
> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> >> >> > page draining, which subsequently leads to a substantial reduction in
> >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> >> >> > improvement in latency.
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 120      |                                        |
> >> >> >> >        256 -> 511        : 365      |*                                       |
> >> >> >> >        512 -> 1023       : 201      |                                        |
> >> >> >> >       1024 -> 2047       : 103      |                                        |
> >> >> >> >       2048 -> 4095       : 84       |                                        |
> >> >> >> >       4096 -> 8191       : 87       |                                        |
> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
> >> >> >> >
> >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> >> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> >> >> > have yet to observe any significant difference in throughput within our
> >> >> >> > production environment after implementing this change.
> >> >> >> >
> >> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> >> > -------------------------------------------------
> >> >> >>
> >> >> >> Again, not upstream activity.  We can describe the upstream behavior
> >> >> >> directly.
> >> >> >
> >> >> > Andrew has requested that I provide a more comprehensive analysis of
> >> >> > this issue, and in response, I have endeavored to outline all the
> >> >> > pertinent details in a thorough and detailed manner.
> >> >>
> >> >> IMHO, upstream activity can provide comprehensive analysis of the issue
> >> >> too.  And, your patch has changed much from the first version.  It's
> >> >> better to describe your current version.
> >> >
> >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> >> > code is almost the same with the upstream kernel wrt the pcp. I have
> >> > thoroughly documented the detailed data showcasing the changes in the
> >> > backported version, providing a clear picture of the results. However,
> >> > it's crucial to note that I am unable to directly run the upstream
> >> > kernel on our production environment due to practical constraints.
> >>
> >> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> >> focus should be the upstream activity.  The issue of the upstream
> >> kernel, and how to resolve it.  The production environment test results
> >> can be used to support the upstream change.
> >
> >  The sole distinction in the pcp between version 6.1.y and the
> > upstream kernel lies solely in the modifications made to the code by
> > you. Furthermore, given that your code changes have now been
> > successfully backported, what else do you expect me to do ?
>
> If you can run the upstream kernel directly with some proxy workloads,
> it will be better.  But, I understand that this may be not easy for you.
>
> So, what I really expect you to do is to organize the patch description
> in an upstream centric way.  Describe the issue of the upstream kernel,
> and how do you resolve it.  Although your test data comes from a
> downstream kernel with the same page allocator behavior.
>
> >>
> >> >> >> > My second endeavor was to backport the series titled
> >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> >> >> > production environment, I noted a pronounced reduction in latency. The
> >> >> >> > observed outcomes are as enumerated below:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >> >        512 -> 1023       : 0        |                                        |
> >> >> >> >       1024 -> 2047       : 2        |                                        |
> >> >> >> >       2048 -> 4095       : 11       |                                        |
> >> >> >> >       4096 -> 8191       : 3        |                                        |
> >> >> >> >       8192 -> 16383      : 1        |                                        |
> >> >> >> >      16384 -> 32767      : 2        |                                        |
> >> >> >> >      32768 -> 65535      : 7        |                                        |
> >> >> >> >      65536 -> 131071     : 198      |*********                               |
> >> >> >> >     131072 -> 262143     : 530      |************************                |
> >> >> >> >     262144 -> 524287     : 824      |**************************************  |
> >> >> >> >     524288 -> 1048575    : 852      |****************************************|
> >> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
> >> >> >> >    2097152 -> 4194303    : 389      |******************                      |
> >> >> >> >    4194304 -> 8388607    : 143      |******                                  |
> >> >> >> >    8388608 -> 16777215   : 29       |*                                       |
> >> >> >> >   16777216 -> 33554431   : 1        |                                        |
> >> >> >> >
> >> >> >> > Compared to the previous data, the maximum latency has been reduced to
> >> >> >> > less than 30ms.
> >> >> >>
> >> >> >> People don't care too much about page freeing latency during processes
> >> >> >> exiting.  Instead, they care more about the process exiting time, that
> >> >> >> is, throughput.  So, it's better to show the page allocation latency
> >> >> >> which is affected by the simultaneous processes exiting.
> >> >> >
> >> >> > I'm confused also. Is this issue really hard to understand ?
> >> >>
> >> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
> >> >> directly, you can try alternative one and describe why.
> >> >
> >> > Not all data can be verified straightforwardly or effortlessly. The
> >> > primary focus lies in the zone->lock contention, which necessitates
> >> > measuring the latency it incurs. To accomplish this, the
> >> > free_pcppages_bulk() function serves as an effective tool for
> >> > evaluation. Therefore, I have opted to specifically measure the
> >> > latency associated with free_pcppages_bulk().
> >> >
> >> > The rationale behind not measuring allocation latency is due to the
> >> > necessity of finding a willing participant to endure potential delays,
> >> > a task that proved unsuccessful as no one expressed interest. In
> >> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> >> > identifying and experimenting with the source causing the delays,
> >> > making it a more feasible approach.
> >>
> >> Can you run a benchmark program that do quite some memory allocation by
> >> yourself to test it?
> >
> > I can have a try.
>
> Thanks!
>
> > However, is it the key point here?
>
> It's better to prove the issue directly instead of indirectly.
>
> > Why can't the lock contention be measured by the freeing?
>
> Have you measured the lock contention after adjusting
> CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
> worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
> hurt lock contention.  I have said it several times, but it seems that
> you don't agree with me.  Can you prove I'm wrong with data?

Now I understand the point. It seems we have different understandings
regarding the zone lock contention.

    CPU A  (Freer)                    CPU B (Allocator)
lock zone->lock
free pages                              lock zone->lock
unlock zone->lock                  alloc pages
                                               unlock zone->lock

If the Freer holds the zone lock for an extended period, the Allocator
has to wait, right? Isn't that a lock contention issue? Lock
contention affects not only CPU system usage but also latency.
Huang, Ying July 12, 2024, 2:32 a.m. UTC | #9
Yafang Shao <laoar.shao@gmail.com> writes:

> On Thu, Jul 11, 2024 at 7:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 4:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > Background
>> >> >> >> > ==========
>> >> >> >> >
>> >> >> >> > In our containerized environment, we have a specific type of container
>> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> >> >> >> > processes are organized as separate processes rather than threads due
>> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> >> > containers hosted on the same machine experience significant latency
>> >> >> >> > spikes.
>> >> >> >> >
>> >> >> >> > Investigation
>> >> >> >> > =============
>> >> >> >> >
>> >> >> >> > My investigation using perf tracing revealed that the root cause of
>> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
>> >> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
>> >> >> >> > performance. The perf results clearly indicate this contention as a
>> >> >> >> > primary contributor to the observed latency issues.
>> >> >> >> >
>> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
>> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
>> >> >> >> >    - 76.97% exit_mmap
>> >> >> >> >       - 58.58% unmap_vmas
>> >> >> >> >          - 58.55% unmap_single_vma
>> >> >> >> >             - unmap_page_range
>> >> >> >> >                - 58.32% zap_pte_range
>> >> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >> >                         - 41.22% release_pages
>> >> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >> >                                       1.28% __list_del_entry_valid
>> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >> >                           1.07% free_swap_cache
>> >> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >> >       - 17.34% remove_vma
>> >> >> >> >          - 17.25% vm_area_free
>> >> >> >> >             - 17.23% kmem_cache_free
>> >> >> >> >                - 17.15% __slab_free
>> >> >> >> >                   - 14.56% discard_slab
>> >> >> >> >                        free_slab
>> >> >> >> >                        __free_slab
>> >> >> >> >                        __free_pages
>> >> >> >> >                      - free_unref_page
>> >> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >> >                            - free_pcppages_bulk
>> >> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >>
>> >> >> >> I don't think your change will reduce zone->lock contention cycles.  So,
>> >> >> >> I don't find the value of the above data.
>> >> >> >>
>> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
>> >> >> >> > with the majority of them being regular order-0 user pages.
>> >> >> >> >
>> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> >> >> >> > e=1
>> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >> >  => free_pcppages_bulk
>> >> >> >> >  => free_unref_page_commit
>> >> >> >> >  => free_unref_page_list
>> >> >> >> >  => release_pages
>> >> >> >> >  => free_pages_and_swap_cache
>> >> >> >> >  => tlb_flush_mmu
>> >> >> >> >  => zap_pte_range
>> >> >> >> >  => unmap_page_range
>> >> >> >> >  => unmap_single_vma
>> >> >> >> >  => unmap_vmas
>> >> >> >> >  => exit_mmap
>> >> >> >> >  => mmput
>> >> >> >> >  => do_exit
>> >> >> >> >  => do_group_exit
>> >> >> >> >  => get_signal
>> >> >> >> >  => arch_do_signal_or_restart
>> >> >> >> >  => exit_to_user_mode_prepare
>> >> >> >> >  => syscall_exit_to_user_mode
>> >> >> >> >  => do_syscall_64
>> >> >> >> >  => entry_SYSCALL_64_after_hwframe
>> >> >> >> >
>> >> >> >> > The servers experiencing these issues are equipped with impressive
>> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >> >
>> >> >> >> > Node 0, zone   Normal
>> >> >> >> >   pages free     144465775
>> >> >> >> >         boost    0
>> >> >> >> >         min      1309270
>> >> >> >> >         low      1636587
>> >> >> >> >         high     1963904
>> >> >> >> >         spanned  564133888
>> >> >> >> >         present  296747008
>> >> >> >> >         managed  291974346
>> >> >> >> >         cma      0
>> >> >> >> >         protection: (0, 0, 0, 0)
>> >> >> >> > ...
>> >> >> >> >   pagesets
>> >> >> >> >     cpu: 0
>> >> >> >> >               count: 2217
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 1
>> >> >> >> >               count: 4510
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >   vm stats threshold: 125
>> >> >> >> >     cpu: 2
>> >> >> >> >               count: 3059
>> >> >> >> >               high:  6392
>> >> >> >> >               batch: 63
>> >> >> >> >
>> >> >> >> > ...
>> >> >> >> >
>> >> >> >> > The pcp high is around 100 times the batch size.
>> >> >> >> >
>> >> >> >> > I also traced the latency associated with the free_pcppages_bulk()
>> >> >> >> > function during the container exit process:
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 148      |*****************                       |
>> >> >> >> >        512 -> 1023       : 334      |****************************************|
>> >> >> >> >       1024 -> 2047       : 33       |***                                     |
>> >> >> >> >       2048 -> 4095       : 5        |                                        |
>> >> >> >> >       4096 -> 8191       : 7        |                                        |
>> >> >> >> >       8192 -> 16383      : 12       |*                                       |
>> >> >> >> >      16384 -> 32767      : 30       |***                                     |
>> >> >> >> >      32768 -> 65535      : 21       |**                                      |
>> >> >> >> >      65536 -> 131071     : 15       |*                                       |
>> >> >> >> >     131072 -> 262143     : 27       |***                                     |
>> >> >> >> >     262144 -> 524287     : 84       |**********                              |
>> >> >> >> >     524288 -> 1048575    : 203      |************************                |
>> >> >> >> >    1048576 -> 2097151    : 284      |**********************************      |
>> >> >> >> >    2097152 -> 4194303    : 327      |*************************************** |
>> >> >> >> >    4194304 -> 8388607    : 215      |*************************               |
>> >> >> >> >    8388608 -> 16777215   : 116      |*************                           |
>> >> >> >> >   16777216 -> 33554431   : 47       |*****                                   |
>> >> >> >> >   33554432 -> 67108863   : 8        |                                        |
>> >> >> >> >   67108864 -> 134217727  : 3        |                                        |
>> >> >> >> >
>> >> >> >> > The latency can reach tens of milliseconds.
>> >> >> >> >
>> >> >> >> > Experimenting
>> >> >> >> > =============
>> >> >> >> >
>> >> >> >> > vm.percpu_pagelist_high_fraction
>> >> >> >> > --------------------------------
>> >> >> >> >
>> >> >> >> > The kernel version currently deployed in our production environment is the
>> >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
>> >> >> >>
>> >> >> >> IMHO, we should focus on upstream activity in the cover letter and patch
>> >> >> >> description.  And I don't think that it's necessary to describe the
>> >> >> >> alternative solution with too much details.
>> >> >> >>
>> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
>> >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
>> >> >> >> > page draining, which subsequently leads to a substantial reduction in
>> >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
>> >> >> >> > improvement in latency.
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 120      |                                        |
>> >> >> >> >        256 -> 511        : 365      |*                                       |
>> >> >> >> >        512 -> 1023       : 201      |                                        |
>> >> >> >> >       1024 -> 2047       : 103      |                                        |
>> >> >> >> >       2048 -> 4095       : 84       |                                        |
>> >> >> >> >       4096 -> 8191       : 87       |                                        |
>> >> >> >> >       8192 -> 16383      : 4777     |**************                          |
>> >> >> >> >      16384 -> 32767      : 10572    |*******************************         |
>> >> >> >> >      32768 -> 65535      : 13544    |****************************************|
>> >> >> >> >      65536 -> 131071     : 12723    |*************************************   |
>> >> >> >> >     131072 -> 262143     : 8604     |*************************               |
>> >> >> >> >     262144 -> 524287     : 3659     |**********                              |
>> >> >> >> >     524288 -> 1048575    : 921      |**                                      |
>> >> >> >> >    1048576 -> 2097151    : 122      |                                        |
>> >> >> >> >    2097152 -> 4194303    : 5        |                                        |
>> >> >> >> >
>> >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
>> >> >> >> > pcp high watermark size to a minimum of four times the batch size. While
>> >> >> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
>> >> >> >> > have yet to observe any significant difference in throughput within our
>> >> >> >> > production environment after implementing this change.
>> >> >> >> >
>> >> >> >> > Backporting the series "mm: PCP high auto-tuning"
>> >> >> >> > -------------------------------------------------
>> >> >> >>
>> >> >> >> Again, not upstream activity.  We can describe the upstream behavior
>> >> >> >> directly.
>> >> >> >
>> >> >> > Andrew has requested that I provide a more comprehensive analysis of
>> >> >> > this issue, and in response, I have endeavored to outline all the
>> >> >> > pertinent details in a thorough and detailed manner.
>> >> >>
>> >> >> IMHO, upstream activity can provide comprehensive analysis of the issue
>> >> >> too.  And, your patch has changed much from the first version.  It's
>> >> >> better to describe your current version.
>> >> >
>> >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
>> >> > code is almost the same with the upstream kernel wrt the pcp. I have
>> >> > thoroughly documented the detailed data showcasing the changes in the
>> >> > backported version, providing a clear picture of the results. However,
>> >> > it's crucial to note that I am unable to directly run the upstream
>> >> > kernel on our production environment due to practical constraints.
>> >>
>> >> IMHO, the patch is for upstream kernel, not some downstream kernel, so
>> >> focus should be the upstream activity.  The issue of the upstream
>> >> kernel, and how to resolve it.  The production environment test results
>> >> can be used to support the upstream change.
>> >
>> >  The sole distinction in the pcp between version 6.1.y and the
>> > upstream kernel lies solely in the modifications made to the code by
>> > you. Furthermore, given that your code changes have now been
>> > successfully backported, what else do you expect me to do ?
>>
>> If you can run the upstream kernel directly with some proxy workloads,
>> it will be better.  But, I understand that this may be not easy for you.
>>
>> So, what I really expect you to do is to organize the patch description
>> in an upstream centric way.  Describe the issue of the upstream kernel,
>> and how do you resolve it.  Although your test data comes from a
>> downstream kernel with the same page allocator behavior.
>>
>> >>
>> >> >> >> > My second endeavor was to backport the series titled
>> >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
>> >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
>> >> >> >> > production environment, I noted a pronounced reduction in latency. The
>> >> >> >> > observed outcomes are as enumerated below:
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >> >        512 -> 1023       : 0        |                                        |
>> >> >> >> >       1024 -> 2047       : 2        |                                        |
>> >> >> >> >       2048 -> 4095       : 11       |                                        |
>> >> >> >> >       4096 -> 8191       : 3        |                                        |
>> >> >> >> >       8192 -> 16383      : 1        |                                        |
>> >> >> >> >      16384 -> 32767      : 2        |                                        |
>> >> >> >> >      32768 -> 65535      : 7        |                                        |
>> >> >> >> >      65536 -> 131071     : 198      |*********                               |
>> >> >> >> >     131072 -> 262143     : 530      |************************                |
>> >> >> >> >     262144 -> 524287     : 824      |**************************************  |
>> >> >> >> >     524288 -> 1048575    : 852      |****************************************|
>> >> >> >> >    1048576 -> 2097151    : 714      |*********************************       |
>> >> >> >> >    2097152 -> 4194303    : 389      |******************                      |
>> >> >> >> >    4194304 -> 8388607    : 143      |******                                  |
>> >> >> >> >    8388608 -> 16777215   : 29       |*                                       |
>> >> >> >> >   16777216 -> 33554431   : 1        |                                        |
>> >> >> >> >
>> >> >> >> > Compared to the previous data, the maximum latency has been reduced to
>> >> >> >> > less than 30ms.
>> >> >> >>
>> >> >> >> People don't care too much about page freeing latency during processes
>> >> >> >> exiting.  Instead, they care more about the process exiting time, that
>> >> >> >> is, throughput.  So, it's better to show the page allocation latency
>> >> >> >> which is affected by the simultaneous processes exiting.
>> >> >> >
>> >> >> > I'm confused also. Is this issue really hard to understand ?
>> >> >>
>> >> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
>> >> >> directly, you can try alternative one and describe why.
>> >> >
>> >> > Not all data can be verified straightforwardly or effortlessly. The
>> >> > primary focus lies in the zone->lock contention, which necessitates
>> >> > measuring the latency it incurs. To accomplish this, the
>> >> > free_pcppages_bulk() function serves as an effective tool for
>> >> > evaluation. Therefore, I have opted to specifically measure the
>> >> > latency associated with free_pcppages_bulk().
>> >> >
>> >> > The rationale behind not measuring allocation latency is due to the
>> >> > necessity of finding a willing participant to endure potential delays,
>> >> > a task that proved unsuccessful as no one expressed interest. In
>> >> > contrast, assessing free_pcppages_bulk()'s latency solely requires
>> >> > identifying and experimenting with the source causing the delays,
>> >> > making it a more feasible approach.
>> >>
>> >> Can you run a benchmark program that do quite some memory allocation by
>> >> yourself to test it?
>> >
>> > I can have a try.
>>
>> Thanks!
>>
>> > However, is it the key point here?
>>
>> It's better to prove the issue directly instead of indirectly.
>>
>> > Why can't the lock contention be measured by the freeing?
>>
>> Have you measured the lock contention after adjusting
>> CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
>> worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
>> hurt lock contention.  I have said it several times, but it seems that
>> you don't agree with me.  Can you prove I'm wrong with data?
>
> Now I understand the point. It seems we have different understandings
> regarding the zone lock contention.
>
>     CPU A  (Freer)                    CPU B (Allocator)
> lock zone->lock
> free pages                              lock zone->lock
> unlock zone->lock                  alloc pages
>                                                unlock zone->lock
>
> If the Freer holds the zone lock for an extended period, the Allocator
> has to wait, right? Isn't that a lock contention issue? Lock
> contention affects not only CPU system usage but also latency.

Thanks for explanation!  Now, I understand our difference.

When I measure the spin lock contention, I usually measure lock spinning
CPU cycles.  The more spinning cycles, the more severe the lock
contention.  IIUC, we cannot measure the lock contention with the
latency only.  For example, if there are only 1 "Freer" and 0
"Allocator" in the system, there will be no lock contention at all.
But, if we increase page free batch number, the latency of "Freer" and
"Allocator" will increase without the lock contention.  Another example,
the total spin cycles decrease, but the spin cycles of each waiting
increase because the total number of waiting decreases with larger batch
number.  So, the lock contention becomes better at the cost of latency.

So, I suggest to use the latency spike to describe the issue and the
solution.

I suggest you to measure the processes exit time too.  To evaluate the
possible negative impact on throughput of smaller batch number.

--
Best Regards,
Huang, Ying