[v2,3/5] mm: memcg: make stats flushing threshold per-memcg

Message ID	20231010032117.1577496-4-yosryahmed@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 10 Oct 2023 03:21:14 +0000 In-Reply-To: <20231010032117.1577496-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20231010032117.1577496-1-yosryahmed@google.com> Message-ID: <20231010032117.1577496-4-yosryahmed@google.com> Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com>, Waiman Long <longman@redhat.com>, kernel-team@cloudflare.com, Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcg: subtree stats flushing and thresholds \| expand [v2,0/5] mm: memcg: subtree stats flushing and thresholds [v2,1/5] mm: memcg: change flush_next_time to flush_last_time [v2,2/5] mm: memcg: move vmstats structs definition above flushing code [v2,3/5] mm: memcg: make stats flushing threshold per-memcg [v2,4/5] mm: workingset: move the stats flush into workingset_test_recent() [v2,5/5] mm: memcg: restore subtree stats flushing

Yosry Ahmed Oct. 10, 2023, 3:21 a.m. UTC

A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.

Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.

This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.

(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.

Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.

(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.

(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.

This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
deeper than a usual setup:

(a) neper [1] with 1000 flows and 100 threads (single machine). The
values in the table are the average of server and client throughputs in
mbps after 30 iterations, each running for 30s:

				tcp_rr		tcp_stream
Base				9504218.56	357366.84
Patched				9656205.68	356978.39
Delta				+1.6%		-0.1%
Standard Deviation		0.95%		1.03%

An increase in the performance of tcp_rr doesn't really make sense, but
it's probably in the noise. The same tests were ran with 1 flow and 1
thread but the throughput was too noisy to make any conclusions (the
averages did not show regressions nonetheless).

Looking at perf for one iteration of the above test, __mod_memcg_state()
(which is where memcg_rstat_updated() is called) does not show up at all
without this patch, but it shows up with this patch as 1.06% for tcp_rr
and 0.36% for tcp_stream.

(b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
stress-ng very well, so I am not sure that's the best way to test this,
but it spawns 384 workers and spits a lot of metrics which looks nice :)
I picked a few ones that seem to be relevant to the stats update path. I
also included cache misses as this patch introduce more atomics that may
bounce between cpu caches:

Metric			Base		Patched		Delta
Cache Misses		3.394 B/sec 	3.433 B/sec	+1.14%
Cache L1D Read		0.148 T/sec	0.154 T/sec	+4.05%
Cache L1D Read Miss	20.430 B/sec	21.820 B/sec	+6.8%
Page Faults Total	4.304 M/sec	4.535 M/sec	+5.4%
Page Faults Minor	4.304 M/sec	4.535 M/sec	+5.4%
Page Faults Major	18.794 /sec	0.000 /sec
Kmalloc			0.153 M/sec	0.152 M/sec	-0.65%
Kfree			0.152 M/sec	0.153 M/sec	+0.65%
MM Page Alloc		4.640 M/sec	4.898 M/sec	+5.56%
MM Page Free		4.639 M/sec	4.897 M/sec	+5.56%
Lock Contention Begin	0.362 M/sec	0.479 M/sec	+32.32%
Lock Contention End	0.362 M/sec	0.479 M/sec	+32.32%
page-cache add		238.057 /sec	0.000 /sec
page-cache del		6.265 /sec	6.267 /sec	-0.03%

This is only using a single run in each case. I am not sure what to
make out of most of these numbers, but they mostly seem in the noise
(some better, some worse). The lock contention numbers are interesting.
I am not sure if higher is better or worse here. No new locks or lock
sections are introduced by this patch either way.

Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
this patch. This is suspicious, but I verified while stress-ng is
running that all the threads are in the right cgroup.

(3) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [2]. These are the
numbers from 30 runs (+ is good):

             LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
------------------------------+-------------+-------------+-------------
  page_fault1_per_process_ops |             |             |            |
  (A) base                    | 265207.738  | 262941.000  | 12112.379  |
  (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
                              | -6.02%      | -5.39%      |            |
  page_fault1_per_thread_ops  |             |             |            |
  (A) base                    | 241618.484  | 240209.000  | 10162.207  |
  (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
                              | -4.88%      | -4.62%      |            |
  page_fault1_scalability     |             |             |
  (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
  (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
                              | -9.29%      | -9.35%      |            |
  page_fault2_per_process_ops |             |             |
  (A) base                    | 203916.148  | 203496.000  | 2908.331   |
  (B) patched                 | 186975.419  | 187023.000  | 1991.100   |
                              | -6.85%      | -6.90%      |            |
  page_fault2_per_thread_ops  |             |             |
  (A) base                    | 170604.972  | 170532.000  | 1624.834   |
  (B) patched                 | 163100.260  | 163263.000  | 1517.967   |
                              | -4.40%      | -4.26%      |            |
  page_fault2_scalability     |             |             |
  (A) base                    | 0.054603    | 0.054693    | 0.00080196 |
  (B) patched                 | 0.044882    | 0.044957    | 0.0011766  |
                              | -0.05%      | +0.33%      |            |
  page_fault3_per_process_ops |             |             |
  (A) base                    | 1299821.099 | 1297918.000 | 9882.872   |
  (B) patched                 | 1248700.839 | 1247168.000 | 8454.891   |
                              | -3.93%      | -3.91%      |            |
  page_fault3_per_thread_ops  |             |             |
  (A) base                    | 387216.963  | 387115.000  | 1605.760   |
  (B) patched                 | 368538.213  | 368826.000  | 1852.594   |
                              | -4.82%      | -4.72%      |            |
  page_fault3_scalability     |             |             |
  (A) base                    | 0.59909     | 0.59367     | 0.01256    |
  (B) patched                 | 0.59995     | 0.59769     | 0.010088   |
                              | +0.14%      | +0.68%      |            |

There is some microbenchmarks regressions (and some minute improvements),
but nothing outside the normal variance of this benchmark between kernel
versions. The fix for [2]  assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.

[1]https://github.com/google/neper
[2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 16 deletions(-)

Yosry Ahmed Oct. 10, 2023, 3:24 a.m. UTC | #1

On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> deeper than a usual setup:
>
> (a) neper [1] with 1000 flows and 100 threads (single machine). The
> values in the table are the average of server and client throughputs in
> mbps after 30 iterations, each running for 30s:
>
>                                 tcp_rr          tcp_stream
> Base                            9504218.56      357366.84
> Patched                         9656205.68      356978.39
> Delta                           +1.6%           -0.1%
> Standard Deviation              0.95%           1.03%
>
> An increase in the performance of tcp_rr doesn't really make sense, but
> it's probably in the noise. The same tests were ran with 1 flow and 1
> thread but the throughput was too noisy to make any conclusions (the
> averages did not show regressions nonetheless).
>
> Looking at perf for one iteration of the above test, __mod_memcg_state()
> (which is where memcg_rstat_updated() is called) does not show up at all
> without this patch, but it shows up with this patch as 1.06% for tcp_rr
> and 0.36% for tcp_stream.
>
> (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> stress-ng very well, so I am not sure that's the best way to test this,
> but it spawns 384 workers and spits a lot of metrics which looks nice :)
> I picked a few ones that seem to be relevant to the stats update path. I
> also included cache misses as this patch introduce more atomics that may
> bounce between cpu caches:
>
> Metric                  Base            Patched         Delta
> Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> Page Faults Major       18.794 /sec     0.000 /sec
> Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> page-cache add          238.057 /sec    0.000 /sec
> page-cache del          6.265 /sec      6.267 /sec      -0.03%
>
> This is only using a single run in each case. I am not sure what to
> make out of most of these numbers, but they mostly seem in the noise
> (some better, some worse). The lock contention numbers are interesting.
> I am not sure if higher is better or worse here. No new locks or lock
> sections are introduced by this patch either way.
>
> Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> this patch. This is suspicious, but I verified while stress-ng is
> running that all the threads are in the right cgroup.
>
> (3) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [2]. These are the
> numbers from 30 runs (+ is good):
>
>              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> ------------------------------+-------------+-------------+-------------
>   page_fault1_per_process_ops |             |             |            |
>   (A) base                    | 265207.738  | 262941.000  | 12112.379  |
>   (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
>                               | -6.02%      | -5.39%      |            |
>   page_fault1_per_thread_ops  |             |             |            |
>   (A) base                    | 241618.484  | 240209.000  | 10162.207  |
>   (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
>                               | -4.88%      | -4.62%      |            |
>   page_fault1_scalability     |             |             |
>   (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
>   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
>                               | -9.29%      | -9.35%      |            |
>   page_fault2_per_process_ops |             |             |
>   (A) base                    | 203916.148  | 203496.000  | 2908.331   |
>   (B) patched                 | 186975.419  | 187023.000  | 1991.100   |
>                               | -6.85%      | -6.90%      |            |
>   page_fault2_per_thread_ops  |             |             |
>   (A) base                    | 170604.972  | 170532.000  | 1624.834   |
>   (B) patched                 | 163100.260  | 163263.000  | 1517.967   |
>                               | -4.40%      | -4.26%      |            |
>   page_fault2_scalability     |             |             |
>   (A) base                    | 0.054603    | 0.054693    | 0.00080196 |
>   (B) patched                 | 0.044882    | 0.044957    | 0.0011766  |
>                               | -0.05%      | +0.33%      |            |
>   page_fault3_per_process_ops |             |             |
>   (A) base                    | 1299821.099 | 1297918.000 | 9882.872   |
>   (B) patched                 | 1248700.839 | 1247168.000 | 8454.891   |
>                               | -3.93%      | -3.91%      |            |
>   page_fault3_per_thread_ops  |             |             |
>   (A) base                    | 387216.963  | 387115.000  | 1605.760   |
>   (B) patched                 | 368538.213  | 368826.000  | 1852.594   |
>                               | -4.82%      | -4.72%      |            |
>   page_fault3_scalability     |             |             |
>   (A) base                    | 0.59909     | 0.59367     | 0.01256    |
>   (B) patched                 | 0.59995     | 0.59769     | 0.010088   |
>                               | +0.14%      | +0.68%      |            |
>
> There is some microbenchmarks regressions (and some minute improvements),
> but nothing outside the normal variance of this benchmark between kernel
> versions. The fix for [2]  assumes that 3% is noise -- and there were no
> further practical complaints), so hopefully this means that such variations
> in these microbenchmarks do not reflect on practical workloads.
>
> [1]https://github.com/google/neper
> [2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Johannes, as I mentioned in a reply to v1, I think this might be what
you suggested in our previous discussion [1], but I am not sure this
is what you meant for the update path, so I did not add a
Suggested-by.

Please let me know if this is what you meant and I can amend the tag as such.

[1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/

Shakeel Butt Oct. 10, 2023, 8:45 p.m. UTC | #2

On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> deeper than a usual setup:
>
> (a) neper [1] with 1000 flows and 100 threads (single machine). The
> values in the table are the average of server and client throughputs in
> mbps after 30 iterations, each running for 30s:
>
>                                 tcp_rr          tcp_stream
> Base                            9504218.56      357366.84
> Patched                         9656205.68      356978.39
> Delta                           +1.6%           -0.1%
> Standard Deviation              0.95%           1.03%
>
> An increase in the performance of tcp_rr doesn't really make sense, but
> it's probably in the noise. The same tests were ran with 1 flow and 1
> thread but the throughput was too noisy to make any conclusions (the
> averages did not show regressions nonetheless).
>
> Looking at perf for one iteration of the above test, __mod_memcg_state()
> (which is where memcg_rstat_updated() is called) does not show up at all
> without this patch, but it shows up with this patch as 1.06% for tcp_rr
> and 0.36% for tcp_stream.
>
> (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> stress-ng very well, so I am not sure that's the best way to test this,
> but it spawns 384 workers and spits a lot of metrics which looks nice :)
> I picked a few ones that seem to be relevant to the stats update path. I
> also included cache misses as this patch introduce more atomics that may
> bounce between cpu caches:
>
> Metric                  Base            Patched         Delta
> Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> Page Faults Major       18.794 /sec     0.000 /sec
> Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> page-cache add          238.057 /sec    0.000 /sec
> page-cache del          6.265 /sec      6.267 /sec      -0.03%
>
> This is only using a single run in each case. I am not sure what to
> make out of most of these numbers, but they mostly seem in the noise
> (some better, some worse). The lock contention numbers are interesting.
> I am not sure if higher is better or worse here. No new locks or lock
> sections are introduced by this patch either way.
>
> Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> this patch. This is suspicious, but I verified while stress-ng is
> running that all the threads are in the right cgroup.
>
> (3) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [2]. These are the
> numbers from 30 runs (+ is good):
>
>              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> ------------------------------+-------------+-------------+-------------
>   page_fault1_per_process_ops |             |             |            |
>   (A) base                    | 265207.738  | 262941.000  | 12112.379  |
>   (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
>                               | -6.02%      | -5.39%      |            |
>   page_fault1_per_thread_ops  |             |             |            |
>   (A) base                    | 241618.484  | 240209.000  | 10162.207  |
>   (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
>                               | -4.88%      | -4.62%      |            |
>   page_fault1_scalability     |             |             |
>   (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
>   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
>                               | -9.29%      | -9.35%      |            |

This much regression is not acceptable.

In addition, I ran netperf with the same 4 level hierarchy as you have
run and I am seeing ~11% regression.

More specifically on a machine with 44 CPUs (HT disabled ixion machine):

# for server
$ netserver -6

# 22 instances of netperf clients
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

(averaged over 4 runs)

base (next-20231009): 33081 MBPS
patched: 29267 MBPS

So, this series is not acceptable unless this regression is resolved.

Yosry Ahmed Oct. 10, 2023, 9:02 p.m. UTC | #3

On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > A global counter for the magnitude of memcg stats update is maintained
> > on the memcg side to avoid invoking rstat flushes when the pending
> > updates are not significant. This avoids unnecessary flushes, which are
> > not very cheap even if there isn't a lot of stats to flush. It also
> > avoids unnecessary lock contention on the underlying global rstat lock.
> >
> > Make this threshold per-memcg. The scheme is followed where percpu (now
> > also per-memcg) counters are incremented in the update path, and only
> > propagated to per-memcg atomics when they exceed a certain threshold.
> >
> > This provides two benefits:
> > (a) On large machines with a lot of memcgs, the global threshold can be
> > reached relatively fast, so guarding the underlying lock becomes less
> > effective. Making the threshold per-memcg avoids this.
> >
> > (b) Having a global threshold makes it hard to do subtree flushes, as we
> > cannot reset the global counter except for a full flush. Per-memcg
> > counters removes this as a blocker from doing subtree flushes, which
> > helps avoid unnecessary work when the stats of a small subtree are
> > needed.
> >
> > Nothing is free, of course. This comes at a cost:
> > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > bytes. The extra memory usage is insigificant.
> >
> > (b) More work on the update side, although in the common case it will
> > only be percpu counter updates. The amount of work scales with the
> > number of ancestors (i.e. tree depth). This is not a new concept, adding
> > a cgroup to the rstat tree involves a parent loop, so is charging.
> > Testing results below show no significant regressions.
> >
> > (c) The error margin in the stats for the system as a whole increases
> > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > error in charges coming from percpu stocks, and we have a periodic
> > flusher that makes sure we always flush all the stats every 2s anyway.
> >
> > This patch was tested to make sure no significant regressions are
> > introduced on the update path as follows. The following benchmarks were
> > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> > deeper than a usual setup:
> >
> > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > values in the table are the average of server and client throughputs in
> > mbps after 30 iterations, each running for 30s:
> >
> >                                 tcp_rr          tcp_stream
> > Base                            9504218.56      357366.84
> > Patched                         9656205.68      356978.39
> > Delta                           +1.6%           -0.1%
> > Standard Deviation              0.95%           1.03%
> >
> > An increase in the performance of tcp_rr doesn't really make sense, but
> > it's probably in the noise. The same tests were ran with 1 flow and 1
> > thread but the throughput was too noisy to make any conclusions (the
> > averages did not show regressions nonetheless).
> >
> > Looking at perf for one iteration of the above test, __mod_memcg_state()
> > (which is where memcg_rstat_updated() is called) does not show up at all
> > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > and 0.36% for tcp_stream.
> >
> > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > stress-ng very well, so I am not sure that's the best way to test this,
> > but it spawns 384 workers and spits a lot of metrics which looks nice :)
> > I picked a few ones that seem to be relevant to the stats update path. I
> > also included cache misses as this patch introduce more atomics that may
> > bounce between cpu caches:
> >
> > Metric                  Base            Patched         Delta
> > Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> > Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> > Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> > Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Major       18.794 /sec     0.000 /sec
> > Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> > Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> > MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> > MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> > Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> > Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> > page-cache add          238.057 /sec    0.000 /sec
> > page-cache del          6.265 /sec      6.267 /sec      -0.03%
> >
> > This is only using a single run in each case. I am not sure what to
> > make out of most of these numbers, but they mostly seem in the noise
> > (some better, some worse). The lock contention numbers are interesting.
> > I am not sure if higher is better or worse here. No new locks or lock
> > sections are introduced by this patch either way.
> >
> > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > this patch. This is suspicious, but I verified while stress-ng is
> > running that all the threads are in the right cgroup.
> >
> > (3) will-it-scale page_fault tests. These tests (specifically
> > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > for a change in the stats update path [2]. These are the
> > numbers from 30 runs (+ is good):
> >
> >              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> > ------------------------------+-------------+-------------+-------------
> >   page_fault1_per_process_ops |             |             |            |
> >   (A) base                    | 265207.738  | 262941.000  | 12112.379  |
> >   (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
> >                               | -6.02%      | -5.39%      |            |
> >   page_fault1_per_thread_ops  |             |             |            |
> >   (A) base                    | 241618.484  | 240209.000  | 10162.207  |
> >   (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
> >                               | -4.88%      | -4.62%      |            |
> >   page_fault1_scalability     |             |             |
> >   (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
> >   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
> >                               | -9.29%      | -9.35%      |            |
>
> This much regression is not acceptable.
>
> In addition, I ran netperf with the same 4 level hierarchy as you have
> run and I am seeing ~11% regression.

Interesting, I thought neper and netperf should be similar. Let me try
to reproduce this.

Thanks for testing!

>
> More specifically on a machine with 44 CPUs (HT disabled ixion machine):
>
> # for server
> $ netserver -6
>
> # 22 instances of netperf clients
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> (averaged over 4 runs)
>
> base (next-20231009): 33081 MBPS
> patched: 29267 MBPS
>
> So, this series is not acceptable unless this regression is resolved.

Yosry Ahmed Oct. 10, 2023, 10:21 p.m. UTC | #4

On Tue, Oct 10, 2023 at 2:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > A global counter for the magnitude of memcg stats update is maintained
> > > on the memcg side to avoid invoking rstat flushes when the pending
> > > updates are not significant. This avoids unnecessary flushes, which are
> > > not very cheap even if there isn't a lot of stats to flush. It also
> > > avoids unnecessary lock contention on the underlying global rstat lock.
> > >
> > > Make this threshold per-memcg. The scheme is followed where percpu (now
> > > also per-memcg) counters are incremented in the update path, and only
> > > propagated to per-memcg atomics when they exceed a certain threshold.
> > >
> > > This provides two benefits:
> > > (a) On large machines with a lot of memcgs, the global threshold can be
> > > reached relatively fast, so guarding the underlying lock becomes less
> > > effective. Making the threshold per-memcg avoids this.
> > >
> > > (b) Having a global threshold makes it hard to do subtree flushes, as we
> > > cannot reset the global counter except for a full flush. Per-memcg
> > > counters removes this as a blocker from doing subtree flushes, which
> > > helps avoid unnecessary work when the stats of a small subtree are
> > > needed.
> > >
> > > Nothing is free, of course. This comes at a cost:
> > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > > bytes. The extra memory usage is insigificant.
> > >
> > > (b) More work on the update side, although in the common case it will
> > > only be percpu counter updates. The amount of work scales with the
> > > number of ancestors (i.e. tree depth). This is not a new concept, adding
> > > a cgroup to the rstat tree involves a parent loop, so is charging.
> > > Testing results below show no significant regressions.
> > >
> > > (c) The error margin in the stats for the system as a whole increases
> > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > > error in charges coming from percpu stocks, and we have a periodic
> > > flusher that makes sure we always flush all the stats every 2s anyway.
> > >
> > > This patch was tested to make sure no significant regressions are
> > > introduced on the update path as follows. The following benchmarks were
> > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> > > deeper than a usual setup:
> > >
> > > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > > values in the table are the average of server and client throughputs in
> > > mbps after 30 iterations, each running for 30s:
> > >
> > >                                 tcp_rr          tcp_stream
> > > Base                            9504218.56      357366.84
> > > Patched                         9656205.68      356978.39
> > > Delta                           +1.6%           -0.1%
> > > Standard Deviation              0.95%           1.03%
> > >
> > > An increase in the performance of tcp_rr doesn't really make sense, but
> > > it's probably in the noise. The same tests were ran with 1 flow and 1
> > > thread but the throughput was too noisy to make any conclusions (the
> > > averages did not show regressions nonetheless).
> > >
> > > Looking at perf for one iteration of the above test, __mod_memcg_state()
> > > (which is where memcg_rstat_updated() is called) does not show up at all
> > > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > > and 0.36% for tcp_stream.
> > >
> > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > > stress-ng very well, so I am not sure that's the best way to test this,
> > > but it spawns 384 workers and spits a lot of metrics which looks nice :)
> > > I picked a few ones that seem to be relevant to the stats update path. I
> > > also included cache misses as this patch introduce more atomics that may
> > > bounce between cpu caches:
> > >
> > > Metric                  Base            Patched         Delta
> > > Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> > > Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> > > Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> > > Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> > > Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> > > Page Faults Major       18.794 /sec     0.000 /sec
> > > Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> > > Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> > > MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> > > MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> > > Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> > > Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> > > page-cache add          238.057 /sec    0.000 /sec
> > > page-cache del          6.265 /sec      6.267 /sec      -0.03%
> > >
> > > This is only using a single run in each case. I am not sure what to
> > > make out of most of these numbers, but they mostly seem in the noise
> > > (some better, some worse). The lock contention numbers are interesting.
> > > I am not sure if higher is better or worse here. No new locks or lock
> > > sections are introduced by this patch either way.
> > >
> > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > > this patch. This is suspicious, but I verified while stress-ng is
> > > running that all the threads are in the right cgroup.
> > >
> > > (3) will-it-scale page_fault tests. These tests (specifically
> > > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > > for a change in the stats update path [2]. These are the
> > > numbers from 30 runs (+ is good):
> > >
> > >              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> > > ------------------------------+-------------+-------------+-------------
> > >   page_fault1_per_process_ops |             |             |            |
> > >   (A) base                    | 265207.738  | 262941.000  | 12112.379  |
> > >   (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
> > >                               | -6.02%      | -5.39%      |            |
> > >   page_fault1_per_thread_ops  |             |             |            |
> > >   (A) base                    | 241618.484  | 240209.000  | 10162.207  |
> > >   (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
> > >                               | -4.88%      | -4.62%      |            |
> > >   page_fault1_scalability     |             |             |
> > >   (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
> > >   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
> > >                               | -9.29%      | -9.35%      |            |
> >
> > This much regression is not acceptable.
> >
> > In addition, I ran netperf with the same 4 level hierarchy as you have
> > run and I am seeing ~11% regression.
>
> Interesting, I thought neper and netperf should be similar. Let me try
> to reproduce this.
>
> Thanks for testing!
>
> >
> > More specifically on a machine with 44 CPUs (HT disabled ixion machine):
> >
> > # for server
> > $ netserver -6
> >
> > # 22 instances of netperf clients
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > (averaged over 4 runs)
> >
> > base (next-20231009): 33081 MBPS
> > patched: 29267 MBPS
> >
> > So, this series is not acceptable unless this regression is resolved.

I tried this on a machine with 72 cpus (also ixion), running both
netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
# echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a
# echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b
# echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b/c
# echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b/c/d
# echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
# ./netserver -6

# echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
# for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
-m 10K; done

Base:
540000 262144 10240 60.00 54613.89
540000 262144 10240 60.00 54940.52
540000 262144 10240 60.00 55168.86
540000 262144 10240 60.00 54800.15
540000 262144 10240 60.00 54452.55
540000 262144 10240 60.00 54501.60
540000 262144 10240 60.00 55036.11
540000 262144 10240 60.00 52018.91
540000 262144 10240 60.00 54877.78
540000 262144 10240 60.00 55342.38

Average: 54575.275

Patched:
540000 262144 10240 60.00 53694.86
540000 262144 10240 60.00 54807.68
540000 262144 10240 60.00 54782.89
540000 262144 10240 60.00 51404.91
540000 262144 10240 60.00 55024.00
540000 262144 10240 60.00 54725.84
540000 262144 10240 60.00 51400.40
540000 262144 10240 60.00 54212.63
540000 262144 10240 60.00 51951.47
540000 262144 10240 60.00 51978.27

Average: 53398.295

That's ~2% regression. Did I do anything incorrectly?

Shakeel Butt Oct. 11, 2023, 12:36 a.m. UTC | #5

On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
[...]
> 
> I tried this on a machine with 72 cpus (also ixion), running both
> netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a
> # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b
> # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c
> # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c/d
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # ./netserver -6
> 
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> -m 10K; done

You are missing '&' at the end. Use something like below:

#!/bin/bash
for i in {1..22}
do
   /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
done
wait

Yosry Ahmed Oct. 11, 2023, 1:48 a.m. UTC | #6

On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> [...]
> >
> > I tried this on a machine with 72 cpus (also ixion), running both
> > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a
> > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b
> > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c
> > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c/d
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # ./netserver -6
> >
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > -m 10K; done
>
> You are missing '&' at the end. Use something like below:
>
> #!/bin/bash
> for i in {1..22}
> do
>    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> done
> wait
>

Oh sorry I missed the fact that you are running instances in parallel, my bad.

So I ran 36 instances on a machine with 72 cpus. I did this 10 times
and got an average from all instances for all runs to reduce noise:

#!/bin/bash

ITER=10
NR_INSTANCES=36

for i in $(seq $ITER); do
  echo "iteration $i"
  for j in $(seq $NR_INSTANCES); do
    echo "iteration $i" >> "out$j"
    ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
  done
  wait
done

cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'

Base: 22169 mbps
Patched: 21331.9 mbps

The difference is ~3.7% in my runs. I am not sure what's different.
Perhaps it's the number of runs?

Shakeel Butt Oct. 11, 2023, 12:45 p.m. UTC | #7

On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > [...]
> > >
> > > I tried this on a machine with 72 cpus (also ixion), running both
> > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a
> > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b
> > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c
> > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # ./netserver -6
> > >
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > -m 10K; done
> >
> > You are missing '&' at the end. Use something like below:
> >
> > #!/bin/bash
> > for i in {1..22}
> > do
> >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > done
> > wait
> >
>
> Oh sorry I missed the fact that you are running instances in parallel, my bad.
>
> So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> and got an average from all instances for all runs to reduce noise:
>
> #!/bin/bash
>
> ITER=10
> NR_INSTANCES=36
>
> for i in $(seq $ITER); do
>   echo "iteration $i"
>   for j in $(seq $NR_INSTANCES); do
>     echo "iteration $i" >> "out$j"
>     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
>   done
>   wait
> done
>
> cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
>
> Base: 22169 mbps
> Patched: 21331.9 mbps
>
> The difference is ~3.7% in my runs. I am not sure what's different.
> Perhaps it's the number of runs?

My base kernel is next-20231009 and I am running experiments with
hyperthreading disabled.

Yosry Ahmed Oct. 12, 2023, 3:13 a.m. UTC | #8

On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > [...]
> > > >
> > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a
> > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # ./netserver -6
> > > >
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > -m 10K; done
> > >
> > > You are missing '&' at the end. Use something like below:
> > >
> > > #!/bin/bash
> > > for i in {1..22}
> > > do
> > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > done
> > > wait
> > >
> >
> > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> >
> > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > and got an average from all instances for all runs to reduce noise:
> >
> > #!/bin/bash
> >
> > ITER=10
> > NR_INSTANCES=36
> >
> > for i in $(seq $ITER); do
> >   echo "iteration $i"
> >   for j in $(seq $NR_INSTANCES); do
> >     echo "iteration $i" >> "out$j"
> >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> >   done
> >   wait
> > done
> >
> > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> >
> > Base: 22169 mbps
> > Patched: 21331.9 mbps
> >
> > The difference is ~3.7% in my runs. I am not sure what's different.
> > Perhaps it's the number of runs?
>
> My base kernel is next-20231009 and I am running experiments with
> hyperthreading disabled.

Using next-20231009 and a similar 44 core machine with hyperthreading
disabled, I ran 22 instances of netperf in parallel and got the
following numbers from averaging 20 runs:

Base: 33076.5 mbps
Patched: 31410.1 mbps

That's about 5% diff. I guess the number of iterations helps reduce
the noise? I am not sure.

Please also keep in mind that in this case all netperf instances are
in the same cgroup and at a 4-level depth. I imagine in a practical
setup processes would be a little more spread out, which means less
common ancestors, so less contended atomic operations.

Yosry Ahmed Oct. 12, 2023, 8:01 a.m. UTC | #9

On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com>
wrote:
> > >
> > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com>
wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > [...]
> > > > >
> > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a
> > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # ./netserver -6
> > > > >
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t
TCP_SENDFILE --
> > > > > -m 10K; done
> > > >
> > > > You are missing '&' at the end. Use something like below:
> > > >
> > > > #!/bin/bash
> > > > for i in {1..22}
> > > > do
> > > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > done
> > > > wait
> > > >
> > >
> > > Oh sorry I missed the fact that you are running instances in
parallel, my bad.
> > >
> > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > and got an average from all instances for all runs to reduce noise:
> > >
> > > #!/bin/bash
> > >
> > > ITER=10
> > > NR_INSTANCES=36
> > >
> > > for i in $(seq $ITER); do
> > >   echo "iteration $i"
> > >   for j in $(seq $NR_INSTANCES); do
> > >     echo "iteration $i" >> "out$j"
> > >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > >   done
> > >   wait
> > > done
> > >
> > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > >
> > > Base: 22169 mbps
> > > Patched: 21331.9 mbps
> > >
> > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > Perhaps it's the number of runs?
> >
> > My base kernel is next-20231009 and I am running experiments with
> > hyperthreading disabled.
>
> Using next-20231009 and a similar 44 core machine with hyperthreading
> disabled, I ran 22 instances of netperf in parallel and got the
> following numbers from averaging 20 runs:
>
> Base: 33076.5 mbps
> Patched: 31410.1 mbps
>
> That's about 5% diff. I guess the number of iterations helps reduce
> the noise? I am not sure.
>
> Please also keep in mind that in this case all netperf instances are
> in the same cgroup and at a 4-level depth. I imagine in a practical
> setup processes would be a little more spread out, which means less
> common ancestors, so less contended atomic operations.

I was curious, so I ran the same testing in a cgroup 2 levels deep (i.e
/sys/fs/cgroup/a/b), which is a much more common setup in my experience.
Here are the numbers:

Base: 40198.0 mbps
Patched: 38629.7 mbps

The regression is reduced to ~3.9%.

What's more interesting is that going from a level 2 cgroup to a level 4
cgroup is already a big hit with or without this patch:

Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
Patched: 38629.7 -> 31410.1 (~18.7% regression)

So going from level 2 to 4 is already a significant regression for other
reasons (e.g. hierarchical charging). This patch only makes it marginally
worse. This puts the numbers more into perspective imo than comparing
values at level 4. What do you think?

Yosry Ahmed Oct. 12, 2023, 8:04 a.m. UTC | #10

On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > [...]
> > > > >
> > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a
> > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # ./netserver -6
> > > > >
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > -m 10K; done
> > > >
> > > > You are missing '&' at the end. Use something like below:
> > > >
> > > > #!/bin/bash
> > > > for i in {1..22}
> > > > do
> > > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > done
> > > > wait
> > > >
> > >
> > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > >
> > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > and got an average from all instances for all runs to reduce noise:
> > >
> > > #!/bin/bash
> > >
> > > ITER=10
> > > NR_INSTANCES=36
> > >
> > > for i in $(seq $ITER); do
> > >   echo "iteration $i"
> > >   for j in $(seq $NR_INSTANCES); do
> > >     echo "iteration $i" >> "out$j"
> > >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > >   done
> > >   wait
> > > done
> > >
> > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > >
> > > Base: 22169 mbps
> > > Patched: 21331.9 mbps
> > >
> > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > Perhaps it's the number of runs?
> >
> > My base kernel is next-20231009 and I am running experiments with
> > hyperthreading disabled.
>
> Using next-20231009 and a similar 44 core machine with hyperthreading
> disabled, I ran 22 instances of netperf in parallel and got the
> following numbers from averaging 20 runs:
>
> Base: 33076.5 mbps
> Patched: 31410.1 mbps
>
> That's about 5% diff. I guess the number of iterations helps reduce
> the noise? I am not sure.
>
> Please also keep in mind that in this case all netperf instances are
> in the same cgroup and at a 4-level depth. I imagine in a practical
> setup processes would be a little more spread out, which means less
> common ancestors, so less contended atomic operations.


(Resending the reply as I messed up the last one, was not in plain text)

I was curious, so I ran the same testing in a cgroup 2 levels deep
(i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
experience. Here are the numbers:

Base: 40198.0 mbps
Patched: 38629.7 mbps

The regression is reduced to ~3.9%.

What's more interesting is that going from a level 2 cgroup to a level
4 cgroup is already a big hit with or without this patch:

Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
Patched: 38629.7 -> 31410.1 (~18.7% regression)

So going from level 2 to 4 is already a significant regression for
other reasons (e.g. hierarchical charging). This patch only makes it
marginally worse. This puts the numbers more into perspective imo than
comparing values at level 4. What do you think?

Johannes Weiner Oct. 12, 2023, 1:29 p.m. UTC | #11

On Thu, Oct 12, 2023 at 01:04:03AM -0700, Yosry Ahmed wrote:
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > >   echo "iteration $i"
> > > >   for j in $(seq $NR_INSTANCES); do
> > > >     echo "iteration $i" >> "out$j"
> > > >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > >   done
> > > >   wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
> 
> 
> (Resending the reply as I messed up the last one, was not in plain text)
> 
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
> 
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
> 
> The regression is reduced to ~3.9%.
> 
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
> 
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
> 
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?

I think it's reasonable.

Especially comparing to how many cachelines we used to touch on the
write side when all flushing happened there. This looks like a good
trade-off to me.

Shakeel Butt Oct. 12, 2023, 1:35 p.m. UTC | #12

On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > >   echo "iteration $i"
> > > >   for j in $(seq $NR_INSTANCES); do
> > > >     echo "iteration $i" >> "out$j"
> > > >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > >   done
> > > >   wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
>
>
> (Resending the reply as I messed up the last one, was not in plain text)
>
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
>
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
>
> The regression is reduced to ~3.9%.
>
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
>
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
>
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?

This is weird as we are running the experiments on the same machine. I
will rerun with 2 levels as well. Also can you rerun the page fault
benchmark as well which was showing 9% regression in your original
commit message?

Yosry Ahmed Oct. 12, 2023, 3:10 p.m. UTC | #13

On Thu, Oct 12, 2023 at 6:35 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > >
> > > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # ./netserver -6
> > > > > > >
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > > -m 10K; done
> > > > > >
> > > > > > You are missing '&' at the end. Use something like below:
> > > > > >
> > > > > > #!/bin/bash
> > > > > > for i in {1..22}
> > > > > > do
> > > > > >    /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > > done
> > > > > > wait
> > > > > >
> > > > >
> > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > > >
> > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > > and got an average from all instances for all runs to reduce noise:
> > > > >
> > > > > #!/bin/bash
> > > > >
> > > > > ITER=10
> > > > > NR_INSTANCES=36
> > > > >
> > > > > for i in $(seq $ITER); do
> > > > >   echo "iteration $i"
> > > > >   for j in $(seq $NR_INSTANCES); do
> > > > >     echo "iteration $i" >> "out$j"
> > > > >     ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > >   done
> > > > >   wait
> > > > > done
> > > > >
> > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > > >
> > > > > Base: 22169 mbps
> > > > > Patched: 21331.9 mbps
> > > > >
> > > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > > Perhaps it's the number of runs?
> > > >
> > > > My base kernel is next-20231009 and I am running experiments with
> > > > hyperthreading disabled.
> > >
> > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > disabled, I ran 22 instances of netperf in parallel and got the
> > > following numbers from averaging 20 runs:
> > >
> > > Base: 33076.5 mbps
> > > Patched: 31410.1 mbps
> > >
> > > That's about 5% diff. I guess the number of iterations helps reduce
> > > the noise? I am not sure.
> > >
> > > Please also keep in mind that in this case all netperf instances are
> > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > setup processes would be a little more spread out, which means less
> > > common ancestors, so less contended atomic operations.
> >
> >
> > (Resending the reply as I messed up the last one, was not in plain text)
> >
> > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > experience. Here are the numbers:
> >
> > Base: 40198.0 mbps
> > Patched: 38629.7 mbps
> >
> > The regression is reduced to ~3.9%.
> >
> > What's more interesting is that going from a level 2 cgroup to a level
> > 4 cgroup is already a big hit with or without this patch:
> >
> > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> >
> > So going from level 2 to 4 is already a significant regression for
> > other reasons (e.g. hierarchical charging). This patch only makes it
> > marginally worse. This puts the numbers more into perspective imo than
> > comparing values at level 4. What do you think?
>
> This is weird as we are running the experiments on the same machine. I
> will rerun with 2 levels as well. Also can you rerun the page fault
> benchmark as well which was showing 9% regression in your original
> commit message?

Thanks. I will re-run the page_fault tests, but keep in mind that the
page fault benchmarks in will-it-scale are highly variable. We run
them between kernel versions internally, and I think we ignore any
changes below 10% as the benchmark is naturally noisy.

I have a couple of runs for page_fault3_scalability showing a 2-3%
improvement with this patch :)

Yosry Ahmed Oct. 12, 2023, 9:05 p.m. UTC | #14

[..]
> > > >
> > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > following numbers from averaging 20 runs:
> > > >
> > > > Base: 33076.5 mbps
> > > > Patched: 31410.1 mbps
> > > >
> > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > the noise? I am not sure.
> > > >
> > > > Please also keep in mind that in this case all netperf instances are
> > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > setup processes would be a little more spread out, which means less
> > > > common ancestors, so less contended atomic operations.
> > >
> > >
> > > (Resending the reply as I messed up the last one, was not in plain text)
> > >
> > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > experience. Here are the numbers:
> > >
> > > Base: 40198.0 mbps
> > > Patched: 38629.7 mbps
> > >
> > > The regression is reduced to ~3.9%.
> > >
> > > What's more interesting is that going from a level 2 cgroup to a level
> > > 4 cgroup is already a big hit with or without this patch:
> > >
> > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > >
> > > So going from level 2 to 4 is already a significant regression for
> > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > marginally worse. This puts the numbers more into perspective imo than
> > > comparing values at level 4. What do you think?
> >
> > This is weird as we are running the experiments on the same machine. I
> > will rerun with 2 levels as well. Also can you rerun the page fault
> > benchmark as well which was showing 9% regression in your original
> > commit message?
>
> Thanks. I will re-run the page_fault tests, but keep in mind that the
> page fault benchmarks in will-it-scale are highly variable. We run
> them between kernel versions internally, and I think we ignore any
> changes below 10% as the benchmark is naturally noisy.
>
> I have a couple of runs for page_fault3_scalability showing a 2-3%
> improvement with this patch :)

I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
level 2 cgroup, here are the results (the results in the original
commit message are for 384 cpus in a level 4 cgroup):

               LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
------------------------------+-------------+-------------+-------------
  page_fault1_per_process_ops |             |             |            |
  (A) base                    | 270249.164  | 265437.000  | 13451.836  |
  (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
                              | -3.29%      | -3.66%      |            |
  page_fault1_per_thread_ops  |             |             |            |
  (A) base                    | 242111.345  | 239737.000  | 10026.031  |
  (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
                              | -2.09%      | -1.85%      |            |
  page_fault1_scalability     |             |             |
  (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
  (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
                              | -1.16%      | -1.69%      |            |
  page_fault2_per_process_ops |             |             |
  (A) base                    | 203561.836  | 203301.000  | 2550.764   |
  (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
                              | -3.13%      | -2.73%      |            |
  page_fault2_per_thread_ops  |             |             |
  (A) base                    | 171046.473  | 170776.000  | 1509.679   |
  (B) patched                 | 166626.327  | 166406.000  | 768.753    |
                              | -2.58%      | -2.56%      |            |
  page_fault2_scalability     |             |             |
  (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
  (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
                              | -1.29%      | -1.41%      |            |
  page_fault3_per_process_ops |             |             |
  (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
  (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
                              | -1.56%      | -1.86%      |            |
  page_fault3_per_thread_ops  |             |             |
  (A) base                    | 391234.164  | 390860.000  | 1760.720   |
  (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
                              | -3.58%      | -3.71%      |            |
  page_fault3_scalability     |             |             |
  (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
  (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
                              | +2.26%      | +2.45%      |            |

The numbers are much better. I can modify the commit log to include
the testing in the replies instead of what's currently there if this
helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
256 cpus -- all in a level 2 cgroup).

Shakeel Butt Oct. 12, 2023, 9:16 p.m. UTC | #15

On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > >
> > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > following numbers from averaging 20 runs:
> > > > >
> > > > > Base: 33076.5 mbps
> > > > > Patched: 31410.1 mbps
> > > > >
> > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > the noise? I am not sure.
> > > > >
> > > > > Please also keep in mind that in this case all netperf instances are
> > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > setup processes would be a little more spread out, which means less
> > > > > common ancestors, so less contended atomic operations.
> > > >
> > > >
> > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > >
> > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > experience. Here are the numbers:
> > > >
> > > > Base: 40198.0 mbps
> > > > Patched: 38629.7 mbps
> > > >
> > > > The regression is reduced to ~3.9%.
> > > >
> > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > 4 cgroup is already a big hit with or without this patch:
> > > >
> > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > >
> > > > So going from level 2 to 4 is already a significant regression for
> > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > marginally worse. This puts the numbers more into perspective imo than
> > > > comparing values at level 4. What do you think?
> > >
> > > This is weird as we are running the experiments on the same machine. I
> > > will rerun with 2 levels as well. Also can you rerun the page fault
> > > benchmark as well which was showing 9% regression in your original
> > > commit message?
> >
> > Thanks. I will re-run the page_fault tests, but keep in mind that the
> > page fault benchmarks in will-it-scale are highly variable. We run
> > them between kernel versions internally, and I think we ignore any
> > changes below 10% as the benchmark is naturally noisy.
> >
> > I have a couple of runs for page_fault3_scalability showing a 2-3%
> > improvement with this patch :)
>
> I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
> level 2 cgroup, here are the results (the results in the original
> commit message are for 384 cpus in a level 4 cgroup):
>
>                LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> ------------------------------+-------------+-------------+-------------
>   page_fault1_per_process_ops |             |             |            |
>   (A) base                    | 270249.164  | 265437.000  | 13451.836  |
>   (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
>                               | -3.29%      | -3.66%      |            |
>   page_fault1_per_thread_ops  |             |             |            |
>   (A) base                    | 242111.345  | 239737.000  | 10026.031  |
>   (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
>                               | -2.09%      | -1.85%      |            |
>   page_fault1_scalability     |             |             |
>   (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
>   (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
>                               | -1.16%      | -1.69%      |            |
>   page_fault2_per_process_ops |             |             |
>   (A) base                    | 203561.836  | 203301.000  | 2550.764   |
>   (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
>                               | -3.13%      | -2.73%      |            |
>   page_fault2_per_thread_ops  |             |             |
>   (A) base                    | 171046.473  | 170776.000  | 1509.679   |
>   (B) patched                 | 166626.327  | 166406.000  | 768.753    |
>                               | -2.58%      | -2.56%      |            |
>   page_fault2_scalability     |             |             |
>   (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
>   (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
>                               | -1.29%      | -1.41%      |            |
>   page_fault3_per_process_ops |             |             |
>   (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
>   (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
>                               | -1.56%      | -1.86%      |            |
>   page_fault3_per_thread_ops  |             |             |
>   (A) base                    | 391234.164  | 390860.000  | 1760.720   |
>   (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
>                               | -3.58%      | -3.71%      |            |
>   page_fault3_scalability     |             |             |
>   (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
>   (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
>                               | +2.26%      | +2.45%      |            |
>
> The numbers are much better. I can modify the commit log to include
> the testing in the replies instead of what's currently there if this
> helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
> 256 cpus -- all in a level 2 cgroup).

Yes this looks better. I think we should also ask intel perf and
phoronix folks to run their benchmarks as well (but no need to block
on them).

Yosry Ahmed Oct. 12, 2023, 9:19 p.m. UTC | #16

On Thu, Oct 12, 2023 at 2:16 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > [..]
> > > > > >
> > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > > following numbers from averaging 20 runs:
> > > > > >
> > > > > > Base: 33076.5 mbps
> > > > > > Patched: 31410.1 mbps
> > > > > >
> > > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > > the noise? I am not sure.
> > > > > >
> > > > > > Please also keep in mind that in this case all netperf instances are
> > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > > setup processes would be a little more spread out, which means less
> > > > > > common ancestors, so less contended atomic operations.
> > > > >
> > > > >
> > > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > > >
> > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > > experience. Here are the numbers:
> > > > >
> > > > > Base: 40198.0 mbps
> > > > > Patched: 38629.7 mbps
> > > > >
> > > > > The regression is reduced to ~3.9%.
> > > > >
> > > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > > 4 cgroup is already a big hit with or without this patch:
> > > > >
> > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > > >
> > > > > So going from level 2 to 4 is already a significant regression for
> > > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > > marginally worse. This puts the numbers more into perspective imo than
> > > > > comparing values at level 4. What do you think?
> > > >
> > > > This is weird as we are running the experiments on the same machine. I
> > > > will rerun with 2 levels as well. Also can you rerun the page fault
> > > > benchmark as well which was showing 9% regression in your original
> > > > commit message?
> > >
> > > Thanks. I will re-run the page_fault tests, but keep in mind that the
> > > page fault benchmarks in will-it-scale are highly variable. We run
> > > them between kernel versions internally, and I think we ignore any
> > > changes below 10% as the benchmark is naturally noisy.
> > >
> > > I have a couple of runs for page_fault3_scalability showing a 2-3%
> > > improvement with this patch :)
> >
> > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
> > level 2 cgroup, here are the results (the results in the original
> > commit message are for 384 cpus in a level 4 cgroup):
> >
> >                LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> > ------------------------------+-------------+-------------+-------------
> >   page_fault1_per_process_ops |             |             |            |
> >   (A) base                    | 270249.164  | 265437.000  | 13451.836  |
> >   (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
> >                               | -3.29%      | -3.66%      |            |
> >   page_fault1_per_thread_ops  |             |             |            |
> >   (A) base                    | 242111.345  | 239737.000  | 10026.031  |
> >   (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
> >                               | -2.09%      | -1.85%      |            |
> >   page_fault1_scalability     |             |             |
> >   (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
> >   (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
> >                               | -1.16%      | -1.69%      |            |
> >   page_fault2_per_process_ops |             |             |
> >   (A) base                    | 203561.836  | 203301.000  | 2550.764   |
> >   (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
> >                               | -3.13%      | -2.73%      |            |
> >   page_fault2_per_thread_ops  |             |             |
> >   (A) base                    | 171046.473  | 170776.000  | 1509.679   |
> >   (B) patched                 | 166626.327  | 166406.000  | 768.753    |
> >                               | -2.58%      | -2.56%      |            |
> >   page_fault2_scalability     |             |             |
> >   (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
> >   (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
> >                               | -1.29%      | -1.41%      |            |
> >   page_fault3_per_process_ops |             |             |
> >   (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
> >   (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
> >                               | -1.56%      | -1.86%      |            |
> >   page_fault3_per_thread_ops  |             |             |
> >   (A) base                    | 391234.164  | 390860.000  | 1760.720   |
> >   (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
> >                               | -3.58%      | -3.71%      |            |
> >   page_fault3_scalability     |             |             |
> >   (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
> >   (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
> >                               | +2.26%      | +2.45%      |            |
> >
> > The numbers are much better. I can modify the commit log to include
> > the testing in the replies instead of what's currently there if this
> > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
> > 256 cpus -- all in a level 2 cgroup).
>
> Yes this looks better. I think we should also ask intel perf and
> phoronix folks to run their benchmarks as well (but no need to block
> on them).

Anything I need to do for this to happen? (I thought such testing is
already done on linux-next)

Also, any further comments on the patch (or the series in general)? If
not, I can send a new commit message for this patch in-place.

Shakeel Butt Oct. 12, 2023, 9:38 p.m. UTC | #17

On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> >
> > Yes this looks better. I think we should also ask intel perf and
> > phoronix folks to run their benchmarks as well (but no need to block
> > on them).
>
> Anything I need to do for this to happen? (I thought such testing is
> already done on linux-next)

Just Cced the relevant folks.

Michael, Oliver & Feng, if you have some time/resource available,
please do trigger your performance benchmarks on the following series
(but nothing urgent):

https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/

>
> Also, any further comments on the patch (or the series in general)? If
> not, I can send a new commit message for this patch in-place.

Sorry, I haven't taken a look yet but will try in a week or so.

Yosry Ahmed Oct. 12, 2023, 10:23 p.m. UTC | #18

On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > >
> > > Yes this looks better. I think we should also ask intel perf and
> > > phoronix folks to run their benchmarks as well (but no need to block
> > > on them).
> >
> > Anything I need to do for this to happen? (I thought such testing is
> > already done on linux-next)
>
> Just Cced the relevant folks.
>
> Michael, Oliver & Feng, if you have some time/resource available,
> please do trigger your performance benchmarks on the following series
> (but nothing urgent):
>
> https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/

Thanks for that.

>
> >
> > Also, any further comments on the patch (or the series in general)? If
> > not, I can send a new commit message for this patch in-place.
>
> Sorry, I haven't taken a look yet but will try in a week or so.

Sounds good, thanks.

Meanwhile, Andrew, could you please replace the commit log of this
patch as follows for more updated testing info:

Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg

A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.

Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.

This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.

(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.

Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.

(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.

(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.

This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):

(1) Running 22 instances of netperf on a 44 cpu machine with
hyperthreading disabled. All instances are run in a level 2 cgroup, as
well as netserver:
  # netserver -6
  # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Averaging 20 runs, the numbers are as follows:
Base: 40198.0 mbps
Patched: 38629.7 mbps (-3.9%)

The regression is minimal, especially for 22 instances in the same
cgroup sharing all ancestors (so updating the same atomics).

(2) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [1]. These are the
numbers from 10 runs (+ is good) on a machine with 256 cpus:

               LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
------------------------------+-------------+-------------+-------------
  page_fault1_per_process_ops |             |             |            |
  (A) base                    | 270249.164  | 265437.000  | 13451.836  |
  (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
                              | -3.29%      | -3.66%      |            |
  page_fault1_per_thread_ops  |             |             |            |
  (A) base                    | 242111.345  | 239737.000  | 10026.031  |
  (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
                              | -2.09%      | -1.85%      |            |
  page_fault1_scalability     |             |             |
  (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
  (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
                              | -1.16%      | -1.69%      |            |
  page_fault2_per_process_ops |             |             |
  (A) base                    | 203561.836  | 203301.000  | 2550.764   |
  (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
                              | -3.13%      | -2.73%      |            |
  page_fault2_per_thread_ops  |             |             |
  (A) base                    | 171046.473  | 170776.000  | 1509.679   |
  (B) patched                 | 166626.327  | 166406.000  | 768.753    |
                              | -2.58%      | -2.56%      |            |
  page_fault2_scalability     |             |             |
  (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
  (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
                              | -1.29%      | -1.41%      |            |
  page_fault3_per_process_ops |             |             |
  (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
  (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
                              | -1.56%      | -1.86%      |            |
  page_fault3_per_thread_ops  |             |             |
  (A) base                    | 391234.164  | 390860.000  | 1760.720   |
  (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
                              | -3.58%      | -3.71%      |            |
  page_fault3_scalability     |             |             |
  (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
  (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
                              | +2.26%      | +2.45%      |            |

All regressions seem to be minimal, and within the normal variance for
the benchmark. The fix for [1] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.

(3) I also ran stress-ng in a nested cgroup and did not observe any
obvious regressions.

[1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/

Yosry Ahmed Oct. 12, 2023, 11:28 p.m. UTC | #19

[..]
> > >
> > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > disabled, I ran 22 instances of netperf in parallel and got the
> > > following numbers from averaging 20 runs:
> > >
> > > Base: 33076.5 mbps
> > > Patched: 31410.1 mbps
> > >
> > > That's about 5% diff. I guess the number of iterations helps reduce
> > > the noise? I am not sure.
> > >
> > > Please also keep in mind that in this case all netperf instances are
> > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > setup processes would be a little more spread out, which means less
> > > common ancestors, so less contended atomic operations.
> >
> >
> > (Resending the reply as I messed up the last one, was not in plain text)
> >
> > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > experience. Here are the numbers:
> >
> > Base: 40198.0 mbps
> > Patched: 38629.7 mbps
> >
> > The regression is reduced to ~3.9%.
> >
> > What's more interesting is that going from a level 2 cgroup to a level
> > 4 cgroup is already a big hit with or without this patch:
> >
> > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> >
> > So going from level 2 to 4 is already a significant regression for
> > other reasons (e.g. hierarchical charging). This patch only makes it
> > marginally worse. This puts the numbers more into perspective imo than
> > comparing values at level 4. What do you think?
>
> I think it's reasonable.
>
> Especially comparing to how many cachelines we used to touch on the
> write side when all flushing happened there. This looks like a good
> trade-off to me.

Thanks.

Still wanting to figure out if this patch is what you suggested in our
previous discussion [1], to add a
Suggested-by if appropriate :)

[1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/

Johannes Weiner Oct. 13, 2023, 2:33 a.m. UTC | #20

On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote:
> [..]
> > > >
> > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > following numbers from averaging 20 runs:
> > > >
> > > > Base: 33076.5 mbps
> > > > Patched: 31410.1 mbps
> > > >
> > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > the noise? I am not sure.
> > > >
> > > > Please also keep in mind that in this case all netperf instances are
> > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > setup processes would be a little more spread out, which means less
> > > > common ancestors, so less contended atomic operations.
> > >
> > >
> > > (Resending the reply as I messed up the last one, was not in plain text)
> > >
> > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > experience. Here are the numbers:
> > >
> > > Base: 40198.0 mbps
> > > Patched: 38629.7 mbps
> > >
> > > The regression is reduced to ~3.9%.
> > >
> > > What's more interesting is that going from a level 2 cgroup to a level
> > > 4 cgroup is already a big hit with or without this patch:
> > >
> > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > >
> > > So going from level 2 to 4 is already a significant regression for
> > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > marginally worse. This puts the numbers more into perspective imo than
> > > comparing values at level 4. What do you think?
> >
> > I think it's reasonable.
> >
> > Especially comparing to how many cachelines we used to touch on the
> > write side when all flushing happened there. This looks like a good
> > trade-off to me.
> 
> Thanks.
> 
> Still wanting to figure out if this patch is what you suggested in our
> previous discussion [1], to add a
> Suggested-by if appropriate :)
> 
> [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/

Haha, sort of. I suggested the cgroup-level flush-batching, but my
proposal was missing the clever upward propagation of the pending stat
updates that you added.

You can add the tag if you're feeling generous, but I wouldn't be mad
if you don't!

Yosry Ahmed Oct. 13, 2023, 2:38 a.m. UTC | #21

On Thu, Oct 12, 2023 at 7:33 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote:
> > [..]
> > > > >
> > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > following numbers from averaging 20 runs:
> > > > >
> > > > > Base: 33076.5 mbps
> > > > > Patched: 31410.1 mbps
> > > > >
> > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > the noise? I am not sure.
> > > > >
> > > > > Please also keep in mind that in this case all netperf instances are
> > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > setup processes would be a little more spread out, which means less
> > > > > common ancestors, so less contended atomic operations.
> > > >
> > > >
> > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > >
> > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > experience. Here are the numbers:
> > > >
> > > > Base: 40198.0 mbps
> > > > Patched: 38629.7 mbps
> > > >
> > > > The regression is reduced to ~3.9%.
> > > >
> > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > 4 cgroup is already a big hit with or without this patch:
> > > >
> > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > >
> > > > So going from level 2 to 4 is already a significant regression for
> > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > marginally worse. This puts the numbers more into perspective imo than
> > > > comparing values at level 4. What do you think?
> > >
> > > I think it's reasonable.
> > >
> > > Especially comparing to how many cachelines we used to touch on the
> > > write side when all flushing happened there. This looks like a good
> > > trade-off to me.
> >
> > Thanks.
> >
> > Still wanting to figure out if this patch is what you suggested in our
> > previous discussion [1], to add a
> > Suggested-by if appropriate :)
> >
> > [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/
>
> Haha, sort of. I suggested the cgroup-level flush-batching, but my
> proposal was missing the clever upward propagation of the pending stat
> updates that you added.
>
> You can add the tag if you're feeling generous, but I wouldn't be mad
> if you don't!

I like to think that I am a generous person :)

Will add it in the next respin.

Andrew Morton Oct. 14, 2023, 11:08 p.m. UTC | #22

On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:

> Meanwhile, Andrew, could you please replace the commit log of this
> patch as follows for more updated testing info:

Done.

Yosry Ahmed Oct. 16, 2023, 6:42 p.m. UTC | #23

On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Meanwhile, Andrew, could you please replace the commit log of this
> > patch as follows for more updated testing info:
>
> Done.

Thanks!

Yosry Ahmed Oct. 17, 2023, 11:52 p.m. UTC | #24

On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Meanwhile, Andrew, could you please replace the commit log of this
> > patch as follows for more updated testing info:
>
> Done.


Sorry Andrew, but could you please also take this fixlet?

From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 17 Oct 2023 23:07:59 +0000
Subject: [PATCH] mm: memcg: clear percpu stats_pending during stats flush

When flushing memcg stats, we clear the per-memcg count of pending stat
updates, as they are captured by the flush. Also clear the percpu count
for the cpu being flushed.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b1377b16b3e0..fa92de780ac89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5653,6 +5653,7 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
                        }
                }
        }
+       statc->stats_updates = 0;
        /* We are in a per-cpu loop here, only do the atomic write once */
        if (atomic64_read(&memcg->vmstats->stats_updates))
                atomic64_set(&memcg->vmstats->stats_updates, 0);
--
2.42.0.655.g421f12c284-goog

Oliver Sang Oct. 18, 2023, 8:22 a.m. UTC | #25

hi, Yosry Ahmed, hi, Shakeel Butt,

On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote:
> On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > > >
> > > > Yes this looks better. I think we should also ask intel perf and
> > > > phoronix folks to run their benchmarks as well (but no need to block
> > > > on them).
> > >
> > > Anything I need to do for this to happen? (I thought such testing is
> > > already done on linux-next)
> >
> > Just Cced the relevant folks.
> >
> > Michael, Oliver & Feng, if you have some time/resource available,
> > please do trigger your performance benchmarks on the following series
> > (but nothing urgent):
> >
> > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/
> 
> Thanks for that.

we (0day team) have already applied the patch-set as:

c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set

they've already in our so-called hourly-kernel which under various function
and performance tests.

our 0day test logic is if we found any regression by these hourly-kernels
comparing to base (e.g. milestone release), auto-bisect will be triggnered.
then we only report when we capture a first bad commit for a regression.

based on this, if you don't receive any report in following 2-3 weeks, you
could think 0day cannot capture any regression from your patch-set.

*However*, please be aware that 0day is not a traditional CI system, and also
due to resource constraints, we cannot guaratee coverage, we cannot tigger
specific tests for your patchset, either.
(sorry if this is not your expectation)


> 
> >
> > >
> > > Also, any further comments on the patch (or the series in general)? If
> > > not, I can send a new commit message for this patch in-place.
> >
> > Sorry, I haven't taken a look yet but will try in a week or so.
> 
> Sounds good, thanks.
> 
> Meanwhile, Andrew, could you please replace the commit log of this
> patch as follows for more updated testing info:
> 
> Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> 
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
> 
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
> 
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
> 
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
> 
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
> 
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
> 
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
> 
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
> 
> (1) Running 22 instances of netperf on a 44 cpu machine with
> hyperthreading disabled. All instances are run in a level 2 cgroup, as
> well as netserver:
>   # netserver -6
>   # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Averaging 20 runs, the numbers are as follows:
> Base: 40198.0 mbps
> Patched: 38629.7 mbps (-3.9%)
> 
> The regression is minimal, especially for 22 instances in the same
> cgroup sharing all ancestors (so updating the same atomics).
> 
> (2) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [1]. These are the
> numbers from 10 runs (+ is good) on a machine with 256 cpus:
> 
>                LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> ------------------------------+-------------+-------------+-------------
>   page_fault1_per_process_ops |             |             |            |
>   (A) base                    | 270249.164  | 265437.000  | 13451.836  |
>   (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
>                               | -3.29%      | -3.66%      |            |
>   page_fault1_per_thread_ops  |             |             |            |
>   (A) base                    | 242111.345  | 239737.000  | 10026.031  |
>   (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
>                               | -2.09%      | -1.85%      |            |
>   page_fault1_scalability     |             |             |
>   (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
>   (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
>                               | -1.16%      | -1.69%      |            |
>   page_fault2_per_process_ops |             |             |
>   (A) base                    | 203561.836  | 203301.000  | 2550.764   |
>   (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
>                               | -3.13%      | -2.73%      |            |
>   page_fault2_per_thread_ops  |             |             |
>   (A) base                    | 171046.473  | 170776.000  | 1509.679   |
>   (B) patched                 | 166626.327  | 166406.000  | 768.753    |
>                               | -2.58%      | -2.56%      |            |
>   page_fault2_scalability     |             |             |
>   (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
>   (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
>                               | -1.29%      | -1.41%      |            |
>   page_fault3_per_process_ops |             |             |
>   (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
>   (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
>                               | -1.56%      | -1.86%      |            |
>   page_fault3_per_thread_ops  |             |             |
>   (A) base                    | 391234.164  | 390860.000  | 1760.720   |
>   (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
>                               | -3.58%      | -3.71%      |            |
>   page_fault3_scalability     |             |             |
>   (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
>   (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
>                               | +2.26%      | +2.45%      |            |
> 
> All regressions seem to be minimal, and within the normal variance for
> the benchmark. The fix for [1] assumes that 3% is noise -- and there were no
> further practical complaints), so hopefully this means that such variations
> in these microbenchmarks do not reflect on practical workloads.
> 
> (3) I also ran stress-ng in a nested cgroup and did not observe any
> obvious regressions.
> 
> [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/

Yosry Ahmed Oct. 18, 2023, 8:54 a.m. UTC | #26

On Wed, Oct 18, 2023 at 1:22 AM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yosry Ahmed, hi, Shakeel Butt,
>
> On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote:
> > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > > >
> > > > > Yes this looks better. I think we should also ask intel perf and
> > > > > phoronix folks to run their benchmarks as well (but no need to block
> > > > > on them).
> > > >
> > > > Anything I need to do for this to happen? (I thought such testing is
> > > > already done on linux-next)
> > >
> > > Just Cced the relevant folks.
> > >
> > > Michael, Oliver & Feng, if you have some time/resource available,
> > > please do trigger your performance benchmarks on the following series
> > > (but nothing urgent):
> > >
> > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/
> >
> > Thanks for that.
>
> we (0day team) have already applied the patch-set as:
>
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set
>
> they've already in our so-called hourly-kernel which under various function
> and performance tests.
>
> our 0day test logic is if we found any regression by these hourly-kernels
> comparing to base (e.g. milestone release), auto-bisect will be triggnered.
> then we only report when we capture a first bad commit for a regression.
>
> based on this, if you don't receive any report in following 2-3 weeks, you
> could think 0day cannot capture any regression from your patch-set.
>
> *However*, please be aware that 0day is not a traditional CI system, and also
> due to resource constraints, we cannot guaratee coverage, we cannot tigger
> specific tests for your patchset, either.
> (sorry if this is not your expectation)
>

Thanks for taking a look and clarifying this, much appreciated.
Fingers crossed for not getting any reports :)

Oliver Sang Oct. 20, 2023, 4:17 p.m. UTC | #27

Hello,

kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:


commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg

testcase: will-it-scale
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:

	nr_task: 100%
	mode: thread
	test: fallocate1
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
| test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
| test parameters  | cpufreq_governor=performance                                  |
|                  | mode=thread                                                   |
|                  | nr_task=50%                                                   |
|                  | test=fallocate1                                               |
+------------------+---------------------------------------------------------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310202303.c68e7639-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231020/202310202303.c68e7639-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

commit: 
  130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
  51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      2.09            -0.5        1.61 ±  2%  mpstat.cpu.all.usr%
     27.58            +3.7%      28.59        turbostat.RAMWatt
      3324           -10.0%       2993        vmstat.system.cs
      1056          -100.0%       0.00        numa-meminfo.node0.Inactive(file)
      6.67 ±141%  +15799.3%       1059        numa-meminfo.node1.Inactive(file)
    120.83 ± 11%     +79.6%     217.00 ±  9%  perf-c2c.DRAM.local
    594.50 ±  6%     +43.8%     854.83 ±  5%  perf-c2c.DRAM.remote
   3797041           -25.8%    2816352        will-it-scale.104.threads
     36509           -25.8%      27079        will-it-scale.per_thread_ops
   3797041           -25.8%    2816352        will-it-scale.workload
 1.142e+09           -26.2%  8.437e+08        numa-numastat.node0.local_node
 1.143e+09           -26.1%  8.439e+08        numa-numastat.node0.numa_hit
 1.148e+09           -25.4%  8.563e+08 ±  2%  numa-numastat.node1.local_node
 1.149e+09           -25.4%  8.564e+08 ±  2%  numa-numastat.node1.numa_hit
     32933            -2.6%      32068        proc-vmstat.nr_slab_reclaimable
 2.291e+09           -25.8%    1.7e+09        proc-vmstat.numa_hit
 2.291e+09           -25.8%    1.7e+09        proc-vmstat.numa_local
  2.29e+09           -25.8%  1.699e+09        proc-vmstat.pgalloc_normal
 2.289e+09           -25.8%  1.699e+09        proc-vmstat.pgfree
      1.00 ± 93%    +154.2%       2.55 ± 16%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
    191.10 ±  2%     +18.0%     225.55 ±  2%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    385.50 ± 14%     +39.6%     538.17 ± 12%  perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
    118.67 ± 11%     -62.6%      44.33 ±100%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      5043 ±  2%     -13.0%       4387 ±  6%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    167.12 ±222%    +200.1%     501.48 ± 99%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
    191.09 ±  2%     +18.0%     225.53 ±  2%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    293.46 ±  4%     +12.8%     330.98 ±  6%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    199.33          -100.0%       0.00        numa-vmstat.node0.nr_active_file
    264.00          -100.0%       0.00        numa-vmstat.node0.nr_inactive_file
    199.33          -100.0%       0.00        numa-vmstat.node0.nr_zone_active_file
    264.00          -100.0%       0.00        numa-vmstat.node0.nr_zone_inactive_file
 1.143e+09           -26.1%  8.439e+08        numa-vmstat.node0.numa_hit
 1.142e+09           -26.2%  8.437e+08        numa-vmstat.node0.numa_local
      1.67 ±141%  +15799.3%     264.99        numa-vmstat.node1.nr_inactive_file
      1.67 ±141%  +15799.3%     264.99        numa-vmstat.node1.nr_zone_inactive_file
 1.149e+09           -25.4%  8.564e+08 ±  2%  numa-vmstat.node1.numa_hit
 1.148e+09           -25.4%  8.563e+08 ±  2%  numa-vmstat.node1.numa_local
      0.59 ±  3%    +125.2%       1.32 ±  2%  perf-stat.i.MPKI
 9.027e+09           -17.9%  7.408e+09        perf-stat.i.branch-instructions
      0.64            -0.0        0.60        perf-stat.i.branch-miss-rate%
  58102855           -23.3%   44580037 ±  2%  perf-stat.i.branch-misses
     15.28            +7.0       22.27        perf-stat.i.cache-miss-rate%
  25155306 ±  2%     +82.7%   45953601 ±  3%  perf-stat.i.cache-misses
 1.644e+08           +25.4%  2.062e+08 ±  2%  perf-stat.i.cache-references
      3258           -10.3%       2921        perf-stat.i.context-switches
      6.73           +23.3%       8.30        perf-stat.i.cpi
    145.97            -1.3%     144.13        perf-stat.i.cpu-migrations
     11519 ±  3%     -45.4%       6293 ±  3%  perf-stat.i.cycles-between-cache-misses
      0.04            -0.0        0.03        perf-stat.i.dTLB-load-miss-rate%
   3921408           -25.3%    2929564        perf-stat.i.dTLB-load-misses
 1.098e+10           -18.1%  8.993e+09        perf-stat.i.dTLB-loads
      0.00 ±  2%      +0.0        0.00 ±  4%  perf-stat.i.dTLB-store-miss-rate%
 5.606e+09           -23.2%  4.304e+09        perf-stat.i.dTLB-stores
     95.65            -1.2       94.49        perf-stat.i.iTLB-load-miss-rate%
   3876741           -25.0%    2905764        perf-stat.i.iTLB-load-misses
 4.286e+10           -18.9%  3.477e+10        perf-stat.i.instructions
     11061            +8.2%      11969        perf-stat.i.instructions-per-iTLB-miss
      0.15           -18.9%       0.12        perf-stat.i.ipc
     48.65 ±  2%     +46.2%      71.11 ±  2%  perf-stat.i.metric.K/sec
    247.84           -18.9%     201.05        perf-stat.i.metric.M/sec
   3138385 ±  2%     +77.7%    5578401 ±  2%  perf-stat.i.node-load-misses
    375827 ±  3%     +69.2%     635857 ± 11%  perf-stat.i.node-loads
   1343194           -26.8%     983668        perf-stat.i.node-store-misses
     51550 ±  3%     -19.0%      41748 ±  7%  perf-stat.i.node-stores
      0.59 ±  3%    +125.1%       1.32 ±  2%  perf-stat.overall.MPKI
      0.64            -0.0        0.60        perf-stat.overall.branch-miss-rate%
     15.30            +7.0       22.28        perf-stat.overall.cache-miss-rate%
      6.73           +23.3%       8.29        perf-stat.overall.cpi
     11470 ±  2%     -45.3%       6279 ±  2%  perf-stat.overall.cycles-between-cache-misses
      0.04            -0.0        0.03        perf-stat.overall.dTLB-load-miss-rate%
      0.00 ±  2%      +0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     95.56            -1.4       94.17        perf-stat.overall.iTLB-load-miss-rate%
     11059            +8.2%      11967        perf-stat.overall.instructions-per-iTLB-miss
      0.15           -18.9%       0.12        perf-stat.overall.ipc
   3396437            +9.5%    3718021        perf-stat.overall.path-length
 8.997e+09           -17.9%  7.383e+09        perf-stat.ps.branch-instructions
  57910417           -23.3%   44426577 ±  2%  perf-stat.ps.branch-misses
  25075498 ±  2%     +82.7%   45803186 ±  3%  perf-stat.ps.cache-misses
 1.639e+08           +25.4%  2.056e+08 ±  2%  perf-stat.ps.cache-references
      3247           -10.3%       2911        perf-stat.ps.context-switches
    145.47            -1.3%     143.61        perf-stat.ps.cpu-migrations
   3908900           -25.3%    2920218        perf-stat.ps.dTLB-load-misses
 1.094e+10           -18.1%  8.963e+09        perf-stat.ps.dTLB-loads
 5.587e+09           -23.2%  4.289e+09        perf-stat.ps.dTLB-stores
   3863663           -25.0%    2895895        perf-stat.ps.iTLB-load-misses
 4.272e+10           -18.9%  3.466e+10        perf-stat.ps.instructions
   3128132 ±  2%     +77.7%    5559939 ±  2%  perf-stat.ps.node-load-misses
    375403 ±  3%     +69.0%     634300 ± 11%  perf-stat.ps.node-loads
   1338688           -26.8%     980311        perf-stat.ps.node-store-misses
     51546 ±  3%     -19.1%      41692 ±  7%  perf-stat.ps.node-stores
  1.29e+13           -18.8%  1.047e+13        perf-stat.total.instructions
      0.96            -0.3        0.70 ±  2%  perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.97            -0.3        0.72        perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
      0.76 ±  2%      -0.2        0.54 ±  3%  perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.82            -0.2        0.60 ±  2%  perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.91            -0.2        0.72        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.68            +0.1        0.76 ±  2%  perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.67            +0.1        1.77        perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.78 ±  2%      +0.1        1.92 ±  2%  perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
      0.69 ±  5%      +0.1        0.84 ±  4%  perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      1.56 ±  2%      +0.2        1.76 ±  2%  perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
      0.85 ±  4%      +0.4        1.23 ±  2%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.78 ±  4%      +0.4        1.20 ±  3%  perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
      0.73 ±  4%      +0.4        1.17 ±  3%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
     48.39            +0.8       49.14        perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.00            +0.8        0.77 ±  4%  perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
     40.24            +0.8       41.03        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     40.22            +0.8       41.01        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
      0.00            +0.8        0.79 ±  3%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     40.19            +0.8       40.98        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
      1.33 ±  5%      +0.8        2.13 ±  4%  perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
     48.16            +0.8       48.98        perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.00            +0.9        0.88 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
     47.92            +0.9       48.81        perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     47.07            +0.9       48.01        perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
     46.59            +1.1       47.64        perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
      0.99            -0.3        0.73 ±  2%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.96            -0.3        0.70 ±  2%  perf-profile.children.cycles-pp.shmem_alloc_folio
      0.78 ±  2%      -0.2        0.56 ±  3%  perf-profile.children.cycles-pp.shmem_inode_acct_blocks
      0.83            -0.2        0.61 ±  2%  perf-profile.children.cycles-pp.alloc_pages_mpol
      0.92            -0.2        0.73        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.74 ±  2%      -0.2        0.55 ±  2%  perf-profile.children.cycles-pp.xas_store
      0.67            -0.2        0.50 ±  3%  perf-profile.children.cycles-pp.__alloc_pages
      0.43            -0.1        0.31 ±  2%  perf-profile.children.cycles-pp.__entry_text_start
      0.41 ±  2%      -0.1        0.30 ±  3%  perf-profile.children.cycles-pp.free_unref_page_list
      0.35            -0.1        0.25 ±  2%  perf-profile.children.cycles-pp.xas_load
      0.35 ±  2%      -0.1        0.25 ±  4%  perf-profile.children.cycles-pp.__mod_lruvec_state
      0.39            -0.1        0.30 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.27 ±  2%      -0.1        0.19 ±  4%  perf-profile.children.cycles-pp.__mod_node_page_state
      0.32 ±  3%      -0.1        0.24 ±  3%  perf-profile.children.cycles-pp.find_lock_entries
      0.23 ±  2%      -0.1        0.15 ±  4%  perf-profile.children.cycles-pp.xas_descend
      0.28 ±  3%      -0.1        0.20 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock
      0.25 ±  3%      -0.1        0.18 ±  3%  perf-profile.children.cycles-pp.__dquot_alloc_space
      0.16 ±  3%      -0.1        0.10 ±  5%  perf-profile.children.cycles-pp.xas_find_conflict
      0.26 ±  2%      -0.1        0.20 ±  3%  perf-profile.children.cycles-pp.filemap_get_entry
      0.26            -0.1        0.20 ±  2%  perf-profile.children.cycles-pp.rmqueue
      0.20 ±  3%      -0.1        0.14 ±  3%  perf-profile.children.cycles-pp.truncate_cleanup_folio
      0.19 ±  5%      -0.1        0.14 ±  4%  perf-profile.children.cycles-pp.xas_clear_mark
      0.17 ±  5%      -0.0        0.12 ±  4%  perf-profile.children.cycles-pp.xas_init_marks
      0.15 ±  4%      -0.0        0.10 ±  4%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.18 ±  3%      -0.0        0.14 ±  3%  perf-profile.children.cycles-pp.__cond_resched
      0.07 ±  5%      -0.0        0.02 ± 99%  perf-profile.children.cycles-pp.xas_find
      0.13 ±  2%      -0.0        0.09        perf-profile.children.cycles-pp.security_vm_enough_memory_mm
      0.14 ±  4%      -0.0        0.10 ±  7%  perf-profile.children.cycles-pp.__fget_light
      0.06 ±  6%      -0.0        0.02 ± 99%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.12 ±  4%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.xas_start
      0.08 ±  5%      -0.0        0.05        perf-profile.children.cycles-pp.__folio_throttle_swaprate
      0.12            -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.folio_unlock
      0.14 ±  3%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.try_charge_memcg
      0.12 ±  6%      -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.12 ±  3%      -0.0        0.09 ±  4%  perf-profile.children.cycles-pp.noop_dirty_folio
      0.20 ±  2%      -0.0        0.17 ±  5%  perf-profile.children.cycles-pp.page_counter_uncharge
      0.10            -0.0        0.07 ±  5%  perf-profile.children.cycles-pp.cap_vm_enough_memory
      0.09 ±  6%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp._raw_spin_trylock
      0.09 ±  5%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp.inode_add_bytes
      0.06 ±  6%      -0.0        0.03 ± 70%  perf-profile.children.cycles-pp.filemap_free_folio
      0.06 ±  6%      -0.0        0.03 ± 70%  perf-profile.children.cycles-pp.percpu_counter_add_batch
      0.12 ±  3%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.__folio_cancel_dirty
      0.12 ±  3%      -0.0        0.10 ±  5%  perf-profile.children.cycles-pp.shmem_recalc_inode
      0.09 ±  5%      -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.__vm_enough_memory
      0.08 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.08 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.security_file_permission
      0.08 ±  6%      -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.apparmor_file_permission
      0.09 ±  4%      -0.0        0.07 ±  8%  perf-profile.children.cycles-pp.__percpu_counter_limited_add
      0.08 ±  6%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.__list_add_valid_or_report
      0.07 ±  8%      -0.0        0.05        perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.14 ±  3%      -0.0        0.12 ±  6%  perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.07 ±  5%      -0.0        0.05        perf-profile.children.cycles-pp.policy_nodemask
      0.24 ±  2%      -0.0        0.22 ±  2%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.08            -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.xas_create
      0.69            +0.1        0.78        perf-profile.children.cycles-pp.lru_add_fn
      1.72 ±  2%      +0.1        1.80        perf-profile.children.cycles-pp.shmem_add_to_page_cache
      1.79 ±  2%      +0.1        1.93 ±  2%  perf-profile.children.cycles-pp.filemap_remove_folio
      0.13 ±  5%      +0.1        0.28        perf-profile.children.cycles-pp.file_modified
      0.69 ±  5%      +0.1        0.84 ±  3%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      0.09 ±  7%      +0.2        0.24 ±  2%  perf-profile.children.cycles-pp.inode_needs_update_time
      1.58 ±  3%      +0.2        1.77 ±  2%  perf-profile.children.cycles-pp.__filemap_remove_folio
      0.15 ±  3%      +0.4        0.50 ±  3%  perf-profile.children.cycles-pp.__count_memcg_events
      0.79 ±  4%      +0.4        1.20 ±  3%  perf-profile.children.cycles-pp.filemap_unaccount_folio
      0.36 ±  5%      +0.4        0.77 ±  4%  perf-profile.children.cycles-pp.mem_cgroup_commit_charge
     98.33            +0.5       98.78        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     97.74            +0.6       98.34        perf-profile.children.cycles-pp.do_syscall_64
     48.39            +0.8       49.15        perf-profile.children.cycles-pp.__x64_sys_fallocate
      1.34 ±  5%      +0.8        2.14 ±  4%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      1.61 ±  4%      +0.8        2.42 ±  2%  perf-profile.children.cycles-pp.__mod_lruvec_page_state
     48.17            +0.8       48.98        perf-profile.children.cycles-pp.vfs_fallocate
     47.94            +0.9       48.82        perf-profile.children.cycles-pp.shmem_fallocate
     47.10            +0.9       48.04        perf-profile.children.cycles-pp.shmem_get_folio_gfp
     84.34            +0.9       85.28        perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
     84.31            +0.9       85.26        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     84.24            +1.0       85.21        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     46.65            +1.1       47.70        perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
      1.23 ±  4%      +1.4        2.58 ±  2%  perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      0.98            -0.3        0.73 ±  2%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.88            -0.2        0.70        perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.60            -0.2        0.45        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.41 ±  3%      -0.1        0.27 ±  3%  perf-profile.self.cycles-pp.release_pages
      0.41            -0.1        0.30 ±  3%  perf-profile.self.cycles-pp.xas_store
      0.41 ±  3%      -0.1        0.29 ±  2%  perf-profile.self.cycles-pp.folio_batch_move_lru
      0.30 ±  3%      -0.1        0.18 ±  5%  perf-profile.self.cycles-pp.shmem_add_to_page_cache
      0.38 ±  2%      -0.1        0.27 ±  2%  perf-profile.self.cycles-pp.__entry_text_start
      0.30 ±  3%      -0.1        0.20 ±  6%  perf-profile.self.cycles-pp.lru_add_fn
      0.28 ±  2%      -0.1        0.20 ±  5%  perf-profile.self.cycles-pp.shmem_fallocate
      0.26 ±  2%      -0.1        0.18 ±  5%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.27 ±  3%      -0.1        0.20 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.21 ±  2%      -0.1        0.15 ±  4%  perf-profile.self.cycles-pp.__alloc_pages
      0.20 ±  2%      -0.1        0.14 ±  3%  perf-profile.self.cycles-pp.xas_descend
      0.26 ±  3%      -0.1        0.20 ±  4%  perf-profile.self.cycles-pp.find_lock_entries
      0.18 ±  4%      -0.0        0.13 ±  5%  perf-profile.self.cycles-pp.xas_clear_mark
      0.15 ±  7%      -0.0        0.10 ± 11%  perf-profile.self.cycles-pp.shmem_inode_acct_blocks
      0.16 ±  4%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.__dquot_alloc_space
      0.13 ±  4%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.free_unref_page_commit
      0.13            -0.0        0.09 ±  5%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.16 ±  4%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
      0.13 ±  5%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.__filemap_remove_folio
      0.13 ±  2%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.get_page_from_freelist
      0.12 ±  4%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.vfs_fallocate
      0.06 ±  7%      -0.0        0.02 ± 99%  perf-profile.self.cycles-pp.apparmor_file_permission
      0.13 ±  3%      -0.0        0.10 ±  5%  perf-profile.self.cycles-pp.fallocate64
      0.11 ±  4%      -0.0        0.07        perf-profile.self.cycles-pp.xas_start
      0.07 ±  5%      -0.0        0.03 ± 70%  perf-profile.self.cycles-pp.shmem_alloc_folio
      0.14 ±  4%      -0.0        0.10 ±  7%  perf-profile.self.cycles-pp.__fget_light
      0.10 ±  4%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.rmqueue
      0.12 ±  3%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.xas_load
      0.11 ±  4%      -0.0        0.08 ±  7%  perf-profile.self.cycles-pp.folio_unlock
      0.10 ±  4%      -0.0        0.07 ±  8%  perf-profile.self.cycles-pp.alloc_pages_mpol
      0.15 ±  2%      -0.0        0.12 ±  5%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.10            -0.0        0.07        perf-profile.self.cycles-pp.cap_vm_enough_memory
      0.16 ±  2%      -0.0        0.13 ±  6%  perf-profile.self.cycles-pp.page_counter_uncharge
      0.12 ±  5%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.__cond_resched
      0.06 ±  6%      -0.0        0.03 ± 70%  perf-profile.self.cycles-pp.filemap_free_folio
      0.12 ±  3%      -0.0        0.10 ±  5%  perf-profile.self.cycles-pp.free_unref_page_list
      0.12            -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.noop_dirty_folio
      0.10 ±  3%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.filemap_remove_folio
      0.10 ±  5%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.try_charge_memcg
      0.12 ±  3%      -0.0        0.10 ±  8%  perf-profile.self.cycles-pp.cgroup_rstat_updated
      0.09 ±  4%      -0.0        0.07 ±  7%  perf-profile.self.cycles-pp.__folio_cancel_dirty
      0.08 ±  4%      -0.0        0.06 ±  8%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.08 ±  5%      -0.0        0.06        perf-profile.self.cycles-pp._raw_spin_trylock
      0.08            -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.folio_add_lru
      0.08 ±  8%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.__mod_lruvec_state
      0.07 ±  5%      -0.0        0.05        perf-profile.self.cycles-pp.xas_find_conflict
      0.08 ± 10%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.truncate_cleanup_folio
      0.07 ± 10%      -0.0        0.05        perf-profile.self.cycles-pp.xas_init_marks
      0.08 ±  4%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__percpu_counter_limited_add
      0.07 ±  7%      -0.0        0.05        perf-profile.self.cycles-pp.get_pfnblock_flags_mask
      0.07 ±  5%      -0.0        0.06 ±  8%  perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.02 ±141%      +0.0        0.06 ±  8%  perf-profile.self.cycles-pp.uncharge_batch
      0.21 ±  9%      +0.1        0.31 ±  7%  perf-profile.self.cycles-pp.mem_cgroup_commit_charge
      0.69 ±  5%      +0.1        0.83 ±  4%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.06 ±  6%      +0.2        0.22 ±  2%  perf-profile.self.cycles-pp.inode_needs_update_time
      0.14 ±  8%      +0.3        0.42 ±  7%  perf-profile.self.cycles-pp.__mem_cgroup_charge
      0.13 ±  7%      +0.4        0.49 ±  3%  perf-profile.self.cycles-pp.__count_memcg_events
     84.24            +1.0       85.21        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      1.12 ±  5%      +1.4        2.50 ±  2%  perf-profile.self.cycles-pp.__mod_memcg_lruvec_state


***************************************************************************************************
lkp-skl-fpga01: 104 threads 2 sockets (Skylake) with 192G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

commit: 
  130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
  51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      1.87            -0.4        1.43 ±  3%  mpstat.cpu.all.usr%
      3171            -5.3%       3003 ±  2%  vmstat.system.cs
     84.83 ±  9%     +55.8%     132.17 ± 16%  perf-c2c.DRAM.local
    484.17 ±  3%     +37.1%     663.67 ± 10%  perf-c2c.DRAM.remote
     72763 ±  5%     +14.4%      83212 ± 12%  turbostat.C1
      0.08           -25.0%       0.06        turbostat.IPC
     27.90            +4.6%      29.18        turbostat.RAMWatt
   3982212           -30.0%    2785941        will-it-scale.52.threads
     76580           -30.0%      53575        will-it-scale.per_thread_ops
   3982212           -30.0%    2785941        will-it-scale.workload
 1.175e+09 ±  2%     -28.6%  8.392e+08 ±  2%  numa-numastat.node0.local_node
 1.175e+09 ±  2%     -28.6%  8.394e+08 ±  2%  numa-numastat.node0.numa_hit
 1.231e+09 ±  2%     -31.3%  8.463e+08 ±  3%  numa-numastat.node1.local_node
 1.232e+09 ±  2%     -31.3%  8.466e+08 ±  3%  numa-numastat.node1.numa_hit
 1.175e+09 ±  2%     -28.6%  8.394e+08 ±  2%  numa-vmstat.node0.numa_hit
 1.175e+09 ±  2%     -28.6%  8.392e+08 ±  2%  numa-vmstat.node0.numa_local
 1.232e+09 ±  2%     -31.3%  8.466e+08 ±  3%  numa-vmstat.node1.numa_hit
 1.231e+09 ±  2%     -31.3%  8.463e+08 ±  3%  numa-vmstat.node1.numa_local
 2.408e+09           -30.0%  1.686e+09        proc-vmstat.numa_hit
 2.406e+09           -30.0%  1.685e+09        proc-vmstat.numa_local
 2.404e+09           -29.9%  1.684e+09        proc-vmstat.pgalloc_normal
 2.404e+09           -29.9%  1.684e+09        proc-vmstat.pgfree
      0.04 ±  9%     -19.3%       0.03 ±  6%  perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.04 ±  8%     -22.3%       0.03 ±  5%  perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.91 ±  2%     +11.3%       1.01 ±  5%  perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.04 ± 13%     -90.3%       0.00 ±223%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      1.14           +15.1%       1.31        perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    189.94 ±  3%     +18.3%     224.73 ±  4%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1652 ±  4%     -13.4%       1431 ±  4%  perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
     83.67 ±  7%     -87.6%      10.33 ±223%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      3827 ±  4%     -13.0%       3328 ±  3%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1.71 ±165%     -83.4%       0.28 ± 21%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.43 ± 17%     -43.8%       0.24 ± 26%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.46 ± 17%     -36.7%       0.29 ± 12%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.30 ± 34%     -90.7%       0.03 ±223%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      0.04 ±  9%     -19.3%       0.03 ±  6%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.04 ±  8%     -22.3%       0.03 ±  5%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.04 ± 11%     -33.1%       0.03 ± 17%  perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.90 ±  2%     +11.5%       1.00 ±  5%  perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.04 ± 13%     -26.6%       0.03 ± 12%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      1.13           +15.2%       1.30        perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    189.93 ±  3%     +18.3%     224.72 ±  4%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1.71 ±165%     -83.4%       0.28 ± 21%  perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.43 ± 17%     -43.8%       0.24 ± 26%  perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.46 ± 17%     -36.7%       0.29 ± 12%  perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.75          +142.0%       1.83 ±  2%  perf-stat.i.MPKI
  8.47e+09           -24.4%  6.407e+09        perf-stat.i.branch-instructions
      0.66            -0.0        0.63        perf-stat.i.branch-miss-rate%
  56364992           -28.3%   40421603 ±  3%  perf-stat.i.branch-misses
     14.64            +6.7       21.30        perf-stat.i.cache-miss-rate%
  30868184           +81.3%   55977240 ±  3%  perf-stat.i.cache-misses
 2.107e+08           +24.7%  2.627e+08 ±  2%  perf-stat.i.cache-references
      3106            -5.5%       2934 ±  2%  perf-stat.i.context-switches
      3.55           +33.4%       4.74        perf-stat.i.cpi
      4722           -44.8%       2605 ±  3%  perf-stat.i.cycles-between-cache-misses
      0.04            -0.0        0.04        perf-stat.i.dTLB-load-miss-rate%
   4117232           -29.1%    2917107        perf-stat.i.dTLB-load-misses
 1.051e+10           -24.1%  7.979e+09        perf-stat.i.dTLB-loads
      0.00 ±  3%      +0.0        0.00 ±  6%  perf-stat.i.dTLB-store-miss-rate%
 5.886e+09           -27.5%  4.269e+09        perf-stat.i.dTLB-stores
     78.16            -6.6       71.51        perf-stat.i.iTLB-load-miss-rate%
   4131074 ±  3%     -30.0%    2891515        perf-stat.i.iTLB-load-misses
 4.098e+10           -25.0%  3.072e+10        perf-stat.i.instructions
      9929 ±  2%      +7.0%      10627        perf-stat.i.instructions-per-iTLB-miss
      0.28           -25.0%       0.21        perf-stat.i.ipc
     63.49           +43.8%      91.27 ±  3%  perf-stat.i.metric.K/sec
    241.12           -24.6%     181.87        perf-stat.i.metric.M/sec
   3735316           +78.6%    6669641 ±  3%  perf-stat.i.node-load-misses
    377465 ±  4%     +86.1%     702512 ± 11%  perf-stat.i.node-loads
   1322217           -27.6%     957081 ±  5%  perf-stat.i.node-store-misses
     37459 ±  3%     -23.0%      28826 ±  5%  perf-stat.i.node-stores
      0.75          +141.8%       1.82 ±  2%  perf-stat.overall.MPKI
      0.67            -0.0        0.63        perf-stat.overall.branch-miss-rate%
     14.65            +6.7       21.30        perf-stat.overall.cache-miss-rate%
      3.55           +33.4%       4.73        perf-stat.overall.cpi
      4713           -44.8%       2601 ±  3%  perf-stat.overall.cycles-between-cache-misses
      0.04            -0.0        0.04        perf-stat.overall.dTLB-load-miss-rate%
      0.00 ±  3%      +0.0        0.00 ±  5%  perf-stat.overall.dTLB-store-miss-rate%
     78.14            -6.7       71.47        perf-stat.overall.iTLB-load-miss-rate%
      9927 ±  2%      +7.0%      10624        perf-stat.overall.instructions-per-iTLB-miss
      0.28           -25.0%       0.21        perf-stat.overall.ipc
   3098901            +7.1%    3318983        perf-stat.overall.path-length
 8.441e+09           -24.4%  6.385e+09        perf-stat.ps.branch-instructions
  56179581           -28.3%   40286337 ±  3%  perf-stat.ps.branch-misses
  30759982           +81.3%   55777812 ±  3%  perf-stat.ps.cache-misses
   2.1e+08           +24.6%  2.618e+08 ±  2%  perf-stat.ps.cache-references
      3095            -5.5%       2923 ±  2%  perf-stat.ps.context-switches
   4103292           -29.1%    2907270        perf-stat.ps.dTLB-load-misses
 1.048e+10           -24.1%  7.952e+09        perf-stat.ps.dTLB-loads
 5.866e+09           -27.5%  4.255e+09        perf-stat.ps.dTLB-stores
   4117020 ±  3%     -30.0%    2881750        perf-stat.ps.iTLB-load-misses
 4.084e+10           -25.0%  3.062e+10        perf-stat.ps.instructions
   3722149           +78.5%    6645867 ±  3%  perf-stat.ps.node-load-misses
    376240 ±  4%     +86.1%     700053 ± 11%  perf-stat.ps.node-loads
   1317772           -27.6%     953773 ±  5%  perf-stat.ps.node-store-misses
     37408 ±  3%     -23.2%      28748 ±  5%  perf-stat.ps.node-stores
 1.234e+13           -25.1%  9.246e+12        perf-stat.total.instructions
      1.28            -0.4        0.90 ±  2%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
      1.26 ±  2%      -0.4        0.90 ±  3%  perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.08 ±  2%      -0.3        0.77 ±  3%  perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.92 ±  2%      -0.3        0.62 ±  3%  perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.84 ±  3%      -0.2        0.61 ±  3%  perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.26            -0.2        1.08        perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
      1.26            -0.2        1.08        perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
      1.24            -0.2        1.06        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
      1.24            -0.2        1.06        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
      1.23            -0.2        1.06        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
      1.20            -0.2        1.04 ±  2%  perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.68 ±  3%      +0.0        0.72 ±  4%  perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
      1.08            +0.1        1.20        perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      2.91            +0.3        3.18 ±  2%  perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      2.56            +0.4        2.92 ±  2%  perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
      1.36 ±  3%      +0.4        1.76 ±  9%  perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      2.22            +0.5        2.68 ±  2%  perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
      0.00            +0.6        0.60 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
      2.33            +0.6        2.94        perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.00            +0.7        0.72 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
      0.69 ±  4%      +0.8        1.47 ±  3%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
      1.24 ±  2%      +0.8        2.04 ±  2%  perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
      0.00            +0.8        0.82 ±  4%  perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.17 ±  2%      +0.8        2.00 ±  2%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
      0.59 ±  4%      +0.9        1.53        perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.38            +1.0        2.33 ±  2%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.62 ±  3%      +1.0        1.66 ±  5%  perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
     38.70            +1.2       39.90        perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
     38.34            +1.3       39.65        perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     37.24            +1.6       38.86        perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
     36.64            +1.8       38.40        perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
      2.47 ±  2%      +2.1        4.59 ±  8%  perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.30            -0.4        0.92 ±  2%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.28 ±  2%      -0.4        0.90 ±  3%  perf-profile.children.cycles-pp.shmem_alloc_folio
      1.10 ±  2%      -0.3        0.78 ±  3%  perf-profile.children.cycles-pp.alloc_pages_mpol
      0.96 ±  2%      -0.3        0.64 ±  3%  perf-profile.children.cycles-pp.shmem_inode_acct_blocks
      0.88            -0.3        0.58 ±  2%  perf-profile.children.cycles-pp.xas_store
      0.88 ±  3%      -0.2        0.64 ±  3%  perf-profile.children.cycles-pp.__alloc_pages
      0.61 ±  2%      -0.2        0.43 ±  3%  perf-profile.children.cycles-pp.__entry_text_start
      1.26            -0.2        1.09        perf-profile.children.cycles-pp.lru_add_drain_cpu
      0.56            -0.2        0.39 ±  4%  perf-profile.children.cycles-pp.free_unref_page_list
      1.22            -0.2        1.06 ±  2%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.46            -0.1        0.32 ±  3%  perf-profile.children.cycles-pp.__mod_lruvec_state
      0.41 ±  3%      -0.1        0.28 ±  4%  perf-profile.children.cycles-pp.xas_load
      0.44 ±  4%      -0.1        0.31 ±  4%  perf-profile.children.cycles-pp.find_lock_entries
      0.50 ±  3%      -0.1        0.37 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.24 ±  7%      -0.1        0.12 ±  5%  perf-profile.children.cycles-pp.__list_add_valid_or_report
      0.34 ±  2%      -0.1        0.24 ±  4%  perf-profile.children.cycles-pp.__mod_node_page_state
      0.38 ±  3%      -0.1        0.28 ±  4%  perf-profile.children.cycles-pp._raw_spin_lock
      0.32 ±  2%      -0.1        0.22 ±  5%  perf-profile.children.cycles-pp.__dquot_alloc_space
      0.26 ±  2%      -0.1        0.17 ±  2%  perf-profile.children.cycles-pp.xas_descend
      0.22 ±  3%      -0.1        0.14 ±  4%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.25            -0.1        0.17 ±  3%  perf-profile.children.cycles-pp.xas_clear_mark
      0.32 ±  4%      -0.1        0.25 ±  3%  perf-profile.children.cycles-pp.rmqueue
      0.23 ±  2%      -0.1        0.16 ±  2%  perf-profile.children.cycles-pp.xas_init_marks
      0.24 ±  2%      -0.1        0.17 ±  5%  perf-profile.children.cycles-pp.__cond_resched
      0.25 ±  4%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.truncate_cleanup_folio
      0.30 ±  3%      -0.1        0.23 ±  4%  perf-profile.children.cycles-pp.filemap_get_entry
      0.20 ±  2%      -0.1        0.13 ±  5%  perf-profile.children.cycles-pp.folio_unlock
      0.16 ±  4%      -0.1        0.10 ±  5%  perf-profile.children.cycles-pp.xas_find_conflict
      0.19 ±  3%      -0.1        0.13 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.17 ±  5%      -0.1        0.12 ±  3%  perf-profile.children.cycles-pp.noop_dirty_folio
      0.13 ±  4%      -0.1        0.08 ±  9%  perf-profile.children.cycles-pp.security_vm_enough_memory_mm
      0.18 ±  8%      -0.1        0.13 ±  4%  perf-profile.children.cycles-pp.shmem_recalc_inode
      0.16 ±  2%      -0.1        0.11 ±  3%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.09 ±  5%      -0.1        0.04 ± 45%  perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
      0.10 ±  7%      -0.0        0.05 ± 45%  perf-profile.children.cycles-pp.cap_vm_enough_memory
      0.14 ±  5%      -0.0        0.10        perf-profile.children.cycles-pp.__folio_cancel_dirty
      0.14 ±  5%      -0.0        0.10 ±  4%  perf-profile.children.cycles-pp.security_file_permission
      0.10 ±  5%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.xas_find
      0.15 ±  4%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.__fget_light
      0.14 ±  5%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.file_modified
      0.12 ±  3%      -0.0        0.09 ±  7%  perf-profile.children.cycles-pp.__vm_enough_memory
      0.12 ±  3%      -0.0        0.09 ±  4%  perf-profile.children.cycles-pp.apparmor_file_permission
      0.12 ±  3%      -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.12 ±  4%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.xas_start
      0.09            -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.__folio_throttle_swaprate
      0.12 ±  6%      -0.0        0.08 ±  8%  perf-profile.children.cycles-pp._raw_spin_trylock
      0.12 ±  4%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.__percpu_counter_limited_add
      0.12 ±  4%      -0.0        0.09 ±  4%  perf-profile.children.cycles-pp.inode_add_bytes
      0.20 ±  2%      -0.0        0.17 ±  7%  perf-profile.children.cycles-pp.try_charge_memcg
      0.10 ±  5%      -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.policy_nodemask
      0.09 ±  6%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.09 ±  6%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp.filemap_free_folio
      0.07 ±  6%      -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.down_write
      0.08 ±  4%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.get_task_policy
      0.09 ±  5%      -0.0        0.07 ±  5%  perf-profile.children.cycles-pp.xas_create
      0.09 ±  7%      -0.0        0.07        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.09 ±  7%      -0.0        0.07        perf-profile.children.cycles-pp.inode_needs_update_time
      0.16 ±  2%      -0.0        0.14 ±  5%  perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.08 ±  7%      -0.0        0.06 ±  9%  perf-profile.children.cycles-pp.percpu_counter_add_batch
      0.07 ±  5%      -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.folio_mark_dirty
      0.08 ± 10%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.shmem_is_huge
      0.07 ±  6%      +0.0        0.09 ± 10%  perf-profile.children.cycles-pp.propagate_protected_usage
      0.43 ±  3%      +0.0        0.46 ±  5%  perf-profile.children.cycles-pp.uncharge_batch
      0.68 ±  3%      +0.0        0.73 ±  4%  perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
      1.11            +0.1        1.22        perf-profile.children.cycles-pp.lru_add_fn
      2.91            +0.3        3.18 ±  2%  perf-profile.children.cycles-pp.truncate_inode_folio
      2.56            +0.4        2.92 ±  2%  perf-profile.children.cycles-pp.filemap_remove_folio
      1.37 ±  3%      +0.4        1.76 ±  9%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      2.24            +0.5        2.70 ±  2%  perf-profile.children.cycles-pp.__filemap_remove_folio
      2.38            +0.6        2.97        perf-profile.children.cycles-pp.shmem_add_to_page_cache
      0.18 ±  4%      +0.7        0.91 ±  4%  perf-profile.children.cycles-pp.__count_memcg_events
      1.26            +0.8        2.04 ±  2%  perf-profile.children.cycles-pp.filemap_unaccount_folio
      0.63 ±  2%      +1.0        1.67 ±  5%  perf-profile.children.cycles-pp.mem_cgroup_commit_charge
     38.71            +1.2       39.91        perf-profile.children.cycles-pp.vfs_fallocate
     38.37            +1.3       39.66        perf-profile.children.cycles-pp.shmem_fallocate
     37.28            +1.6       38.89        perf-profile.children.cycles-pp.shmem_get_folio_gfp
     36.71            +1.7       38.45        perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
      2.58            +1.8        4.36 ±  2%  perf-profile.children.cycles-pp.__mod_lruvec_page_state
      2.48 ±  2%      +2.1        4.60 ±  8%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      1.93 ±  3%      +2.4        4.36 ±  2%  perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      1.30            -0.4        0.92 ±  2%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.73            -0.2        0.52 ±  2%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.54 ±  2%      -0.2        0.36 ±  3%  perf-profile.self.cycles-pp.release_pages
      0.48            -0.2        0.30 ±  3%  perf-profile.self.cycles-pp.xas_store
      0.54 ±  2%      -0.2        0.38 ±  3%  perf-profile.self.cycles-pp.__entry_text_start
      1.17            -0.1        1.03 ±  2%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.36 ±  2%      -0.1        0.22 ±  3%  perf-profile.self.cycles-pp.shmem_add_to_page_cache
      0.43 ±  5%      -0.1        0.30 ±  7%  perf-profile.self.cycles-pp.lru_add_fn
      0.24 ±  7%      -0.1        0.12 ±  6%  perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.38 ±  4%      -0.1        0.27 ±  4%  perf-profile.self.cycles-pp._raw_spin_lock
      0.52 ±  3%      -0.1        0.41        perf-profile.self.cycles-pp.folio_batch_move_lru
      0.32 ±  2%      -0.1        0.22 ±  4%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.36 ±  4%      -0.1        0.26 ±  4%  perf-profile.self.cycles-pp.find_lock_entries
      0.36 ±  2%      -0.1        0.26 ±  2%  perf-profile.self.cycles-pp.shmem_fallocate
      0.28 ±  3%      -0.1        0.20 ±  5%  perf-profile.self.cycles-pp.__alloc_pages
      0.24 ±  2%      -0.1        0.16 ±  4%  perf-profile.self.cycles-pp.xas_descend
      0.23 ±  2%      -0.1        0.16 ±  3%  perf-profile.self.cycles-pp.xas_clear_mark
      0.18 ±  3%      -0.1        0.11 ±  6%  perf-profile.self.cycles-pp.free_unref_page_commit
      0.18 ±  3%      -0.1        0.12 ±  4%  perf-profile.self.cycles-pp.shmem_inode_acct_blocks
      0.21 ±  3%      -0.1        0.15 ±  2%  perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
      0.18 ±  2%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp.__filemap_remove_folio
      0.18 ±  7%      -0.1        0.12 ±  7%  perf-profile.self.cycles-pp.vfs_fallocate
      0.20 ±  2%      -0.1        0.14 ±  6%  perf-profile.self.cycles-pp.__dquot_alloc_space
      0.18 ±  2%      -0.1        0.13 ±  3%  perf-profile.self.cycles-pp.folio_unlock
      0.18 ±  2%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp.get_page_from_freelist
      0.15 ±  3%      -0.1        0.10 ±  7%  perf-profile.self.cycles-pp.xas_load
      0.17 ±  3%      -0.1        0.12 ±  8%  perf-profile.self.cycles-pp.__cond_resched
      0.17 ±  2%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.17 ±  5%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp.noop_dirty_folio
      0.10 ±  7%      -0.0        0.05 ± 45%  perf-profile.self.cycles-pp.cap_vm_enough_memory
      0.12 ±  3%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.rmqueue
      0.07 ±  5%      -0.0        0.02 ± 99%  perf-profile.self.cycles-pp.xas_find
      0.13 ±  3%      -0.0        0.09 ±  6%  perf-profile.self.cycles-pp.alloc_pages_mpol
      0.07 ±  6%      -0.0        0.03 ± 70%  perf-profile.self.cycles-pp.xas_find_conflict
      0.16 ±  2%      -0.0        0.12 ±  6%  perf-profile.self.cycles-pp.free_unref_page_list
      0.12 ±  5%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.fallocate64
      0.20 ±  4%      -0.0        0.16 ±  3%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.06 ±  7%      -0.0        0.02 ± 99%  perf-profile.self.cycles-pp.shmem_recalc_inode
      0.13 ±  3%      -0.0        0.09        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.22 ±  3%      -0.0        0.19 ±  6%  perf-profile.self.cycles-pp.page_counter_uncharge
      0.14 ±  3%      -0.0        0.10 ±  6%  perf-profile.self.cycles-pp.filemap_remove_folio
      0.15 ±  5%      -0.0        0.11 ±  3%  perf-profile.self.cycles-pp.__fget_light
      0.12 ±  4%      -0.0        0.08        perf-profile.self.cycles-pp.__folio_cancel_dirty
      0.11 ±  4%      -0.0        0.08 ±  7%  perf-profile.self.cycles-pp._raw_spin_trylock
      0.12 ±  3%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.__mod_lruvec_state
      0.11 ±  5%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.truncate_cleanup_folio
      0.11 ±  3%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.__percpu_counter_limited_add
      0.11 ±  3%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.xas_start
      0.10 ±  6%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.xas_init_marks
      0.09 ±  6%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.get_pfnblock_flags_mask
      0.11            -0.0        0.08 ±  5%  perf-profile.self.cycles-pp.folio_add_lru
      0.09 ±  6%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.filemap_free_folio
      0.09 ±  4%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.shmem_alloc_folio
      0.14 ±  5%      -0.0        0.12 ±  5%  perf-profile.self.cycles-pp.cgroup_rstat_updated
      0.10 ±  4%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.apparmor_file_permission
      0.07 ±  7%      -0.0        0.04 ± 44%  perf-profile.self.cycles-pp.policy_nodemask
      0.07 ± 11%      -0.0        0.04 ± 45%  perf-profile.self.cycles-pp.shmem_is_huge
      0.08 ±  4%      -0.0        0.06 ±  8%  perf-profile.self.cycles-pp.get_task_policy
      0.08 ±  6%      -0.0        0.05 ±  8%  perf-profile.self.cycles-pp.__x64_sys_fallocate
      0.12 ±  3%      -0.0        0.10 ±  6%  perf-profile.self.cycles-pp.try_charge_memcg
      0.07            -0.0        0.05        perf-profile.self.cycles-pp.free_unref_page_prepare
      0.07 ±  6%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.percpu_counter_add_batch
      0.08 ±  4%      -0.0        0.06        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.09 ±  7%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.filemap_get_entry
      0.07 ±  9%      +0.0        0.09 ± 10%  perf-profile.self.cycles-pp.propagate_protected_usage
      0.96 ±  2%      +0.2        1.12 ±  7%  perf-profile.self.cycles-pp.__mod_lruvec_page_state
      0.45 ±  4%      +0.4        0.82 ±  8%  perf-profile.self.cycles-pp.mem_cgroup_commit_charge
      1.36 ±  3%      +0.4        1.75 ±  9%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.29            +0.7        1.00 ± 10%  perf-profile.self.cycles-pp.__mem_cgroup_charge
      0.16 ±  4%      +0.7        0.90 ±  4%  perf-profile.self.cycles-pp.__count_memcg_events
      1.80 ±  2%      +2.5        4.26 ±  2%  perf-profile.self.cycles-pp.__mod_memcg_lruvec_state





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Shakeel Butt Oct. 20, 2023, 5:23 p.m. UTC | #28

On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
>
>
> commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
>
> testcase: will-it-scale
> test machine: 104 threads 2 sockets (Skylake) with 192G memory
> parameters:
>
>         nr_task: 100%
>         mode: thread
>         test: fallocate1
>         cpufreq_governor: performance
>
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> | test parameters  | cpufreq_governor=performance                                  |
> |                  | mode=thread                                                   |
> |                  | nr_task=50%                                                   |
> |                  | test=fallocate1                                               |
> +------------------+---------------------------------------------------------------+
>

Yosry, I don't think 25% to 30% regression can be ignored. Unless
there is a quick fix, IMO this series should be skipped for the
upcoming kernel open window.

Yosry Ahmed Oct. 20, 2023, 5:42 p.m. UTC | #29

On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> >
> >
> > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> >
> > testcase: will-it-scale
> > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > parameters:
> >
> >         nr_task: 100%
> >         mode: thread
> >         test: fallocate1
> >         cpufreq_governor: performance
> >
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> > | test parameters  | cpufreq_governor=performance                                  |
> > |                  | mode=thread                                                   |
> > |                  | nr_task=50%                                                   |
> > |                  | test=fallocate1                                               |
> > +------------------+---------------------------------------------------------------+
> >
>
> Yosry, I don't think 25% to 30% regression can be ignored. Unless
> there is a quick fix, IMO this series should be skipped for the
> upcoming kernel open window.

I am currently looking into it. It's reasonable to skip the next merge
window if a quick fix isn't found soon.

I am surprised by the size of the regression given the following:
      1.12 ą  5%      +1.4        2.50 ą  2%
perf-profile.self.cycles-pp.__mod_memcg_lruvec_state

IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().

Feng Tang Oct. 23, 2023, 1:25 a.m. UTC | #30

On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > >
> > >
> > >
> > > Hello,
> > >
> > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > >
> > >
> > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > >
> > > testcase: will-it-scale
> > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > parameters:
> > >
> > >         nr_task: 100%
> > >         mode: thread
> > >         test: fallocate1
> > >         cpufreq_governor: performance
> > >
> > >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> > > | test parameters  | cpufreq_governor=performance                                  |
> > > |                  | mode=thread                                                   |
> > > |                  | nr_task=50%                                                   |
> > > |                  | test=fallocate1                                               |
> > > +------------------+---------------------------------------------------------------+
> > >
> >
> > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > there is a quick fix, IMO this series should be skipped for the
> > upcoming kernel open window.
> 
> I am currently looking into it. It's reasonable to skip the next merge
> window if a quick fix isn't found soon.
> 
> I am surprised by the size of the regression given the following:
>       1.12 ą  5%      +1.4        2.50 ą  2%
> perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> 
> IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().

Yes, this is kind of confusing. And we have seen similar cases before,
espcially for micro benchmark like will-it-scale, stressng, netperf
etc, the change to those functions in hot path was greatly amplified
in the final benchmark score.

In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
the affected functions have around 10% change in perf's cpu-cycles,
and trigger 69% regression. IIRC, micro benchmarks are very sensitive
to those statistics update, like memcg's and vmstat.

Thanks,
Feng

Yosry Ahmed Oct. 23, 2023, 6:25 p.m. UTC | #31

On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > >
> > > >
> > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > >
> > > > testcase: will-it-scale
> > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > parameters:
> > > >
> > > >         nr_task: 100%
> > > >         mode: thread
> > > >         test: fallocate1
> > > >         cpufreq_governor: performance
> > > >
> > > >
> > > > In addition to that, the commit also has significant impact on the following tests:
> > > >
> > > > +------------------+---------------------------------------------------------------+
> > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> > > > | test parameters  | cpufreq_governor=performance                                  |
> > > > |                  | mode=thread                                                   |
> > > > |                  | nr_task=50%                                                   |
> > > > |                  | test=fallocate1                                               |
> > > > +------------------+---------------------------------------------------------------+
> > > >
> > >
> > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > there is a quick fix, IMO this series should be skipped for the
> > > upcoming kernel open window.
> >
> > I am currently looking into it. It's reasonable to skip the next merge
> > window if a quick fix isn't found soon.
> >
> > I am surprised by the size of the regression given the following:
> >       1.12 ą  5%      +1.4        2.50 ą  2%
> > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> >
> > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
>
> Yes, this is kind of confusing. And we have seen similar cases before,
> espcially for micro benchmark like will-it-scale, stressng, netperf
> etc, the change to those functions in hot path was greatly amplified
> in the final benchmark score.
>
> In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> the affected functions have around 10% change in perf's cpu-cycles,
> and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> to those statistics update, like memcg's and vmstat.
>

Thanks for clarifying. I am still trying to reproduce locally but I am
running into some quirks with tooling. I may have to run a modified
version of the fallocate test manually. Meanwhile, I noticed that the
patch was tested without the fixlet that I posted [1] for it,
understandably. Would it be possible to get some numbers with that
fixlet? It should reduce the total number of contended atomic
operations, so it may help.

[1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/

I am also wondering if aligning the stats_updates atomic will help.
Right now it may share a cacheline with some items of the
events_pending array. The latter may be dirtied during a flush and
unnecessarily dirty the former, but the chances are slim to be honest.
If it's easy to test such a diff, that would be nice, but I don't
expect a lot of difference:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..a35fce653262 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -646,7 +646,7 @@ struct memcg_vmstats {
        unsigned long           events_pending[NR_MEMCG_EVENTS];

        /* Stats updates since the last flush */
-       atomic64_t              stats_updates;
+       atomic64_t              stats_updates ____cacheline_aligned_in_smp;
 };

 /*

Yosry Ahmed Oct. 24, 2023, 2:13 a.m. UTC | #32

On Mon, Oct 23, 2023 at 11:25 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > > >
> > > > >
> > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > > >
> > > > > testcase: will-it-scale
> > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > > parameters:
> > > > >
> > > > >         nr_task: 100%
> > > > >         mode: thread
> > > > >         test: fallocate1
> > > > >         cpufreq_governor: performance
> > > > >
> > > > >
> > > > > In addition to that, the commit also has significant impact on the following tests:
> > > > >
> > > > > +------------------+---------------------------------------------------------------+
> > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> > > > > | test parameters  | cpufreq_governor=performance                                  |
> > > > > |                  | mode=thread                                                   |
> > > > > |                  | nr_task=50%                                                   |
> > > > > |                  | test=fallocate1                                               |
> > > > > +------------------+---------------------------------------------------------------+
> > > > >
> > > >
> > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > > there is a quick fix, IMO this series should be skipped for the
> > > > upcoming kernel open window.
> > >
> > > I am currently looking into it. It's reasonable to skip the next merge
> > > window if a quick fix isn't found soon.
> > >
> > > I am surprised by the size of the regression given the following:
> > >       1.12 ą  5%      +1.4        2.50 ą  2%
> > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> > >
> > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
> >
> > Yes, this is kind of confusing. And we have seen similar cases before,
> > espcially for micro benchmark like will-it-scale, stressng, netperf
> > etc, the change to those functions in hot path was greatly amplified
> > in the final benchmark score.
> >
> > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> > the affected functions have around 10% change in perf's cpu-cycles,
> > and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> > to those statistics update, like memcg's and vmstat.
> >
>
> Thanks for clarifying. I am still trying to reproduce locally but I am
> running into some quirks with tooling. I may have to run a modified
> version of the fallocate test manually. Meanwhile, I noticed that the
> patch was tested without the fixlet that I posted [1] for it,
> understandably. Would it be possible to get some numbers with that
> fixlet? It should reduce the total number of contended atomic
> operations, so it may help.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
>
> I am also wondering if aligning the stats_updates atomic will help.
> Right now it may share a cacheline with some items of the
> events_pending array. The latter may be dirtied during a flush and
> unnecessarily dirty the former, but the chances are slim to be honest.
> If it's easy to test such a diff, that would be nice, but I don't
> expect a lot of difference:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..a35fce653262 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -646,7 +646,7 @@ struct memcg_vmstats {
>         unsigned long           events_pending[NR_MEMCG_EVENTS];
>
>         /* Stats updates since the last flush */
> -       atomic64_t              stats_updates;
> +       atomic64_t              stats_updates ____cacheline_aligned_in_smp;
>  };
>
>  /*

I still could not run the benchmark, but I used a version of
fallocate1.c that does 1 million iterations. I ran 100 in parallel.
This showed ~13% regression with the patch, so not the same as the
will-it-scale version, but it could be an indicator.

With that, I did not see any improvement with the fixlet above or
___cacheline_aligned_in_smp. So you can scratch that.

I did, however, see some improvement with reducing the indirection
layers by moving stats_updates directly into struct mem_cgroup. The
regression in my manual testing went down to 9%. Still not great, but
I am wondering how this reflects on the benchmark. If you're able to
test it that would be great, the diff is below. Meanwhile I am still
looking for other improvements that can be made.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f64ac140083e..b4dfcd8b9cc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -270,6 +270,9 @@ struct mem_cgroup {

        CACHELINE_PADDING(_pad1_);

+       /* Stats updates since the last flush */
+       atomic64_t              stats_updates;
+
        /* memory.stat */
        struct memcg_vmstats    *vmstats;

@@ -309,6 +312,7 @@ struct mem_cgroup {
        atomic_t                moving_account;
        struct task_struct      *move_lock_task;

+       unsigned int __percpu *stats_updates_percpu;
        struct memcg_vmstats_percpu __percpu *vmstats_percpu;

 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..e5d2f3d4d874 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
        /* Cgroup1: threshold notifications & softlimit tree updates */
        unsigned long           nr_page_events;
        unsigned long           targets[MEM_CGROUP_NTARGETS];
-
-       /* Stats updates since the last flush */
-       unsigned int            stats_updates;
 };

 struct memcg_vmstats {
@@ -644,9 +641,6 @@ struct memcg_vmstats {
        /* Pending child counts during tree propagation */
        long                    state_pending[MEMCG_NR_STAT];
        unsigned long           events_pending[NR_MEMCG_EVENTS];
-
-       /* Stats updates since the last flush */
-       atomic64_t              stats_updates;
 };

 /*
@@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)

 static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
 {
-       return atomic64_read(&memcg->vmstats->stats_updates) >
+       return atomic64_read(&memcg->stats_updates) >
                MEMCG_CHARGE_BATCH * num_online_cpus();
 }

 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
        int cpu = smp_processor_id();
-       unsigned int x;
+       unsigned int *stats_updates_percpu;

        if (!val)
                return;
@@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
        cgroup_rstat_updated(memcg->css.cgroup, cpu);

        for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-               x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
-                                         abs(val));
+               stats_updates_percpu =
this_cpu_ptr(memcg->stats_updates_percpu);

-               if (x < MEMCG_CHARGE_BATCH)
+               *stats_updates_percpu += abs(val);
+               if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
                        continue;

                /*
@@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
                 * redundant. Avoid the overhead of the atomic update.
                 */
                if (!memcg_should_flush_stats(memcg))
-                       atomic64_add(x, &memcg->vmstats->stats_updates);
-               __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+                       atomic64_add(*stats_updates_percpu,
&memcg->stats_updates);
+               *stats_updates_percpu = 0;
        }
 }

@@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
                free_mem_cgroup_per_node_info(memcg, node);
        kfree(memcg->vmstats);
        free_percpu(memcg->vmstats_percpu);
+       free_percpu(memcg->stats_updates_percpu);
        kfree(memcg);
 }

@@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
        if (!memcg->vmstats_percpu)
                goto fail;

+       memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
+                                                      GFP_KERNEL_ACCOUNT);
+       if (!memcg->stats_updates_percpu)
+               goto fail;
+
        for_each_node(node)
                if (alloc_mem_cgroup_per_node_info(memcg, node))
                        goto fail;
@@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
        struct mem_cgroup *parent = parent_mem_cgroup(memcg);
        struct memcg_vmstats_percpu *statc;
+       int *stats_updates_percpu;
        long delta, delta_cpu, v;
        int i, nid;

        statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+       stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);

        for (i = 0; i < MEMCG_NR_STAT; i++) {
                /*
@@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
                        }
                }
        }
-       statc->stats_updates = 0;
+       *stats_updates_percpu = 0;
        /* We are in a per-cpu loop here, only do the atomic write once */
-       if (atomic64_read(&memcg->vmstats->stats_updates))
-               atomic64_set(&memcg->vmstats->stats_updates, 0);
+       if (atomic64_read(&memcg->stats_updates))
+               atomic64_set(&memcg->stats_updates, 0);
 }

 #ifdef CONFIG_MMU

Oliver Sang Oct. 24, 2023, 6:56 a.m. UTC | #33

hi, Yosry Ahmed,

On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:

...

> 
> I still could not run the benchmark, but I used a version of
> fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> This showed ~13% regression with the patch, so not the same as the
> will-it-scale version, but it could be an indicator.
> 
> With that, I did not see any improvement with the fixlet above or
> ___cacheline_aligned_in_smp. So you can scratch that.
> 
> I did, however, see some improvement with reducing the indirection
> layers by moving stats_updates directly into struct mem_cgroup. The
> regression in my manual testing went down to 9%. Still not great, but
> I am wondering how this reflects on the benchmark. If you're able to
> test it that would be great, the diff is below. Meanwhile I am still
> looking for other improvements that can be made.

we applied previous patch-set as below:

c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set

I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
but failed. could you guide how to apply this patch?
Thanks

> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f64ac140083e..b4dfcd8b9cc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -270,6 +270,9 @@ struct mem_cgroup {
> 
>         CACHELINE_PADDING(_pad1_);
> 
> +       /* Stats updates since the last flush */
> +       atomic64_t              stats_updates;
> +
>         /* memory.stat */
>         struct memcg_vmstats    *vmstats;
> 
> @@ -309,6 +312,7 @@ struct mem_cgroup {
>         atomic_t                moving_account;
>         struct task_struct      *move_lock_task;
> 
> +       unsigned int __percpu *stats_updates_percpu;
>         struct memcg_vmstats_percpu __percpu *vmstats_percpu;
> 
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..e5d2f3d4d874 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
>         /* Cgroup1: threshold notifications & softlimit tree updates */
>         unsigned long           nr_page_events;
>         unsigned long           targets[MEM_CGROUP_NTARGETS];
> -
> -       /* Stats updates since the last flush */
> -       unsigned int            stats_updates;
>  };
> 
>  struct memcg_vmstats {
> @@ -644,9 +641,6 @@ struct memcg_vmstats {
>         /* Pending child counts during tree propagation */
>         long                    state_pending[MEMCG_NR_STAT];
>         unsigned long           events_pending[NR_MEMCG_EVENTS];
> -
> -       /* Stats updates since the last flush */
> -       atomic64_t              stats_updates;
>  };
> 
>  /*
> @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
> 
>  static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
>  {
> -       return atomic64_read(&memcg->vmstats->stats_updates) >
> +       return atomic64_read(&memcg->stats_updates) >
>                 MEMCG_CHARGE_BATCH * num_online_cpus();
>  }
> 
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  {
>         int cpu = smp_processor_id();
> -       unsigned int x;
> +       unsigned int *stats_updates_percpu;
> 
>         if (!val)
>                 return;
> @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
>         cgroup_rstat_updated(memcg->css.cgroup, cpu);
> 
>         for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> -               x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
> -                                         abs(val));
> +               stats_updates_percpu =
> this_cpu_ptr(memcg->stats_updates_percpu);
> 
> -               if (x < MEMCG_CHARGE_BATCH)
> +               *stats_updates_percpu += abs(val);
> +               if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
>                         continue;
> 
>                 /*
> @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
>                  * redundant. Avoid the overhead of the atomic update.
>                  */
>                 if (!memcg_should_flush_stats(memcg))
> -                       atomic64_add(x, &memcg->vmstats->stats_updates);
> -               __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
> +                       atomic64_add(*stats_updates_percpu,
> &memcg->stats_updates);
> +               *stats_updates_percpu = 0;
>         }
>  }
> 
> @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>                 free_mem_cgroup_per_node_info(memcg, node);
>         kfree(memcg->vmstats);
>         free_percpu(memcg->vmstats_percpu);
> +       free_percpu(memcg->stats_updates_percpu);
>         kfree(memcg);
>  }
> 
> @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>         if (!memcg->vmstats_percpu)
>                 goto fail;
> 
> +       memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
> +                                                      GFP_KERNEL_ACCOUNT);
> +       if (!memcg->stats_updates_percpu)
> +               goto fail;
> +
>         for_each_node(node)
>                 if (alloc_mem_cgroup_per_node_info(memcg, node))
>                         goto fail;
> @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
>         struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>         struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>         struct memcg_vmstats_percpu *statc;
> +       int *stats_updates_percpu;
>         long delta, delta_cpu, v;
>         int i, nid;
> 
>         statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> +       stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
> 
>         for (i = 0; i < MEMCG_NR_STAT; i++) {
>                 /*
> @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
>                         }
>                 }
>         }
> -       statc->stats_updates = 0;
> +       *stats_updates_percpu = 0;
>         /* We are in a per-cpu loop here, only do the atomic write once */
> -       if (atomic64_read(&memcg->vmstats->stats_updates))
> -               atomic64_set(&memcg->vmstats->stats_updates, 0);
> +       if (atomic64_read(&memcg->stats_updates))
> +               atomic64_set(&memcg->stats_updates, 0);
>  }
> 
>  #ifdef CONFIG_MMU
>

Yosry Ahmed Oct. 24, 2023, 7:14 a.m. UTC | #34

On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yosry Ahmed,
>
> On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
>
> ...
>
> >
> > I still could not run the benchmark, but I used a version of
> > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > This showed ~13% regression with the patch, so not the same as the
> > will-it-scale version, but it could be an indicator.
> >
> > With that, I did not see any improvement with the fixlet above or
> > ___cacheline_aligned_in_smp. So you can scratch that.
> >
> > I did, however, see some improvement with reducing the indirection
> > layers by moving stats_updates directly into struct mem_cgroup. The
> > regression in my manual testing went down to 9%. Still not great, but
> > I am wondering how this reflects on the benchmark. If you're able to
> > test it that would be great, the diff is below. Meanwhile I am still
> > looking for other improvements that can be made.
>
> we applied previous patch-set as below:
>
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set
>
> I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> but failed. could you guide how to apply this patch?
> Thanks
>

Thanks for looking into this. I rebased the diff on top of
c5f50d8b23c79. Please find it attached.

Oliver Sang Oct. 25, 2023, 6:09 a.m. UTC | #35

hi, Yosry Ahmed,

On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote:
> On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yosry Ahmed,
> >
> > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
> >
> > ...
> >
> > >
> > > I still could not run the benchmark, but I used a version of
> > > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > > This showed ~13% regression with the patch, so not the same as the
> > > will-it-scale version, but it could be an indicator.
> > >
> > > With that, I did not see any improvement with the fixlet above or
> > > ___cacheline_aligned_in_smp. So you can scratch that.
> > >
> > > I did, however, see some improvement with reducing the indirection
> > > layers by moving stats_updates directly into struct mem_cgroup. The
> > > regression in my manual testing went down to 9%. Still not great, but
> > > I am wondering how this reflects on the benchmark. If you're able to
> > > test it that would be great, the diff is below. Meanwhile I am still
> > > looking for other improvements that can be made.
> >
> > we applied previous patch-set as below:
> >
> > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set
> >
> > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> > but failed. could you guide how to apply this patch?
> > Thanks
> >
> 
> Thanks for looking into this. I rebased the diff on top of
> c5f50d8b23c79. Please find it attached.

from our tests, this patch has little impact.

it was applied as below ac6a9444dec85:

ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything

for the first regression reported in original report, data are very close
for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85),
and ac6a9444dec85.
full comparison is as [1]

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
     36509           -25.8%      27079           -25.2%      27305           -25.0%      27383        will-it-scale.per_thread_ops

for the second regression reported in origianl report, seems a small impact
from ac6a9444dec85.
full comparison is as [2]

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
     76580           -30.0%      53575           -28.9%      54415           -26.7%      56152        will-it-scale.per_thread_ops

[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
      2.09            -0.5        1.61 ±  2%      -0.5        1.61            -0.5        1.60        mpstat.cpu.all.usr%
      3324           -10.0%       2993            +3.6%       3444 ± 20%      -6.2%       3118 ±  4%  vmstat.system.cs
    120.83 ± 11%     +79.6%     217.00 ±  9%    +105.8%     248.67 ± 10%    +115.2%     260.00 ± 10%  perf-c2c.DRAM.local
    594.50 ±  6%     +43.8%     854.83 ±  5%     +56.6%     931.17 ± 10%     +21.2%     720.67 ±  7%  perf-c2c.DRAM.remote
    -16.64           +39.7%     -23.25          +177.3%     -46.14           +13.9%     -18.94        sched_debug.cpu.nr_uninterruptible.min
      6.59 ± 13%      +6.5%       7.02 ± 11%     +84.7%      12.18 ± 51%      -6.6%       6.16 ± 10%  sched_debug.cpu.nr_uninterruptible.stddev
      0.04           -20.8%       0.03 ± 11%     -20.8%       0.03 ± 11%     -25.0%       0.03        turbostat.IPC
     27.58            +3.7%      28.59            +4.2%      28.74            +3.8%      28.63        turbostat.RAMWatt
     71000 ± 68%     +66.4%     118174 ± 60%     -49.8%      35634 ± 13%     -59.9%      28485 ± 10%  numa-meminfo.node0.AnonHugePages
      1056          -100.0%       0.00            +1.9%       1076           -12.6%     923.33 ± 44%  numa-meminfo.node0.Inactive(file)
      6.67 ±141%  +15799.3%       1059          -100.0%       0.00         +2669.8%     184.65 ±223%  numa-meminfo.node1.Inactive(file)
   3797041           -25.8%    2816352           -25.2%    2839803           -25.0%    2847955        will-it-scale.104.threads
     36509           -25.8%      27079           -25.2%      27305           -25.0%      27383        will-it-scale.per_thread_ops
   3797041           -25.8%    2816352           -25.2%    2839803           -25.0%    2847955        will-it-scale.workload
 1.142e+09           -26.2%  8.437e+08           -26.6%  8.391e+08           -25.7%  8.489e+08        numa-numastat.node0.local_node
 1.143e+09           -26.1%  8.439e+08           -26.6%  8.392e+08           -25.7%  8.491e+08        numa-numastat.node0.numa_hit
 1.148e+09           -25.4%  8.563e+08 ±  2%     -23.7%  8.756e+08 ±  2%     -24.2%  8.702e+08        numa-numastat.node1.local_node
 1.149e+09           -25.4%  8.564e+08 ±  2%     -23.8%  8.758e+08 ±  2%     -24.2%  8.707e+08        numa-numastat.node1.numa_hit
     10842            +0.9%      10941            +2.9%      11153 ±  2%      +0.3%      10873        proc-vmstat.nr_mapped
     32933            -2.6%      32068            +0.1%      32956 ±  2%      -1.5%      32450 ±  2%  proc-vmstat.nr_slab_reclaimable
 2.291e+09           -25.8%    1.7e+09           -25.1%  1.715e+09           -24.9%   1.72e+09        proc-vmstat.numa_hit
 2.291e+09           -25.8%    1.7e+09           -25.1%  1.715e+09           -25.0%  1.719e+09        proc-vmstat.numa_local
  2.29e+09           -25.8%  1.699e+09           -25.1%  1.714e+09           -24.9%  1.718e+09        proc-vmstat.pgalloc_normal
 2.289e+09           -25.8%  1.699e+09           -25.1%  1.714e+09           -24.9%  1.718e+09        proc-vmstat.pgfree
    199.33          -100.0%       0.00            -0.3%     198.66           -16.4%     166.67 ± 44%  numa-vmstat.node0.nr_active_file
    264.00          -100.0%       0.00            +1.9%     269.00           -12.6%     230.83 ± 44%  numa-vmstat.node0.nr_inactive_file
    199.33          -100.0%       0.00            -0.3%     198.66           -16.4%     166.67 ± 44%  numa-vmstat.node0.nr_zone_active_file
    264.00          -100.0%       0.00            +1.9%     269.00           -12.6%     230.83 ± 44%  numa-vmstat.node0.nr_zone_inactive_file
 1.143e+09           -26.1%  8.439e+08           -26.6%  8.392e+08           -25.7%  8.491e+08        numa-vmstat.node0.numa_hit
 1.142e+09           -26.2%  8.437e+08           -26.6%  8.391e+08           -25.7%  8.489e+08        numa-vmstat.node0.numa_local
      1.67 ±141%  +15799.3%     264.99          -100.0%       0.00         +2669.8%      46.16 ±223%  numa-vmstat.node1.nr_inactive_file
      1.67 ±141%  +15799.3%     264.99          -100.0%       0.00         +2669.8%      46.16 ±223%  numa-vmstat.node1.nr_zone_inactive_file
 1.149e+09           -25.4%  8.564e+08 ±  2%     -23.8%  8.758e+08 ±  2%     -24.2%  8.707e+08        numa-vmstat.node1.numa_hit
 1.148e+09           -25.4%  8.563e+08 ±  2%     -23.7%  8.756e+08 ±  2%     -24.2%  8.702e+08        numa-vmstat.node1.numa_local
      0.04 ±108%     -76.2%       0.01 ± 23%    +154.8%       0.10 ± 34%    +110.0%       0.08 ± 88%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      1.00 ± 93%    +154.2%       2.55 ± 16%    +133.4%       2.34 ± 39%    +174.6%       2.76 ± 22%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      0.71 ±131%     -91.3%       0.06 ± 74%    +184.4%       2.02 ± 40%    +122.6%       1.58 ± 98%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      1.84 ± 45%     +35.2%       2.48 ± 31%     +66.1%       3.05 ± 25%     +61.9%       2.98 ± 10%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
    191.10 ±  2%     +18.0%     225.55 ±  2%     +18.9%     227.22 ±  4%     +19.8%     228.89 ±  4%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      3484            -7.8%       3211 ±  6%      -7.3%       3230 ±  7%     -11.0%       3101 ±  3%  perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
    385.50 ± 14%     +39.6%     538.17 ± 12%    +104.5%     788.17 ± 54%     +30.9%     504.67 ± 41%  perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      3784            -7.5%       3500 ±  6%      -7.1%       3516 ±  2%     -10.6%       3383 ±  4%  perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
    118.67 ± 11%     -62.6%      44.33 ±100%     -45.9%      64.17 ± 71%     -64.9%      41.67 ±100%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      5043 ±  2%     -13.0%       4387 ±  6%     -14.7%       4301 ±  3%     -16.5%       4210 ±  4%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    167.12 ±222%    +200.1%     501.48 ± 99%      +2.9%     171.99 ±215%    +399.7%     835.05 ± 44%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      2.17 ± 21%      +8.9%       2.36 ± 16%     +94.3%       4.21 ± 36%     +40.4%       3.04 ± 21%  perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    191.09 ±  2%     +18.0%     225.53 ±  2%     +18.9%     227.21 ±  4%     +19.8%     228.88 ±  4%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    293.46 ±  4%     +12.8%     330.98 ±  6%     +21.0%     355.18 ± 16%      +7.1%     314.31 ± 26%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     30.33 ±105%     -35.1%      19.69 ±138%    +494.1%     180.18 ± 79%    +135.5%      71.43 ± 76%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.59 ±  3%    +125.2%       1.32 ±  2%    +139.3%       1.41          +128.6%       1.34        perf-stat.i.MPKI
 9.027e+09           -17.9%  7.408e+09           -17.5%  7.446e+09           -17.3%  7.465e+09        perf-stat.i.branch-instructions
      0.64            -0.0        0.60            -0.0        0.60            -0.0        0.60        perf-stat.i.branch-miss-rate%
  58102855           -23.3%   44580037 ±  2%     -23.4%   44524712 ±  2%     -22.9%   44801374        perf-stat.i.branch-misses
     15.28            +7.0       22.27            +7.9       23.14            +7.2       22.50        perf-stat.i.cache-miss-rate%
  25155306 ±  2%     +82.7%   45953601 ±  3%     +95.2%   49105558 ±  2%     +87.7%   47212483        perf-stat.i.cache-misses
 1.644e+08           +25.4%  2.062e+08 ±  2%     +29.0%   2.12e+08           +27.6%  2.098e+08        perf-stat.i.cache-references
      3258           -10.3%       2921            +2.5%       3341 ± 19%      -6.7%       3041 ±  4%  perf-stat.i.context-switches
      6.73           +23.3%       8.30           +22.7%       8.26           +21.8%       8.20        perf-stat.i.cpi
    145.97            -1.3%     144.13            -1.4%     143.89            -1.2%     144.29        perf-stat.i.cpu-migrations
     11519 ±  3%     -45.4%       6293 ±  3%     -48.9%       5892 ±  2%     -46.9%       6118        perf-stat.i.cycles-between-cache-misses
      0.04            -0.0        0.03            -0.0        0.03            -0.0        0.03        perf-stat.i.dTLB-load-miss-rate%
   3921408           -25.3%    2929564           -24.6%    2957991           -24.5%    2961168        perf-stat.i.dTLB-load-misses
 1.098e+10           -18.1%  8.993e+09           -17.6%  9.045e+09           -16.3%  9.185e+09        perf-stat.i.dTLB-loads
      0.00 ±  2%      +0.0        0.00 ±  4%      +0.0        0.00 ±  5%      +0.0        0.00 ±  3%  perf-stat.i.dTLB-store-miss-rate%
 5.606e+09           -23.2%  4.304e+09           -22.6%  4.338e+09           -22.4%  4.349e+09        perf-stat.i.dTLB-stores
     95.65            -1.2       94.49            -0.9       94.74            -0.8       94.87        perf-stat.i.iTLB-load-miss-rate%
   3876741           -25.0%    2905764           -24.8%    2915184           -25.0%    2909099        perf-stat.i.iTLB-load-misses
 4.286e+10           -18.9%  3.477e+10           -18.4%  3.496e+10           -17.9%  3.517e+10        perf-stat.i.instructions
     11061            +8.2%      11969            +8.4%      11996            +9.3%      12091        perf-stat.i.instructions-per-iTLB-miss
      0.15           -18.9%       0.12           -18.5%       0.12           -17.9%       0.12        perf-stat.i.ipc
      0.01 ± 96%      -8.9%       0.01 ± 96%     +72.3%       0.01 ± 73%    +174.6%       0.02 ± 32%  perf-stat.i.major-faults
     48.65 ±  2%     +46.2%      71.11 ±  2%     +57.0%      76.37 ±  2%     +45.4%      70.72        perf-stat.i.metric.K/sec
    247.84           -18.9%     201.05           -18.4%     202.30           -17.7%     203.92        perf-stat.i.metric.M/sec
     89.33            +0.5       89.79            -0.7       88.67            -2.1       87.23        perf-stat.i.node-load-miss-rate%
   3138385 ±  2%     +77.7%    5578401 ±  2%     +89.9%    5958861 ±  2%     +70.9%    5363943        perf-stat.i.node-load-misses
    375827 ±  3%     +69.2%     635857 ± 11%    +102.6%     761334 ±  4%    +109.3%     786773 ±  5%  perf-stat.i.node-loads
   1343194           -26.8%     983668           -22.6%    1039799 ±  2%     -23.6%    1026076        perf-stat.i.node-store-misses
     51550 ±  3%     -19.0%      41748 ±  7%     -22.5%      39954 ±  4%     -20.6%      40921 ±  7%  perf-stat.i.node-stores
      0.59 ±  3%    +125.1%       1.32 ±  2%    +139.2%       1.40          +128.7%       1.34        perf-stat.overall.MPKI
      0.64            -0.0        0.60            -0.0        0.60            -0.0        0.60        perf-stat.overall.branch-miss-rate%
     15.30            +7.0       22.28            +7.9       23.16            +7.2       22.50        perf-stat.overall.cache-miss-rate%
      6.73           +23.3%       8.29           +22.6%       8.25           +21.9%       8.20        perf-stat.overall.cpi
     11470 ±  2%     -45.3%       6279 ±  2%     -48.8%       5875 ±  2%     -46.7%       6108        perf-stat.overall.cycles-between-cache-misses
      0.04            -0.0        0.03            -0.0        0.03            -0.0        0.03        perf-stat.overall.dTLB-load-miss-rate%
      0.00 ±  2%      +0.0        0.00 ±  4%      +0.0        0.00 ±  5%      +0.0        0.00 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
     95.56            -1.4       94.17            -1.0       94.56            -0.9       94.66        perf-stat.overall.iTLB-load-miss-rate%
     11059            +8.2%      11967            +8.5%      11994            +9.3%      12091        perf-stat.overall.instructions-per-iTLB-miss
      0.15           -18.9%       0.12           -18.4%       0.12           -17.9%       0.12        perf-stat.overall.ipc
     89.29            +0.5       89.78            -0.6       88.67            -2.1       87.20        perf-stat.overall.node-load-miss-rate%
   3396437            +9.5%    3718021            +9.1%    3705386            +9.6%    3721307        perf-stat.overall.path-length
 8.997e+09           -17.9%  7.383e+09           -17.5%  7.421e+09           -17.3%   7.44e+09        perf-stat.ps.branch-instructions
  57910417           -23.3%   44426577 ±  2%     -23.4%   44376780 ±  2%     -22.9%   44649215        perf-stat.ps.branch-misses
  25075498 ±  2%     +82.7%   45803186 ±  3%     +95.2%   48942749 ±  2%     +87.7%   47057228        perf-stat.ps.cache-misses
 1.639e+08           +25.4%  2.056e+08 ±  2%     +28.9%  2.113e+08           +27.6%  2.091e+08        perf-stat.ps.cache-references
      3247           -10.3%       2911            +2.5%       3329 ± 19%      -6.7%       3030 ±  4%  perf-stat.ps.context-switches
    145.47            -1.3%     143.61            -1.4%     143.38            -1.2%     143.70        perf-stat.ps.cpu-migrations
   3908900           -25.3%    2920218           -24.6%    2949112           -24.5%    2951633        perf-stat.ps.dTLB-load-misses
 1.094e+10           -18.1%  8.963e+09           -17.6%  9.014e+09           -16.3%  9.154e+09        perf-stat.ps.dTLB-loads
 5.587e+09           -23.2%  4.289e+09           -22.6%  4.324e+09           -22.4%  4.335e+09        perf-stat.ps.dTLB-stores
   3863663           -25.0%    2895895           -24.8%    2905355           -25.0%    2899323        perf-stat.ps.iTLB-load-misses
 4.272e+10           -18.9%  3.466e+10           -18.4%  3.484e+10           -17.9%  3.505e+10        perf-stat.ps.instructions
   3128132 ±  2%     +77.7%    5559939 ±  2%     +89.9%    5938929 ±  2%     +70.9%    5346027        perf-stat.ps.node-load-misses
    375403 ±  3%     +69.0%     634300 ± 11%    +102.3%     759484 ±  4%    +109.1%     784913 ±  5%  perf-stat.ps.node-loads
   1338688           -26.8%     980311           -22.6%    1036279 ±  2%     -23.6%    1022618        perf-stat.ps.node-store-misses
     51546 ±  3%     -19.1%      41692 ±  7%     -22.6%      39921 ±  4%     -20.7%      40875 ±  7%  perf-stat.ps.node-stores
  1.29e+13           -18.8%  1.047e+13           -18.4%  1.052e+13           -17.8%   1.06e+13        perf-stat.total.instructions
      0.96            -0.3        0.70 ±  2%      -0.3        0.70 ±  2%      -0.3        0.70        perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.97            -0.3        0.72            -0.2        0.72            -0.2        0.72        perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
      0.76 ±  2%      -0.2        0.54 ±  3%      -0.2        0.59 ±  3%      -0.1        0.68        perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.82            -0.2        0.60 ±  2%      -0.2        0.60 ±  2%      -0.2        0.60        perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.91            -0.2        0.72            -0.2        0.72            -0.2        0.70 ±  2%  perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
     51.50            -0.0       51.47            -0.5       50.99            -0.3       51.21        perf-profile.calltrace.cycles-pp.fallocate64
     48.31            +0.0       48.35            +0.5       48.83            +0.3       48.61        perf-profile.calltrace.cycles-pp.ftruncate64
     48.29            +0.0       48.34            +0.5       48.81            +0.3       48.60        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
     48.28            +0.0       48.33            +0.5       48.80            +0.3       48.59        perf-profile.calltrace.cycles-pp.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
     48.29            +0.1       48.34            +0.5       48.82            +0.3       48.60        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
     48.28            +0.1       48.33            +0.5       48.80            +0.3       48.58        perf-profile.calltrace.cycles-pp.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
     48.27            +0.1       48.33            +0.5       48.80            +0.3       48.58        perf-profile.calltrace.cycles-pp.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     48.27            +0.1       48.32            +0.5       48.80            +0.3       48.58        perf-profile.calltrace.cycles-pp.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64
     48.25            +0.1       48.31            +0.5       48.78            +0.3       48.57        perf-profile.calltrace.cycles-pp.shmem_undo_range.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate
      2.06 ±  2%      +0.1        2.13 ±  2%      +0.1        2.16            +0.0        2.09        perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.68            +0.1        0.76 ±  2%      +0.1        0.75            +0.1        0.74        perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.67            +0.1        1.77            +0.1        1.81 ±  2%      +0.0        1.71        perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
     45.76            +0.1       45.86            +0.5       46.29            +0.4       46.13        perf-profile.calltrace.cycles-pp.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      1.78 ±  2%      +0.1        1.92 ±  2%      +0.2        1.95            +0.1        1.88        perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
      0.69 ±  5%      +0.1        0.84 ±  4%      +0.2        0.86 ±  5%      +0.1        0.79 ±  2%  perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      1.56 ±  2%      +0.2        1.76 ±  2%      +0.2        1.79            +0.2        1.71        perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
      0.85 ±  4%      +0.4        1.23 ±  2%      +0.4        1.26 ±  3%      +0.3        1.14        perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.78 ±  4%      +0.4        1.20 ±  3%      +0.4        1.22            +0.3        1.11        perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
      0.73 ±  4%      +0.4        1.17 ±  3%      +0.5        1.19 ±  2%      +0.4        1.08        perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
     41.60            +0.7       42.30            +0.1       41.73            +0.5       42.06        perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
     41.50            +0.7       42.23            +0.2       41.66            +0.5       41.99        perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
     48.39            +0.8       49.14            +0.2       48.64            +0.5       48.89        perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.00            +0.8        0.77 ±  4%      +0.8        0.80 ±  2%      +0.8        0.78 ±  2%  perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
     40.24            +0.8       41.03            +0.2       40.48            +0.6       40.80        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     40.22            +0.8       41.01            +0.2       40.47            +0.6       40.79        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
      0.00            +0.8        0.79 ±  3%      +0.8        0.82 ±  3%      +0.8        0.79 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     40.19            +0.8       40.98            +0.3       40.44            +0.6       40.76        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
      1.33 ±  5%      +0.8        2.13 ±  4%      +0.9        2.21 ±  4%      +0.8        2.09 ±  2%  perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
     48.16            +0.8       48.98            +0.3       48.48            +0.6       48.72        perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
      0.00            +0.9        0.88 ±  2%      +0.9        0.91            +0.9        0.86        perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
     47.92            +0.9       48.81            +0.4       48.30            +0.6       48.56        perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     47.07            +0.9       48.01            +0.5       47.60            +0.7       47.79        perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
     46.59            +1.1       47.64            +0.7       47.24            +0.8       47.44        perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
      0.99            -0.3        0.73 ±  2%      -0.3        0.74            -0.3        0.74        perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.96            -0.3        0.70 ±  2%      -0.3        0.70 ±  2%      -0.3        0.71        perf-profile.children.cycles-pp.shmem_alloc_folio
      0.78 ±  2%      -0.2        0.56 ±  3%      -0.2        0.61 ±  3%      -0.1        0.69 ±  2%  perf-profile.children.cycles-pp.shmem_inode_acct_blocks
      0.83            -0.2        0.61 ±  2%      -0.2        0.61 ±  2%      -0.2        0.62        perf-profile.children.cycles-pp.alloc_pages_mpol
      0.92            -0.2        0.73            -0.2        0.73            -0.2        0.71 ±  2%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.74 ±  2%      -0.2        0.55 ±  2%      -0.2        0.56 ±  2%      -0.2        0.58 ±  3%  perf-profile.children.cycles-pp.xas_store
      0.67            -0.2        0.50 ±  3%      -0.2        0.50 ±  2%      -0.2        0.50        perf-profile.children.cycles-pp.__alloc_pages
      0.43            -0.1        0.31 ±  2%      -0.1        0.31            -0.1        0.31        perf-profile.children.cycles-pp.__entry_text_start
      0.41 ±  2%      -0.1        0.30 ±  3%      -0.1        0.31 ±  2%      -0.1        0.31 ±  2%  perf-profile.children.cycles-pp.free_unref_page_list
      0.35            -0.1        0.25 ±  2%      -0.1        0.25 ±  2%      -0.1        0.25        perf-profile.children.cycles-pp.xas_load
      0.35 ±  2%      -0.1        0.25 ±  4%      -0.1        0.25 ±  2%      -0.1        0.26 ±  2%  perf-profile.children.cycles-pp.__mod_lruvec_state
      0.39            -0.1        0.30 ±  2%      -0.1        0.29 ±  3%      -0.1        0.30        perf-profile.children.cycles-pp.get_page_from_freelist
      0.27 ±  2%      -0.1        0.19 ±  4%      -0.1        0.19 ±  5%      -0.1        0.19 ±  3%  perf-profile.children.cycles-pp.__mod_node_page_state
      0.32 ±  3%      -0.1        0.24 ±  3%      -0.1        0.25            -0.1        0.26 ±  4%  perf-profile.children.cycles-pp.find_lock_entries
      0.23 ±  2%      -0.1        0.15 ±  4%      -0.1        0.16 ±  3%      -0.1        0.16 ±  5%  perf-profile.children.cycles-pp.xas_descend
      0.25 ±  3%      -0.1        0.18 ±  3%      -0.1        0.18 ±  3%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.__dquot_alloc_space
      0.28 ±  3%      -0.1        0.20 ±  3%      -0.1        0.21 ±  2%      -0.1        0.20 ±  2%  perf-profile.children.cycles-pp._raw_spin_lock
      0.16 ±  3%      -0.1        0.10 ±  5%      -0.1        0.10 ±  4%      -0.1        0.10 ±  4%  perf-profile.children.cycles-pp.xas_find_conflict
      0.26 ±  2%      -0.1        0.20 ±  3%      -0.1        0.19 ±  3%      -0.1        0.19        perf-profile.children.cycles-pp.filemap_get_entry
      0.26            -0.1        0.20 ±  2%      -0.1        0.20 ±  4%      -0.1        0.20 ±  2%  perf-profile.children.cycles-pp.rmqueue
      0.20 ±  3%      -0.1        0.14 ±  3%      -0.0        0.15 ±  3%      -0.0        0.16 ±  3%  perf-profile.children.cycles-pp.truncate_cleanup_folio
      0.19 ±  5%      -0.1        0.14 ±  4%      -0.0        0.15 ±  5%      -0.0        0.15 ±  4%  perf-profile.children.cycles-pp.xas_clear_mark
      0.17 ±  5%      -0.0        0.12 ±  4%      -0.0        0.12 ±  6%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.xas_init_marks
      0.15 ±  4%      -0.0        0.10 ±  4%      -0.0        0.10 ±  4%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.15 ± 12%      -0.0        0.10 ± 20%      -0.1        0.10 ± 15%      -0.1        0.10 ± 14%  perf-profile.children.cycles-pp._raw_spin_lock_irq
     51.56            -0.0       51.51            -0.5       51.03            -0.3       51.26        perf-profile.children.cycles-pp.fallocate64
      0.18 ±  3%      -0.0        0.14 ±  3%      -0.0        0.13 ±  5%      -0.0        0.14 ±  2%  perf-profile.children.cycles-pp.__cond_resched
      0.07 ±  5%      -0.0        0.02 ± 99%      -0.0        0.04 ± 44%      -0.0        0.04 ± 44%  perf-profile.children.cycles-pp.xas_find
      0.13 ±  2%      -0.0        0.09            -0.0        0.10 ±  5%      -0.0        0.12 ±  4%  perf-profile.children.cycles-pp.security_vm_enough_memory_mm
      0.14 ±  4%      -0.0        0.10 ±  7%      -0.0        0.10 ±  6%      -0.0        0.10 ±  3%  perf-profile.children.cycles-pp.__fget_light
      0.06 ±  6%      -0.0        0.02 ± 99%      -0.0        0.05            -0.0        0.05        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.12 ±  4%      -0.0        0.08 ±  4%      -0.0        0.08 ±  4%      -0.0        0.08        perf-profile.children.cycles-pp.xas_start
      0.08 ±  5%      -0.0        0.05            -0.0        0.05            -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.__folio_throttle_swaprate
      0.12            -0.0        0.08 ±  5%      -0.0        0.08 ±  5%      -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.folio_unlock
      0.14 ±  3%      -0.0        0.11 ±  3%      -0.0        0.11 ±  4%      -0.0        0.12 ±  3%  perf-profile.children.cycles-pp.try_charge_memcg
      0.12 ±  6%      -0.0        0.08 ±  5%      -0.0        0.09 ±  5%      -0.0        0.09 ±  7%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.12 ±  3%      -0.0        0.09 ±  4%      -0.0        0.09 ±  7%      -0.0        0.09        perf-profile.children.cycles-pp.noop_dirty_folio
      0.20 ±  2%      -0.0        0.17 ±  5%      -0.0        0.18            -0.0        0.19 ±  2%  perf-profile.children.cycles-pp.page_counter_uncharge
      0.10            -0.0        0.07 ±  5%      -0.0        0.08 ±  8%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.cap_vm_enough_memory
      0.09 ±  6%      -0.0        0.06 ±  6%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp._raw_spin_trylock
      0.09 ±  5%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%      -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.inode_add_bytes
      0.06 ±  6%      -0.0        0.03 ± 70%      -0.0        0.04 ± 44%      -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.filemap_free_folio
      0.06 ±  6%      -0.0        0.03 ± 70%      +0.0        0.07 ±  7%      +0.1        0.14 ±  6%  perf-profile.children.cycles-pp.percpu_counter_add_batch
      0.12 ±  3%      -0.0        0.10 ±  5%      -0.0        0.09 ±  4%      -0.0        0.09        perf-profile.children.cycles-pp.shmem_recalc_inode
      0.12 ±  3%      -0.0        0.09 ±  5%      -0.0        0.09 ±  5%      -0.0        0.10 ±  4%  perf-profile.children.cycles-pp.__folio_cancel_dirty
      0.09 ±  5%      -0.0        0.07 ±  7%      -0.0        0.09 ±  4%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.__vm_enough_memory
      0.08 ±  5%      -0.0        0.06            -0.0        0.06 ±  6%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.security_file_permission
      0.08 ±  5%      -0.0        0.06            -0.0        0.06 ±  6%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.08 ±  6%      -0.0        0.05 ±  7%      -0.0        0.05 ±  8%      -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.apparmor_file_permission
      0.09 ±  4%      -0.0        0.07 ±  8%      -0.0        0.09 ±  8%      -0.0        0.07 ±  6%  perf-profile.children.cycles-pp.__percpu_counter_limited_add
      0.08 ±  6%      -0.0        0.06 ±  8%      -0.0        0.06            -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.__list_add_valid_or_report
      0.07 ±  8%      -0.0        0.05            -0.0        0.05 ±  7%      -0.0        0.06 ±  9%  perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.14 ±  3%      -0.0        0.12 ±  6%      -0.0        0.12 ±  3%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.07 ±  5%      -0.0        0.05            -0.0        0.05            -0.0        0.05        perf-profile.children.cycles-pp.policy_nodemask
      0.24 ±  2%      -0.0        0.22 ±  2%      -0.0        0.22 ±  2%      -0.0        0.22 ±  2%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.08            -0.0        0.07 ±  7%      -0.0        0.06 ±  6%      -0.0        0.07 ±  6%  perf-profile.children.cycles-pp.xas_create
      0.08 ±  8%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%      -0.0        0.07 ±  7%  perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
      0.00            +0.0        0.00            +0.0        0.00            +0.1        0.08 ±  8%  perf-profile.children.cycles-pp.__file_remove_privs
      0.28 ±  2%      +0.0        0.28 ±  4%      +0.0        0.30            +0.0        0.30        perf-profile.children.cycles-pp.uncharge_batch
      0.14 ±  5%      +0.0        0.17 ±  4%      +0.0        0.17 ±  2%      +0.0        0.16        perf-profile.children.cycles-pp.uncharge_folio
      0.43            +0.0        0.46 ±  4%      +0.0        0.48            +0.0        0.47        perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
     48.31            +0.0       48.35            +0.5       48.83            +0.3       48.61        perf-profile.children.cycles-pp.ftruncate64
     48.28            +0.0       48.33            +0.5       48.80            +0.3       48.59        perf-profile.children.cycles-pp.do_sys_ftruncate
     48.28            +0.1       48.33            +0.5       48.80            +0.3       48.58        perf-profile.children.cycles-pp.do_truncate
     48.27            +0.1       48.33            +0.5       48.80            +0.3       48.58        perf-profile.children.cycles-pp.notify_change
     48.27            +0.1       48.32            +0.5       48.80            +0.3       48.58        perf-profile.children.cycles-pp.shmem_setattr
     48.26            +0.1       48.32            +0.5       48.79            +0.3       48.57        perf-profile.children.cycles-pp.shmem_undo_range
      2.06 ±  2%      +0.1        2.13 ±  2%      +0.1        2.16            +0.0        2.10        perf-profile.children.cycles-pp.truncate_inode_folio
      0.69            +0.1        0.78            +0.1        0.77            +0.1        0.76        perf-profile.children.cycles-pp.lru_add_fn
      1.72 ±  2%      +0.1        1.80            +0.1        1.83 ±  2%      +0.0        1.74        perf-profile.children.cycles-pp.shmem_add_to_page_cache
     45.77            +0.1       45.86            +0.5       46.29            +0.4       46.13        perf-profile.children.cycles-pp.__folio_batch_release
      1.79 ±  2%      +0.1        1.93 ±  2%      +0.2        1.96            +0.1        1.88        perf-profile.children.cycles-pp.filemap_remove_folio
      0.13 ±  5%      +0.1        0.28            +0.1        0.19 ±  5%      +0.1        0.24 ±  2%  perf-profile.children.cycles-pp.file_modified
      0.69 ±  5%      +0.1        0.84 ±  3%      +0.2        0.86 ±  5%      +0.1        0.79 ±  2%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      0.09 ±  7%      +0.2        0.24 ±  2%      +0.1        0.15 ±  3%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.inode_needs_update_time
      1.58 ±  3%      +0.2        1.77 ±  2%      +0.2        1.80            +0.1        1.72        perf-profile.children.cycles-pp.__filemap_remove_folio
      0.15 ±  3%      +0.4        0.50 ±  3%      +0.4        0.52 ±  2%      +0.4        0.52 ±  2%  perf-profile.children.cycles-pp.__count_memcg_events
      0.79 ±  4%      +0.4        1.20 ±  3%      +0.4        1.22            +0.3        1.12        perf-profile.children.cycles-pp.filemap_unaccount_folio
      0.36 ±  5%      +0.4        0.77 ±  4%      +0.4        0.81 ±  2%      +0.4        0.78 ±  2%  perf-profile.children.cycles-pp.mem_cgroup_commit_charge
     98.33            +0.5       98.78            +0.4       98.77            +0.4       98.77        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     97.74            +0.6       98.34            +0.6       98.32            +0.6       98.33        perf-profile.children.cycles-pp.do_syscall_64
     41.62            +0.7       42.33            +0.1       41.76            +0.5       42.08        perf-profile.children.cycles-pp.folio_add_lru
     43.91            +0.7       44.64            +0.2       44.09            +0.5       44.40        perf-profile.children.cycles-pp.folio_batch_move_lru
     48.39            +0.8       49.15            +0.2       48.64            +0.5       48.89        perf-profile.children.cycles-pp.__x64_sys_fallocate
      1.34 ±  5%      +0.8        2.14 ±  4%      +0.9        2.22 ±  4%      +0.8        2.10 ±  2%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      1.61 ±  4%      +0.8        2.42 ±  2%      +0.9        2.47 ±  2%      +0.6        2.24        perf-profile.children.cycles-pp.__mod_lruvec_page_state
     48.17            +0.8       48.98            +0.3       48.48            +0.6       48.72        perf-profile.children.cycles-pp.vfs_fallocate
     47.94            +0.9       48.82            +0.4       48.32            +0.6       48.56        perf-profile.children.cycles-pp.shmem_fallocate
     47.10            +0.9       48.04            +0.5       47.64            +0.7       47.83        perf-profile.children.cycles-pp.shmem_get_folio_gfp
     84.34            +0.9       85.28            +0.8       85.11            +0.9       85.28        perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
     84.31            +0.9       85.26            +0.8       85.08            +0.9       85.26        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     84.24            +1.0       85.21            +0.8       85.04            +1.0       85.21        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     46.65            +1.1       47.70            +0.7       47.30            +0.8       47.48        perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
      1.23 ±  4%      +1.4        2.58 ±  2%      +1.4        2.63 ±  2%      +1.3        2.52        perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      0.98            -0.3        0.73 ±  2%      -0.2        0.74            -0.2        0.74        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.88            -0.2        0.70            -0.2        0.70            -0.2        0.69 ±  2%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.60            -0.2        0.45            -0.1        0.46 ±  2%      -0.2        0.46 ±  3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.41 ±  3%      -0.1        0.27 ±  3%      -0.1        0.27 ±  2%      -0.1        0.28 ±  2%  perf-profile.self.cycles-pp.release_pages
      0.41 ±  3%      -0.1        0.29 ±  2%      -0.1        0.28 ±  3%      -0.1        0.29 ±  2%  perf-profile.self.cycles-pp.folio_batch_move_lru
      0.41            -0.1        0.30 ±  3%      -0.1        0.30 ±  2%      -0.1        0.32 ±  4%  perf-profile.self.cycles-pp.xas_store
      0.30 ±  3%      -0.1        0.18 ±  5%      -0.1        0.19 ±  2%      -0.1        0.19 ±  2%  perf-profile.self.cycles-pp.shmem_add_to_page_cache
      0.38 ±  2%      -0.1        0.27 ±  2%      -0.1        0.27 ±  2%      -0.1        0.27        perf-profile.self.cycles-pp.__entry_text_start
      0.30 ±  3%      -0.1        0.20 ±  6%      -0.1        0.20 ±  5%      -0.1        0.21 ±  2%  perf-profile.self.cycles-pp.lru_add_fn
      0.28 ±  2%      -0.1        0.20 ±  5%      -0.1        0.20 ±  2%      -0.1        0.20 ±  3%  perf-profile.self.cycles-pp.shmem_fallocate
      0.26 ±  2%      -0.1        0.18 ±  5%      -0.1        0.18 ±  4%      -0.1        0.19 ±  3%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.27 ±  3%      -0.1        0.20 ±  2%      -0.1        0.20 ±  3%      -0.1        0.20 ±  3%  perf-profile.self.cycles-pp._raw_spin_lock
      0.21 ±  2%      -0.1        0.15 ±  4%      -0.1        0.15 ±  4%      -0.1        0.16 ±  2%  perf-profile.self.cycles-pp.__alloc_pages
      0.20 ±  2%      -0.1        0.14 ±  3%      -0.1        0.14 ±  2%      -0.1        0.14 ±  5%  perf-profile.self.cycles-pp.xas_descend
      0.26 ±  3%      -0.1        0.20 ±  4%      -0.1        0.21 ±  3%      -0.0        0.22 ±  4%  perf-profile.self.cycles-pp.find_lock_entries
      0.06 ±  6%      -0.1        0.00            +0.0        0.06 ±  7%      +0.1        0.13 ±  6%  perf-profile.self.cycles-pp.percpu_counter_add_batch
      0.18 ±  4%      -0.0        0.13 ±  5%      -0.0        0.13 ±  3%      -0.0        0.14 ±  4%  perf-profile.self.cycles-pp.xas_clear_mark
      0.15 ±  7%      -0.0        0.10 ± 11%      -0.0        0.11 ±  8%      -0.0        0.10 ±  6%  perf-profile.self.cycles-pp.shmem_inode_acct_blocks
      0.13 ±  4%      -0.0        0.09 ±  5%      -0.0        0.08 ±  5%      -0.0        0.09        perf-profile.self.cycles-pp.free_unref_page_commit
      0.13            -0.0        0.09 ±  5%      -0.0        0.09 ±  5%      -0.0        0.09 ±  6%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.16 ±  4%      -0.0        0.12 ±  4%      -0.0        0.12 ±  3%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.__dquot_alloc_space
      0.16 ±  4%      -0.0        0.12 ±  4%      -0.0        0.11 ±  6%      -0.0        0.11        perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
      0.13 ±  5%      -0.0        0.09 ±  7%      -0.0        0.09            -0.0        0.10 ±  7%  perf-profile.self.cycles-pp.__filemap_remove_folio
      0.13 ±  2%      -0.0        0.09 ±  5%      -0.0        0.09 ±  4%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.get_page_from_freelist
      0.06 ±  7%      -0.0        0.02 ± 99%      -0.0        0.02 ± 99%      -0.0        0.02 ±141%  perf-profile.self.cycles-pp.apparmor_file_permission
      0.12 ±  4%      -0.0        0.09 ±  5%      -0.0        0.09 ±  5%      -0.0        0.08 ±  8%  perf-profile.self.cycles-pp.vfs_fallocate
      0.13 ±  3%      -0.0        0.10 ±  5%      -0.0        0.10 ±  4%      -0.0        0.10 ±  4%  perf-profile.self.cycles-pp.fallocate64
      0.11 ±  4%      -0.0        0.07            -0.0        0.08 ±  6%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.xas_start
      0.07 ±  5%      -0.0        0.03 ± 70%      -0.0        0.04 ± 44%      -0.1        0.02 ±141%  perf-profile.self.cycles-pp.shmem_alloc_folio
      0.14 ±  4%      -0.0        0.10 ±  7%      -0.0        0.10 ±  5%      -0.0        0.10 ±  3%  perf-profile.self.cycles-pp.__fget_light
      0.10 ±  4%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.rmqueue
      0.10 ±  4%      -0.0        0.07 ±  8%      -0.0        0.07 ±  5%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.alloc_pages_mpol
      0.12 ±  3%      -0.0        0.09 ±  4%      -0.0        0.09 ±  4%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.xas_load
      0.11 ±  4%      -0.0        0.08 ±  7%      -0.0        0.08 ±  5%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.folio_unlock
      0.15 ±  2%      -0.0        0.12 ±  5%      -0.0        0.12 ±  4%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.10            -0.0        0.07            -0.0        0.08 ±  7%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.cap_vm_enough_memory
      0.16 ±  2%      -0.0        0.13 ±  6%      -0.0        0.14            -0.0        0.14        perf-profile.self.cycles-pp.page_counter_uncharge
      0.12 ±  5%      -0.0        0.09 ±  4%      -0.0        0.09 ±  7%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.__cond_resched
      0.06 ±  6%      -0.0        0.03 ± 70%      -0.0        0.04 ± 44%      -0.0        0.05        perf-profile.self.cycles-pp.filemap_free_folio
      0.12            -0.0        0.09 ±  4%      -0.0        0.09 ±  4%      -0.0        0.09        perf-profile.self.cycles-pp.noop_dirty_folio
      0.12 ±  3%      -0.0        0.10 ±  5%      -0.0        0.10 ±  7%      -0.0        0.10 ±  5%  perf-profile.self.cycles-pp.free_unref_page_list
      0.10 ±  3%      -0.0        0.07 ±  5%      -0.0        0.07 ±  5%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.filemap_remove_folio
      0.10 ±  5%      -0.0        0.07 ±  5%      -0.0        0.07            -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.try_charge_memcg
      0.12 ±  3%      -0.0        0.10 ±  8%      -0.0        0.10            -0.0        0.10 ±  4%  perf-profile.self.cycles-pp.cgroup_rstat_updated
      0.09 ±  4%      -0.0        0.07 ±  7%      -0.0        0.07 ±  5%      -0.0        0.07 ±  7%  perf-profile.self.cycles-pp.__folio_cancel_dirty
      0.08 ±  4%      -0.0        0.06 ±  8%      -0.0        0.06 ±  6%      -0.0        0.06 ±  8%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.08 ±  5%      -0.0        0.06            -0.0        0.06            -0.0        0.06        perf-profile.self.cycles-pp._raw_spin_trylock
      0.08            -0.0        0.06 ±  6%      -0.0        0.06 ±  8%      -0.0        0.06        perf-profile.self.cycles-pp.folio_add_lru
      0.07 ±  5%      -0.0        0.05            -0.0        0.05            -0.0        0.04 ± 44%  perf-profile.self.cycles-pp.xas_find_conflict
      0.08 ±  8%      -0.0        0.06 ±  6%      -0.0        0.06 ±  6%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__mod_lruvec_state
      0.56 ±  6%      -0.0        0.54 ±  9%      -0.0        0.55 ±  5%      -0.2        0.40 ±  3%  perf-profile.self.cycles-pp.__mod_lruvec_page_state
      0.08 ± 10%      -0.0        0.06 ±  9%      -0.0        0.06            -0.0        0.06        perf-profile.self.cycles-pp.truncate_cleanup_folio
      0.07 ± 10%      -0.0        0.05            -0.0        0.05 ±  7%      -0.0        0.05 ±  8%  perf-profile.self.cycles-pp.xas_init_marks
      0.08 ±  4%      -0.0        0.06 ±  7%      +0.0        0.08 ±  4%      -0.0        0.07 ± 10%  perf-profile.self.cycles-pp.__percpu_counter_limited_add
      0.07 ±  7%      -0.0        0.05            -0.0        0.05 ±  7%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.get_pfnblock_flags_mask
      0.07 ±  5%      -0.0        0.06 ±  8%      -0.0        0.06 ±  6%      -0.0        0.05 ±  7%  perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.07 ±  5%      -0.0        0.06 ±  9%      -0.0        0.06 ±  7%      -0.0        0.06        perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
      0.08 ±  4%      -0.0        0.07 ±  5%      -0.0        0.06            -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.filemap_get_entry
      0.00            +0.0        0.00            +0.0        0.00            +0.1        0.08 ±  8%  perf-profile.self.cycles-pp.__file_remove_privs
      0.14 ±  2%      +0.0        0.16 ±  6%      +0.0        0.17 ±  3%      +0.0        0.16        perf-profile.self.cycles-pp.uncharge_folio
      0.02 ±141%      +0.0        0.06 ±  8%      +0.0        0.06            +0.0        0.06 ±  9%  perf-profile.self.cycles-pp.uncharge_batch
      0.21 ±  9%      +0.1        0.31 ±  7%      +0.1        0.32 ±  5%      +0.1        0.30 ±  4%  perf-profile.self.cycles-pp.mem_cgroup_commit_charge
      0.69 ±  5%      +0.1        0.83 ±  4%      +0.2        0.86 ±  5%      +0.1        0.79 ±  2%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.06 ±  6%      +0.2        0.22 ±  2%      +0.1        0.13 ±  5%      +0.1        0.11 ±  4%  perf-profile.self.cycles-pp.inode_needs_update_time
      0.14 ±  8%      +0.3        0.42 ±  7%      +0.3        0.44 ±  6%      +0.3        0.40 ±  3%  perf-profile.self.cycles-pp.__mem_cgroup_charge
      0.13 ±  7%      +0.4        0.49 ±  3%      +0.4        0.51 ±  2%      +0.4        0.51 ±  2%  perf-profile.self.cycles-pp.__count_memcg_events
     84.24            +1.0       85.21            +0.8       85.04            +1.0       85.21        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      1.12 ±  5%      +1.4        2.50 ±  2%      +1.4        2.55 ±  2%      +1.3        2.43        perf-profile.self.cycles-pp.__mod_memcg_lruvec_state


[2]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale

130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
  10544810 ± 11%      +1.7%   10720938 ±  4%      +1.7%   10719232 ±  4%     +24.8%   13160448        meminfo.DirectMap2M
      1.87            -0.4        1.43 ±  3%      -0.4        1.47 ±  2%      -0.4        1.46        mpstat.cpu.all.usr%
      3171            -5.3%       3003 ±  2%     +17.4%       3725 ± 30%      +2.6%       3255 ±  5%  vmstat.system.cs
     93.97 ±130%    +360.8%     433.04 ± 83%   +5204.4%       4984 ±150%   +1540.1%       1541 ± 56%  boot-time.boot
      6762 ±101%     +96.3%      13275 ± 75%   +3212.0%     223971 ±150%    +752.6%      57655 ± 60%  boot-time.idle
     84.83 ±  9%     +55.8%     132.17 ± 16%     +75.6%     149.00 ± 11%     +98.0%     168.00 ±  6%  perf-c2c.DRAM.local
    484.17 ±  3%     +37.1%     663.67 ± 10%     +44.1%     697.67 ±  7%      -0.2%     483.00 ±  5%  perf-c2c.DRAM.remote
     72763 ±  5%     +14.4%      83212 ± 12%    +141.5%     175744 ± 83%     +55.7%     113321 ± 21%  turbostat.C1
      0.08           -25.0%       0.06           -27.1%       0.06 ±  6%     -25.0%       0.06        turbostat.IPC
     27.90            +4.6%      29.18            +4.9%      29.27            +3.9%      29.00        turbostat.RAMWatt
   3982212           -30.0%    2785941           -28.9%    2829631           -26.7%    2919929        will-it-scale.52.threads
     76580           -30.0%      53575           -28.9%      54415           -26.7%      56152        will-it-scale.per_thread_ops
   3982212           -30.0%    2785941           -28.9%    2829631           -26.7%    2919929        will-it-scale.workload
 1.175e+09 ±  2%     -28.6%  8.392e+08 ±  2%     -28.2%  8.433e+08 ±  2%     -25.4%  8.762e+08        numa-numastat.node0.local_node
 1.175e+09 ±  2%     -28.6%  8.394e+08 ±  2%     -28.3%  8.434e+08 ±  2%     -25.4%  8.766e+08        numa-numastat.node0.numa_hit
 1.231e+09 ±  2%     -31.3%  8.463e+08 ±  3%     -29.5%  8.683e+08 ±  3%     -27.7%  8.901e+08        numa-numastat.node1.local_node
 1.232e+09 ±  2%     -31.3%  8.466e+08 ±  3%     -29.5%  8.688e+08 ±  3%     -27.7%  8.907e+08        numa-numastat.node1.numa_hit
 2.408e+09           -30.0%  1.686e+09           -28.9%  1.712e+09           -26.6%  1.767e+09        proc-vmstat.numa_hit
 2.406e+09           -30.0%  1.685e+09           -28.9%  1.712e+09           -26.6%  1.766e+09        proc-vmstat.numa_local
 2.404e+09           -29.9%  1.684e+09           -28.8%   1.71e+09           -26.6%  1.765e+09        proc-vmstat.pgalloc_normal
 2.404e+09           -29.9%  1.684e+09           -28.8%   1.71e+09           -26.6%  1.765e+09        proc-vmstat.pgfree
   2302080            -0.9%    2280448            -0.5%    2290432            -1.2%    2274688        proc-vmstat.unevictable_pgs_scanned
     83444 ± 71%     +34.2%     111978 ± 65%      -9.1%      75877 ± 86%     -76.2%      19883 ± 12%  numa-meminfo.node0.AnonHugePages
    150484 ± 55%      +9.3%     164434 ± 46%      -9.3%     136435 ± 53%     -62.4%      56548 ± 18%  numa-meminfo.node0.AnonPages
    167427 ± 50%      +8.2%     181159 ± 41%      -8.3%     153613 ± 47%     -56.1%      73487 ± 14%  numa-meminfo.node0.Inactive
    166720 ± 50%      +8.7%     181159 ± 41%      -8.3%     152902 ± 48%     -56.6%      72379 ± 14%  numa-meminfo.node0.Inactive(anon)
    111067 ± 62%     -13.7%      95819 ± 59%     +14.6%     127294 ± 60%     +86.1%     206693 ±  8%  numa-meminfo.node1.AnonHugePages
    179594 ± 47%      -4.2%     172027 ± 43%      +9.3%     196294 ± 39%     +55.8%     279767 ±  3%  numa-meminfo.node1.AnonPages
    257406 ± 30%      -2.1%     251990 ± 32%      +9.9%     282766 ± 26%     +42.2%     366131 ±  8%  numa-meminfo.node1.AnonPages.max
    196741 ± 43%      -3.6%     189753 ± 39%      +8.1%     212645 ± 36%     +50.9%     296827 ±  3%  numa-meminfo.node1.Inactive
    196385 ± 43%      -3.9%     188693 ± 39%      +8.1%     212288 ± 36%     +51.1%     296827 ±  3%  numa-meminfo.node1.Inactive(anon)
     37621 ± 55%      +9.3%      41115 ± 46%      -9.3%      34116 ± 53%     -62.4%      14141 ± 18%  numa-vmstat.node0.nr_anon_pages
     41664 ± 50%      +8.6%      45233 ± 41%      -8.2%      38240 ± 47%     -56.6%      18079 ± 14%  numa-vmstat.node0.nr_inactive_anon
     41677 ± 50%      +8.6%      45246 ± 41%      -8.2%      38250 ± 47%     -56.6%      18092 ± 14%  numa-vmstat.node0.nr_zone_inactive_anon
 1.175e+09 ±  2%     -28.6%  8.394e+08 ±  2%     -28.3%  8.434e+08 ±  2%     -25.4%  8.766e+08        numa-vmstat.node0.numa_hit
 1.175e+09 ±  2%     -28.6%  8.392e+08 ±  2%     -28.2%  8.433e+08 ±  2%     -25.4%  8.762e+08        numa-vmstat.node0.numa_local
     44903 ± 47%      -4.2%      43015 ± 43%      +9.3%      49079 ± 39%     +55.8%      69957 ±  3%  numa-vmstat.node1.nr_anon_pages
     49030 ± 43%      -3.9%      47139 ± 39%      +8.3%      53095 ± 36%     +51.4%      74210 ±  3%  numa-vmstat.node1.nr_inactive_anon
     49035 ± 43%      -3.9%      47135 ± 39%      +8.3%      53098 ± 36%     +51.3%      74212 ±  3%  numa-vmstat.node1.nr_zone_inactive_anon
 1.232e+09 ±  2%     -31.3%  8.466e+08 ±  3%     -29.5%  8.688e+08 ±  3%     -27.7%  8.907e+08        numa-vmstat.node1.numa_hit
 1.231e+09 ±  2%     -31.3%  8.463e+08 ±  3%     -29.5%  8.683e+08 ±  3%     -27.7%  8.901e+08        numa-vmstat.node1.numa_local
   5256095 ± 59%    +557.5%   34561019 ± 89%   +4549.1%  2.444e+08 ±146%   +1646.7%   91810708 ± 50%  sched_debug.cfs_rq:/.avg_vruntime.avg
   8288083 ± 52%    +365.0%   38543329 ± 81%   +3020.3%  2.586e+08 ±145%   +1133.9%  1.023e+08 ± 49%  sched_debug.cfs_rq:/.avg_vruntime.max
   1364475 ± 40%     +26.7%    1728262 ± 29%    +346.8%    6096205 ±118%    +180.4%    3826288 ± 41%  sched_debug.cfs_rq:/.avg_vruntime.stddev
    161.62 ± 99%     -42.4%      93.09 ±144%     -57.3%      69.01 ± 74%     -86.6%      21.73 ± 10%  sched_debug.cfs_rq:/.load_avg.avg
    902.70 ±107%     -46.8%     480.28 ±171%     -57.3%     385.28 ±120%     -94.8%      47.03 ±  8%  sched_debug.cfs_rq:/.load_avg.stddev
   5256095 ± 59%    +557.5%   34561019 ± 89%   +4549.1%  2.444e+08 ±146%   +1646.7%   91810708 ± 50%  sched_debug.cfs_rq:/.min_vruntime.avg
   8288083 ± 52%    +365.0%   38543329 ± 81%   +3020.3%  2.586e+08 ±145%   +1133.9%  1.023e+08 ± 49%  sched_debug.cfs_rq:/.min_vruntime.max
   1364475 ± 40%     +26.7%    1728262 ± 29%    +346.8%    6096205 ±118%    +180.4%    3826288 ± 41%  sched_debug.cfs_rq:/.min_vruntime.stddev
     31.84 ±161%     -71.8%       8.98 ± 44%     -84.0%       5.10 ± 43%     -79.0%       6.68 ± 24%  sched_debug.cfs_rq:/.removed.load_avg.avg
    272.14 ±192%     -84.9%      41.10 ± 29%     -89.7%      28.08 ± 21%     -87.8%      33.19 ± 12%  sched_debug.cfs_rq:/.removed.load_avg.stddev
    334.70 ± 17%     +32.4%     443.13 ± 19%     +34.3%     449.66 ± 11%     +14.6%     383.66 ± 24%  sched_debug.cfs_rq:/.util_est_enqueued.avg
    322.95 ± 23%     +12.5%     363.30 ± 19%     +27.9%     412.92 ±  6%     +11.2%     359.17 ± 18%  sched_debug.cfs_rq:/.util_est_enqueued.stddev
    240924 ± 52%    +136.5%     569868 ± 62%   +2031.9%    5136297 ±145%    +600.7%    1688103 ± 51%  sched_debug.cpu.clock.avg
    240930 ± 52%    +136.5%     569874 ± 62%   +2031.9%    5136304 ±145%    +600.7%    1688109 ± 51%  sched_debug.cpu.clock.max
    240917 ± 52%    +136.5%     569861 ± 62%   +2032.0%    5136290 ±145%    +600.7%    1688095 ± 51%  sched_debug.cpu.clock.min
    239307 ± 52%    +136.6%     566140 ± 62%   +2009.9%    5049095 ±145%    +600.7%    1676912 ± 51%  sched_debug.cpu.clock_task.avg
    239479 ± 52%    +136.5%     566334 ± 62%   +2014.9%    5064818 ±145%    +600.4%    1677208 ± 51%  sched_debug.cpu.clock_task.max
    232462 ± 53%    +140.6%     559281 ± 63%   +2064.0%    5030381 ±146%    +617.9%    1668793 ± 52%  sched_debug.cpu.clock_task.min
    683.22 ±  3%      +0.7%     688.14 ±  4%   +1762.4%      12724 ±138%     +19.2%     814.55 ±  8%  sched_debug.cpu.clock_task.stddev
      3267 ± 57%    +146.0%       8040 ± 63%   +2127.2%      72784 ±146%    +652.5%      24591 ± 52%  sched_debug.cpu.curr->pid.avg
     10463 ± 39%    +101.0%      21030 ± 54%   +1450.9%     162275 ±143%    +448.5%      57391 ± 49%  sched_debug.cpu.curr->pid.max
      3373 ± 57%    +149.1%       8403 ± 64%   +2141.6%      75621 ±146%    +657.7%      25561 ± 52%  sched_debug.cpu.curr->pid.stddev
     58697 ± 14%      +1.6%      59612 ±  7%  +1.9e+05%  1.142e+08 ±156%    +105.4%     120565 ± 32%  sched_debug.cpu.nr_switches.max
      6023 ± 10%     +13.6%       6843 ± 11%  +2.9e+05%   17701514 ±151%    +124.8%      13541 ± 32%  sched_debug.cpu.nr_switches.stddev
    240917 ± 52%    +136.5%     569862 ± 62%   +2032.0%    5136291 ±145%    +600.7%    1688096 ± 51%  sched_debug.cpu_clk
    240346 ± 52%    +136.9%     569288 ± 62%   +2036.8%    5135723 ±145%    +602.1%    1687529 ± 51%  sched_debug.ktime
    241481 ± 51%    +136.2%     570443 ± 62%   +2027.2%    5136856 ±145%    +599.3%    1688672 ± 51%  sched_debug.sched_clk
      0.04 ±  9%     -19.3%       0.03 ±  6%     -19.7%       0.03 ±  6%     -14.3%       0.03 ±  8%  perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.04 ± 11%     -18.0%       0.03 ± 13%     -22.8%       0.03 ± 10%     -14.0%       0.04 ± 15%  perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.04 ±  8%     -22.3%       0.03 ±  5%     -19.4%       0.03 ±  3%     -12.6%       0.04 ±  9%  perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.91 ±  2%     +11.3%       1.01 ±  5%     +65.3%       1.51 ± 53%     +28.8%       1.17 ± 11%  perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.04 ± 13%     -90.3%       0.00 ±223%     -66.4%       0.01 ±101%     -83.8%       0.01 ±223%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
     24.11 ±  3%      -8.5%      22.08 ± 11%     -25.2%      18.04 ± 50%     -29.5%      17.01 ± 21%  perf-sched.wait_and_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      1.14           +15.1%       1.31           -24.1%       0.86 ± 70%     +13.7%       1.29        perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    189.94 ±  3%     +18.3%     224.73 ±  4%     +20.3%     228.52 ±  3%     +22.1%     231.82 ±  3%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1652 ±  4%     -13.4%       1431 ±  4%     -13.4%       1431 ±  2%     -14.3%       1416 ±  6%  perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      1628 ±  8%     -15.0%       1383 ±  9%     -16.6%       1357 ±  2%     -16.6%       1358 ±  7%  perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
     83.67 ±  7%     -87.6%      10.33 ±223%     -59.2%      34.17 ±100%     -85.5%      12.17 ±223%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      2835 ±  3%     +10.6%       3135 ± 10%    +123.8%       6345 ± 80%     +48.4%       4207 ± 19%  perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64
      3827 ±  4%     -13.0%       3328 ±  3%     -12.9%       3335 ±  2%     -14.7%       3264 ±  2%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1.71 ±165%     -83.4%       0.28 ± 21%     -82.3%       0.30 ± 16%     -74.6%       0.43 ± 60%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.43 ± 17%     -43.8%       0.24 ± 26%     -44.4%       0.24 ± 27%     -32.9%       0.29 ± 23%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.46 ± 17%     -36.7%       0.29 ± 12%     -35.7%       0.30 ± 19%     -35.3%       0.30 ± 21%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
     45.41 ±  4%     +13.4%      51.51 ± 12%    +148.6%     112.88 ± 86%     +56.7%      71.18 ± 21%  perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.30 ± 34%     -90.7%       0.03 ±223%     -66.0%       0.10 ±110%     -88.2%       0.04 ±223%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      2.39           +10.7%       2.65 ±  2%     -24.3%       1.81 ± 70%     +12.1%       2.68 ±  2%  perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.04 ±  9%     -19.3%       0.03 ±  6%     -19.7%       0.03 ±  6%     -14.3%       0.03 ±  8%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.04 ± 11%     -18.0%       0.03 ± 13%     -22.8%       0.03 ± 10%     -14.0%       0.04 ± 15%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.04 ±  8%     -22.3%       0.03 ±  5%     -19.4%       0.03 ±  3%     -12.6%       0.04 ±  9%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.04 ± 11%     -33.1%       0.03 ± 17%     -32.3%       0.03 ± 22%     -16.3%       0.04 ± 12%  perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.90 ±  2%     +11.5%       1.00 ±  5%     +66.1%       1.50 ± 53%     +29.2%       1.16 ± 11%  perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.04 ± 13%     -26.6%       0.03 ± 12%     -33.6%       0.03 ± 11%     -18.1%       0.04 ± 16%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
     24.05 ±  3%      -9.0%      21.90 ± 10%     -25.0%      18.04 ± 50%     -29.4%      16.97 ± 21%  perf-sched.wait_time.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      1.13           +15.2%       1.30           +15.0%       1.30           +13.7%       1.29        perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    189.93 ±  3%     +18.3%     224.72 ±  4%     +20.3%     228.50 ±  3%     +22.1%     231.81 ±  3%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1.71 ±165%     -83.4%       0.28 ± 21%     -82.3%       0.30 ± 16%     -74.6%       0.43 ± 60%  perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
      0.43 ± 17%     -43.8%       0.24 ± 26%     -44.4%       0.24 ± 27%     -32.9%       0.29 ± 23%  perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.46 ± 17%     -36.7%       0.29 ± 12%     -35.7%       0.30 ± 19%     -35.3%       0.30 ± 21%  perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      0.31 ± 26%     -42.1%       0.18 ± 58%     -64.1%       0.11 ± 40%     -28.5%       0.22 ± 30%  perf-sched.wait_time.max.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     45.41 ±  4%     +13.4%      51.50 ± 12%    +148.6%     112.87 ± 86%     +56.8%      71.18 ± 21%  perf-sched.wait_time.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      2.39           +10.7%       2.64 ±  2%     +12.9%       2.69 ±  2%     +12.1%       2.68 ±  2%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.75          +142.0%       1.83 ±  2%    +146.9%       1.86          +124.8%       1.70        perf-stat.i.MPKI
  8.47e+09           -24.4%  6.407e+09           -23.2%  6.503e+09           -21.2%  6.674e+09        perf-stat.i.branch-instructions
      0.66            -0.0        0.63            -0.0        0.64            -0.0        0.63        perf-stat.i.branch-miss-rate%
  56364992           -28.3%   40421603 ±  3%     -26.0%   41734061 ±  2%     -25.8%   41829975        perf-stat.i.branch-misses
     14.64            +6.7       21.30            +6.9       21.54            +6.5       21.10        perf-stat.i.cache-miss-rate%
  30868184           +81.3%   55977240 ±  3%     +87.7%   57950237           +76.2%   54404466        perf-stat.i.cache-misses
 2.107e+08           +24.7%  2.627e+08 ±  2%     +27.6%   2.69e+08           +22.3%  2.578e+08        perf-stat.i.cache-references
      3106            -5.5%       2934 ±  2%     +16.4%       3615 ± 29%      +2.4%       3181 ±  5%  perf-stat.i.context-switches
      3.55           +33.4%       4.74           +31.5%       4.67           +27.4%       4.52        perf-stat.i.cpi
      4722           -44.8%       2605 ±  3%     -46.7%       2515           -43.3%       2675        perf-stat.i.cycles-between-cache-misses
      0.04            -0.0        0.04            -0.0        0.04            -0.0        0.04        perf-stat.i.dTLB-load-miss-rate%
   4117232           -29.1%    2917107           -28.1%    2961876           -25.8%    3056956        perf-stat.i.dTLB-load-misses
 1.051e+10           -24.1%  7.979e+09           -23.0%    8.1e+09           -19.7%   8.44e+09        perf-stat.i.dTLB-loads
      0.00 ±  3%      +0.0        0.00 ±  6%      +0.0        0.00 ±  5%      +0.0        0.00 ±  4%  perf-stat.i.dTLB-store-miss-rate%
 5.886e+09           -27.5%  4.269e+09           -26.3%   4.34e+09           -24.1%  4.467e+09        perf-stat.i.dTLB-stores
     78.16            -6.6       71.51            -6.4       71.75            -5.9       72.23        perf-stat.i.iTLB-load-miss-rate%
   4131074 ±  3%     -30.0%    2891515           -29.2%    2922789           -26.2%    3048227        perf-stat.i.iTLB-load-misses
 4.098e+10           -25.0%  3.072e+10           -23.9%  3.119e+10           -21.6%  3.214e+10        perf-stat.i.instructions
      9929 ±  2%      +7.0%      10627            +7.5%      10673            +6.2%      10547        perf-stat.i.instructions-per-iTLB-miss
      0.28           -25.0%       0.21           -23.9%       0.21           -21.5%       0.22        perf-stat.i.ipc
     63.49           +43.8%      91.27 ±  3%     +48.2%      94.07           +38.6%      87.97        perf-stat.i.metric.K/sec
    241.12           -24.6%     181.87           -23.4%     184.70           -20.9%     190.75        perf-stat.i.metric.M/sec
     90.84            -0.4       90.49            -0.9       89.98            -2.9       87.93        perf-stat.i.node-load-miss-rate%
   3735316           +78.6%    6669641 ±  3%     +83.1%    6839047           +62.4%    6067727        perf-stat.i.node-load-misses
    377465 ±  4%     +86.1%     702512 ± 11%    +101.7%     761510 ±  4%    +120.8%     833359        perf-stat.i.node-loads
   1322217           -27.6%     957081 ±  5%     -22.9%    1019779 ±  2%     -19.4%    1066178        perf-stat.i.node-store-misses
     37459 ±  3%     -23.0%      28826 ±  5%     -19.2%      30253 ±  6%     -23.4%      28682 ±  3%  perf-stat.i.node-stores
      0.75          +141.8%       1.82 ±  2%    +146.6%       1.86          +124.7%       1.69        perf-stat.overall.MPKI
      0.67            -0.0        0.63            -0.0        0.64            -0.0        0.63        perf-stat.overall.branch-miss-rate%
     14.65            +6.7       21.30            +6.9       21.54            +6.5       21.11        perf-stat.overall.cache-miss-rate%
      3.55           +33.4%       4.73           +31.4%       4.66           +27.4%       4.52        perf-stat.overall.cpi
      4713           -44.8%       2601 ±  3%     -46.7%       2511           -43.3%       2671        perf-stat.overall.cycles-between-cache-misses
      0.04            -0.0        0.04            -0.0        0.04            -0.0        0.04        perf-stat.overall.dTLB-load-miss-rate%
      0.00 ±  3%      +0.0        0.00 ±  5%      +0.0        0.00 ±  5%      +0.0        0.00        perf-stat.overall.dTLB-store-miss-rate%
     78.14            -6.7       71.47            -6.4       71.70            -5.9       72.20        perf-stat.overall.iTLB-load-miss-rate%
      9927 ±  2%      +7.0%      10624            +7.5%      10672            +6.2%      10547        perf-stat.overall.instructions-per-iTLB-miss
      0.28           -25.0%       0.21           -23.9%       0.21           -21.5%       0.22        perf-stat.overall.ipc
     90.82            -0.3       90.49            -0.8       89.98            -2.9       87.92        perf-stat.overall.node-load-miss-rate%
   3098901            +7.1%    3318983            +6.9%    3313112            +7.0%    3316044        perf-stat.overall.path-length
 8.441e+09           -24.4%  6.385e+09           -23.2%   6.48e+09           -21.2%  6.652e+09        perf-stat.ps.branch-instructions
  56179581           -28.3%   40286337 ±  3%     -26.0%   41593521 ±  2%     -25.8%   41687151        perf-stat.ps.branch-misses
  30759982           +81.3%   55777812 ±  3%     +87.7%   57746279           +76.3%   54217757        perf-stat.ps.cache-misses
   2.1e+08           +24.6%  2.618e+08 ±  2%     +27.6%   2.68e+08           +22.3%  2.569e+08        perf-stat.ps.cache-references
      3095            -5.5%       2923 ±  2%     +16.2%       3597 ± 29%      +2.3%       3167 ±  5%  perf-stat.ps.context-switches
    135.89            -0.8%     134.84            -0.7%     134.99            -1.0%     134.55        perf-stat.ps.cpu-migrations
   4103292           -29.1%    2907270           -28.1%    2951746           -25.7%    3046739        perf-stat.ps.dTLB-load-misses
 1.048e+10           -24.1%  7.952e+09           -23.0%  8.072e+09           -19.7%  8.412e+09        perf-stat.ps.dTLB-loads
 5.866e+09           -27.5%  4.255e+09           -26.3%  4.325e+09           -24.1%  4.452e+09        perf-stat.ps.dTLB-stores
   4117020 ±  3%     -30.0%    2881750           -29.3%    2912744           -26.2%    3037970        perf-stat.ps.iTLB-load-misses
 4.084e+10           -25.0%  3.062e+10           -23.9%  3.109e+10           -21.6%  3.203e+10        perf-stat.ps.instructions
   3722149           +78.5%    6645867 ±  3%     +83.1%    6814976           +62.5%    6046854        perf-stat.ps.node-load-misses
    376240 ±  4%     +86.1%     700053 ± 11%    +101.7%     758898 ±  4%    +120.8%     830575        perf-stat.ps.node-loads
   1317772           -27.6%     953773 ±  5%     -22.9%    1016183 ±  2%     -19.4%    1062457        perf-stat.ps.node-store-misses
     37408 ±  3%     -23.2%      28748 ±  5%     -19.3%      30192 ±  6%     -23.5%      28607 ±  3%  perf-stat.ps.node-stores
 1.234e+13           -25.1%  9.246e+12           -24.0%  9.375e+12           -21.5%  9.683e+12        perf-stat.total.instructions
      1.28            -0.4        0.90 ±  2%      -0.4        0.91            -0.3        0.94 ±  2%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
      1.26 ±  2%      -0.4        0.90 ±  3%      -0.3        0.92 ±  2%      -0.3        0.94 ±  2%  perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.08 ±  2%      -0.3        0.77 ±  3%      -0.3        0.79 ±  2%      -0.3        0.81 ±  2%  perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.92 ±  2%      -0.3        0.62 ±  3%      -0.3        0.63            -0.3        0.66 ±  2%  perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.84 ±  3%      -0.2        0.61 ±  3%      -0.2        0.63 ±  2%      -0.2        0.65 ±  2%  perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     29.27            -0.2       29.09            -1.0       28.32            -0.2       29.04        perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.26            -0.2        1.08            -0.2        1.07            -0.2        1.10        perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
      1.26            -0.2        1.08            -0.2        1.07            -0.2        1.10        perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
      1.24            -0.2        1.06            -0.2        1.05            -0.2        1.08        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
      1.23            -0.2        1.06            -0.2        1.05            -0.2        1.08        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
      1.24            -0.2        1.06            -0.2        1.05            -0.2        1.08        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
     29.15            -0.2       28.99            -0.9       28.23            -0.2       28.94        perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      1.20            -0.2        1.04 ±  2%      -0.2        1.05            -0.2        1.02 ±  2%  perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
     27.34            -0.1       27.22 ±  2%      -0.9       26.49            -0.1       27.20        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
     27.36            -0.1       27.24 ±  2%      -0.9       26.51            -0.1       27.22        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
     27.28            -0.1       27.17 ±  2%      -0.8       26.44            -0.1       27.16        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
     25.74            -0.1       25.67 ±  2%      +0.2       25.98            +0.9       26.62        perf-profile.calltrace.cycles-pp.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
     23.43            +0.0       23.43 ±  2%      +0.3       23.70            +0.9       24.34        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range
     23.45            +0.0       23.45 ±  2%      +0.3       23.73            +0.9       24.35        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
     23.37            +0.0       23.39 ±  2%      +0.3       23.67            +0.9       24.30        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release
      0.68 ±  3%      +0.0        0.72 ±  4%      +0.1        0.73 ±  3%      +0.1        0.74        perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
      1.08            +0.1        1.20            +0.1        1.17            +0.1        1.15 ±  2%  perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      2.91            +0.3        3.18 ±  2%      +0.3        3.23            +0.1        3.02        perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
      2.56            +0.4        2.92 ±  2%      +0.4        2.98            +0.2        2.75        perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
      1.36 ±  3%      +0.4        1.76 ±  9%      +0.4        1.75 ±  5%      +0.3        1.68 ±  3%  perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      2.22            +0.5        2.68 ±  2%      +0.5        2.73            +0.3        2.50        perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
      0.00            +0.6        0.60 ±  2%      +0.6        0.61 ±  2%      +0.6        0.61        perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
      2.33            +0.6        2.94            +0.6        2.96 ±  3%      +0.3        2.59        perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      0.00            +0.7        0.72 ±  2%      +0.7        0.72 ±  2%      +0.7        0.68 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
      0.69 ±  4%      +0.8        1.47 ±  3%      +0.8        1.48 ±  2%      +0.7        1.42        perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
      1.24 ±  2%      +0.8        2.04 ±  2%      +0.8        2.07 ±  2%      +0.6        1.82        perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
      0.00            +0.8        0.82 ±  4%      +0.8        0.85 ±  3%      +0.8        0.78 ±  2%  perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.17 ±  2%      +0.8        2.00 ±  2%      +0.9        2.04 ±  2%      +0.6        1.77        perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
      0.59 ±  4%      +0.9        1.53            +0.9        1.53 ±  4%      +0.8        1.37 ±  2%  perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
      1.38            +1.0        2.33 ±  2%      +1.0        2.34 ±  3%      +0.6        1.94 ±  2%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
      0.62 ±  3%      +1.0        1.66 ±  5%      +1.1        1.68 ±  4%      +1.0        1.57 ±  2%  perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
     38.70            +1.2       39.90            +0.5       39.23            +0.7       39.45        perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
     38.34            +1.3       39.65            +0.6       38.97            +0.9       39.20        perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
     37.24            +1.6       38.86            +0.9       38.17            +1.1       38.35        perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
     36.64            +1.8       38.40            +1.1       37.72            +1.2       37.88        perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
      2.47 ±  2%      +2.1        4.59 ±  8%      +2.1        4.61 ±  5%      +1.9        4.37 ±  2%  perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
      1.30            -0.4        0.92 ±  2%      -0.4        0.93            -0.4        0.96        perf-profile.children.cycles-pp.syscall_return_via_sysret
      1.28 ±  2%      -0.4        0.90 ±  3%      -0.3        0.93 ±  2%      -0.3        0.95 ±  2%  perf-profile.children.cycles-pp.shmem_alloc_folio
     30.44            -0.3       30.11            -1.1       29.33            -0.4       30.07        perf-profile.children.cycles-pp.folio_batch_move_lru
      1.10 ±  2%      -0.3        0.78 ±  3%      -0.3        0.81 ±  2%      -0.3        0.82 ±  2%  perf-profile.children.cycles-pp.alloc_pages_mpol
      0.96 ±  2%      -0.3        0.64 ±  3%      -0.3        0.65            -0.3        0.68 ±  2%  perf-profile.children.cycles-pp.shmem_inode_acct_blocks
      0.88            -0.3        0.58 ±  2%      -0.3        0.60 ±  2%      -0.3        0.62 ±  2%  perf-profile.children.cycles-pp.xas_store
      0.88 ±  3%      -0.2        0.64 ±  3%      -0.2        0.66 ±  2%      -0.2        0.67 ±  2%  perf-profile.children.cycles-pp.__alloc_pages
     29.29            -0.2       29.10            -1.0       28.33            -0.2       29.06        perf-profile.children.cycles-pp.folio_add_lru
      0.61 ±  2%      -0.2        0.43 ±  3%      -0.2        0.44 ±  2%      -0.2        0.45 ±  3%  perf-profile.children.cycles-pp.__entry_text_start
      1.26            -0.2        1.09            -0.2        1.08            -0.2        1.10        perf-profile.children.cycles-pp.lru_add_drain_cpu
      0.56            -0.2        0.39 ±  4%      -0.2        0.40 ±  3%      -0.2        0.40 ±  3%  perf-profile.children.cycles-pp.free_unref_page_list
      1.22            -0.2        1.06 ±  2%      -0.2        1.06            -0.2        1.04 ±  2%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.46            -0.1        0.32 ±  3%      -0.1        0.32            -0.1        0.32 ±  3%  perf-profile.children.cycles-pp.__mod_lruvec_state
      0.41 ±  3%      -0.1        0.28 ±  4%      -0.1        0.28 ±  3%      -0.1        0.29 ±  2%  perf-profile.children.cycles-pp.xas_load
      0.44 ±  4%      -0.1        0.31 ±  4%      -0.1        0.32 ±  2%      -0.1        0.34 ±  3%  perf-profile.children.cycles-pp.find_lock_entries
      0.50 ±  3%      -0.1        0.37 ±  2%      -0.1        0.39 ±  4%      -0.1        0.39 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.24 ±  7%      -0.1        0.12 ±  5%      -0.1        0.13 ±  2%      -0.1        0.13 ±  3%  perf-profile.children.cycles-pp.__list_add_valid_or_report
     25.89            -0.1       25.78 ±  2%      +0.2       26.08            +0.8       26.73        perf-profile.children.cycles-pp.release_pages
      0.34 ±  2%      -0.1        0.24 ±  4%      -0.1        0.23 ±  2%      -0.1        0.23 ±  4%  perf-profile.children.cycles-pp.__mod_node_page_state
      0.38 ±  3%      -0.1        0.28 ±  4%      -0.1        0.29 ±  3%      -0.1        0.28        perf-profile.children.cycles-pp._raw_spin_lock
      0.32 ±  2%      -0.1        0.22 ±  5%      -0.1        0.23 ±  2%      -0.1        0.23 ±  2%  perf-profile.children.cycles-pp.__dquot_alloc_space
      0.26 ±  2%      -0.1        0.17 ±  2%      -0.1        0.18 ±  3%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.xas_descend
      0.22 ±  3%      -0.1        0.14 ±  4%      -0.1        0.14 ±  3%      -0.1        0.14 ±  2%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.25            -0.1        0.17 ±  3%      -0.1        0.18 ±  4%      -0.1        0.18 ±  4%  perf-profile.children.cycles-pp.xas_clear_mark
      0.32 ±  4%      -0.1        0.25 ±  3%      -0.1        0.26 ±  4%      -0.1        0.26 ±  2%  perf-profile.children.cycles-pp.rmqueue
      0.23 ±  2%      -0.1        0.16 ±  2%      -0.1        0.16 ±  4%      -0.1        0.16 ±  6%  perf-profile.children.cycles-pp.xas_init_marks
      0.24 ±  2%      -0.1        0.17 ±  5%      -0.1        0.17 ±  4%      -0.1        0.18 ±  2%  perf-profile.children.cycles-pp.__cond_resched
      0.25 ±  4%      -0.1        0.18 ±  2%      -0.1        0.18 ±  2%      -0.1        0.18 ±  4%  perf-profile.children.cycles-pp.truncate_cleanup_folio
      0.30 ±  3%      -0.1        0.23 ±  4%      -0.1        0.22 ±  3%      -0.1        0.22 ±  2%  perf-profile.children.cycles-pp.filemap_get_entry
      0.20 ±  2%      -0.1        0.13 ±  5%      -0.1        0.13 ±  3%      -0.1        0.14 ±  4%  perf-profile.children.cycles-pp.folio_unlock
      0.16 ±  4%      -0.1        0.10 ±  5%      -0.1        0.10 ±  7%      -0.1        0.11 ±  6%  perf-profile.children.cycles-pp.xas_find_conflict
      0.19 ±  3%      -0.1        0.13 ±  5%      -0.0        0.14 ± 12%      -0.1        0.14 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.17 ±  5%      -0.1        0.12 ±  3%      -0.1        0.12 ±  4%      -0.0        0.13 ±  3%  perf-profile.children.cycles-pp.noop_dirty_folio
      0.13 ±  4%      -0.1        0.08 ±  9%      -0.1        0.08 ±  8%      -0.0        0.09        perf-profile.children.cycles-pp.security_vm_enough_memory_mm
      0.18 ±  8%      -0.1        0.13 ±  4%      -0.0        0.13 ±  5%      -0.0        0.13 ±  5%  perf-profile.children.cycles-pp.shmem_recalc_inode
      0.16 ±  2%      -0.1        0.11 ±  3%      -0.0        0.12 ±  4%      -0.0        0.12 ±  6%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.09 ±  5%      -0.1        0.04 ± 45%      -0.0        0.05            -0.0        0.05 ±  7%  perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
      0.10 ±  7%      -0.0        0.05 ± 45%      -0.0        0.06 ± 13%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp.cap_vm_enough_memory
      0.14 ±  5%      -0.0        0.10            -0.0        0.10 ±  4%      -0.0        0.11 ±  5%  perf-profile.children.cycles-pp.__folio_cancel_dirty
      0.14 ±  5%      -0.0        0.10 ±  4%      -0.0        0.10 ±  3%      -0.0        0.10 ±  6%  perf-profile.children.cycles-pp.security_file_permission
      0.10 ±  5%      -0.0        0.06 ±  6%      -0.0        0.06 ±  7%      -0.0        0.07 ± 10%  perf-profile.children.cycles-pp.xas_find
      0.15 ±  4%      -0.0        0.11 ±  3%      -0.0        0.11 ±  6%      -0.0        0.11 ±  3%  perf-profile.children.cycles-pp.__fget_light
      0.12 ±  3%      -0.0        0.09 ±  7%      -0.0        0.09 ±  7%      -0.0        0.09 ±  6%  perf-profile.children.cycles-pp.__vm_enough_memory
      0.12 ±  3%      -0.0        0.09 ±  4%      -0.0        0.09 ±  4%      -0.0        0.09 ±  6%  perf-profile.children.cycles-pp.apparmor_file_permission
      0.12 ±  3%      -0.0        0.08 ±  5%      -0.0        0.08 ±  5%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.14 ±  5%      -0.0        0.11 ±  3%      -0.0        0.11 ±  4%      -0.0        0.12 ±  3%  perf-profile.children.cycles-pp.file_modified
      0.12 ±  4%      -0.0        0.08 ±  4%      -0.0        0.08 ±  7%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.xas_start
      0.09            -0.0        0.06 ±  8%      -0.0        0.04 ± 45%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.__folio_throttle_swaprate
      0.12 ±  4%      -0.0        0.08 ±  4%      -0.0        0.08 ±  4%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.__percpu_counter_limited_add
      0.12 ±  6%      -0.0        0.08 ±  8%      -0.0        0.08 ±  8%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp._raw_spin_trylock
      0.12 ±  4%      -0.0        0.09 ±  4%      -0.0        0.09 ±  4%      -0.0        0.09        perf-profile.children.cycles-pp.inode_add_bytes
      0.20 ±  2%      -0.0        0.17 ±  7%      -0.0        0.17 ±  4%      -0.0        0.18 ±  3%  perf-profile.children.cycles-pp.try_charge_memcg
      0.10 ±  5%      -0.0        0.07 ±  7%      -0.0        0.07 ±  7%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp.policy_nodemask
      0.09 ±  6%      -0.0        0.06 ±  6%      -0.0        0.06 ±  7%      -0.0        0.06 ±  6%  perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.09 ±  6%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%      -0.0        0.07 ±  5%  perf-profile.children.cycles-pp.filemap_free_folio
      0.07 ±  6%      -0.0        0.05 ±  7%      -0.0        0.06 ±  9%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.down_write
      0.08 ±  4%      -0.0        0.06 ±  8%      -0.0        0.06 ±  9%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.get_task_policy
      0.09 ±  7%      -0.0        0.07            -0.0        0.07 ±  7%      -0.0        0.07        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.09 ±  7%      -0.0        0.07            -0.0        0.07 ±  5%      -0.0        0.08 ±  6%  perf-profile.children.cycles-pp.inode_needs_update_time
      0.09 ±  5%      -0.0        0.07 ±  5%      -0.0        0.08 ±  4%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.xas_create
      0.16 ±  2%      -0.0        0.14 ±  5%      -0.0        0.14 ±  2%      -0.0        0.15 ±  4%  perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.08 ±  7%      -0.0        0.06 ±  9%      -0.0        0.06 ±  6%      -0.0        0.06        perf-profile.children.cycles-pp.percpu_counter_add_batch
      0.07 ±  5%      -0.0        0.05 ±  7%      -0.0        0.03 ± 70%      -0.0        0.06 ± 14%  perf-profile.children.cycles-pp.folio_mark_dirty
      0.08 ± 10%      -0.0        0.06 ±  6%      -0.0        0.06 ± 13%      -0.0        0.05        perf-profile.children.cycles-pp.shmem_is_huge
      0.07 ±  6%      +0.0        0.09 ± 10%      +0.0        0.09 ±  5%      +0.0        0.09 ±  6%  perf-profile.children.cycles-pp.propagate_protected_usage
      0.43 ±  3%      +0.0        0.46 ±  5%      +0.0        0.47 ±  3%      +0.0        0.48 ±  2%  perf-profile.children.cycles-pp.uncharge_batch
      0.68 ±  3%      +0.0        0.73 ±  4%      +0.0        0.74 ±  3%      +0.1        0.74        perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
      1.11            +0.1        1.22            +0.1        1.19            +0.1        1.17 ±  2%  perf-profile.children.cycles-pp.lru_add_fn
      2.91            +0.3        3.18 ±  2%      +0.3        3.23            +0.1        3.02        perf-profile.children.cycles-pp.truncate_inode_folio
      2.56            +0.4        2.92 ±  2%      +0.4        2.98            +0.2        2.75        perf-profile.children.cycles-pp.filemap_remove_folio
      1.37 ±  3%      +0.4        1.76 ±  9%      +0.4        1.76 ±  5%      +0.3        1.69 ±  2%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      2.24            +0.5        2.70 ±  2%      +0.5        2.75            +0.3        2.51        perf-profile.children.cycles-pp.__filemap_remove_folio
      2.38            +0.6        2.97            +0.6        2.99 ±  3%      +0.2        2.63        perf-profile.children.cycles-pp.shmem_add_to_page_cache
      0.18 ±  4%      +0.7        0.91 ±  4%      +0.8        0.94 ±  4%      +0.7        0.87 ±  2%  perf-profile.children.cycles-pp.__count_memcg_events
      1.26            +0.8        2.04 ±  2%      +0.8        2.08 ±  2%      +0.6        1.82        perf-profile.children.cycles-pp.filemap_unaccount_folio
      0.63 ±  2%      +1.0        1.67 ±  5%      +1.1        1.68 ±  5%      +1.0        1.58 ±  2%  perf-profile.children.cycles-pp.mem_cgroup_commit_charge
     38.71            +1.2       39.91            +0.5       39.23            +0.7       39.46        perf-profile.children.cycles-pp.vfs_fallocate
     38.37            +1.3       39.66            +0.6       38.99            +0.8       39.21        perf-profile.children.cycles-pp.shmem_fallocate
     37.28            +1.6       38.89            +0.9       38.20            +1.1       38.39        perf-profile.children.cycles-pp.shmem_get_folio_gfp
     36.71            +1.7       38.45            +1.1       37.77            +1.2       37.94        perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
      2.58            +1.8        4.36 ±  2%      +1.8        4.40 ±  3%      +1.2        3.74        perf-profile.children.cycles-pp.__mod_lruvec_page_state
      2.48 ±  2%      +2.1        4.60 ±  8%      +2.1        4.62 ±  5%      +1.9        4.38 ±  2%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      1.93 ±  3%      +2.4        4.36 ±  2%      +2.5        4.38 ±  3%      +2.2        4.09        perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      1.30            -0.4        0.92 ±  2%      -0.4        0.93            -0.3        0.95        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.73            -0.2        0.52 ±  2%      -0.2        0.53            -0.2        0.54 ±  2%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.54 ±  2%      -0.2        0.36 ±  3%      -0.2        0.36 ±  3%      -0.2        0.37 ±  2%  perf-profile.self.cycles-pp.release_pages
      0.48            -0.2        0.30 ±  3%      -0.2        0.32 ±  3%      -0.2        0.33 ±  2%  perf-profile.self.cycles-pp.xas_store
      0.54 ±  2%      -0.2        0.38 ±  3%      -0.1        0.39 ±  2%      -0.1        0.39 ±  3%  perf-profile.self.cycles-pp.__entry_text_start
      1.17            -0.1        1.03 ±  2%      -0.1        1.03            -0.2        1.00 ±  2%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.36 ±  2%      -0.1        0.22 ±  3%      -0.1        0.22 ±  3%      -0.1        0.24 ±  2%  perf-profile.self.cycles-pp.shmem_add_to_page_cache
      0.43 ±  5%      -0.1        0.30 ±  7%      -0.2        0.27 ±  7%      -0.1        0.29 ±  2%  perf-profile.self.cycles-pp.lru_add_fn
      0.24 ±  7%      -0.1        0.12 ±  6%      -0.1        0.13 ±  2%      -0.1        0.12 ±  6%  perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.38 ±  4%      -0.1        0.27 ±  4%      -0.1        0.28 ±  3%      -0.1        0.28 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.52 ±  3%      -0.1        0.41            -0.1        0.41            -0.1        0.43 ±  3%  perf-profile.self.cycles-pp.folio_batch_move_lru
      0.32 ±  2%      -0.1        0.22 ±  4%      -0.1        0.22 ±  3%      -0.1        0.22 ±  5%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.36 ±  2%      -0.1        0.26 ±  2%      -0.1        0.26 ±  2%      -0.1        0.27        perf-profile.self.cycles-pp.shmem_fallocate
      0.36 ±  4%      -0.1        0.26 ±  4%      -0.1        0.26 ±  3%      -0.1        0.27 ±  3%  perf-profile.self.cycles-pp.find_lock_entries
      0.28 ±  3%      -0.1        0.20 ±  5%      -0.1        0.20 ±  2%      -0.1        0.21 ±  3%  perf-profile.self.cycles-pp.__alloc_pages
      0.24 ±  2%      -0.1        0.16 ±  4%      -0.1        0.16 ±  4%      -0.1        0.16 ±  3%  perf-profile.self.cycles-pp.xas_descend
      0.09 ±  5%      -0.1        0.01 ±223%      -0.1        0.03 ± 70%      -0.1        0.03 ± 70%  perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
      0.23 ±  2%      -0.1        0.16 ±  3%      -0.1        0.16 ±  2%      -0.1        0.16 ±  4%  perf-profile.self.cycles-pp.xas_clear_mark
      0.18 ±  3%      -0.1        0.11 ±  6%      -0.1        0.12 ±  4%      -0.1        0.11 ±  4%  perf-profile.self.cycles-pp.free_unref_page_commit
      0.18 ±  3%      -0.1        0.12 ±  4%      -0.1        0.12 ±  3%      -0.0        0.13 ±  5%  perf-profile.self.cycles-pp.shmem_inode_acct_blocks
      0.21 ±  3%      -0.1        0.15 ±  2%      -0.1        0.15 ±  2%      -0.1        0.16 ±  3%  perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
      0.18 ±  2%      -0.1        0.12 ±  3%      -0.1        0.12 ±  4%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp.__filemap_remove_folio
      0.18 ±  7%      -0.1        0.12 ±  7%      -0.0        0.13 ±  5%      -0.1        0.12 ±  3%  perf-profile.self.cycles-pp.vfs_fallocate
      0.18 ±  2%      -0.1        0.13 ±  3%      -0.1        0.13            -0.1        0.13 ±  5%  perf-profile.self.cycles-pp.folio_unlock
      0.20 ±  2%      -0.1        0.14 ±  6%      -0.1        0.15 ±  3%      -0.1        0.15 ±  6%  perf-profile.self.cycles-pp.__dquot_alloc_space
      0.18 ±  2%      -0.1        0.12 ±  3%      -0.1        0.13 ±  3%      -0.0        0.13 ±  4%  perf-profile.self.cycles-pp.get_page_from_freelist
      0.15 ±  3%      -0.1        0.10 ±  7%      -0.0        0.10 ±  3%      -0.0        0.10 ±  3%  perf-profile.self.cycles-pp.xas_load
      0.17 ±  3%      -0.1        0.12 ±  8%      -0.1        0.12 ±  3%      -0.0        0.12 ±  4%  perf-profile.self.cycles-pp.__cond_resched
      0.17 ±  2%      -0.1        0.12 ±  3%      -0.1        0.12 ±  7%      -0.0        0.13 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.17 ±  5%      -0.1        0.12 ±  3%      -0.0        0.12 ±  4%      -0.0        0.12 ±  6%  perf-profile.self.cycles-pp.noop_dirty_folio
      0.10 ±  7%      -0.0        0.05 ± 45%      -0.0        0.06 ± 13%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.cap_vm_enough_memory
      0.12 ±  3%      -0.0        0.08 ±  4%      -0.0        0.08            -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.rmqueue
      0.06            -0.0        0.02 ±141%      -0.0        0.03 ± 70%      -0.0        0.04 ± 44%  perf-profile.self.cycles-pp.inode_needs_update_time
      0.07 ±  5%      -0.0        0.02 ± 99%      -0.0        0.05            -0.0        0.05 ±  7%  perf-profile.self.cycles-pp.xas_find
      0.13 ±  3%      -0.0        0.09 ±  6%      -0.0        0.10 ±  5%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.alloc_pages_mpol
      0.07 ±  6%      -0.0        0.03 ± 70%      -0.0        0.04 ± 44%      -0.0        0.05        perf-profile.self.cycles-pp.xas_find_conflict
      0.16 ±  2%      -0.0        0.12 ±  6%      -0.0        0.12 ±  3%      -0.0        0.13 ±  5%  perf-profile.self.cycles-pp.free_unref_page_list
      0.12 ±  5%      -0.0        0.08 ±  4%      -0.0        0.08 ±  4%      -0.0        0.09 ±  7%  perf-profile.self.cycles-pp.fallocate64
      0.20 ±  4%      -0.0        0.16 ±  3%      -0.0        0.16 ±  3%      -0.0        0.18 ±  4%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.06 ±  7%      -0.0        0.02 ± 99%      -0.0        0.02 ± 99%      -0.0        0.03 ± 70%  perf-profile.self.cycles-pp.shmem_recalc_inode
      0.13 ±  3%      -0.0        0.09            -0.0        0.09 ±  6%      -0.0        0.09 ±  6%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.22 ±  3%      -0.0        0.19 ±  6%      -0.0        0.20 ±  3%      -0.0        0.21 ±  4%  perf-profile.self.cycles-pp.page_counter_uncharge
      0.14 ±  3%      -0.0        0.10 ±  6%      -0.0        0.10 ±  8%      -0.0        0.10 ±  4%  perf-profile.self.cycles-pp.filemap_remove_folio
      0.15 ±  5%      -0.0        0.11 ±  3%      -0.0        0.11 ±  6%      -0.0        0.11 ±  3%  perf-profile.self.cycles-pp.__fget_light
      0.12 ±  4%      -0.0        0.08            -0.0        0.08 ±  5%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.__folio_cancel_dirty
      0.11 ±  4%      -0.0        0.08 ±  7%      -0.0        0.08 ±  8%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp._raw_spin_trylock
      0.11 ±  3%      -0.0        0.08 ±  6%      -0.0        0.07 ±  9%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.xas_start
      0.11 ±  3%      -0.0        0.08 ±  6%      -0.0        0.08 ±  6%      -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.__percpu_counter_limited_add
      0.12 ±  3%      -0.0        0.09 ±  5%      -0.0        0.08 ±  5%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.__mod_lruvec_state
      0.11 ±  5%      -0.0        0.08 ±  4%      -0.0        0.08 ±  6%      -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.truncate_cleanup_folio
      0.10 ±  6%      -0.0        0.07 ±  5%      -0.0        0.07 ±  7%      -0.0        0.07 ± 11%  perf-profile.self.cycles-pp.xas_init_marks
      0.09 ±  6%      -0.0        0.06 ±  6%      -0.0        0.06 ±  7%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.get_pfnblock_flags_mask
      0.11            -0.0        0.08 ±  5%      -0.0        0.08            -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.folio_add_lru
      0.09 ±  6%      -0.0        0.06 ±  7%      -0.0        0.06 ±  7%      -0.0        0.07 ±  5%  perf-profile.self.cycles-pp.filemap_free_folio
      0.09 ±  4%      -0.0        0.06 ±  6%      -0.0        0.06 ±  6%      -0.0        0.06 ±  6%  perf-profile.self.cycles-pp.shmem_alloc_folio
      0.10 ±  4%      -0.0        0.08 ±  4%      -0.0        0.08 ±  6%      -0.0        0.08 ±  7%  perf-profile.self.cycles-pp.apparmor_file_permission
      0.14 ±  5%      -0.0        0.12 ±  5%      -0.0        0.12 ±  3%      -0.0        0.13 ±  4%  perf-profile.self.cycles-pp.cgroup_rstat_updated
      0.07 ±  7%      -0.0        0.04 ± 44%      -0.0        0.04 ± 44%      -0.0        0.04 ± 71%  perf-profile.self.cycles-pp.policy_nodemask
      0.07 ± 11%      -0.0        0.04 ± 45%      -0.0        0.05 ±  7%      -0.0        0.03 ± 70%  perf-profile.self.cycles-pp.shmem_is_huge
      0.08 ±  4%      -0.0        0.06 ±  8%      -0.0        0.06 ±  9%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.get_task_policy
      0.08 ±  6%      -0.0        0.05 ±  8%      -0.0        0.06 ±  8%      -0.0        0.05 ±  8%  perf-profile.self.cycles-pp.__x64_sys_fallocate
      0.12 ±  3%      -0.0        0.10 ±  6%      -0.0        0.10 ±  6%      -0.0        0.10 ±  3%  perf-profile.self.cycles-pp.try_charge_memcg
      0.07            -0.0        0.05            -0.0        0.05            -0.0        0.04 ± 45%  perf-profile.self.cycles-pp.free_unref_page_prepare
      0.07 ±  6%      -0.0        0.06 ±  9%      -0.0        0.06 ±  8%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.percpu_counter_add_batch
      0.08 ±  4%      -0.0        0.06            -0.0        0.06 ±  6%      -0.0        0.06        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.09 ±  7%      -0.0        0.07 ±  5%      -0.0        0.07 ±  5%      -0.0        0.07 ±  7%  perf-profile.self.cycles-pp.filemap_get_entry
      0.07 ±  9%      +0.0        0.09 ± 10%      +0.0        0.09 ±  5%      +0.0        0.09 ±  6%  perf-profile.self.cycles-pp.propagate_protected_usage
      0.96 ±  2%      +0.2        1.12 ±  7%      +0.2        1.16 ±  4%      -0.2        0.72 ±  2%  perf-profile.self.cycles-pp.__mod_lruvec_page_state
      0.45 ±  4%      +0.4        0.82 ±  8%      +0.4        0.81 ±  6%      +0.3        0.77 ±  3%  perf-profile.self.cycles-pp.mem_cgroup_commit_charge
      1.36 ±  3%      +0.4        1.75 ±  9%      +0.4        1.75 ±  5%      +0.3        1.68 ±  2%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.29            +0.7        1.00 ± 10%      +0.7        1.01 ±  7%      +0.6        0.93 ±  2%  perf-profile.self.cycles-pp.__mem_cgroup_charge
      0.16 ±  4%      +0.7        0.90 ±  4%      +0.8        0.92 ±  4%      +0.7        0.85 ±  2%  perf-profile.self.cycles-pp.__count_memcg_events
      1.80 ±  2%      +2.5        4.26 ±  2%      +2.5        4.28 ±  3%      +2.2        3.98        perf-profile.self.cycles-pp.__mod_memcg_lruvec_state

Yosry Ahmed Oct. 25, 2023, 6:22 a.m. UTC | #36

On Tue, Oct 24, 2023 at 11:09 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yosry Ahmed,
>
> On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote:
> > >
> > > hi, Yosry Ahmed,
> > >
> > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
> > >
> > > ...
> > >
> > > >
> > > > I still could not run the benchmark, but I used a version of
> > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > > > This showed ~13% regression with the patch, so not the same as the
> > > > will-it-scale version, but it could be an indicator.
> > > >
> > > > With that, I did not see any improvement with the fixlet above or
> > > > ___cacheline_aligned_in_smp. So you can scratch that.
> > > >
> > > > I did, however, see some improvement with reducing the indirection
> > > > layers by moving stats_updates directly into struct mem_cgroup. The
> > > > regression in my manual testing went down to 9%. Still not great, but
> > > > I am wondering how this reflects on the benchmark. If you're able to
> > > > test it that would be great, the diff is below. Meanwhile I am still
> > > > looking for other improvements that can be made.
> > >
> > > we applied previous patch-set as below:
> > >
> > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything   <---- the base our tool picked for the patch set
> > >
> > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> > > but failed. could you guide how to apply this patch?
> > > Thanks
> > >
> >
> > Thanks for looking into this. I rebased the diff on top of
> > c5f50d8b23c79. Please find it attached.
>
> from our tests, this patch has little impact.
>
> it was applied as below ac6a9444dec85:
>
> ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything
>
> for the first regression reported in original report, data are very close
> for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85),
> and ac6a9444dec85.
> full comparison is as [1]
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>   gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
>
> 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \          |                \
>      36509           -25.8%      27079           -25.2%      27305           -25.0%      27383        will-it-scale.per_thread_ops
>
> for the second regression reported in origianl report, seems a small impact
> from ac6a9444dec85.
> full comparison is as [2]
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>   gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
>
> 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \          |                \
>      76580           -30.0%      53575           -28.9%      54415           -26.7%      56152        will-it-scale.per_thread_ops
>
> [1]
>

Thanks Oliver for running the numbers. If I understand correctly the
will-it-scale.fallocate1 microbenchmark is the only one showing
significant regression here, is this correct?

In my runs, other more representative microbenchmarks benchmarks like
netperf and will-it-scale.page_fault* show minimal regression. I would
expect practical workloads to have high concurrency of page faults or
networking, but maybe not fallocate/ftruncate.

Oliver, in your experience, how often does such a regression in such a
microbenchmark translate to a real regression that people care about?
(or how often do people dismiss it?)

I tried optimizing this further for the fallocate/ftruncate case but
without luck. I even tried moving stats_updates into cgroup core
(struct cgroup_rstat_cpu) to reuse the existing loop in
cgroup_rstat_updated() -- but it somehow made it worse.

On the other hand, we do have some machines in production running this
series together with a previous optimization for non-hierarchical
stats [1] on an older kernel, and we do see significant reduction in
cpu time spent on reading the stats. Domenico did a similar experiment
with only this series and reported similar results [2].

Shakeel, Johannes, (and other memcg folks), I personally think the
benefits here outweigh a regression in this particular benchmark, but
I am obviously biased. What do you think?

[1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/
[2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/

Shakeel Butt Oct. 25, 2023, 5:06 p.m. UTC | #37

On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> Thanks Oliver for running the numbers. If I understand correctly the
> will-it-scale.fallocate1 microbenchmark is the only one showing
> significant regression here, is this correct?
>
> In my runs, other more representative microbenchmarks benchmarks like
> netperf and will-it-scale.page_fault* show minimal regression. I would
> expect practical workloads to have high concurrency of page faults or
> networking, but maybe not fallocate/ftruncate.
>
> Oliver, in your experience, how often does such a regression in such a
> microbenchmark translate to a real regression that people care about?
> (or how often do people dismiss it?)
>
> I tried optimizing this further for the fallocate/ftruncate case but
> without luck. I even tried moving stats_updates into cgroup core
> (struct cgroup_rstat_cpu) to reuse the existing loop in
> cgroup_rstat_updated() -- but it somehow made it worse.
>
> On the other hand, we do have some machines in production running this
> series together with a previous optimization for non-hierarchical
> stats [1] on an older kernel, and we do see significant reduction in
> cpu time spent on reading the stats. Domenico did a similar experiment
> with only this series and reported similar results [2].
>
> Shakeel, Johannes, (and other memcg folks), I personally think the
> benefits here outweigh a regression in this particular benchmark, but
> I am obviously biased. What do you think?
>
> [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/
> [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/

I still am not convinced of the benefits outweighing the regression
but I would not block this. So, let's do this, skip this open window,
get the patch series reviewed and hopefully we can work together on
fixing that regression and we can make an informed decision of
accepting the regression for this series for the next cycle.

Yosry Ahmed Oct. 25, 2023, 6:36 p.m. UTC | #38

On Wed, Oct 25, 2023 at 10:06 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> >
> > Thanks Oliver for running the numbers. If I understand correctly the
> > will-it-scale.fallocate1 microbenchmark is the only one showing
> > significant regression here, is this correct?
> >
> > In my runs, other more representative microbenchmarks benchmarks like
> > netperf and will-it-scale.page_fault* show minimal regression. I would
> > expect practical workloads to have high concurrency of page faults or
> > networking, but maybe not fallocate/ftruncate.
> >
> > Oliver, in your experience, how often does such a regression in such a
> > microbenchmark translate to a real regression that people care about?
> > (or how often do people dismiss it?)
> >
> > I tried optimizing this further for the fallocate/ftruncate case but
> > without luck. I even tried moving stats_updates into cgroup core
> > (struct cgroup_rstat_cpu) to reuse the existing loop in
> > cgroup_rstat_updated() -- but it somehow made it worse.
> >
> > On the other hand, we do have some machines in production running this
> > series together with a previous optimization for non-hierarchical
> > stats [1] on an older kernel, and we do see significant reduction in
> > cpu time spent on reading the stats. Domenico did a similar experiment
> > with only this series and reported similar results [2].
> >
> > Shakeel, Johannes, (and other memcg folks), I personally think the
> > benefits here outweigh a regression in this particular benchmark, but
> > I am obviously biased. What do you think?
> >
> > [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/
> > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
>
> I still am not convinced of the benefits outweighing the regression
> but I would not block this. So, let's do this, skip this open window,
> get the patch series reviewed and hopefully we can work together on
> fixing that regression and we can make an informed decision of
> accepting the regression for this series for the next cycle.

Skipping this open window sounds okay to me.

FWIW, I think with this patch series we can keep the old behavior
(roughly) and hide the changes behind a tunable (config option or
sysfs file). I think the only changes that need to be done to the code
to approximate the previous behavior are:
- Use root when updating the pending stats in memcg_rstat_updated()
instead of the passed memcg.
- Use root in mem_cgroup_flush_stats() instead of the passed memcg.
- Use mutex_trylock() instead of mutex_lock() in mem_cgroup_flush_stats().

So I think it should be doable to hide most changes behind a tunable,
but let's not do this unless necessary.

[v2,3/5] mm: memcg: make stats flushing threshold per-memcg

Commit Message

Comments

Patch