diff mbox series

memcg: performance degradation since v5.9

Message ID 20210408193948.vfktg3azh2wrt56t@gabell (mailing list archive)
State New
Headers show
Series memcg: performance degradation since v5.9 | expand

Commit Message

Masayoshi Mizuma April 8, 2021, 7:39 p.m. UTC
Hello,

I detected a performance degradation issue for a benchmark of PostgresSQL [1],
and the issue seems to be related to object level memory cgroup [2].
I would appreciate it if you could give me some ideas to solve it.

The benchmark shows the transaction per second (tps) and the tps for v5.9
and later kernel get about 10%-20% smaller than v5.8.

The benchmark does sendto() and recvfrom() system calls repeatedly,
and the duration of the system calls get longer than v5.8.
The result of perf trace of the benchmark is as follows:

  - v5.8

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
   recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%

  - v5.9

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
   recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%

  - v5.12-rc6

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
   recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%

I bisected the kernel patches, then I found the patch series, which add
object level memory cgroup support, causes the degradation.

I confirmed the delay with a kernel module which just runs
kmem_cache_alloc/kmem_cache_free as follows. The duration is about
2-3 times than v5.8.

   dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
   for (i = 0; i < 100000000; i++)
   {
           p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
           kmem_cache_free(dummy_cache, p);
   }

It seems that the object accounting work in slab_pre_alloc_hook() and
slab_post_alloc_hook() is the overhead.

cgroup.nokmem kernel parameter doesn't work for my case because it disables
all of kmem accounting.

The degradation is gone when I apply a patch (at the bottom of this email)
that adds a kernel parameter that expects to fallback to the page level
accounting, however, I'm not sure it's a good approach though...

I use the kernel config which is based on Fedora33 kernel.
The related configs are as follows:

  CONFIG_CGROUPS=y
  CONFIG_MEMCG=y
  CONFIG_MEMCG_KMEM=y
  CONFIG_SLUB=y

I would appreciate it if you could give me advices and ideas to
reduce the overhead.

[1]: https://www.postgresql.org/docs/10/pgbench.html
[2]: https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/

---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 10 ++++++++++
 mm/slab.h                  |  9 +++++++++
 3 files changed, 21 insertions(+)

Comments

Roman Gushchin April 8, 2021, 8:53 p.m. UTC | #1
On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote:
> Hello,
> 
> I detected a performance degradation issue for a benchmark of PostgresSQL [1],
> and the issue seems to be related to object level memory cgroup [2].
> I would appreciate it if you could give me some ideas to solve it.
> 
> The benchmark shows the transaction per second (tps) and the tps for v5.9
> and later kernel get about 10%-20% smaller than v5.8.
> 
> The benchmark does sendto() and recvfrom() system calls repeatedly,
> and the duration of the system calls get longer than v5.8.
> The result of perf trace of the benchmark is as follows:
> 
>   - v5.8
> 
>    syscall            calls  errors  total       min       avg       max       stddev
>                                      (msec)    (msec)    (msec)    (msec)        (%)
>    --------------- --------  ------ -------- --------- --------- ---------     ------
>    sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
>    recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%
> 
>   - v5.9
> 
>    syscall            calls  errors  total       min       avg       max       stddev
>                                      (msec)    (msec)    (msec)    (msec)        (%)
>    --------------- --------  ------ -------- --------- --------- ---------     ------
>    sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
>    recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%
> 
>   - v5.12-rc6
> 
>    syscall            calls  errors  total       min       avg       max       stddev
>                                      (msec)    (msec)    (msec)    (msec)        (%)
>    --------------- --------  ------ -------- --------- --------- ---------     ------
>    sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
>    recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%
> 
> I bisected the kernel patches, then I found the patch series, which add
> object level memory cgroup support, causes the degradation.
> 
> I confirmed the delay with a kernel module which just runs
> kmem_cache_alloc/kmem_cache_free as follows. The duration is about
> 2-3 times than v5.8.
> 
>    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
>    for (i = 0; i < 100000000; i++)
>    {
>            p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
>            kmem_cache_free(dummy_cache, p);
>    }
> 
> It seems that the object accounting work in slab_pre_alloc_hook() and
> slab_post_alloc_hook() is the overhead.
> 
> cgroup.nokmem kernel parameter doesn't work for my case because it disables
> all of kmem accounting.
> 
> The degradation is gone when I apply a patch (at the bottom of this email)
> that adds a kernel parameter that expects to fallback to the page level
> accounting, however, I'm not sure it's a good approach though...

Hello Masayoshi!

Thank you for the report!

It's not a secret that per-object accounting is more expensive than a per-page
allocation. I had micro-benchmark results similar to yours: accounted
allocations are about 2x slower. But in general it tends to not affect real
workloads, because the cost of allocations is still low and tends to be only
a small fraction of the whole cpu load. And because it brings up significant
benefits: 40%+ slab memory savings, less fragmentation, more stable workingset,
etc, real workloads tend to perform on pair or better.

So my first question is if you see the regression in any real workload
or it's only about the benchmark?

Second, I'll try to take a look into the benchmark to figure out why it's
affected so badly, but I'm not sure we can easily fix it. If you have any
ideas what kind of objects the benchmark is allocating in big numbers,
please let me know.

Thanks!
Shakeel Butt April 8, 2021, 9:08 p.m. UTC | #2
On Thu, Apr 8, 2021 at 1:54 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote:
> > Hello,
> >
> > I detected a performance degradation issue for a benchmark of PostgresSQL [1],
> > and the issue seems to be related to object level memory cgroup [2].
> > I would appreciate it if you could give me some ideas to solve it.
> >
> > The benchmark shows the transaction per second (tps) and the tps for v5.9
> > and later kernel get about 10%-20% smaller than v5.8.
> >
> > The benchmark does sendto() and recvfrom() system calls repeatedly,
> > and the duration of the system calls get longer than v5.8.
> > The result of perf trace of the benchmark is as follows:
> >
> >   - v5.8
> >
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
> >    recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%
> >
> >   - v5.9
> >
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
> >    recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%
> >
> >   - v5.12-rc6
> >
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
> >    recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%
> >

Can you please explain how to read these numbers? Or at least put a %
regression.

> > I bisected the kernel patches, then I found the patch series, which add
> > object level memory cgroup support, causes the degradation.
> >
> > I confirmed the delay with a kernel module which just runs
> > kmem_cache_alloc/kmem_cache_free as follows. The duration is about
> > 2-3 times than v5.8.
> >
> >    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
> >    for (i = 0; i < 100000000; i++)
> >    {
> >            p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> >            kmem_cache_free(dummy_cache, p);
> >    }
> >
> > It seems that the object accounting work in slab_pre_alloc_hook() and
> > slab_post_alloc_hook() is the overhead.
> >
> > cgroup.nokmem kernel parameter doesn't work for my case because it disables
> > all of kmem accounting.

The patch is somewhat doing that i.e. disabling memcg accounting for slab.

> >
> > The degradation is gone when I apply a patch (at the bottom of this email)
> > that adds a kernel parameter that expects to fallback to the page level
> > accounting, however, I'm not sure it's a good approach though...
>
> Hello Masayoshi!
>
> Thank you for the report!
>
> It's not a secret that per-object accounting is more expensive than a per-page
> allocation. I had micro-benchmark results similar to yours: accounted
> allocations are about 2x slower. But in general it tends to not affect real
> workloads, because the cost of allocations is still low and tends to be only
> a small fraction of the whole cpu load. And because it brings up significant
> benefits: 40%+ slab memory savings, less fragmentation, more stable workingset,
> etc, real workloads tend to perform on pair or better.
>
> So my first question is if you see the regression in any real workload
> or it's only about the benchmark?
>
> Second, I'll try to take a look into the benchmark to figure out why it's
> affected so badly, but I'm not sure we can easily fix it. If you have any
> ideas what kind of objects the benchmark is allocating in big numbers,
> please let me know.
>

One idea would be to increase MEMCG_CHARGE_BATCH.
Masayoshi Mizuma April 9, 2021, 4:05 p.m. UTC | #3
On Thu, Apr 08, 2021 at 01:53:47PM -0700, Roman Gushchin wrote:
> On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote:
> > Hello,
> > 
> > I detected a performance degradation issue for a benchmark of PostgresSQL [1],
> > and the issue seems to be related to object level memory cgroup [2].
> > I would appreciate it if you could give me some ideas to solve it.
> > 
> > The benchmark shows the transaction per second (tps) and the tps for v5.9
> > and later kernel get about 10%-20% smaller than v5.8.
> > 
> > The benchmark does sendto() and recvfrom() system calls repeatedly,
> > and the duration of the system calls get longer than v5.8.
> > The result of perf trace of the benchmark is as follows:
> > 
> >   - v5.8
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
> >    recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%
> > 
> >   - v5.9
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
> >    recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%
> > 
> >   - v5.12-rc6
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
> >    recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%
> > 
> > I bisected the kernel patches, then I found the patch series, which add
> > object level memory cgroup support, causes the degradation.
> > 
> > I confirmed the delay with a kernel module which just runs
> > kmem_cache_alloc/kmem_cache_free as follows. The duration is about
> > 2-3 times than v5.8.
> > 
> >    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
> >    for (i = 0; i < 100000000; i++)
> >    {
> >            p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> >            kmem_cache_free(dummy_cache, p);
> >    }
> > 
> > It seems that the object accounting work in slab_pre_alloc_hook() and
> > slab_post_alloc_hook() is the overhead.
> > 
> > cgroup.nokmem kernel parameter doesn't work for my case because it disables
> > all of kmem accounting.
> > 
> > The degradation is gone when I apply a patch (at the bottom of this email)
> > that adds a kernel parameter that expects to fallback to the page level
> > accounting, however, I'm not sure it's a good approach though...
> 
> Hello Masayoshi!
> 
> Thank you for the report!

Hi!

> 
> It's not a secret that per-object accounting is more expensive than a per-page
> allocation. I had micro-benchmark results similar to yours: accounted
> allocations are about 2x slower. But in general it tends to not affect real
> workloads, because the cost of allocations is still low and tends to be only
> a small fraction of the whole cpu load. And because it brings up significant
> benefits: 40%+ slab memory savings, less fragmentation, more stable workingset,
> etc, real workloads tend to perform on pair or better.
> 
> So my first question is if you see the regression in any real workload
> or it's only about the benchmark?

It's only about the benchmark so far. I'll let you know if I get the issue with
real workload.

> 
> Second, I'll try to take a look into the benchmark to figure out why it's
> affected so badly, but I'm not sure we can easily fix it. If you have any
> ideas what kind of objects the benchmark is allocating in big numbers,
> please let me know.

The benchmark does sendto() and recvfrom() to the unix domain socket
repeatedly, and kmem_cache_alloc_node()/kmem_cache_free() is called
to allocate/free the socket buffers.
The call graph to allocate the object is as flllows.

  do_syscall_64
    __x64_sys_sendto
      __sys_sendto
        sock_sendmsg
          unix_stream_sendmsg
            sock_alloc_send_pskb
              alloc_skb_with_frags
                __alloc_skb
                  kmem_cache_alloc_node

kmem_cache_alloc_node()/kmem_cache_free() is called about 1,400,000 times
during the benchmark and the object size is 216 byte, the GFP flag is 0x400cc0:
 ___GFP_ACCOUNT | ___GFP_KSWAPD_RECLAIM | ___GFP_DIRECT_RECLAIM | ___GFP_FS | ___GFP_IO

I got the data by following bpftrace script.

  # cat kmem.bt 
  #!/usr/bin/env bpftrace

  tracepoint:kmem:kmem_cache_alloc_node /comm == "pgbench"/
  {
	@alloc[comm, args->bytes_req, args->bytes_alloc, args->gfp_flags] = count();
  }

  tracepoint:kmem:kmem_cache_free /comm == "pgbench"/
  {
	@free[comm] = count();
  }
  # ./kmem.bt 
  Attaching 2 probes...
  ^C

  @alloc[pgbench, 11784, 11840, 3264]: 1
  @alloc[pgbench, 216, 256, 3264]: 23
  @alloc[pgbench, 216, 256, 4197568]: 1400046

  @free[pgbench]: 1400560

  # 

I hope this helps...

Thanks!
Masa
Masayoshi Mizuma April 9, 2021, 4:35 p.m. UTC | #4
On Thu, Apr 08, 2021 at 02:08:13PM -0700, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 1:54 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote:
> > > Hello,
> > >
> > > I detected a performance degradation issue for a benchmark of PostgresSQL [1],
> > > and the issue seems to be related to object level memory cgroup [2].
> > > I would appreciate it if you could give me some ideas to solve it.
> > >
> > > The benchmark shows the transaction per second (tps) and the tps for v5.9
> > > and later kernel get about 10%-20% smaller than v5.8.
> > >
> > > The benchmark does sendto() and recvfrom() system calls repeatedly,
> > > and the duration of the system calls get longer than v5.8.
> > > The result of perf trace of the benchmark is as follows:
> > >
> > >   - v5.8
> > >
> > >    syscall            calls  errors  total       min       avg       max       stddev
> > >                                      (msec)    (msec)    (msec)    (msec)        (%)
> > >    --------------- --------  ------ -------- --------- --------- ---------     ------
> > >    sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
> > >    recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%
> > >
> > >   - v5.9
> > >
> > >    syscall            calls  errors  total       min       avg       max       stddev
> > >                                      (msec)    (msec)    (msec)    (msec)        (%)
> > >    --------------- --------  ------ -------- --------- --------- ---------     ------
> > >    sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
> > >    recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%
> > >
> > >   - v5.12-rc6
> > >
> > >    syscall            calls  errors  total       min       avg       max       stddev
> > >                                      (msec)    (msec)    (msec)    (msec)        (%)
> > >    --------------- --------  ------ -------- --------- --------- ---------     ------
> > >    sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
> > >    recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%
> > >
> 
> Can you please explain how to read these numbers? Or at least put a %
> regression.

Let me summarize them here.
The total duration ('total' column above) of each system call is as follows
if v5.8 is assumed as 100%:

- sendto:
  - v5.8         100%
  - v5.9         128%
  - v5.12-rc6    116%

- revfrom:
  - v5.8         100%
  - v5.9         114%
  - v5.12-rc6    108%

> 
> > > I bisected the kernel patches, then I found the patch series, which add
> > > object level memory cgroup support, causes the degradation.
> > >
> > > I confirmed the delay with a kernel module which just runs
> > > kmem_cache_alloc/kmem_cache_free as follows. The duration is about
> > > 2-3 times than v5.8.
> > >
> > >    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
> > >    for (i = 0; i < 100000000; i++)
> > >    {
> > >            p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> > >            kmem_cache_free(dummy_cache, p);
> > >    }
> > >
> > > It seems that the object accounting work in slab_pre_alloc_hook() and
> > > slab_post_alloc_hook() is the overhead.
> > >
> > > cgroup.nokmem kernel parameter doesn't work for my case because it disables
> > > all of kmem accounting.
> 
> The patch is somewhat doing that i.e. disabling memcg accounting for slab.
> 
> > >
> > > The degradation is gone when I apply a patch (at the bottom of this email)
> > > that adds a kernel parameter that expects to fallback to the page level
> > > accounting, however, I'm not sure it's a good approach though...
> >
> > Hello Masayoshi!
> >
> > Thank you for the report!
> >
> > It's not a secret that per-object accounting is more expensive than a per-page
> > allocation. I had micro-benchmark results similar to yours: accounted
> > allocations are about 2x slower. But in general it tends to not affect real
> > workloads, because the cost of allocations is still low and tends to be only
> > a small fraction of the whole cpu load. And because it brings up significant
> > benefits: 40%+ slab memory savings, less fragmentation, more stable workingset,
> > etc, real workloads tend to perform on pair or better.
> >
> > So my first question is if you see the regression in any real workload
> > or it's only about the benchmark?
> >
> > Second, I'll try to take a look into the benchmark to figure out why it's
> > affected so badly, but I'm not sure we can easily fix it. If you have any
> > ideas what kind of objects the benchmark is allocating in big numbers,
> > please let me know.
> >
> 
> One idea would be to increase MEMCG_CHARGE_BATCH.

Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be
a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something.

Thanks!
Masa
Shakeel Butt April 9, 2021, 4:50 p.m. UTC | #5
On Fri, Apr 9, 2021 at 9:35 AM Masayoshi Mizuma <msys.mizuma@gmail.com> wrote:
>
[...]
> > Can you please explain how to read these numbers? Or at least put a %
> > regression.
>
> Let me summarize them here.
> The total duration ('total' column above) of each system call is as follows
> if v5.8 is assumed as 100%:
>
> - sendto:
>   - v5.8         100%
>   - v5.9         128%
>   - v5.12-rc6    116%
>
> - revfrom:
>   - v5.8         100%
>   - v5.9         114%
>   - v5.12-rc6    108%
>

Thanks, that is helpful. Most probably the improvement of 5.12 from
5.9 is due to 3de7d4f25a7438f ("mm: memcg/slab: optimize objcg stock
draining").

[...]
> >
> > One idea would be to increase MEMCG_CHARGE_BATCH.
>
> Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be
> a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something.
>

Can you rerun the benchmark with MEMCG_CHARGE_BATCH equal 64UL?

I think with memcg stats moving to rstat, the stat accuracy is not an
issue if we increase MEMCG_CHARGE_BATCH to 64UL. Not sure if we want
this to be tuneable but most probably we do want this to be sync'ed
with SWAP_CLUSTER_MAX.
Masayoshi Mizuma April 12, 2021, 3:22 p.m. UTC | #6
On Fri, Apr 09, 2021 at 09:50:45AM -0700, Shakeel Butt wrote:
> On Fri, Apr 9, 2021 at 9:35 AM Masayoshi Mizuma <msys.mizuma@gmail.com> wrote:
> >
> [...]
> > > Can you please explain how to read these numbers? Or at least put a %
> > > regression.
> >
> > Let me summarize them here.
> > The total duration ('total' column above) of each system call is as follows
> > if v5.8 is assumed as 100%:
> >
> > - sendto:
> >   - v5.8         100%
> >   - v5.9         128%
> >   - v5.12-rc6    116%
> >
> > - revfrom:
> >   - v5.8         100%
> >   - v5.9         114%
> >   - v5.12-rc6    108%
> >
> 
> Thanks, that is helpful. Most probably the improvement of 5.12 from
> 5.9 is due to 3de7d4f25a7438f ("mm: memcg/slab: optimize objcg stock
> draining").
> 
> [...]
> > >
> > > One idea would be to increase MEMCG_CHARGE_BATCH.
> >
> > Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be
> > a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something.
> >
> 
Hi!

Thank you for your comments!

> Can you rerun the benchmark with MEMCG_CHARGE_BATCH equal 64UL?

Yes, I reran the benchmark with MEMCG_CHARGE_BATCH == 64UL, but it seems that
it doesn't reduce the duration of system calls...

- v5.12-rc6 vanilla

   syscall      total  
               (msec) 
   --------- --------
   sendto    3049.221
   recvfrom  2421.601

- v5.12-rc6 with MEMCG_CHARGE_BATCH==64

   syscall      total  
               (msec) 
   --------- --------
   sendto    3071.607
   recvfrom  2436.488

> I think with memcg stats moving to rstat, the stat accuracy is not an
> issue if we increase MEMCG_CHARGE_BATCH to 64UL. Not sure if we want
> this to be tuneable but most probably we do want this to be sync'ed
> with SWAP_CLUSTER_MAX.

Thanks. I understand that. 
Waiman posted some patches to reduce the overhead [1]. I'll try the patch.

[1]: https://lore.kernel.org/linux-mm/51ea6b09-b7ee-36e9-a500-b7141bd3a42b@redhat.com/T/#me75806a3555e7a42e793f099b98c42e687962d10

Thanks!
Masa
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..d447bfc8cf5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1597,6 +1597,8 @@  extern int memcg_nr_cache_ids;
 void memcg_get_cache_ids(void);
 void memcg_put_cache_ids(void);
 
+extern bool memcg_obj_account_disabled;
+
 /*
  * Helper macro to loop through all memcg-specific caches. Callers must still
  * check if the cache is valid (it is either valid or NULL).
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..cc07d89bc449 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,6 +400,16 @@  void memcg_put_cache_ids(void)
  */
 DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
+
+bool memcg_obj_account_disabled __read_mostly;
+
+static int __init memcg_obj_account_disable(char *s)
+{
+	memcg_obj_account_disabled = true;
+	pr_debug("object memory cgroup account disabled\n");
+	return 1;
+}
+__setup("cgroup.nomemobj", memcg_obj_account_disable);
 #endif
 
 static int memcg_shrinker_map_size;
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..7f7f7867f636 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -270,6 +270,9 @@  static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT))
 		return true;
 
+	if (memcg_obj_account_disabled)
+		return true;
+
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
 		return true;
@@ -309,6 +312,9 @@  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 	if (!memcg_kmem_enabled() || !objcg)
 		return;
 
+	if (memcg_obj_account_disabled)
+		return;
+
 	flags &= ~__GFP_ACCOUNT;
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
@@ -346,6 +352,9 @@  static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 	if (!memcg_kmem_enabled())
 		return;
 
+	if (memcg_obj_account_disabled)
+		return;
+
 	for (i = 0; i < objects; i++) {
 		if (unlikely(!p[i]))
 			continue;