Message ID | 20210408193948.vfktg3azh2wrt56t@gabell (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: performance degradation since v5.9 | expand |
On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote: > Hello, > > I detected a performance degradation issue for a benchmark of PostgresSQL [1], > and the issue seems to be related to object level memory cgroup [2]. > I would appreciate it if you could give me some ideas to solve it. > > The benchmark shows the transaction per second (tps) and the tps for v5.9 > and later kernel get about 10%-20% smaller than v5.8. > > The benchmark does sendto() and recvfrom() system calls repeatedly, > and the duration of the system calls get longer than v5.8. > The result of perf trace of the benchmark is as follows: > > - v5.8 > > syscall calls errors total min avg max stddev > (msec) (msec) (msec) (msec) (%) > --------------- -------- ------ -------- --------- --------- --------- ------ > sendto 699574 0 2595.220 0.001 0.004 0.462 0.03% > recvfrom 1391089 694427 2163.458 0.001 0.002 0.442 0.04% > > - v5.9 > > syscall calls errors total min avg max stddev > (msec) (msec) (msec) (msec) (%) > --------------- -------- ------ -------- --------- --------- --------- ------ > sendto 699187 0 3316.948 0.002 0.005 0.044 0.02% > recvfrom 1397042 698828 2464.995 0.001 0.002 0.025 0.04% > > - v5.12-rc6 > > syscall calls errors total min avg max stddev > (msec) (msec) (msec) (msec) (%) > --------------- -------- ------ -------- --------- --------- --------- ------ > sendto 699445 0 3015.642 0.002 0.004 0.027 0.02% > recvfrom 1395929 697909 2338.783 0.001 0.002 0.024 0.03% > > I bisected the kernel patches, then I found the patch series, which add > object level memory cgroup support, causes the degradation. > > I confirmed the delay with a kernel module which just runs > kmem_cache_alloc/kmem_cache_free as follows. The duration is about > 2-3 times than v5.8. > > dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT); > for (i = 0; i < 100000000; i++) > { > p = kmem_cache_alloc(dummy_cache, GFP_KERNEL); > kmem_cache_free(dummy_cache, p); > } > > It seems that the object accounting work in slab_pre_alloc_hook() and > slab_post_alloc_hook() is the overhead. > > cgroup.nokmem kernel parameter doesn't work for my case because it disables > all of kmem accounting. > > The degradation is gone when I apply a patch (at the bottom of this email) > that adds a kernel parameter that expects to fallback to the page level > accounting, however, I'm not sure it's a good approach though... Hello Masayoshi! Thank you for the report! It's not a secret that per-object accounting is more expensive than a per-page allocation. I had micro-benchmark results similar to yours: accounted allocations are about 2x slower. But in general it tends to not affect real workloads, because the cost of allocations is still low and tends to be only a small fraction of the whole cpu load. And because it brings up significant benefits: 40%+ slab memory savings, less fragmentation, more stable workingset, etc, real workloads tend to perform on pair or better. So my first question is if you see the regression in any real workload or it's only about the benchmark? Second, I'll try to take a look into the benchmark to figure out why it's affected so badly, but I'm not sure we can easily fix it. If you have any ideas what kind of objects the benchmark is allocating in big numbers, please let me know. Thanks!
On Thu, Apr 8, 2021 at 1:54 PM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote: > > Hello, > > > > I detected a performance degradation issue for a benchmark of PostgresSQL [1], > > and the issue seems to be related to object level memory cgroup [2]. > > I would appreciate it if you could give me some ideas to solve it. > > > > The benchmark shows the transaction per second (tps) and the tps for v5.9 > > and later kernel get about 10%-20% smaller than v5.8. > > > > The benchmark does sendto() and recvfrom() system calls repeatedly, > > and the duration of the system calls get longer than v5.8. > > The result of perf trace of the benchmark is as follows: > > > > - v5.8 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699574 0 2595.220 0.001 0.004 0.462 0.03% > > recvfrom 1391089 694427 2163.458 0.001 0.002 0.442 0.04% > > > > - v5.9 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699187 0 3316.948 0.002 0.005 0.044 0.02% > > recvfrom 1397042 698828 2464.995 0.001 0.002 0.025 0.04% > > > > - v5.12-rc6 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699445 0 3015.642 0.002 0.004 0.027 0.02% > > recvfrom 1395929 697909 2338.783 0.001 0.002 0.024 0.03% > > Can you please explain how to read these numbers? Or at least put a % regression. > > I bisected the kernel patches, then I found the patch series, which add > > object level memory cgroup support, causes the degradation. > > > > I confirmed the delay with a kernel module which just runs > > kmem_cache_alloc/kmem_cache_free as follows. The duration is about > > 2-3 times than v5.8. > > > > dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT); > > for (i = 0; i < 100000000; i++) > > { > > p = kmem_cache_alloc(dummy_cache, GFP_KERNEL); > > kmem_cache_free(dummy_cache, p); > > } > > > > It seems that the object accounting work in slab_pre_alloc_hook() and > > slab_post_alloc_hook() is the overhead. > > > > cgroup.nokmem kernel parameter doesn't work for my case because it disables > > all of kmem accounting. The patch is somewhat doing that i.e. disabling memcg accounting for slab. > > > > The degradation is gone when I apply a patch (at the bottom of this email) > > that adds a kernel parameter that expects to fallback to the page level > > accounting, however, I'm not sure it's a good approach though... > > Hello Masayoshi! > > Thank you for the report! > > It's not a secret that per-object accounting is more expensive than a per-page > allocation. I had micro-benchmark results similar to yours: accounted > allocations are about 2x slower. But in general it tends to not affect real > workloads, because the cost of allocations is still low and tends to be only > a small fraction of the whole cpu load. And because it brings up significant > benefits: 40%+ slab memory savings, less fragmentation, more stable workingset, > etc, real workloads tend to perform on pair or better. > > So my first question is if you see the regression in any real workload > or it's only about the benchmark? > > Second, I'll try to take a look into the benchmark to figure out why it's > affected so badly, but I'm not sure we can easily fix it. If you have any > ideas what kind of objects the benchmark is allocating in big numbers, > please let me know. > One idea would be to increase MEMCG_CHARGE_BATCH.
On Thu, Apr 08, 2021 at 01:53:47PM -0700, Roman Gushchin wrote: > On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote: > > Hello, > > > > I detected a performance degradation issue for a benchmark of PostgresSQL [1], > > and the issue seems to be related to object level memory cgroup [2]. > > I would appreciate it if you could give me some ideas to solve it. > > > > The benchmark shows the transaction per second (tps) and the tps for v5.9 > > and later kernel get about 10%-20% smaller than v5.8. > > > > The benchmark does sendto() and recvfrom() system calls repeatedly, > > and the duration of the system calls get longer than v5.8. > > The result of perf trace of the benchmark is as follows: > > > > - v5.8 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699574 0 2595.220 0.001 0.004 0.462 0.03% > > recvfrom 1391089 694427 2163.458 0.001 0.002 0.442 0.04% > > > > - v5.9 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699187 0 3316.948 0.002 0.005 0.044 0.02% > > recvfrom 1397042 698828 2464.995 0.001 0.002 0.025 0.04% > > > > - v5.12-rc6 > > > > syscall calls errors total min avg max stddev > > (msec) (msec) (msec) (msec) (%) > > --------------- -------- ------ -------- --------- --------- --------- ------ > > sendto 699445 0 3015.642 0.002 0.004 0.027 0.02% > > recvfrom 1395929 697909 2338.783 0.001 0.002 0.024 0.03% > > > > I bisected the kernel patches, then I found the patch series, which add > > object level memory cgroup support, causes the degradation. > > > > I confirmed the delay with a kernel module which just runs > > kmem_cache_alloc/kmem_cache_free as follows. The duration is about > > 2-3 times than v5.8. > > > > dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT); > > for (i = 0; i < 100000000; i++) > > { > > p = kmem_cache_alloc(dummy_cache, GFP_KERNEL); > > kmem_cache_free(dummy_cache, p); > > } > > > > It seems that the object accounting work in slab_pre_alloc_hook() and > > slab_post_alloc_hook() is the overhead. > > > > cgroup.nokmem kernel parameter doesn't work for my case because it disables > > all of kmem accounting. > > > > The degradation is gone when I apply a patch (at the bottom of this email) > > that adds a kernel parameter that expects to fallback to the page level > > accounting, however, I'm not sure it's a good approach though... > > Hello Masayoshi! > > Thank you for the report! Hi! > > It's not a secret that per-object accounting is more expensive than a per-page > allocation. I had micro-benchmark results similar to yours: accounted > allocations are about 2x slower. But in general it tends to not affect real > workloads, because the cost of allocations is still low and tends to be only > a small fraction of the whole cpu load. And because it brings up significant > benefits: 40%+ slab memory savings, less fragmentation, more stable workingset, > etc, real workloads tend to perform on pair or better. > > So my first question is if you see the regression in any real workload > or it's only about the benchmark? It's only about the benchmark so far. I'll let you know if I get the issue with real workload. > > Second, I'll try to take a look into the benchmark to figure out why it's > affected so badly, but I'm not sure we can easily fix it. If you have any > ideas what kind of objects the benchmark is allocating in big numbers, > please let me know. The benchmark does sendto() and recvfrom() to the unix domain socket repeatedly, and kmem_cache_alloc_node()/kmem_cache_free() is called to allocate/free the socket buffers. The call graph to allocate the object is as flllows. do_syscall_64 __x64_sys_sendto __sys_sendto sock_sendmsg unix_stream_sendmsg sock_alloc_send_pskb alloc_skb_with_frags __alloc_skb kmem_cache_alloc_node kmem_cache_alloc_node()/kmem_cache_free() is called about 1,400,000 times during the benchmark and the object size is 216 byte, the GFP flag is 0x400cc0: ___GFP_ACCOUNT | ___GFP_KSWAPD_RECLAIM | ___GFP_DIRECT_RECLAIM | ___GFP_FS | ___GFP_IO I got the data by following bpftrace script. # cat kmem.bt #!/usr/bin/env bpftrace tracepoint:kmem:kmem_cache_alloc_node /comm == "pgbench"/ { @alloc[comm, args->bytes_req, args->bytes_alloc, args->gfp_flags] = count(); } tracepoint:kmem:kmem_cache_free /comm == "pgbench"/ { @free[comm] = count(); } # ./kmem.bt Attaching 2 probes... ^C @alloc[pgbench, 11784, 11840, 3264]: 1 @alloc[pgbench, 216, 256, 3264]: 23 @alloc[pgbench, 216, 256, 4197568]: 1400046 @free[pgbench]: 1400560 # I hope this helps... Thanks! Masa
On Thu, Apr 08, 2021 at 02:08:13PM -0700, Shakeel Butt wrote: > On Thu, Apr 8, 2021 at 1:54 PM Roman Gushchin <guro@fb.com> wrote: > > > > On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote: > > > Hello, > > > > > > I detected a performance degradation issue for a benchmark of PostgresSQL [1], > > > and the issue seems to be related to object level memory cgroup [2]. > > > I would appreciate it if you could give me some ideas to solve it. > > > > > > The benchmark shows the transaction per second (tps) and the tps for v5.9 > > > and later kernel get about 10%-20% smaller than v5.8. > > > > > > The benchmark does sendto() and recvfrom() system calls repeatedly, > > > and the duration of the system calls get longer than v5.8. > > > The result of perf trace of the benchmark is as follows: > > > > > > - v5.8 > > > > > > syscall calls errors total min avg max stddev > > > (msec) (msec) (msec) (msec) (%) > > > --------------- -------- ------ -------- --------- --------- --------- ------ > > > sendto 699574 0 2595.220 0.001 0.004 0.462 0.03% > > > recvfrom 1391089 694427 2163.458 0.001 0.002 0.442 0.04% > > > > > > - v5.9 > > > > > > syscall calls errors total min avg max stddev > > > (msec) (msec) (msec) (msec) (%) > > > --------------- -------- ------ -------- --------- --------- --------- ------ > > > sendto 699187 0 3316.948 0.002 0.005 0.044 0.02% > > > recvfrom 1397042 698828 2464.995 0.001 0.002 0.025 0.04% > > > > > > - v5.12-rc6 > > > > > > syscall calls errors total min avg max stddev > > > (msec) (msec) (msec) (msec) (%) > > > --------------- -------- ------ -------- --------- --------- --------- ------ > > > sendto 699445 0 3015.642 0.002 0.004 0.027 0.02% > > > recvfrom 1395929 697909 2338.783 0.001 0.002 0.024 0.03% > > > > > Can you please explain how to read these numbers? Or at least put a % > regression. Let me summarize them here. The total duration ('total' column above) of each system call is as follows if v5.8 is assumed as 100%: - sendto: - v5.8 100% - v5.9 128% - v5.12-rc6 116% - revfrom: - v5.8 100% - v5.9 114% - v5.12-rc6 108% > > > > I bisected the kernel patches, then I found the patch series, which add > > > object level memory cgroup support, causes the degradation. > > > > > > I confirmed the delay with a kernel module which just runs > > > kmem_cache_alloc/kmem_cache_free as follows. The duration is about > > > 2-3 times than v5.8. > > > > > > dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT); > > > for (i = 0; i < 100000000; i++) > > > { > > > p = kmem_cache_alloc(dummy_cache, GFP_KERNEL); > > > kmem_cache_free(dummy_cache, p); > > > } > > > > > > It seems that the object accounting work in slab_pre_alloc_hook() and > > > slab_post_alloc_hook() is the overhead. > > > > > > cgroup.nokmem kernel parameter doesn't work for my case because it disables > > > all of kmem accounting. > > The patch is somewhat doing that i.e. disabling memcg accounting for slab. > > > > > > > The degradation is gone when I apply a patch (at the bottom of this email) > > > that adds a kernel parameter that expects to fallback to the page level > > > accounting, however, I'm not sure it's a good approach though... > > > > Hello Masayoshi! > > > > Thank you for the report! > > > > It's not a secret that per-object accounting is more expensive than a per-page > > allocation. I had micro-benchmark results similar to yours: accounted > > allocations are about 2x slower. But in general it tends to not affect real > > workloads, because the cost of allocations is still low and tends to be only > > a small fraction of the whole cpu load. And because it brings up significant > > benefits: 40%+ slab memory savings, less fragmentation, more stable workingset, > > etc, real workloads tend to perform on pair or better. > > > > So my first question is if you see the regression in any real workload > > or it's only about the benchmark? > > > > Second, I'll try to take a look into the benchmark to figure out why it's > > affected so badly, but I'm not sure we can easily fix it. If you have any > > ideas what kind of objects the benchmark is allocating in big numbers, > > please let me know. > > > > One idea would be to increase MEMCG_CHARGE_BATCH. Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something. Thanks! Masa
On Fri, Apr 9, 2021 at 9:35 AM Masayoshi Mizuma <msys.mizuma@gmail.com> wrote: > [...] > > Can you please explain how to read these numbers? Or at least put a % > > regression. > > Let me summarize them here. > The total duration ('total' column above) of each system call is as follows > if v5.8 is assumed as 100%: > > - sendto: > - v5.8 100% > - v5.9 128% > - v5.12-rc6 116% > > - revfrom: > - v5.8 100% > - v5.9 114% > - v5.12-rc6 108% > Thanks, that is helpful. Most probably the improvement of 5.12 from 5.9 is due to 3de7d4f25a7438f ("mm: memcg/slab: optimize objcg stock draining"). [...] > > > > One idea would be to increase MEMCG_CHARGE_BATCH. > > Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be > a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something. > Can you rerun the benchmark with MEMCG_CHARGE_BATCH equal 64UL? I think with memcg stats moving to rstat, the stat accuracy is not an issue if we increase MEMCG_CHARGE_BATCH to 64UL. Not sure if we want this to be tuneable but most probably we do want this to be sync'ed with SWAP_CLUSTER_MAX.
On Fri, Apr 09, 2021 at 09:50:45AM -0700, Shakeel Butt wrote: > On Fri, Apr 9, 2021 at 9:35 AM Masayoshi Mizuma <msys.mizuma@gmail.com> wrote: > > > [...] > > > Can you please explain how to read these numbers? Or at least put a % > > > regression. > > > > Let me summarize them here. > > The total duration ('total' column above) of each system call is as follows > > if v5.8 is assumed as 100%: > > > > - sendto: > > - v5.8 100% > > - v5.9 128% > > - v5.12-rc6 116% > > > > - revfrom: > > - v5.8 100% > > - v5.9 114% > > - v5.12-rc6 108% > > > > Thanks, that is helpful. Most probably the improvement of 5.12 from > 5.9 is due to 3de7d4f25a7438f ("mm: memcg/slab: optimize objcg stock > draining"). > > [...] > > > > > > One idea would be to increase MEMCG_CHARGE_BATCH. > > > > Thank you for the idea! It's hard-corded as 32 now, so I'm wondering it may be > > a good idea to make MEMCG_CHARGE_BATCH tunable from a kernel parameter or something. > > > Hi! Thank you for your comments! > Can you rerun the benchmark with MEMCG_CHARGE_BATCH equal 64UL? Yes, I reran the benchmark with MEMCG_CHARGE_BATCH == 64UL, but it seems that it doesn't reduce the duration of system calls... - v5.12-rc6 vanilla syscall total (msec) --------- -------- sendto 3049.221 recvfrom 2421.601 - v5.12-rc6 with MEMCG_CHARGE_BATCH==64 syscall total (msec) --------- -------- sendto 3071.607 recvfrom 2436.488 > I think with memcg stats moving to rstat, the stat accuracy is not an > issue if we increase MEMCG_CHARGE_BATCH to 64UL. Not sure if we want > this to be tuneable but most probably we do want this to be sync'ed > with SWAP_CLUSTER_MAX. Thanks. I understand that. Waiman posted some patches to reduce the overhead [1]. I'll try the patch. [1]: https://lore.kernel.org/linux-mm/51ea6b09-b7ee-36e9-a500-b7141bd3a42b@redhat.com/T/#me75806a3555e7a42e793f099b98c42e687962d10 Thanks! Masa
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c04d39a7967..d447bfc8cf5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1597,6 +1597,8 @@ extern int memcg_nr_cache_ids; void memcg_get_cache_ids(void); void memcg_put_cache_ids(void); +extern bool memcg_obj_account_disabled; + /* * Helper macro to loop through all memcg-specific caches. Callers must still * check if the cache is valid (it is either valid or NULL). diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..cc07d89bc449 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -400,6 +400,16 @@ void memcg_put_cache_ids(void) */ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key); EXPORT_SYMBOL(memcg_kmem_enabled_key); + +bool memcg_obj_account_disabled __read_mostly; + +static int __init memcg_obj_account_disable(char *s) +{ + memcg_obj_account_disabled = true; + pr_debug("object memory cgroup account disabled\n"); + return 1; +} +__setup("cgroup.nomemobj", memcg_obj_account_disable); #endif static int memcg_shrinker_map_size; diff --git a/mm/slab.h b/mm/slab.h index 076582f58f68..7f7f7867f636 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -270,6 +270,9 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)) return true; + if (memcg_obj_account_disabled) + return true; + objcg = get_obj_cgroup_from_current(); if (!objcg) return true; @@ -309,6 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, if (!memcg_kmem_enabled() || !objcg) return; + if (memcg_obj_account_disabled) + return; + flags &= ~__GFP_ACCOUNT; for (i = 0; i < size; i++) { if (likely(p[i])) { @@ -346,6 +352,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, if (!memcg_kmem_enabled()) return; + if (memcg_obj_account_disabled) + return; + for (i = 0; i < objects; i++) { if (unlikely(!p[i])) continue;