diff mbox series

[6.6.y] mm: ratelimit stat flush from workingset shrinker

Message ID 171776806121.384105.7980809581420394573.stgit@firesoul (mailing list archive)
State New
Headers show
Series [6.6.y] mm: ratelimit stat flush from workingset shrinker | expand

Commit Message

Jesper Dangaard Brouer June 7, 2024, 1:48 p.m. UTC
From: Shakeel Butt <shakeelb@google.com>

commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.

One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
upstream kernel and on further investigation, it seems like the cause is
the always synchronous rstat flush in the count_shadow_nodes() added by
the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical
stats").  On further inspection it seems like we don't really need
accurate stats in this function as it was already approximating the amount
of appropriate shadow entries to keep for maintaining the refault
information.  Since there is already 2 sec periodic rstat flush, we don't
need exact stats here.  Let's ratelimit the rstat flush in this code path.

Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

---
On production with kernel v6.6 we are observing issues with excessive
cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats()
in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg:
use rstat for non-hierarchical stats") that commit is part of v6.6.
We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush
from workingset shrinker") as it have a fixes tag for this commit.

IMHO it is worth explaining call path that makes count_shadow_nodes()
cause excessive cgroup rstat flushing calls. Function shrink_node()
calls mem_cgroup_flush_stats() on its own first, and then invokes
shrink_node_memcgs(). Function shrink_node_memcgs() iterates over
cgroups via mem_cgroup_iter() for each calling shrink_slab(). The
shrink_slab() calls do_shrink_slab() that via shrinker->count_objects()
invoke count_shadow_nodes(), and count_shadow_nodes() does
a mem_cgroup_flush_stats() call, that seems unnecessary.

Backport differs slightly due to v6.6.32 doesn't contain commit
7d7ef0a4686a ("mm: memcg: restore subtree stats flushing") from v6.8.
---
 mm/workingset.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Shakeel Butt June 7, 2024, 2:32 p.m. UTC | #1
On Fri, Jun 07, 2024 at 03:48:06PM GMT, Jesper Dangaard Brouer wrote:
> From: Shakeel Butt <shakeelb@google.com>
> 
> commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.
> 
> One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
> upstream kernel and on further investigation, it seems like the cause is
> the always synchronous rstat flush in the count_shadow_nodes() added by
> the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical
> stats").  On further inspection it seems like we don't really need
> accurate stats in this function as it was already approximating the amount
> of appropriate shadow entries to keep for maintaining the refault
> information.  Since there is already 2 sec periodic rstat flush, we don't
> need exact stats here.  Let's ratelimit the rstat flush in this code path.
> 
> Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> ---
> On production with kernel v6.6 we are observing issues with excessive
> cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats()
> in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg:
> use rstat for non-hierarchical stats") that commit is part of v6.6.
> We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush
> from workingset shrinker") as it have a fixes tag for this commit.
> 
> IMHO it is worth explaining call path that makes count_shadow_nodes()
> cause excessive cgroup rstat flushing calls. Function shrink_node()
> calls mem_cgroup_flush_stats() on its own first, and then invokes
> shrink_node_memcgs(). Function shrink_node_memcgs() iterates over
> cgroups via mem_cgroup_iter() for each calling shrink_slab(). The
> shrink_slab() calls do_shrink_slab() that via shrinker->count_objects()
> invoke count_shadow_nodes(), and count_shadow_nodes() does
> a mem_cgroup_flush_stats() call, that seems unnecessary.
> 

Actually at Meta production we have also replaced
mem_cgroup_flush_stats() in shrink_node() with
mem_cgroup_flush_stats_ratelimited() as it was causing too much flushing
issue. We have not observed any issue after the change. I will propose
that patch to upstream as well.
Jesper Dangaard Brouer June 7, 2024, 5:26 p.m. UTC | #2
On 07/06/2024 16.32, Shakeel Butt wrote:
> On Fri, Jun 07, 2024 at 03:48:06PM GMT, Jesper Dangaard Brouer wrote:
>> From: Shakeel Butt <shakeelb@google.com>
>>
>> commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.
>>
>> One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
>> upstream kernel and on further investigation, it seems like the cause is
>> the always synchronous rstat flush in the count_shadow_nodes() added by
>> the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical
>> stats").  On further inspection it seems like we don't really need
>> accurate stats in this function as it was already approximating the amount
>> of appropriate shadow entries to keep for maintaining the refault
>> information.  Since there is already 2 sec periodic rstat flush, we don't
>> need exact stats here.  Let's ratelimit the rstat flush in this code path.
>>
>> Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
>> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Yosry Ahmed <yosryahmed@google.com>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>> Cc: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> ---
>> On production with kernel v6.6 we are observing issues with excessive
>> cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats()
>> in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg:
>> use rstat for non-hierarchical stats") that commit is part of v6.6.
>> We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush
>> from workingset shrinker") as it have a fixes tag for this commit.
>>
>> IMHO it is worth explaining call path that makes count_shadow_nodes()
>> cause excessive cgroup rstat flushing calls. Function shrink_node()
>> calls mem_cgroup_flush_stats() on its own first, and then invokes
>> shrink_node_memcgs(). Function shrink_node_memcgs() iterates over
>> cgroups via mem_cgroup_iter() for each calling shrink_slab(). The
>> shrink_slab() calls do_shrink_slab() that via shrinker->count_objects()
>> invoke count_shadow_nodes(), and count_shadow_nodes() does
>> a mem_cgroup_flush_stats() call, that seems unnecessary.
>>
> 
> Actually at Meta production we have also replaced
> mem_cgroup_flush_stats() in shrink_node() with
> mem_cgroup_flush_stats_ratelimited() as it was causing too much flushing
> issue. We have not observed any issue after the change. I will propose
> that patch to upstream as well.

(Please Cc me as I'm not subscribed on cgroups@vger.kernel.org)

Yes, we also see mem_cgroup_flush_stats() in shrink_node() cause issues.

So, I can confirm the issue. What we see is that it originates from
kswapd, which have a kthread per NUMA node that runs concurrently...  we
measure cgroup rstat lock contention happening due to call in shrink_node().

See call stacks I captured with bpftrace script[1]:

stack_wait[695, kswapd0, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[696, kswapd1, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[697, kswapd2, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[698, kswapd3, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[699, kswapd4, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[700, kswapd5, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[701, kswapd6, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27
@stack_wait[702, kswapd7, 1]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27

--Jesper


[1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency.bt
Greg KH June 12, 2024, 1:38 p.m. UTC | #3
On Fri, Jun 07, 2024 at 03:48:06PM +0200, Jesper Dangaard Brouer wrote:
> From: Shakeel Butt <shakeelb@google.com>
> 
> commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.
> 
> One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
> upstream kernel and on further investigation, it seems like the cause is
> the always synchronous rstat flush in the count_shadow_nodes() added by
> the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical
> stats").  On further inspection it seems like we don't really need
> accurate stats in this function as it was already approximating the amount
> of appropriate shadow entries to keep for maintaining the refault
> information.  Since there is already 2 sec periodic rstat flush, we don't
> need exact stats here.  Let's ratelimit the rstat flush in this code path.
> 
> Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> ---
> On production with kernel v6.6 we are observing issues with excessive
> cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats()
> in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg:
> use rstat for non-hierarchical stats") that commit is part of v6.6.
> We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush
> from workingset shrinker") as it have a fixes tag for this commit.
> 
> IMHO it is worth explaining call path that makes count_shadow_nodes()
> cause excessive cgroup rstat flushing calls. Function shrink_node()
> calls mem_cgroup_flush_stats() on its own first, and then invokes
> shrink_node_memcgs(). Function shrink_node_memcgs() iterates over
> cgroups via mem_cgroup_iter() for each calling shrink_slab(). The
> shrink_slab() calls do_shrink_slab() that via shrinker->count_objects()
> invoke count_shadow_nodes(), and count_shadow_nodes() does
> a mem_cgroup_flush_stats() call, that seems unnecessary.
> 
> Backport differs slightly due to v6.6.32 doesn't contain commit
> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing") from v6.8.

Now queued up, thanks.

greg k-h
diff mbox series

Patch

diff --git a/mm/workingset.c b/mm/workingset.c
index 2559a1f2fc1c..9110957bec5b 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -664,7 +664,7 @@  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats_ratelimited();
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,