diff mbox series

[1/7] mm: memcontrol: fix cpuhotplug statistics flushing

Message ID 20210202184746.119084-2-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series : mm: memcontrol: switch to rstat | expand

Commit Message

Johannes Weiner Feb. 2, 2021, 6:47 p.m. UTC
The memcg hotunplug callback erroneously flushes counts on the local
CPU, not the counts of the CPU going away; those counts will be lost.

Flush the CPU that is actually going away.

Also simplify the code a bit by using mod_memcg_state() and
count_memcg_events() instead of open-coding the upward flush - this is
comparable to how vmstat.c handles hotunplug flushing.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

Comments

Shakeel Butt Feb. 2, 2021, 10:23 p.m. UTC | #1
On Tue, Feb 2, 2021 at 12:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The memcg hotunplug callback erroneously flushes counts on the local
> CPU, not the counts of the CPU going away; those counts will be lost.
>
> Flush the CPU that is actually going away.
>
> Also simplify the code a bit by using mod_memcg_state() and
> count_memcg_events() instead of open-coding the upward flush - this is
> comparable to how vmstat.c handles hotunplug flushing.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I think we need Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive
complexity in memory.stat reporting")

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Roman Gushchin Feb. 2, 2021, 11:07 p.m. UTC | #2
On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote:
> The memcg hotunplug callback erroneously flushes counts on the local
> CPU, not the counts of the CPU going away; those counts will be lost.
> 
> Flush the CPU that is actually going away.
> 
> Also simplify the code a bit by using mod_memcg_state() and
> count_memcg_events() instead of open-coding the upward flush - this is
> comparable to how vmstat.c handles hotunplug flushing.

To the whole series: it's really nice to have an accurate stats at
non-leaf levels. Just as an illustration: if there are 32 CPUs and
1000 sub-cgroups (which is an absolutely realistic number, because
often there are many dying generations of each cgroup), the error
margin is 3.9GB. It makes all numbers pretty much random and all
possible tests extremely flaky.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

To this patch:

Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!
Roman Gushchin Feb. 3, 2021, 2:28 a.m. UTC | #3
On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote:
> On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote:
> > The memcg hotunplug callback erroneously flushes counts on the local
> > CPU, not the counts of the CPU going away; those counts will be lost.
> > 
> > Flush the CPU that is actually going away.
> > 
> > Also simplify the code a bit by using mod_memcg_state() and
> > count_memcg_events() instead of open-coding the upward flush - this is
> > comparable to how vmstat.c handles hotunplug flushing.
> 
> To the whole series: it's really nice to have an accurate stats at
> non-leaf levels. Just as an illustration: if there are 32 CPUs and
> 1000 sub-cgroups (which is an absolutely realistic number, because
> often there are many dying generations of each cgroup), the error
> margin is 3.9GB. It makes all numbers pretty much random and all
> possible tests extremely flaky.

Btw, I was just looking into kmem kselftests failures/flakiness,
which is caused by exactly this problem: without waiting for the
finish of dying cgroups reclaim, we can't make any reliable assumptions
about what to expect from memcg stats.

So looking forward to have this patchset merged!
Michal Hocko Feb. 4, 2021, 1:28 p.m. UTC | #4
On Tue 02-02-21 13:47:40, Johannes Weiner wrote:
> The memcg hotunplug callback erroneously flushes counts on the local
> CPU, not the counts of the CPU going away; those counts will be lost.
> 
> Flush the CPU that is actually going away.
> 
> Also simplify the code a bit by using mod_memcg_state() and
> count_memcg_events() instead of open-coding the upward flush - this is
> comparable to how vmstat.c handles hotunplug flushing.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

Shakeel has already pointed out Fixes.

> ---
>  mm/memcontrol.c | 35 +++++++++++++++++++++--------------
>  1 file changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed5cc78a8dbf..8120d565dd79 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  static int memcg_hotplug_cpu_dead(unsigned int cpu)
>  {
>  	struct memcg_stock_pcp *stock;
> -	struct mem_cgroup *memcg, *mi;
> +	struct mem_cgroup *memcg;
>  
>  	stock = &per_cpu(memcg_stock, cpu);
>  	drain_stock(stock);
>  
>  	for_each_mem_cgroup(memcg) {
> +		struct memcg_vmstats_percpu *statc;
>  		int i;
>  
> +		statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> +
>  		for (i = 0; i < MEMCG_NR_STAT; i++) {
>  			int nid;
> -			long x;
>  
> -			x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
> -			if (x)
> -				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -					atomic_long_add(x, &memcg->vmstats[i]);
> +			if (statc->stat[i]) {
> +				mod_memcg_state(memcg, i, statc->stat[i]);
> +				statc->stat[i] = 0;
> +			}
>  
>  			if (i >= NR_VM_NODE_STAT_ITEMS)
>  				continue;
>  
>  			for_each_node(nid) {
> +				struct batched_lruvec_stat *lstatc;
>  				struct mem_cgroup_per_node *pn;
> +				long x;
>  
>  				pn = mem_cgroup_nodeinfo(memcg, nid);
> -				x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
> -				if (x)
> +				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
> +
> +				x = lstatc->count[i];
> +				lstatc->count[i] = 0;
> +
> +				if (x) {
>  					do {
>  						atomic_long_add(x, &pn->lruvec_stat[i]);
>  					} while ((pn = parent_nodeinfo(pn, nid)));
> +				}
>  			}
>  		}
>  
>  		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> -			long x;
> -
> -			x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
> -			if (x)
> -				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -					atomic_long_add(x, &memcg->vmevents[i]);
> +			if (statc->events[i]) {
> +				count_memcg_events(memcg, i, statc->events[i]);
> +				statc->events[i] = 0;
> +			}
>  		}
>  	}
>  
> -- 
> 2.30.0
>
Johannes Weiner Feb. 4, 2021, 7:29 p.m. UTC | #5
On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote:
> On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote:
> > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote:
> > > The memcg hotunplug callback erroneously flushes counts on the local
> > > CPU, not the counts of the CPU going away; those counts will be lost.
> > > 
> > > Flush the CPU that is actually going away.
> > > 
> > > Also simplify the code a bit by using mod_memcg_state() and
> > > count_memcg_events() instead of open-coding the upward flush - this is
> > > comparable to how vmstat.c handles hotunplug flushing.
> > 
> > To the whole series: it's really nice to have an accurate stats at
> > non-leaf levels. Just as an illustration: if there are 32 CPUs and
> > 1000 sub-cgroups (which is an absolutely realistic number, because
> > often there are many dying generations of each cgroup), the error
> > margin is 3.9GB. It makes all numbers pretty much random and all
> > possible tests extremely flaky.
> 
> Btw, I was just looking into kmem kselftests failures/flakiness,
> which is caused by exactly this problem: without waiting for the
> finish of dying cgroups reclaim, we can't make any reliable assumptions
> about what to expect from memcg stats.

Good point about the selftests. I gave them a shot, and indeed this
series makes test_kmem work again:

vanilla:
ok 1 test_kmem_basic
memory.current = 8810496
slab + anon + file + kernel_stack = 17074568
slab = 6101384
anon = 946176
file = 0
kernel_stack = 10027008
not ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic

patched:
ok 1 test_kmem_basic
ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic

It even passes with a reduced margin in the patched kernel, since the
percpu drift - which this test already tried to account for - is now
only on the page_counter side (whereas memory.stat is always precise).

I'm going to include that data in the v2 changelog, as well as a patch
to update test_kmem.c to the more stringent error tolerances.

> So looking forward to have this patchset merged!

Thanks
Roman Gushchin Feb. 4, 2021, 7:34 p.m. UTC | #6
On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote:
> On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote:
> > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote:
> > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote:
> > > > The memcg hotunplug callback erroneously flushes counts on the local
> > > > CPU, not the counts of the CPU going away; those counts will be lost.
> > > > 
> > > > Flush the CPU that is actually going away.
> > > > 
> > > > Also simplify the code a bit by using mod_memcg_state() and
> > > > count_memcg_events() instead of open-coding the upward flush - this is
> > > > comparable to how vmstat.c handles hotunplug flushing.
> > > 
> > > To the whole series: it's really nice to have an accurate stats at
> > > non-leaf levels. Just as an illustration: if there are 32 CPUs and
> > > 1000 sub-cgroups (which is an absolutely realistic number, because
> > > often there are many dying generations of each cgroup), the error
> > > margin is 3.9GB. It makes all numbers pretty much random and all
> > > possible tests extremely flaky.
> > 
> > Btw, I was just looking into kmem kselftests failures/flakiness,
> > which is caused by exactly this problem: without waiting for the
> > finish of dying cgroups reclaim, we can't make any reliable assumptions
> > about what to expect from memcg stats.
> 
> Good point about the selftests. I gave them a shot, and indeed this
> series makes test_kmem work again:
> 
> vanilla:
> ok 1 test_kmem_basic
> memory.current = 8810496
> slab + anon + file + kernel_stack = 17074568
> slab = 6101384
> anon = 946176
> file = 0
> kernel_stack = 10027008
> not ok 2 test_kmem_memcg_deletion
> ok 3 test_kmem_proc_kpagecgroup
> ok 4 test_kmem_kernel_stacks
> ok 5 test_kmem_dead_cgroups
> ok 6 test_percpu_basic
> 
> patched:
> ok 1 test_kmem_basic
> ok 2 test_kmem_memcg_deletion
> ok 3 test_kmem_proc_kpagecgroup
> ok 4 test_kmem_kernel_stacks
> ok 5 test_kmem_dead_cgroups
> ok 6 test_percpu_basic

Nice! Thanks for checking.

> 
> It even passes with a reduced margin in the patched kernel, since the
> percpu drift - which this test already tried to account for - is now
> only on the page_counter side (whereas memory.stat is always precise).
> 
> I'm going to include that data in the v2 changelog, as well as a patch
> to update test_kmem.c to the more stringent error tolerances.

Hm, I'm not sure it's a good idea to  unconditionally lower the error tolerance:
it's convenient to be able to run the same test on older kernels.
Johannes Weiner Feb. 5, 2021, 5:50 p.m. UTC | #7
On Thu, Feb 04, 2021 at 11:34:46AM -0800, Roman Gushchin wrote:
> On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote:
> > It even passes with a reduced margin in the patched kernel, since the
> > percpu drift - which this test already tried to account for - is now
> > only on the page_counter side (whereas memory.stat is always precise).
> > 
> > I'm going to include that data in the v2 changelog, as well as a patch
> > to update test_kmem.c to the more stringent error tolerances.
> 
> Hm, I'm not sure it's a good idea to  unconditionally lower the error tolerance:
> it's convenient to be able to run the same test on older kernels.

Well, an older version of the kernel will have an older version of the
test that is tailored towards that kernel's specific behavior. That's
sort of the point of tracking code and tests in the same git tree: to
have meaningful, effective and precise tests of an ever-changing
implementation. Trying to be backward compatible will lower the test
signal and miss regressions, when a backward compatible version is at
most one git checkout away.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed5cc78a8dbf..8120d565dd79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2411,45 +2411,52 @@  static void drain_all_stock(struct mem_cgroup *root_memcg)
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
 	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *memcg, *mi;
+	struct mem_cgroup *memcg;
 
 	stock = &per_cpu(memcg_stock, cpu);
 	drain_stock(stock);
 
 	for_each_mem_cgroup(memcg) {
+		struct memcg_vmstats_percpu *statc;
 		int i;
 
+		statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+
 		for (i = 0; i < MEMCG_NR_STAT; i++) {
 			int nid;
-			long x;
 
-			x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
-			if (x)
-				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-					atomic_long_add(x, &memcg->vmstats[i]);
+			if (statc->stat[i]) {
+				mod_memcg_state(memcg, i, statc->stat[i]);
+				statc->stat[i] = 0;
+			}
 
 			if (i >= NR_VM_NODE_STAT_ITEMS)
 				continue;
 
 			for_each_node(nid) {
+				struct batched_lruvec_stat *lstatc;
 				struct mem_cgroup_per_node *pn;
+				long x;
 
 				pn = mem_cgroup_nodeinfo(memcg, nid);
-				x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
-				if (x)
+				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
+
+				x = lstatc->count[i];
+				lstatc->count[i] = 0;
+
+				if (x) {
 					do {
 						atomic_long_add(x, &pn->lruvec_stat[i]);
 					} while ((pn = parent_nodeinfo(pn, nid)));
+				}
 			}
 		}
 
 		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-			long x;
-
-			x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
-			if (x)
-				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-					atomic_long_add(x, &memcg->vmevents[i]);
+			if (statc->events[i]) {
+				count_memcg_events(memcg, i, statc->events[i]);
+				statc->events[i] = 0;
+			}
 		}
 	}