diff mbox series

[3/3] mm: memcg: optimize stats flushing for latency and accuracy

Message ID 20230913073846.1528938-4-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series memcg: more sophisticated stats flushing | expand

Commit Message

Yosry Ahmed Sept. 13, 2023, 7:38 a.m. UTC
Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
  concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.

The reason this approach is followed is because all flushes are
serialized by a global rstat spinlock. On the memcg side, flushing is
invoked from userspace reads as well as in-kernel flushers (e.g.
reclaim, refault, etc). This approach aims to avoid serializing all
flushers on the global lock, which can cause a significant performance
hit under high concurrency.

This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
  be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
  flush, and we skip and return before the subtree of interest is
  actually flushed. This is more visible when reading stats from
  userspace, but can also affect in-kernel flushers.

This patch aims to solve both problems by reworking how flushing
currently works as follows:
- Without contention, there is no need to flush the entire tree. In this
  case, only flush the subtree of interest. This avoids the latency of a
  full root flush if unnecessary.
- With contention, fallback to a coalesced (aka unified) flush of the
  entire hierarchy, a root flush. In this case, instead of returning
  immediately if a root flush is ongoing, wait for it to finish
  *without* attempting to acquire the lock or flush. This is done using
  a completion. Compared to competing directly on the underlying lock,
  this approach makes concurrent flushing a synchronization point
  instead of a serialization point. Once  a root flush finishes, *all*
  waiters can wake up and continue at once.
- Finally, with very high contention, bound the number of waiters to the
  number of online cpus. This keeps the flush latency bounded at the tail
  (very high concurrency). We fallback to sacrificing stats freshness only
  in such cases in favor of performance.

This was tested in two ways on a machine with 384 cpus:
- A synthetic test with 5000 concurrent workers doing allocations and
  reclaim, as well as 1000 readers for memory.stat (variation of [2]).
  No significant regressions were noticed in the total runtime.
  Note that if concurrent flushers compete directly on the spinlock
  instead of waiting for a completion, this test shows 2x-3x slowdowns.
  Even though subsequent flushers would have nothing to flush, just the
  serialization and lock contention is a major problem. Using a
  completion for synchronization instead seems to overcome this problem.

- A synthetic stress test for concurrently reading memcg stats provided
  by Wei Xu.
  With 10k threads reading the stats every 100ms:
  - 98.8% of reads take <100us
  - 1.09% of reads take 100us to 1ms.
  - 0.11% of reads take 1ms to 10ms.
  - Almost no reads take more than 10ms.
  With 10k threads reading the stats every 10ms:
  - 82.3% of reads take <100us.
  - 4.2% of reads take 100us to 1ms.
  - 4.7% of reads take 1ms to 10ms.
  - 8.8% of reads take 10ms to 100ms.
  - Almost no reads take more than 100ms.

[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/

[weixugc@google.com: suggested the fallback logic and bounding the
number of waiters]

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |   4 +-
 mm/memcontrol.c            | 100 ++++++++++++++++++++++++++++---------
 mm/vmscan.c                |   2 +-
 mm/workingset.c            |   8 ++-
 4 files changed, 85 insertions(+), 29 deletions(-)

Comments

Johannes Weiner Sept. 13, 2023, 3:37 p.m. UTC | #1
On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>   concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
> 
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
> 
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>   be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>   flush, and we skip and return before the subtree of interest is
>   actually flushed. This is more visible when reading stats from
>   userspace, but can also affect in-kernel flushers.
> 
> This patch aims to solve both problems by reworking how flushing
> currently works as follows:
> - Without contention, there is no need to flush the entire tree. In this
>   case, only flush the subtree of interest. This avoids the latency of a
>   full root flush if unnecessary.
> - With contention, fallback to a coalesced (aka unified) flush of the
>   entire hierarchy, a root flush. In this case, instead of returning
>   immediately if a root flush is ongoing, wait for it to finish
>   *without* attempting to acquire the lock or flush. This is done using
>   a completion. Compared to competing directly on the underlying lock,
>   this approach makes concurrent flushing a synchronization point
>   instead of a serialization point. Once  a root flush finishes, *all*
>   waiters can wake up and continue at once.
> - Finally, with very high contention, bound the number of waiters to the
>   number of online cpus. This keeps the flush latency bounded at the tail
>   (very high concurrency). We fallback to sacrificing stats freshness only
>   in such cases in favor of performance.
> 
> This was tested in two ways on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers doing allocations and
>   reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>   No significant regressions were noticed in the total runtime.
>   Note that if concurrent flushers compete directly on the spinlock
>   instead of waiting for a completion, this test shows 2x-3x slowdowns.
>   Even though subsequent flushers would have nothing to flush, just the
>   serialization and lock contention is a major problem. Using a
>   completion for synchronization instead seems to overcome this problem.
> 
> - A synthetic stress test for concurrently reading memcg stats provided
>   by Wei Xu.
>   With 10k threads reading the stats every 100ms:
>   - 98.8% of reads take <100us
>   - 1.09% of reads take 100us to 1ms.
>   - 0.11% of reads take 1ms to 10ms.
>   - Almost no reads take more than 10ms.
>   With 10k threads reading the stats every 10ms:
>   - 82.3% of reads take <100us.
>   - 4.2% of reads take 100us to 1ms.
>   - 4.7% of reads take 1ms to 10ms.
>   - 8.8% of reads take 10ms to 100ms.
>   - Almost no reads take more than 100ms.
> 
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> 
> [weixugc@google.com: suggested the fallback logic and bounding the
> number of waiters]
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

>  	/*
> +	 * Opportunistically try to only flush the requested subtree. Otherwise
> +	 * fallback to a coalesced flush below.
>  	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> +		cgroup_rstat_flush(memcg->css.cgroup);
> +		mutex_unlock(&subtree_flush_mutex);
>  		return;
> +	}
>  
> -	WRITE_ONCE(flush_last_time, jiffies_64);
> -
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	/* A coalesced root flush is in order. Are we the designated flusher? */
> +	spin_lock(&root_flusher_lock);

I can't say I'm crazy about this.

Let's say you have a fairly sprawling and active cgroup tree with lots
of updates in lots of groups in the system.

Add a periodic memory.stat reader to one of the subgroups, and that
reader will do very fast, localized aggregation.

Now add a periodic memory.stat reader to some other subgroup. They
might still both do very fast, localized aggregation. Or they might
collide; and now - despite having nothing in common, and sharing no
data besides the rstat lock - will have to flush the entire massive
tree. The rate at which this happens is purely dependent on timing of
what should be independent actors. This is really not great for the
p50 vs p99 gap.

I think this whole thing is getting away from us.

When we first switched to rstat, every reader would take the global
rstat lock and then flush its local subtree of interest.

This was changed in the following commit:

commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
Author: Shakeel Butt <shakeelb@google.com>
Date:   Fri Nov 5 13:37:34 2021 -0700

    memcg: unify memcg stat flushing
    
    The memcg stats can be flushed in multiple context and potentially in
    parallel too.  For example multiple parallel user space readers for
    memcg stats will contend on the rstat locks with each other.  There is
    no need for that.  We just need one flusher and everyone else can
    benefit.
    
    In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
    stats") the kernel periodically flush the memcg stats from the root, so,
    the other flushers will potentially have much less work to do.
    
    Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: "Michal Koutný" <mkoutny@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

The idea was that we can avoid lock contention if somebody is already
doing the flushing. However, you're now bringing global serialization.
Clearly that idea didn't work out. What would be an obstacle to go
back to the original way of doing it?

With one reader, this will work the same as in your proposal.

With two readers, just like in your proposal, flushing must be
serialized against the root level. But at least the two flushes only
aggregate the local data they actually care about - not the entire
tree data that doesn't even have readers! This is much better for lock
contention and performance.

One concern is the thresholding code. The cost of flushing on every
read is too high: even when there is no flush work, checking for it is
kind of expensive due to the locks and the for_each_possible_cpu().

Could we do something like the following?

	mem_cgroup_flush(memcg)
	{
		mutex_lock(&memcg_flush_mutex);
		if (atomic_read(&memcg->stat_delta) > THRESHOLD)
			cgroup_rstat_flush(memcg);
		mutex_unlock(&memcg_flush_mutex);
	}

	mem_cgroup_css_rstat_flush(css, cpu)
	{
		...
		/*
		 * Reset here instead of mem_cgroup_flush()
		 * so that each flushed subgroup is reset
		 * recursively. A recent parent flush will
		 * allow a child flush to skip.
		 */
		atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
	}

	memcg_rstat_updated(memcg, val)
	{
		atomic_add(&memcg->stat_delta, val);
	}
Yosry Ahmed Sept. 13, 2023, 4:26 p.m. UTC | #2
On Wed, Sep 13, 2023 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote:
> > Stats flushing for memcg currently follows the following rules:
> > - Always flush the entire memcg hierarchy (i.e. flush the root).
> > - Only one flusher is allowed at a time. If someone else tries to flush
> >   concurrently, they skip and return immediately.
> > - A periodic flusher flushes all the stats every 2 seconds.
> >
> > The reason this approach is followed is because all flushes are
> > serialized by a global rstat spinlock. On the memcg side, flushing is
> > invoked from userspace reads as well as in-kernel flushers (e.g.
> > reclaim, refault, etc). This approach aims to avoid serializing all
> > flushers on the global lock, which can cause a significant performance
> > hit under high concurrency.
> >
> > This approach has the following problems:
> > - Occasionally a userspace read of the stats of a non-root cgroup will
> >   be too expensive as it has to flush the entire hierarchy [1].
> > - Sometimes the stats accuracy are compromised if there is an ongoing
> >   flush, and we skip and return before the subtree of interest is
> >   actually flushed. This is more visible when reading stats from
> >   userspace, but can also affect in-kernel flushers.
> >
> > This patch aims to solve both problems by reworking how flushing
> > currently works as follows:
> > - Without contention, there is no need to flush the entire tree. In this
> >   case, only flush the subtree of interest. This avoids the latency of a
> >   full root flush if unnecessary.
> > - With contention, fallback to a coalesced (aka unified) flush of the
> >   entire hierarchy, a root flush. In this case, instead of returning
> >   immediately if a root flush is ongoing, wait for it to finish
> >   *without* attempting to acquire the lock or flush. This is done using
> >   a completion. Compared to competing directly on the underlying lock,
> >   this approach makes concurrent flushing a synchronization point
> >   instead of a serialization point. Once  a root flush finishes, *all*
> >   waiters can wake up and continue at once.
> > - Finally, with very high contention, bound the number of waiters to the
> >   number of online cpus. This keeps the flush latency bounded at the tail
> >   (very high concurrency). We fallback to sacrificing stats freshness only
> >   in such cases in favor of performance.
> >
> > This was tested in two ways on a machine with 384 cpus:
> > - A synthetic test with 5000 concurrent workers doing allocations and
> >   reclaim, as well as 1000 readers for memory.stat (variation of [2]).
> >   No significant regressions were noticed in the total runtime.
> >   Note that if concurrent flushers compete directly on the spinlock
> >   instead of waiting for a completion, this test shows 2x-3x slowdowns.
> >   Even though subsequent flushers would have nothing to flush, just the
> >   serialization and lock contention is a major problem. Using a
> >   completion for synchronization instead seems to overcome this problem.
> >
> > - A synthetic stress test for concurrently reading memcg stats provided
> >   by Wei Xu.
> >   With 10k threads reading the stats every 100ms:
> >   - 98.8% of reads take <100us
> >   - 1.09% of reads take 100us to 1ms.
> >   - 0.11% of reads take 1ms to 10ms.
> >   - Almost no reads take more than 10ms.
> >   With 10k threads reading the stats every 10ms:
> >   - 82.3% of reads take <100us.
> >   - 4.2% of reads take 100us to 1ms.
> >   - 4.7% of reads take 1ms to 10ms.
> >   - 8.8% of reads take 10ms to 100ms.
> >   - Almost no reads take more than 100ms.
> >
> > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> >
> > [weixugc@google.com: suggested the fallback logic and bounding the
> > number of waiters]
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> >       /*
> > +      * Opportunistically try to only flush the requested subtree. Otherwise
> > +      * fallback to a coalesced flush below.
> >        */
> > -     if (atomic_read(&stats_flush_ongoing) ||
> > -         atomic_xchg(&stats_flush_ongoing, 1))
> > +     if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> > +             cgroup_rstat_flush(memcg->css.cgroup);
> > +             mutex_unlock(&subtree_flush_mutex);
> >               return;
> > +     }
> >
> > -     WRITE_ONCE(flush_last_time, jiffies_64);
> > -
> > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > +     /* A coalesced root flush is in order. Are we the designated flusher? */
> > +     spin_lock(&root_flusher_lock);
>
> I can't say I'm crazy about this.
>
> Let's say you have a fairly sprawling and active cgroup tree with lots
> of updates in lots of groups in the system.
>
> Add a periodic memory.stat reader to one of the subgroups, and that
> reader will do very fast, localized aggregation.
>
> Now add a periodic memory.stat reader to some other subgroup. They
> might still both do very fast, localized aggregation. Or they might
> collide; and now - despite having nothing in common, and sharing no
> data besides the rstat lock - will have to flush the entire massive
> tree. The rate at which this happens is purely dependent on timing of
> what should be independent actors. This is really not great for the
> p50 vs p99 gap.

Initially I had a few retry iterations for the subtree flush, where we
fsleep for a bit and trylock again. I thought it was a little bit too
complicated and the fsleep duration would be a magic value.

>
> I think this whole thing is getting away from us.
>
> When we first switched to rstat, every reader would take the global
> rstat lock and then flush its local subtree of interest.
>
> This was changed in the following commit:
>
> commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
> Author: Shakeel Butt <shakeelb@google.com>
> Date:   Fri Nov 5 13:37:34 2021 -0700
>
>     memcg: unify memcg stat flushing
>
>     The memcg stats can be flushed in multiple context and potentially in
>     parallel too.  For example multiple parallel user space readers for
>     memcg stats will contend on the rstat locks with each other.  There is
>     no need for that.  We just need one flusher and everyone else can
>     benefit.
>
>     In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
>     stats") the kernel periodically flush the memcg stats from the root, so,
>     the other flushers will potentially have much less work to do.
>
>     Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
>     Signed-off-by: Shakeel Butt <shakeelb@google.com>
>     Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Michal Hocko <mhocko@kernel.org>
>     Cc: "Michal Koutný" <mkoutny@suse.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> The idea was that we can avoid lock contention if somebody is already
> doing the flushing. However, you're now bringing global serialization.
> Clearly that idea didn't work out. What would be an obstacle to go
> back to the original way of doing it?

The obstacle is high concurrency among flushers. A good example is
reclaim code, we can have a lot of concurrent reclaimers. When I tried
to go back to the original way of doing it, a stress test with high
reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think
high concurrency among userspace reads would have a similar outcome,
but I hadn't really checked.

Basically this patch is trying to switch to root flushing when the
cost of contending on the lock is roughly the same as a root flush (or
waiting for one). It's doing that too eagerly now of course (if
contenders > 1), we can try to calibrate this better.

>
> With one reader, this will work the same as in your proposal.
>
> With two readers, just like in your proposal, flushing must be
> serialized against the root level. But at least the two flushes only
> aggregate the local data they actually care about - not the entire
> tree data that doesn't even have readers! This is much better for lock
> contention and performance.

Keep in mind that in my testing, I noticed that synchronization using
a completion is more performant than serialization on a lock. I am
assuming because when we contend on the underlying lock, we serially
wake up and do the flush. Even if there is nothing to do (as you
mention below), we still do this work. On the contrary, in this
proposal we just wait for the root flush to finish and return
immediately, and all waiters return at once (that's a lie because of
scheduling internals).

Also, in the current code, in the two reader scenario, the first
reader will flush the entire tree anyway. The only difference is that
the second reader will wait for it to finish instead of just skipping,
which is the right thing to do from a correctness point of view. Right
now it's a coin flip on whether you get updated stats if someone else
is already flushing.

Having said that, I understand your concern, but I am not really sure
how to bring back subtree flushing for all cases without regressing
in-kernel flushers with high concurrency. I tried and it seemed to
harm performance (at least of synthetic benchmarks). I also tried to
break down the underlying rstat global lock, but that didn't turn out
to be simple either.

>
> One concern is the thresholding code. The cost of flushing on every
> read is too high: even when there is no flush work, checking for it is
> kind of expensive due to the locks and the for_each_possible_cpu().
>
> Could we do something like the following?
>
>         mem_cgroup_flush(memcg)
>         {
>                 mutex_lock(&memcg_flush_mutex);
>                 if (atomic_read(&memcg->stat_delta) > THRESHOLD)
>                         cgroup_rstat_flush(memcg);
>                 mutex_unlock(&memcg_flush_mutex);
>         }
>
>         mem_cgroup_css_rstat_flush(css, cpu)
>         {
>                 ...
>                 /*
>                  * Reset here instead of mem_cgroup_flush()
>                  * so that each flushed subgroup is reset
>                  * recursively. A recent parent flush will
>                  * allow a child flush to skip.
>                  */
>                 atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
>         }
>
>         memcg_rstat_updated(memcg, val)
>         {
>                 atomic_add(&memcg->stat_delta, val);

We need to do this for each parent in the hierarchy, not just the
memcg being updated, right? I guess that may be too expensive for the
update path.

I thought about doing it percpu and propagating to global "stat
deltas" when each cpu has enough updates, similar to what we do now
with the global delta. To do this efficiently we would want to move
that to the cgroup core code, since we already have parents iteration
in cgroup_rstat_updated() and we might want to reuse that. I also
thought that may be too complex.

Anyway, I am open to suggestions on how to improve this.

>         }
Johannes Weiner Sept. 14, 2023, 4:06 p.m. UTC | #3
On Wed, Sep 13, 2023 at 09:26:21AM -0700, Yosry Ahmed wrote:
> On Wed, Sep 13, 2023 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote:
> > > Stats flushing for memcg currently follows the following rules:
> > > - Always flush the entire memcg hierarchy (i.e. flush the root).
> > > - Only one flusher is allowed at a time. If someone else tries to flush
> > >   concurrently, they skip and return immediately.
> > > - A periodic flusher flushes all the stats every 2 seconds.
> > >
> > > The reason this approach is followed is because all flushes are
> > > serialized by a global rstat spinlock. On the memcg side, flushing is
> > > invoked from userspace reads as well as in-kernel flushers (e.g.
> > > reclaim, refault, etc). This approach aims to avoid serializing all
> > > flushers on the global lock, which can cause a significant performance
> > > hit under high concurrency.
> > >
> > > This approach has the following problems:
> > > - Occasionally a userspace read of the stats of a non-root cgroup will
> > >   be too expensive as it has to flush the entire hierarchy [1].
> > > - Sometimes the stats accuracy are compromised if there is an ongoing
> > >   flush, and we skip and return before the subtree of interest is
> > >   actually flushed. This is more visible when reading stats from
> > >   userspace, but can also affect in-kernel flushers.
> > >
> > > This patch aims to solve both problems by reworking how flushing
> > > currently works as follows:
> > > - Without contention, there is no need to flush the entire tree. In this
> > >   case, only flush the subtree of interest. This avoids the latency of a
> > >   full root flush if unnecessary.
> > > - With contention, fallback to a coalesced (aka unified) flush of the
> > >   entire hierarchy, a root flush. In this case, instead of returning
> > >   immediately if a root flush is ongoing, wait for it to finish
> > >   *without* attempting to acquire the lock or flush. This is done using
> > >   a completion. Compared to competing directly on the underlying lock,
> > >   this approach makes concurrent flushing a synchronization point
> > >   instead of a serialization point. Once  a root flush finishes, *all*
> > >   waiters can wake up and continue at once.
> > > - Finally, with very high contention, bound the number of waiters to the
> > >   number of online cpus. This keeps the flush latency bounded at the tail
> > >   (very high concurrency). We fallback to sacrificing stats freshness only
> > >   in such cases in favor of performance.
> > >
> > > This was tested in two ways on a machine with 384 cpus:
> > > - A synthetic test with 5000 concurrent workers doing allocations and
> > >   reclaim, as well as 1000 readers for memory.stat (variation of [2]).
> > >   No significant regressions were noticed in the total runtime.
> > >   Note that if concurrent flushers compete directly on the spinlock
> > >   instead of waiting for a completion, this test shows 2x-3x slowdowns.
> > >   Even though subsequent flushers would have nothing to flush, just the
> > >   serialization and lock contention is a major problem. Using a
> > >   completion for synchronization instead seems to overcome this problem.
> > >
> > > - A synthetic stress test for concurrently reading memcg stats provided
> > >   by Wei Xu.
> > >   With 10k threads reading the stats every 100ms:
> > >   - 98.8% of reads take <100us
> > >   - 1.09% of reads take 100us to 1ms.
> > >   - 0.11% of reads take 1ms to 10ms.
> > >   - Almost no reads take more than 10ms.
> > >   With 10k threads reading the stats every 10ms:
> > >   - 82.3% of reads take <100us.
> > >   - 4.2% of reads take 100us to 1ms.
> > >   - 4.7% of reads take 1ms to 10ms.
> > >   - 8.8% of reads take 10ms to 100ms.
> > >   - Almost no reads take more than 100ms.
> > >
> > > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> > >
> > > [weixugc@google.com: suggested the fallback logic and bounding the
> > > number of waiters]
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > >       /*
> > > +      * Opportunistically try to only flush the requested subtree. Otherwise
> > > +      * fallback to a coalesced flush below.
> > >        */
> > > -     if (atomic_read(&stats_flush_ongoing) ||
> > > -         atomic_xchg(&stats_flush_ongoing, 1))
> > > +     if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> > > +             cgroup_rstat_flush(memcg->css.cgroup);
> > > +             mutex_unlock(&subtree_flush_mutex);
> > >               return;
> > > +     }
> > >
> > > -     WRITE_ONCE(flush_last_time, jiffies_64);
> > > -
> > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > > +     /* A coalesced root flush is in order. Are we the designated flusher? */
> > > +     spin_lock(&root_flusher_lock);
> >
> > I can't say I'm crazy about this.
> >
> > Let's say you have a fairly sprawling and active cgroup tree with lots
> > of updates in lots of groups in the system.
> >
> > Add a periodic memory.stat reader to one of the subgroups, and that
> > reader will do very fast, localized aggregation.
> >
> > Now add a periodic memory.stat reader to some other subgroup. They
> > might still both do very fast, localized aggregation. Or they might
> > collide; and now - despite having nothing in common, and sharing no
> > data besides the rstat lock - will have to flush the entire massive
> > tree. The rate at which this happens is purely dependent on timing of
> > what should be independent actors. This is really not great for the
> > p50 vs p99 gap.
> 
> Initially I had a few retry iterations for the subtree flush, where we
> fsleep for a bit and trylock again. I thought it was a little bit too
> complicated and the fsleep duration would be a magic value.

Hm, how is that different than a mutex / sleepable lock?

> > I think this whole thing is getting away from us.
> >
> > When we first switched to rstat, every reader would take the global
> > rstat lock and then flush its local subtree of interest.
> >
> > This was changed in the following commit:
> >
> > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
> > Author: Shakeel Butt <shakeelb@google.com>
> > Date:   Fri Nov 5 13:37:34 2021 -0700
> >
> >     memcg: unify memcg stat flushing
> >
> >     The memcg stats can be flushed in multiple context and potentially in
> >     parallel too.  For example multiple parallel user space readers for
> >     memcg stats will contend on the rstat locks with each other.  There is
> >     no need for that.  We just need one flusher and everyone else can
> >     benefit.
> >
> >     In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
> >     stats") the kernel periodically flush the memcg stats from the root, so,
> >     the other flushers will potentially have much less work to do.
> >
> >     Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
> >     Signed-off-by: Shakeel Butt <shakeelb@google.com>
> >     Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >     Cc: Michal Hocko <mhocko@kernel.org>
> >     Cc: "Michal Koutný" <mkoutny@suse.com>
> >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> >
> > The idea was that we can avoid lock contention if somebody is already
> > doing the flushing. However, you're now bringing global serialization.
> > Clearly that idea didn't work out. What would be an obstacle to go
> > back to the original way of doing it?
> 
> The obstacle is high concurrency among flushers. A good example is
> reclaim code, we can have a lot of concurrent reclaimers. When I tried
> to go back to the original way of doing it, a stress test with high
> reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think
> high concurrency among userspace reads would have a similar outcome,
> but I hadn't really checked.
> 
> Basically this patch is trying to switch to root flushing when the
> cost of contending on the lock is roughly the same as a root flush (or
> waiting for one). It's doing that too eagerly now of course (if
> contenders > 1), we can try to calibrate this better.

I don't quite understand this.

If you have two readers on separate subtrees, why is it cheaper to
flush the entire tree in sequence (where both readers don't care about
the majority of the data), than having each reader flush their own
small subset one after another? It's not like the whole tree flush
parallelizes its work.

Where is that overhead actually coming from?

> > With one reader, this will work the same as in your proposal.
> >
> > With two readers, just like in your proposal, flushing must be
> > serialized against the root level. But at least the two flushes only
> > aggregate the local data they actually care about - not the entire
> > tree data that doesn't even have readers! This is much better for lock
> > contention and performance.
> 
> Keep in mind that in my testing, I noticed that synchronization using
> a completion is more performant than serialization on a lock. I am
> assuming because when we contend on the underlying lock, we serially
> wake up and do the flush. Even if there is nothing to do (as you
> mention below), we still do this work. On the contrary, in this
> proposal we just wait for the root flush to finish and return
> immediately, and all waiters return at once (that's a lie because of
> scheduling internals).

Right, because rstat's do-nothing case is still somewhat costly due to
the per-cpu loop and the tree walk.

But that should be possible to avoid with the outlined caching of
recently-flushed state at the memcg level.

> Also, in the current code, in the two reader scenario, the first
> reader will flush the entire tree anyway. The only difference is that
> the second reader will wait for it to finish instead of just skipping,
> which is the right thing to do from a correctness point of view. Right
> now it's a coin flip on whether you get updated stats if someone else
> is already flushing.

Agreed, it should wait. My mutex would accomplish this, right?

> > One concern is the thresholding code. The cost of flushing on every
> > read is too high: even when there is no flush work, checking for it is
> > kind of expensive due to the locks and the for_each_possible_cpu().
> >
> > Could we do something like the following?
> >
> >         mem_cgroup_flush(memcg)
> >         {
> >                 mutex_lock(&memcg_flush_mutex);
> >                 if (atomic_read(&memcg->stat_delta) > THRESHOLD)
> >                         cgroup_rstat_flush(memcg);
> >                 mutex_unlock(&memcg_flush_mutex);
> >         }
> >
> >         mem_cgroup_css_rstat_flush(css, cpu)
> >         {
> >                 ...
> >                 /*
> >                  * Reset here instead of mem_cgroup_flush()
> >                  * so that each flushed subgroup is reset
> >                  * recursively. A recent parent flush will
> >                  * allow a child flush to skip.
> >                  */
> >                 atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
> >         }
> >
> >         memcg_rstat_updated(memcg, val)
> >         {
> >                 atomic_add(&memcg->stat_delta, val);
> 
> We need to do this for each parent in the hierarchy, not just the
> memcg being updated, right? I guess that may be too expensive for the
> update path.

How so?

We need to mark the subgroups that are flushed, so that if you have

	root - a - a1 - foo
                `- a2

and somebody flushes a, then a1 and a2 also don't need to be flushed
for a while.

But if we flush a1 first, foo is aggregated into a1. Reading a still
needs to aggregate a1 and a2 into a.

Maybe I'm missing something blatant, but I don't see how we have to
mark parents in any way. We only have to tag cgroups up to the point
to which they were recently flushed, and we already visit all those.

Let me know if I'm missing something blatant here.
Waiman Long Sept. 14, 2023, 5:19 p.m. UTC | #4
On 9/13/23 03:38, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>    concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
>
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
>
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>    be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>    flush, and we skip and return before the subtree of interest is
>    actually flushed. This is more visible when reading stats from
>    userspace, but can also affect in-kernel flushers.
>
> This patch aims to solve both problems by reworking how flushing
> currently works as follows:
> - Without contention, there is no need to flush the entire tree. In this
>    case, only flush the subtree of interest. This avoids the latency of a
>    full root flush if unnecessary.
> - With contention, fallback to a coalesced (aka unified) flush of the
>    entire hierarchy, a root flush. In this case, instead of returning
>    immediately if a root flush is ongoing, wait for it to finish
>    *without* attempting to acquire the lock or flush. This is done using
>    a completion. Compared to competing directly on the underlying lock,
>    this approach makes concurrent flushing a synchronization point
>    instead of a serialization point. Once  a root flush finishes, *all*
>    waiters can wake up and continue at once.
> - Finally, with very high contention, bound the number of waiters to the
>    number of online cpus. This keeps the flush latency bounded at the tail
>    (very high concurrency). We fallback to sacrificing stats freshness only
>    in such cases in favor of performance.
>
> This was tested in two ways on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers doing allocations and
>    reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>    No significant regressions were noticed in the total runtime.
>    Note that if concurrent flushers compete directly on the spinlock
>    instead of waiting for a completion, this test shows 2x-3x slowdowns.
>    Even though subsequent flushers would have nothing to flush, just the
>    serialization and lock contention is a major problem. Using a
>    completion for synchronization instead seems to overcome this problem.
>
> - A synthetic stress test for concurrently reading memcg stats provided
>    by Wei Xu.
>    With 10k threads reading the stats every 100ms:
>    - 98.8% of reads take <100us
>    - 1.09% of reads take 100us to 1ms.
>    - 0.11% of reads take 1ms to 10ms.
>    - Almost no reads take more than 10ms.
>    With 10k threads reading the stats every 10ms:
>    - 82.3% of reads take <100us.
>    - 4.2% of reads take 100us to 1ms.
>    - 4.7% of reads take 1ms to 10ms.
>    - 8.8% of reads take 10ms to 100ms.
>    - Almost no reads take more than 100ms.
>
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
>
> [weixugc@google.com: suggested the fallback logic and bounding the
> number of waiters]
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   include/linux/memcontrol.h |   4 +-
>   mm/memcontrol.c            | 100 ++++++++++++++++++++++++++++---------
>   mm/vmscan.c                |   2 +-
>   mm/workingset.c            |   8 ++-
>   4 files changed, 85 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 11810a2cfd2d..4453cd3fc4b8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>   	return x;
>   }
>   
> -void mem_cgroup_flush_stats(void);
> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>   void mem_cgroup_flush_stats_ratelimited(void);
>   
>   void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>   	return node_page_state(lruvec_pgdat(lruvec), idx);
>   }
>   
> -static inline void mem_cgroup_flush_stats(void)
> +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>   {
>   }
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d729870505f1..edff41e4b4e7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   static void flush_memcg_stats_dwork(struct work_struct *w);
>   static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>   static DEFINE_PER_CPU(unsigned int, stats_updates);
> -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
>   /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */
>   static atomic_t stats_updates_order = ATOMIC_INIT(0);
>   static u64 flush_last_time;
> @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>   	}
>   }
>   
> -static void do_flush_stats(void)
> +/*
> + * do_flush_stats - flush the statistics of a memory cgroup and its tree
> + * @memcg: the memory cgroup to flush
> + * @wait: wait for an ongoing root flush to complete before returning
> + *
> + * All flushes are serialized by the underlying rstat global lock. If there is
> + * no contention, we try to only flush the subtree of the passed @memcg to
> + * minimize the work. Otherwise, we coalesce multiple flushing requests into a
> + * single flush of the root memcg. When there is an ongoing root flush, we wait
> + * for its completion (unless otherwise requested), to get fresh stats. If the
> + * number of waiters exceeds the number of cpus just skip the flush to bound the
> + * flush latency at the tail with very high concurrency.
> + *
> + * This is a trade-off between stats accuracy and flush latency.
> + */
> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait)
>   {
> +	static DECLARE_COMPLETION(root_flush_done);
> +	static DEFINE_SPINLOCK(root_flusher_lock);
> +	static DEFINE_MUTEX(subtree_flush_mutex);
> +	static atomic_t waiters = ATOMIC_INIT(0);
> +	static bool root_flush_ongoing;
> +	bool root_flusher = false;
> +
> +	/* Ongoing root flush, just wait for it (unless otherwise requested) */
> +	if (READ_ONCE(root_flush_ongoing))
> +		goto root_flush_or_wait;
> +
>   	/*
> -	 * We always flush the entire tree, so concurrent flushers can just
> -	 * skip. This avoids a thundering herd problem on the rstat global lock
> -	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 * Opportunistically try to only flush the requested subtree. Otherwise
> +	 * fallback to a coalesced flush below.
>   	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> +		cgroup_rstat_flush(memcg->css.cgroup);
> +		mutex_unlock(&subtree_flush_mutex);
>   		return;
> +	}

If mutex_trylock() is the only way to acquire subtree_flush_mutex, you 
don't really need a mutex. Just a simple integer flag with xchg() call 
should be enough.

Cheers,
Longman
Yosry Ahmed Sept. 14, 2023, 5:22 p.m. UTC | #5
[..]
> > > > -
> > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > > > +     /* A coalesced root flush is in order. Are we the designated flusher? */
> > > > +     spin_lock(&root_flusher_lock);
> > >
> > > I can't say I'm crazy about this.
> > >
> > > Let's say you have a fairly sprawling and active cgroup tree with lots
> > > of updates in lots of groups in the system.
> > >
> > > Add a periodic memory.stat reader to one of the subgroups, and that
> > > reader will do very fast, localized aggregation.
> > >
> > > Now add a periodic memory.stat reader to some other subgroup. They
> > > might still both do very fast, localized aggregation. Or they might
> > > collide; and now - despite having nothing in common, and sharing no
> > > data besides the rstat lock - will have to flush the entire massive
> > > tree. The rate at which this happens is purely dependent on timing of
> > > what should be independent actors. This is really not great for the
> > > p50 vs p99 gap.
> >
> > Initially I had a few retry iterations for the subtree flush, where we
> > fsleep for a bit and trylock again. I thought it was a little bit too
> > complicated and the fsleep duration would be a magic value.
>
> Hm, how is that different than a mutex / sleepable lock?

It is essentially a lock with a timeout, which IIUC we don't support
explicitly in the kernel. What I was trying to do was basically to try
and do a subtree flush, but if we are waiting for too long then a lot
of people are probably flushing, so let's all switch to a root flush
and wait for one flusher instead of contending among ourselves.

>
> > > I think this whole thing is getting away from us.
> > >
> > > When we first switched to rstat, every reader would take the global
> > > rstat lock and then flush its local subtree of interest.
> > >
> > > This was changed in the following commit:
> > >
> > > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
> > > Author: Shakeel Butt <shakeelb@google.com>
> > > Date:   Fri Nov 5 13:37:34 2021 -0700
> > >
> > >     memcg: unify memcg stat flushing
> > >
> > >     The memcg stats can be flushed in multiple context and potentially in
> > >     parallel too.  For example multiple parallel user space readers for
> > >     memcg stats will contend on the rstat locks with each other.  There is
> > >     no need for that.  We just need one flusher and everyone else can
> > >     benefit.
> > >
> > >     In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
> > >     stats") the kernel periodically flush the memcg stats from the root, so,
> > >     the other flushers will potentially have much less work to do.
> > >
> > >     Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
> > >     Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > >     Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > >     Cc: Michal Hocko <mhocko@kernel.org>
> > >     Cc: "Michal Koutný" <mkoutny@suse.com>
> > >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > >
> > > The idea was that we can avoid lock contention if somebody is already
> > > doing the flushing. However, you're now bringing global serialization.
> > > Clearly that idea didn't work out. What would be an obstacle to go
> > > back to the original way of doing it?
> >
> > The obstacle is high concurrency among flushers. A good example is
> > reclaim code, we can have a lot of concurrent reclaimers. When I tried
> > to go back to the original way of doing it, a stress test with high
> > reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think
> > high concurrency among userspace reads would have a similar outcome,
> > but I hadn't really checked.
> >
> > Basically this patch is trying to switch to root flushing when the
> > cost of contending on the lock is roughly the same as a root flush (or
> > waiting for one). It's doing that too eagerly now of course (if
> > contenders > 1), we can try to calibrate this better.
>
> I don't quite understand this.
>
> If you have two readers on separate subtrees, why is it cheaper to
> flush the entire tree in sequence (where both readers don't care about
> the majority of the data), than having each reader flush their own
> small subset one after another? It's not like the whole tree flush
> parallelizes its work.
>
> Where is that overhead actually coming from?

If you have N concurrent readers flushing different parts of the
subtree, at some point the overhead of N sequential subtree flushes
will exceed the overhead of a single root flush.

Ideally, we would be able to identify this point, and switch to a
single root flush with everyone else waiting.

>
> > > With one reader, this will work the same as in your proposal.
> > >
> > > With two readers, just like in your proposal, flushing must be
> > > serialized against the root level. But at least the two flushes only
> > > aggregate the local data they actually care about - not the entire
> > > tree data that doesn't even have readers! This is much better for lock
> > > contention and performance.
> >
> > Keep in mind that in my testing, I noticed that synchronization using
> > a completion is more performant than serialization on a lock. I am
> > assuming because when we contend on the underlying lock, we serially
> > wake up and do the flush. Even if there is nothing to do (as you
> > mention below), we still do this work. On the contrary, in this
> > proposal we just wait for the root flush to finish and return
> > immediately, and all waiters return at once (that's a lie because of
> > scheduling internals).
>
> Right, because rstat's do-nothing case is still somewhat costly due to
> the per-cpu loop and the tree walk.
>
> But that should be possible to avoid with the outlined caching of
> recently-flushed state at the memcg level.

This helps only if concurrent flushers are overlapping, right?

>
> > Also, in the current code, in the two reader scenario, the first
> > reader will flush the entire tree anyway. The only difference is that
> > the second reader will wait for it to finish instead of just skipping,
> > which is the right thing to do from a correctness point of view. Right
> > now it's a coin flip on whether you get updated stats if someone else
> > is already flushing.
>
> Agreed, it should wait. My mutex would accomplish this, right?

I think what you're describing here is v4 of the patchset I mention in
the cover letter:
https://lore.kernel.org/lkml/20230831165611.2610118-5-yosryahmed@google.com/

The problem with that was that with very high concurrency among
readers, the read latency is unbounded and can get out of hand. Wei
shared some numbers in that thread.

What this patch is trying to do is to switch to root flushing if the
mutex is contended to avoid that scenario.  Also, if we keep acquiring
more flushers at some point we just skip the flush in favor of
performance (if the number of waiters exceeds the number of cpus). Now
that I think about it, perhaps we can just go back to the mutex
approach w/ limiting the number of waiters, without ever falling back
to a root flush. Not sure how that would work.

Taking a step back, I think what you are implying is that if we make
the thresholding code per cpu instead of global, this should mitigate
the issue in a different way than falling back to root flushing or
limiting the number of waiters, is that right?
If yes, I think it will work in some cases where the flushes are
overlapping as I mentioned above, but not if concurrent readers are
flushing different parts of the tree. Right?

>
> > > One concern is the thresholding code. The cost of flushing on every
> > > read is too high: even when there is no flush work, checking for it is
> > > kind of expensive due to the locks and the for_each_possible_cpu().
> > >
> > > Could we do something like the following?
> > >
> > >         mem_cgroup_flush(memcg)
> > >         {
> > >                 mutex_lock(&memcg_flush_mutex);
> > >                 if (atomic_read(&memcg->stat_delta) > THRESHOLD)
> > >                         cgroup_rstat_flush(memcg);
> > >                 mutex_unlock(&memcg_flush_mutex);
> > >         }
> > >
> > >         mem_cgroup_css_rstat_flush(css, cpu)
> > >         {
> > >                 ...
> > >                 /*
> > >                  * Reset here instead of mem_cgroup_flush()
> > >                  * so that each flushed subgroup is reset
> > >                  * recursively. A recent parent flush will
> > >                  * allow a child flush to skip.
> > >                  */
> > >                 atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
> > >         }
> > >
> > >         memcg_rstat_updated(memcg, val)
> > >         {
> > >                 atomic_add(&memcg->stat_delta, val);
> >
> > We need to do this for each parent in the hierarchy, not just the
> > memcg being updated, right? I guess that may be too expensive for the
> > update path.
>
> How so?
>
> We need to mark the subgroups that are flushed, so that if you have
>
>         root - a - a1 - foo
>                 `- a2
>
> and somebody flushes a, then a1 and a2 also don't need to be flushed
> for a while.
>
> But if we flush a1 first, foo is aggregated into a1. Reading a still
> needs to aggregate a1 and a2 into a.
>
> Maybe I'm missing something blatant, but I don't see how we have to
> mark parents in any way. We only have to tag cgroups up to the point
> to which they were recently flushed, and we already visit all those.
>
> Let me know if I'm missing something blatant here.

I think we are talking about different paths. I believe you are
talking about resetting memcg->stat_delta, which I agree should be
done during the flush. What I am talking about is updating
memcg->stat_delta when memcg_rstat_updated() is called. We would need
to update that for all parents. In your example hierarchy, if stat
updates happened in a2, and someone tries to flush a, they should be
aware that there are stats to flush.
Yosry Ahmed Sept. 14, 2023, 5:23 p.m. UTC | #6
On Thu, Sep 14, 2023 at 10:19 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 9/13/23 03:38, Yosry Ahmed wrote:
> > Stats flushing for memcg currently follows the following rules:
> > - Always flush the entire memcg hierarchy (i.e. flush the root).
> > - Only one flusher is allowed at a time. If someone else tries to flush
> >    concurrently, they skip and return immediately.
> > - A periodic flusher flushes all the stats every 2 seconds.
> >
> > The reason this approach is followed is because all flushes are
> > serialized by a global rstat spinlock. On the memcg side, flushing is
> > invoked from userspace reads as well as in-kernel flushers (e.g.
> > reclaim, refault, etc). This approach aims to avoid serializing all
> > flushers on the global lock, which can cause a significant performance
> > hit under high concurrency.
> >
> > This approach has the following problems:
> > - Occasionally a userspace read of the stats of a non-root cgroup will
> >    be too expensive as it has to flush the entire hierarchy [1].
> > - Sometimes the stats accuracy are compromised if there is an ongoing
> >    flush, and we skip and return before the subtree of interest is
> >    actually flushed. This is more visible when reading stats from
> >    userspace, but can also affect in-kernel flushers.
> >
> > This patch aims to solve both problems by reworking how flushing
> > currently works as follows:
> > - Without contention, there is no need to flush the entire tree. In this
> >    case, only flush the subtree of interest. This avoids the latency of a
> >    full root flush if unnecessary.
> > - With contention, fallback to a coalesced (aka unified) flush of the
> >    entire hierarchy, a root flush. In this case, instead of returning
> >    immediately if a root flush is ongoing, wait for it to finish
> >    *without* attempting to acquire the lock or flush. This is done using
> >    a completion. Compared to competing directly on the underlying lock,
> >    this approach makes concurrent flushing a synchronization point
> >    instead of a serialization point. Once  a root flush finishes, *all*
> >    waiters can wake up and continue at once.
> > - Finally, with very high contention, bound the number of waiters to the
> >    number of online cpus. This keeps the flush latency bounded at the tail
> >    (very high concurrency). We fallback to sacrificing stats freshness only
> >    in such cases in favor of performance.
> >
> > This was tested in two ways on a machine with 384 cpus:
> > - A synthetic test with 5000 concurrent workers doing allocations and
> >    reclaim, as well as 1000 readers for memory.stat (variation of [2]).
> >    No significant regressions were noticed in the total runtime.
> >    Note that if concurrent flushers compete directly on the spinlock
> >    instead of waiting for a completion, this test shows 2x-3x slowdowns.
> >    Even though subsequent flushers would have nothing to flush, just the
> >    serialization and lock contention is a major problem. Using a
> >    completion for synchronization instead seems to overcome this problem.
> >
> > - A synthetic stress test for concurrently reading memcg stats provided
> >    by Wei Xu.
> >    With 10k threads reading the stats every 100ms:
> >    - 98.8% of reads take <100us
> >    - 1.09% of reads take 100us to 1ms.
> >    - 0.11% of reads take 1ms to 10ms.
> >    - Almost no reads take more than 10ms.
> >    With 10k threads reading the stats every 10ms:
> >    - 82.3% of reads take <100us.
> >    - 4.2% of reads take 100us to 1ms.
> >    - 4.7% of reads take 1ms to 10ms.
> >    - 8.8% of reads take 10ms to 100ms.
> >    - Almost no reads take more than 100ms.
> >
> > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> >
> > [weixugc@google.com: suggested the fallback logic and bounding the
> > number of waiters]
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   include/linux/memcontrol.h |   4 +-
> >   mm/memcontrol.c            | 100 ++++++++++++++++++++++++++++---------
> >   mm/vmscan.c                |   2 +-
> >   mm/workingset.c            |   8 ++-
> >   4 files changed, 85 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 11810a2cfd2d..4453cd3fc4b8 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >       return x;
> >   }
> >
> > -void mem_cgroup_flush_stats(void);
> > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
> >   void mem_cgroup_flush_stats_ratelimited(void);
> >
> >   void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >       return node_page_state(lruvec_pgdat(lruvec), idx);
> >   }
> >
> > -static inline void mem_cgroup_flush_stats(void)
> > +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> >   {
> >   }
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d729870505f1..edff41e4b4e7 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >   static void flush_memcg_stats_dwork(struct work_struct *w);
> >   static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >   static DEFINE_PER_CPU(unsigned int, stats_updates);
> > -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> >   /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */
> >   static atomic_t stats_updates_order = ATOMIC_INIT(0);
> >   static u64 flush_last_time;
> > @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >       }
> >   }
> >
> > -static void do_flush_stats(void)
> > +/*
> > + * do_flush_stats - flush the statistics of a memory cgroup and its tree
> > + * @memcg: the memory cgroup to flush
> > + * @wait: wait for an ongoing root flush to complete before returning
> > + *
> > + * All flushes are serialized by the underlying rstat global lock. If there is
> > + * no contention, we try to only flush the subtree of the passed @memcg to
> > + * minimize the work. Otherwise, we coalesce multiple flushing requests into a
> > + * single flush of the root memcg. When there is an ongoing root flush, we wait
> > + * for its completion (unless otherwise requested), to get fresh stats. If the
> > + * number of waiters exceeds the number of cpus just skip the flush to bound the
> > + * flush latency at the tail with very high concurrency.
> > + *
> > + * This is a trade-off between stats accuracy and flush latency.
> > + */
> > +static void do_flush_stats(struct mem_cgroup *memcg, bool wait)
> >   {
> > +     static DECLARE_COMPLETION(root_flush_done);
> > +     static DEFINE_SPINLOCK(root_flusher_lock);
> > +     static DEFINE_MUTEX(subtree_flush_mutex);
> > +     static atomic_t waiters = ATOMIC_INIT(0);
> > +     static bool root_flush_ongoing;
> > +     bool root_flusher = false;
> > +
> > +     /* Ongoing root flush, just wait for it (unless otherwise requested) */
> > +     if (READ_ONCE(root_flush_ongoing))
> > +             goto root_flush_or_wait;
> > +
> >       /*
> > -      * We always flush the entire tree, so concurrent flushers can just
> > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      * Opportunistically try to only flush the requested subtree. Otherwise
> > +      * fallback to a coalesced flush below.
> >        */
> > -     if (atomic_read(&stats_flush_ongoing) ||
> > -         atomic_xchg(&stats_flush_ongoing, 1))
> > +     if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> > +             cgroup_rstat_flush(memcg->css.cgroup);
> > +             mutex_unlock(&subtree_flush_mutex);
> >               return;
> > +     }
>
> If mutex_trylock() is the only way to acquire subtree_flush_mutex, you
> don't really need a mutex. Just a simple integer flag with xchg() call
> should be enough.

Thanks for pointing this out. Agreed.

If we keep this approach I will drop that mutex.

>
> Cheers,
> Longman
>
Yosry Ahmed Sept. 14, 2023, 5:26 p.m. UTC | #7
On Thu, Sep 14, 2023 at 10:22 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > > -
> > > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > > > > +     /* A coalesced root flush is in order. Are we the designated flusher? */
> > > > > +     spin_lock(&root_flusher_lock);
> > > >
> > > > I can't say I'm crazy about this.
> > > >
> > > > Let's say you have a fairly sprawling and active cgroup tree with lots
> > > > of updates in lots of groups in the system.
> > > >
> > > > Add a periodic memory.stat reader to one of the subgroups, and that
> > > > reader will do very fast, localized aggregation.
> > > >
> > > > Now add a periodic memory.stat reader to some other subgroup. They
> > > > might still both do very fast, localized aggregation. Or they might
> > > > collide; and now - despite having nothing in common, and sharing no
> > > > data besides the rstat lock - will have to flush the entire massive
> > > > tree. The rate at which this happens is purely dependent on timing of
> > > > what should be independent actors. This is really not great for the
> > > > p50 vs p99 gap.
> > >
> > > Initially I had a few retry iterations for the subtree flush, where we
> > > fsleep for a bit and trylock again. I thought it was a little bit too
> > > complicated and the fsleep duration would be a magic value.
> >
> > Hm, how is that different than a mutex / sleepable lock?
>
> It is essentially a lock with a timeout, which IIUC we don't support
> explicitly in the kernel. What I was trying to do was basically to try
> and do a subtree flush, but if we are waiting for too long then a lot
> of people are probably flushing, so let's all switch to a root flush
> and wait for one flusher instead of contending among ourselves.
>
> >
> > > > I think this whole thing is getting away from us.
> > > >
> > > > When we first switched to rstat, every reader would take the global
> > > > rstat lock and then flush its local subtree of interest.
> > > >
> > > > This was changed in the following commit:
> > > >
> > > > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
> > > > Author: Shakeel Butt <shakeelb@google.com>
> > > > Date:   Fri Nov 5 13:37:34 2021 -0700
> > > >
> > > >     memcg: unify memcg stat flushing
> > > >
> > > >     The memcg stats can be flushed in multiple context and potentially in
> > > >     parallel too.  For example multiple parallel user space readers for
> > > >     memcg stats will contend on the rstat locks with each other.  There is
> > > >     no need for that.  We just need one flusher and everyone else can
> > > >     benefit.
> > > >
> > > >     In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
> > > >     stats") the kernel periodically flush the memcg stats from the root, so,
> > > >     the other flushers will potentially have much less work to do.
> > > >
> > > >     Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
> > > >     Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > >     Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >     Cc: Michal Hocko <mhocko@kernel.org>
> > > >     Cc: "Michal Koutný" <mkoutny@suse.com>
> > > >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > >
> > > > The idea was that we can avoid lock contention if somebody is already
> > > > doing the flushing. However, you're now bringing global serialization.
> > > > Clearly that idea didn't work out. What would be an obstacle to go
> > > > back to the original way of doing it?
> > >
> > > The obstacle is high concurrency among flushers. A good example is
> > > reclaim code, we can have a lot of concurrent reclaimers. When I tried
> > > to go back to the original way of doing it, a stress test with high
> > > reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think
> > > high concurrency among userspace reads would have a similar outcome,
> > > but I hadn't really checked.
> > >
> > > Basically this patch is trying to switch to root flushing when the
> > > cost of contending on the lock is roughly the same as a root flush (or
> > > waiting for one). It's doing that too eagerly now of course (if
> > > contenders > 1), we can try to calibrate this better.
> >
> > I don't quite understand this.
> >
> > If you have two readers on separate subtrees, why is it cheaper to
> > flush the entire tree in sequence (where both readers don't care about
> > the majority of the data), than having each reader flush their own
> > small subset one after another? It's not like the whole tree flush
> > parallelizes its work.
> >
> > Where is that overhead actually coming from?
>
> If you have N concurrent readers flushing different parts of the
> subtree, at some point the overhead of N sequential subtree flushes
> will exceed the overhead of a single root flush.
>
> Ideally, we would be able to identify this point, and switch to a
> single root flush with everyone else waiting.
>
> >
> > > > With one reader, this will work the same as in your proposal.
> > > >
> > > > With two readers, just like in your proposal, flushing must be
> > > > serialized against the root level. But at least the two flushes only
> > > > aggregate the local data they actually care about - not the entire
> > > > tree data that doesn't even have readers! This is much better for lock
> > > > contention and performance.
> > >
> > > Keep in mind that in my testing, I noticed that synchronization using
> > > a completion is more performant than serialization on a lock. I am
> > > assuming because when we contend on the underlying lock, we serially
> > > wake up and do the flush. Even if there is nothing to do (as you
> > > mention below), we still do this work. On the contrary, in this
> > > proposal we just wait for the root flush to finish and return
> > > immediately, and all waiters return at once (that's a lie because of
> > > scheduling internals).
> >
> > Right, because rstat's do-nothing case is still somewhat costly due to
> > the per-cpu loop and the tree walk.
> >
> > But that should be possible to avoid with the outlined caching of
> > recently-flushed state at the memcg level.
>
> This helps only if concurrent flushers are overlapping, right?
>
> >
> > > Also, in the current code, in the two reader scenario, the first
> > > reader will flush the entire tree anyway. The only difference is that
> > > the second reader will wait for it to finish instead of just skipping,
> > > which is the right thing to do from a correctness point of view. Right
> > > now it's a coin flip on whether you get updated stats if someone else
> > > is already flushing.
> >
> > Agreed, it should wait. My mutex would accomplish this, right?
>
> I think what you're describing here is v4 of the patchset I mention in
> the cover letter:
> https://lore.kernel.org/lkml/20230831165611.2610118-5-yosryahmed@google.com/
>
> The problem with that was that with very high concurrency among
> readers, the read latency is unbounded and can get out of hand. Wei
> shared some numbers in that thread.
>
> What this patch is trying to do is to switch to root flushing if the
> mutex is contended to avoid that scenario.  Also, if we keep acquiring
> more flushers at some point we just skip the flush in favor of
> performance (if the number of waiters exceeds the number of cpus). Now
> that I think about it, perhaps we can just go back to the mutex
> approach w/ limiting the number of waiters, without ever falling back
> to a root flush. Not sure how that would work.
>
> Taking a step back, I think what you are implying is that if we make
> the thresholding code per cpu instead of global, this should mitigate

per cgroup*

> the issue in a different way than falling back to root flushing or
> limiting the number of waiters, is that right?
> If yes, I think it will work in some cases where the flushes are
> overlapping as I mentioned above, but not if concurrent readers are
> flushing different parts of the tree. Right?
>
> >
> > > > One concern is the thresholding code. The cost of flushing on every
> > > > read is too high: even when there is no flush work, checking for it is
> > > > kind of expensive due to the locks and the for_each_possible_cpu().
> > > >
> > > > Could we do something like the following?
> > > >
> > > >         mem_cgroup_flush(memcg)
> > > >         {
> > > >                 mutex_lock(&memcg_flush_mutex);
> > > >                 if (atomic_read(&memcg->stat_delta) > THRESHOLD)
> > > >                         cgroup_rstat_flush(memcg);
> > > >                 mutex_unlock(&memcg_flush_mutex);
> > > >         }
> > > >
> > > >         mem_cgroup_css_rstat_flush(css, cpu)
> > > >         {
> > > >                 ...
> > > >                 /*
> > > >                  * Reset here instead of mem_cgroup_flush()
> > > >                  * so that each flushed subgroup is reset
> > > >                  * recursively. A recent parent flush will
> > > >                  * allow a child flush to skip.
> > > >                  */
> > > >                 atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
> > > >         }
> > > >
> > > >         memcg_rstat_updated(memcg, val)
> > > >         {
> > > >                 atomic_add(&memcg->stat_delta, val);
> > >
> > > We need to do this for each parent in the hierarchy, not just the
> > > memcg being updated, right? I guess that may be too expensive for the
> > > update path.
> >
> > How so?
> >
> > We need to mark the subgroups that are flushed, so that if you have
> >
> >         root - a - a1 - foo
> >                 `- a2
> >
> > and somebody flushes a, then a1 and a2 also don't need to be flushed
> > for a while.
> >
> > But if we flush a1 first, foo is aggregated into a1. Reading a still
> > needs to aggregate a1 and a2 into a.
> >
> > Maybe I'm missing something blatant, but I don't see how we have to
> > mark parents in any way. We only have to tag cgroups up to the point
> > to which they were recently flushed, and we already visit all those.
> >
> > Let me know if I'm missing something blatant here.
>
> I think we are talking about different paths. I believe you are
> talking about resetting memcg->stat_delta, which I agree should be
> done during the flush. What I am talking about is updating
> memcg->stat_delta when memcg_rstat_updated() is called. We would need
> to update that for all parents. In your example hierarchy, if stat
> updates happened in a2, and someone tries to flush a, they should be
> aware that there are stats to flush.
Shakeel Butt Sept. 14, 2023, 5:36 p.m. UTC | #8
On Wed, Sep 13, 2023 at 12:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>   concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
>
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
>
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>   be too expensive as it has to flush the entire hierarchy [1].

This is a real world workload exhibiting the issue which is good.

> - Sometimes the stats accuracy are compromised if there is an ongoing
>   flush, and we skip and return before the subtree of interest is
>   actually flushed. This is more visible when reading stats from
>   userspace, but can also affect in-kernel flushers.

Please provide similar data/justification for the above. In addition:

1. How much delayed/stale stats have you observed on real world workload?

2. What is acceptable staleness in the stats for your use-case?

3. What is your use-case?

4. Does your use-case care about staleness of all the stats in
memory.stat or some specific stats?

5. If some specific stats in memory.stat, does it make sense to
decouple them from rstat and just pay the price up front to maintain
them accurately?

Most importantly please please please be concise in your responses.

I know I am going back on some of the previous agreements but this
whole locking back and forth has made in question the original
motivation.

thanks,
Shakeel
Waiman Long Sept. 14, 2023, 5:36 p.m. UTC | #9
On 9/14/23 13:23, Yosry Ahmed wrote:
> On Thu, Sep 14, 2023 at 10:19 AM Waiman Long <longman@redhat.com> wrote:
>>
>> On 9/13/23 03:38, Yosry Ahmed wrote:
>>> Stats flushing for memcg currently follows the following rules:
>>> - Always flush the entire memcg hierarchy (i.e. flush the root).
>>> - Only one flusher is allowed at a time. If someone else tries to flush
>>>     concurrently, they skip and return immediately.
>>> - A periodic flusher flushes all the stats every 2 seconds.
>>>
>>> The reason this approach is followed is because all flushes are
>>> serialized by a global rstat spinlock. On the memcg side, flushing is
>>> invoked from userspace reads as well as in-kernel flushers (e.g.
>>> reclaim, refault, etc). This approach aims to avoid serializing all
>>> flushers on the global lock, which can cause a significant performance
>>> hit under high concurrency.
>>>
>>> This approach has the following problems:
>>> - Occasionally a userspace read of the stats of a non-root cgroup will
>>>     be too expensive as it has to flush the entire hierarchy [1].
>>> - Sometimes the stats accuracy are compromised if there is an ongoing
>>>     flush, and we skip and return before the subtree of interest is
>>>     actually flushed. This is more visible when reading stats from
>>>     userspace, but can also affect in-kernel flushers.
>>>
>>> This patch aims to solve both problems by reworking how flushing
>>> currently works as follows:
>>> - Without contention, there is no need to flush the entire tree. In this
>>>     case, only flush the subtree of interest. This avoids the latency of a
>>>     full root flush if unnecessary.
>>> - With contention, fallback to a coalesced (aka unified) flush of the
>>>     entire hierarchy, a root flush. In this case, instead of returning
>>>     immediately if a root flush is ongoing, wait for it to finish
>>>     *without* attempting to acquire the lock or flush. This is done using
>>>     a completion. Compared to competing directly on the underlying lock,
>>>     this approach makes concurrent flushing a synchronization point
>>>     instead of a serialization point. Once  a root flush finishes, *all*
>>>     waiters can wake up and continue at once.
>>> - Finally, with very high contention, bound the number of waiters to the
>>>     number of online cpus. This keeps the flush latency bounded at the tail
>>>     (very high concurrency). We fallback to sacrificing stats freshness only
>>>     in such cases in favor of performance.
>>>
>>> This was tested in two ways on a machine with 384 cpus:
>>> - A synthetic test with 5000 concurrent workers doing allocations and
>>>     reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>>>     No significant regressions were noticed in the total runtime.
>>>     Note that if concurrent flushers compete directly on the spinlock
>>>     instead of waiting for a completion, this test shows 2x-3x slowdowns.
>>>     Even though subsequent flushers would have nothing to flush, just the
>>>     serialization and lock contention is a major problem. Using a
>>>     completion for synchronization instead seems to overcome this problem.
>>>
>>> - A synthetic stress test for concurrently reading memcg stats provided
>>>     by Wei Xu.
>>>     With 10k threads reading the stats every 100ms:
>>>     - 98.8% of reads take <100us
>>>     - 1.09% of reads take 100us to 1ms.
>>>     - 0.11% of reads take 1ms to 10ms.
>>>     - Almost no reads take more than 10ms.
>>>     With 10k threads reading the stats every 10ms:
>>>     - 82.3% of reads take <100us.
>>>     - 4.2% of reads take 100us to 1ms.
>>>     - 4.7% of reads take 1ms to 10ms.
>>>     - 8.8% of reads take 10ms to 100ms.
>>>     - Almost no reads take more than 100ms.
>>>
>>> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
>>> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
>>> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
>>>
>>> [weixugc@google.com: suggested the fallback logic and bounding the
>>> number of waiters]
>>>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>> ---
>>>    include/linux/memcontrol.h |   4 +-
>>>    mm/memcontrol.c            | 100 ++++++++++++++++++++++++++++---------
>>>    mm/vmscan.c                |   2 +-
>>>    mm/workingset.c            |   8 ++-
>>>    4 files changed, 85 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 11810a2cfd2d..4453cd3fc4b8 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>>>        return x;
>>>    }
>>>
>>> -void mem_cgroup_flush_stats(void);
>>> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>>>    void mem_cgroup_flush_stats_ratelimited(void);
>>>
>>>    void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>>> @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>>>        return node_page_state(lruvec_pgdat(lruvec), idx);
>>>    }
>>>
>>> -static inline void mem_cgroup_flush_stats(void)
>>> +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>>>    {
>>>    }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index d729870505f1..edff41e4b4e7 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>>>    static void flush_memcg_stats_dwork(struct work_struct *w);
>>>    static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>>>    static DEFINE_PER_CPU(unsigned int, stats_updates);
>>> -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
>>>    /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */
>>>    static atomic_t stats_updates_order = ATOMIC_INIT(0);
>>>    static u64 flush_last_time;
>>> @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>>>        }
>>>    }
>>>
>>> -static void do_flush_stats(void)
>>> +/*
>>> + * do_flush_stats - flush the statistics of a memory cgroup and its tree
>>> + * @memcg: the memory cgroup to flush
>>> + * @wait: wait for an ongoing root flush to complete before returning
>>> + *
>>> + * All flushes are serialized by the underlying rstat global lock. If there is
>>> + * no contention, we try to only flush the subtree of the passed @memcg to
>>> + * minimize the work. Otherwise, we coalesce multiple flushing requests into a
>>> + * single flush of the root memcg. When there is an ongoing root flush, we wait
>>> + * for its completion (unless otherwise requested), to get fresh stats. If the
>>> + * number of waiters exceeds the number of cpus just skip the flush to bound the
>>> + * flush latency at the tail with very high concurrency.
>>> + *
>>> + * This is a trade-off between stats accuracy and flush latency.
>>> + */
>>> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait)
>>>    {
>>> +     static DECLARE_COMPLETION(root_flush_done);
>>> +     static DEFINE_SPINLOCK(root_flusher_lock);
>>> +     static DEFINE_MUTEX(subtree_flush_mutex);
>>> +     static atomic_t waiters = ATOMIC_INIT(0);
>>> +     static bool root_flush_ongoing;
>>> +     bool root_flusher = false;
>>> +
>>> +     /* Ongoing root flush, just wait for it (unless otherwise requested) */
>>> +     if (READ_ONCE(root_flush_ongoing))
>>> +             goto root_flush_or_wait;
>>> +
>>>        /*
>>> -      * We always flush the entire tree, so concurrent flushers can just
>>> -      * skip. This avoids a thundering herd problem on the rstat global lock
>>> -      * from memcg flushers (e.g. reclaim, refault, etc).
>>> +      * Opportunistically try to only flush the requested subtree. Otherwise
>>> +      * fallback to a coalesced flush below.
>>>         */
>>> -     if (atomic_read(&stats_flush_ongoing) ||
>>> -         atomic_xchg(&stats_flush_ongoing, 1))
>>> +     if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
>>> +             cgroup_rstat_flush(memcg->css.cgroup);
>>> +             mutex_unlock(&subtree_flush_mutex);
>>>                return;
>>> +     }
>> If mutex_trylock() is the only way to acquire subtree_flush_mutex, you
>> don't really need a mutex. Just a simple integer flag with xchg() call
>> should be enough.

Equivalently test_and_set_bit() will work too.

Cheers,
Longman

> Thanks for pointing this out. Agreed.
>
> If we keep this approach I will drop that mutex.
>
Yosry Ahmed Sept. 14, 2023, 5:56 p.m. UTC | #10
On Thu, Sep 14, 2023 at 10:36 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Sep 13, 2023 at 12:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Stats flushing for memcg currently follows the following rules:
> > - Always flush the entire memcg hierarchy (i.e. flush the root).
> > - Only one flusher is allowed at a time. If someone else tries to flush
> >   concurrently, they skip and return immediately.
> > - A periodic flusher flushes all the stats every 2 seconds.
> >
> > The reason this approach is followed is because all flushes are
> > serialized by a global rstat spinlock. On the memcg side, flushing is
> > invoked from userspace reads as well as in-kernel flushers (e.g.
> > reclaim, refault, etc). This approach aims to avoid serializing all
> > flushers on the global lock, which can cause a significant performance
> > hit under high concurrency.
> >
> > This approach has the following problems:
> > - Occasionally a userspace read of the stats of a non-root cgroup will
> >   be too expensive as it has to flush the entire hierarchy [1].
>
> This is a real world workload exhibiting the issue which is good.
>
> > - Sometimes the stats accuracy are compromised if there is an ongoing
> >   flush, and we skip and return before the subtree of interest is
> >   actually flushed. This is more visible when reading stats from
> >   userspace, but can also affect in-kernel flushers.
>
> Please provide similar data/justification for the above. In addition:
>
> 1. How much delayed/stale stats have you observed on real world workload?

I am not really sure. We don't have a wide deployment of kernels with
rstat yet. These are problems observed in testing and/or concerns
expressed by our userspace team.

I am trying to solve this now because any problems that result from
this staleness will be very hard to debug and link back to stale
stats.

>
> 2. What is acceptable staleness in the stats for your use-case?

Again, unfortunately I am not sure, but right now it can be O(seconds)
which is not acceptable as we have workloads querying the stats every
1s (and sometimes more frequently).

>
> 3. What is your use-case?

A few use cases we have that may be affected by this:
- System overhead: calculations using memory.usage and some stats from
memory.stat. If one of them is fresh and the other one isn't we have
an inconsistent view of the system.
- Userspace OOM killing: We use some stats in memory.stat to gauge the
amount of memory that will be freed by killing a task as sometimes
memory.usage includes shared resources that wouldn't be freed anyway.
- Proactive reclaim: we read memory.stat in a proactive reclaim
feedback loop, stale stats may cause us to mistakenly think reclaim is
ineffective and prematurely stop.

>
> 4. Does your use-case care about staleness of all the stats in
> memory.stat or some specific stats?

We have multiple use cases that can be affected by this, so I don't
think there are some specific stats. I am also not aware of all
possibly affected use cases.

>
> 5. If some specific stats in memory.stat, does it make sense to
> decouple them from rstat and just pay the price up front to maintain
> them accurately?
>
> Most importantly please please please be concise in your responses.

I try, sometimes I am not sure how much detail is needed. Sorry about that :)

>
> I know I am going back on some of the previous agreements but this
> whole locking back and forth has made in question the original
> motivation.

That's okay. Taking a step back, having flushing being indeterministic
in this way is a time bomb in my opinion. Note that this also affects
in-kernel flushers like reclaim or dirty isolation, which I suspect
will be more affected by staleness. No one complained yet AFAICT, but
I think it's a time bomb. The worst part is that if/when a problem
happens, we won't be able to easily tell what went wrong.
Shakeel Butt Sept. 14, 2023, 10:58 p.m. UTC | #11
On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote:
[...]
> >
> > 1. How much delayed/stale stats have you observed on real world workload?
> 
> I am not really sure. We don't have a wide deployment of kernels with
> rstat yet. These are problems observed in testing and/or concerns
> expressed by our userspace team.
> 

Why sleep(2) not good enough for the tests?

> I am trying to solve this now because any problems that result from
> this staleness will be very hard to debug and link back to stale
> stats.
> 

I think first you need to show if this (2 sec stale stats) is really a
problem.

> >
> > 2. What is acceptable staleness in the stats for your use-case?
> 
> Again, unfortunately I am not sure, but right now it can be O(seconds)
> which is not acceptable as we have workloads querying the stats every
> 1s (and sometimes more frequently).
> 

It is 2 seconds in most cases and if it is higher, the system is already
in bad shape. O(seconds) seems more dramatic. So, why 2 seconds
staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's
look at the use-cases below.

> >
> > 3. What is your use-case?
> 
> A few use cases we have that may be affected by this:
> - System overhead: calculations using memory.usage and some stats from
> memory.stat. If one of them is fresh and the other one isn't we have
> an inconsistent view of the system.
> - Userspace OOM killing: We use some stats in memory.stat to gauge the
> amount of memory that will be freed by killing a task as sometimes
> memory.usage includes shared resources that wouldn't be freed anyway.
> - Proactive reclaim: we read memory.stat in a proactive reclaim
> feedback loop, stale stats may cause us to mistakenly think reclaim is
> ineffective and prematurely stop.
> 

I don't see why userspace OOM killing and proactive reclaim need
subsecond accuracy. Please explain. Same for system overhead but I can
see the complication of two different sources for stats. Can you provide
the formula of system overhead? I am wondering why do you need to read
stats from memory.stat files. Why not the memory.current of top level
cgroups and /proc/meminfo be enough. Something like:

Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)

> >
> > I know I am going back on some of the previous agreements but this
> > whole locking back and forth has made in question the original
> > motivation.
> 
> That's okay. Taking a step back, having flushing being indeterministic

I would say atmost 2 second stale instead of indeterministic.

> in this way is a time bomb in my opinion. Note that this also affects
> in-kernel flushers like reclaim or dirty isolation

Fix the in-kernel flushers separately. Also the problem Cloudflare is
facing does not need to be tied with this.
Yosry Ahmed Sept. 14, 2023, 11:30 p.m. UTC | #12
On Thu, Sep 14, 2023 at 3:58 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote:
> [...]
> > >
> > > 1. How much delayed/stale stats have you observed on real world workload?
> >
> > I am not really sure. We don't have a wide deployment of kernels with
> > rstat yet. These are problems observed in testing and/or concerns
> > expressed by our userspace team.
> >
>
> Why sleep(2) not good enough for the tests?

The problem is not making the tests pass. The tests are just a signal.

>
> > I am trying to solve this now because any problems that result from
> > this staleness will be very hard to debug and link back to stale
> > stats.
> >
>
> I think first you need to show if this (2 sec stale stats) is really a
> problem.

That's the thing, my main concern is that if this causes a problem, we
probably won't be able to tell it was because of stale stats. It's
very hard to make that connection.

Pre-rstat, reading stats would always yield fresh stats (as much as
possible). Now the stats can be up to 2s stale, and we don't really
know how this will affect our existing workloads.

>
> > >
> > > 2. What is acceptable staleness in the stats for your use-case?
> >
> > Again, unfortunately I am not sure, but right now it can be O(seconds)
> > which is not acceptable as we have workloads querying the stats every
> > 1s (and sometimes more frequently).
> >
>
> It is 2 seconds in most cases and if it is higher, the system is already
> in bad shape. O(seconds) seems more dramatic. So, why 2 seconds
> staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's
> look at the use-cases below.
>
> > >
> > > 3. What is your use-case?
> >
> > A few use cases we have that may be affected by this:
> > - System overhead: calculations using memory.usage and some stats from
> > memory.stat. If one of them is fresh and the other one isn't we have
> > an inconsistent view of the system.
> > - Userspace OOM killing: We use some stats in memory.stat to gauge the
> > amount of memory that will be freed by killing a task as sometimes
> > memory.usage includes shared resources that wouldn't be freed anyway.
> > - Proactive reclaim: we read memory.stat in a proactive reclaim
> > feedback loop, stale stats may cause us to mistakenly think reclaim is
> > ineffective and prematurely stop.
> >
>
> I don't see why userspace OOM killing and proactive reclaim need
> subsecond accuracy. Please explain.

For proactive reclaim it is not about sub-second accuracy. It is about
doing the reclaim then reading the stats immediately to see the
effect. Naturally one would expect that a stat read after reclaim
would show the system state after reclaim.

For userspace OOM killing I am not really sure. It depends on how
dynamic the workload is. If a task recently had a spike in memory
usage causing a threshold to be hit, userspace can kill a different
task if the stats are stale.

I think the whole point is *not* about the amount of staleness. It is
more about that you expect a stats read after an event to reflect the
system state after the event. Whether this event is proactive reclaim
or a spike in memory usage or something else.

As Tejun mentioned previously [1]: "The only guarantee you need is
that there has been at least one flush since
the read attempt started".

[1]https://lore.kernel.org/lkml/ZP92xP5rdKdeps7Z@mtj.duckdns.org/

> Same for system overhead but I can
> see the complication of two different sources for stats. Can you provide
> the formula of system overhead? I am wondering why do you need to read
> stats from memory.stat files. Why not the memory.current of top level
> cgroups and /proc/meminfo be enough. Something like:
>
> Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)

We use the amount of compressed memory in zswap from memory.stat,
which is not accounted as memory usage in cgroup v1.

>
> > >
> > > I know I am going back on some of the previous agreements but this
> > > whole locking back and forth has made in question the original
> > > motivation.
> >
> > That's okay. Taking a step back, having flushing being indeterministic
>
> I would say atmost 2 second stale instead of indeterministic.

Ack.

>
> > in this way is a time bomb in my opinion. Note that this also affects
> > in-kernel flushers like reclaim or dirty isolation
>
> Fix the in-kernel flushers separately.

The in-kernel flushers are basically facing the same problem. For
instance, reclaim would expect a stats read after a reclaim iteration
to reflect the system state after the reclaim iteration.

> Also the problem Cloudflare is facing does not need to be tied with this.

When we try to wait for flushing to complete we run into the same
latency problem of the root flush.
Shakeel Butt Sept. 15, 2023, 1:01 a.m. UTC | #13
On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote:
[...]
> >
> > I think first you need to show if this (2 sec stale stats) is really a
> > problem.
> 
> That's the thing, my main concern is that if this causes a problem, we
> probably won't be able to tell it was because of stale stats. It's
> very hard to make that connection.
> 

Please articulate what the problem would look like which you did in the
use-cases description below, let's discuss there.

> Pre-rstat, reading stats would always yield fresh stats (as much as
> possible). Now the stats can be up to 2s stale, and we don't really
> know how this will affect our existing workloads.
> 

Pre-rstat the stat read would traverse the memcg tree. With rstat
the tradeoff was made between expensive read and staleness.
Yeah there
might still be memcg update tree traversal which I would like to remove
completely. However you are saying to 

[...]
> >
> > I don't see why userspace OOM killing and proactive reclaim need
> > subsecond accuracy. Please explain.
> 
> For proactive reclaim it is not about sub-second accuracy. It is about
> doing the reclaim then reading the stats immediately to see the
> effect. Naturally one would expect that a stat read after reclaim
> would show the system state after reclaim.
> 
> For userspace OOM killing I am not really sure. It depends on how
> dynamic the workload is. If a task recently had a spike in memory
> usage causing a threshold to be hit, userspace can kill a different
> task if the stats are stale.
> 

Please add above reasoning in your commit message (though I am not
convinced but let's leave it at that).

> I think the whole point is *not* about the amount of staleness. It is
> more about that you expect a stats read after an event to reflect the
> system state after the event.

The whole point is to understand the tradeoff between accuracy and cost
of accuracy. I don't think you want to pay the cost of strong
consistency/ordering between stats reading and an event. My worry is
that you are enforcing a tradeoff which *might* be just applicable to
your use-cases. Anyways this is not something that can not be changed
later.

> 
> > Same for system overhead but I can
> > see the complication of two different sources for stats. Can you provide
> > the formula of system overhead? I am wondering why do you need to read
> > stats from memory.stat files. Why not the memory.current of top level
> > cgroups and /proc/meminfo be enough. Something like:
> >
> > Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)
> 
> We use the amount of compressed memory in zswap from memory.stat,
> which is not accounted as memory usage in cgroup v1.
> 

There are zswap stats in /proc/meminfo. Will those work for you?

[...]
> > Fix the in-kernel flushers separately.
> 
> The in-kernel flushers are basically facing the same problem. For
> instance, reclaim would expect a stats read after a reclaim iteration
> to reflect the system state after the reclaim iteration.
> 

I have not seen any complains on memory reclaim recently. Maybe
reclaim does not really need that such accuracy :P

> > Also the problem Cloudflare is facing does not need to be tied with this.
> 
> When we try to wait for flushing to complete we run into the same
> latency problem of the root flush.

Not sure what wait for flushing has to do with Cloudflare's report. They
are ok with no sync flushing at all stat read.
Yosry Ahmed Sept. 19, 2023, 5:29 a.m. UTC | #14
On Thu, Sep 14, 2023 at 6:01 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote:
> [...]
> > >
> > > I think first you need to show if this (2 sec stale stats) is really a
> > > problem.
> >
> > That's the thing, my main concern is that if this causes a problem, we
> > probably won't be able to tell it was because of stale stats. It's
> > very hard to make that connection.
> >
>
> Please articulate what the problem would look like which you did in the
> use-cases description below, let's discuss there.
>
> > Pre-rstat, reading stats would always yield fresh stats (as much as
> > possible). Now the stats can be up to 2s stale, and we don't really
> > know how this will affect our existing workloads.
> >
>
> Pre-rstat the stat read would traverse the memcg tree. With rstat
> the tradeoff was made between expensive read and staleness.
> Yeah there
> might still be memcg update tree traversal which I would like to remove
> completely. However you are saying to

I think this sentence is truncated.

>
> [...]
> > >
> > > I don't see why userspace OOM killing and proactive reclaim need
> > > subsecond accuracy. Please explain.
> >
> > For proactive reclaim it is not about sub-second accuracy. It is about
> > doing the reclaim then reading the stats immediately to see the
> > effect. Naturally one would expect that a stat read after reclaim
> > would show the system state after reclaim.
> >
> > For userspace OOM killing I am not really sure. It depends on how
> > dynamic the workload is. If a task recently had a spike in memory
> > usage causing a threshold to be hit, userspace can kill a different
> > task if the stats are stale.
> >
>
> Please add above reasoning in your commit message (though I am not
> convinced but let's leave it at that).

Will do in the next version, thanks.

>
> > I think the whole point is *not* about the amount of staleness. It is
> > more about that you expect a stats read after an event to reflect the
> > system state after the event.
>
> The whole point is to understand the tradeoff between accuracy and cost
> of accuracy. I don't think you want to pay the cost of strong
> consistency/ordering between stats reading and an event. My worry is
> that you are enforcing a tradeoff which *might* be just applicable to
> your use-cases. Anyways this is not something that can not be changed
> later.

Given the numbers I got with the patch, it doesn't seem like we are
paying a significant cost for the accuracy. Anyway, as you say, it's
not something that can not be changed. In fact, I have another
proposal that I am currently testing, please see my next response to
Johannes.

>
> >
> > > Same for system overhead but I can
> > > see the complication of two different sources for stats. Can you provide
> > > the formula of system overhead? I am wondering why do you need to read
> > > stats from memory.stat files. Why not the memory.current of top level
> > > cgroups and /proc/meminfo be enough. Something like:
> > >
> > > Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)
> >
> > We use the amount of compressed memory in zswap from memory.stat,
> > which is not accounted as memory usage in cgroup v1.
> >
>
> There are zswap stats in /proc/meminfo. Will those work for you?

Yeah this should work for this specific use case, thanks.

>
> [...]
> > > Fix the in-kernel flushers separately.
> >
> > The in-kernel flushers are basically facing the same problem. For
> > instance, reclaim would expect a stats read after a reclaim iteration
> > to reflect the system state after the reclaim iteration.
> >
>
> I have not seen any complains on memory reclaim recently. Maybe
> reclaim does not really need that such accuracy :P

Perhaps, it's full of heuristics anyway :)

>
> > > Also the problem Cloudflare is facing does not need to be tied with this.
> >
> > When we try to wait for flushing to complete we run into the same
> > latency problem of the root flush.
>
> Not sure what wait for flushing has to do with Cloudflare's report. They
> are ok with no sync flushing at all stat read.

Oh I am not saying the wait benefits their use case. I am saying when
the wait is implemented, we face the same problem (expensive flush of
the entire hierarchy), so we need to mitigate it anyway -- hence the
relevance to Cloudflare's use case.

Anyway, I have an alternative that I will propose shortly in response
to Johannes's reply.
Yosry Ahmed Sept. 19, 2023, 5:46 a.m. UTC | #15
On Thu, Sep 14, 2023 at 10:22 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > > -
> > > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> > > > > +     /* A coalesced root flush is in order. Are we the designated flusher? */
> > > > > +     spin_lock(&root_flusher_lock);
> > > >
> > > > I can't say I'm crazy about this.
> > > >
> > > > Let's say you have a fairly sprawling and active cgroup tree with lots
> > > > of updates in lots of groups in the system.
> > > >
> > > > Add a periodic memory.stat reader to one of the subgroups, and that
> > > > reader will do very fast, localized aggregation.
> > > >
> > > > Now add a periodic memory.stat reader to some other subgroup. They
> > > > might still both do very fast, localized aggregation. Or they might
> > > > collide; and now - despite having nothing in common, and sharing no
> > > > data besides the rstat lock - will have to flush the entire massive
> > > > tree. The rate at which this happens is purely dependent on timing of
> > > > what should be independent actors. This is really not great for the
> > > > p50 vs p99 gap.
> > >
> > > Initially I had a few retry iterations for the subtree flush, where we
> > > fsleep for a bit and trylock again. I thought it was a little bit too
> > > complicated and the fsleep duration would be a magic value.
> >
> > Hm, how is that different than a mutex / sleepable lock?
>
> It is essentially a lock with a timeout, which IIUC we don't support
> explicitly in the kernel. What I was trying to do was basically to try
> and do a subtree flush, but if we are waiting for too long then a lot
> of people are probably flushing, so let's all switch to a root flush
> and wait for one flusher instead of contending among ourselves.
>
> >
> > > > I think this whole thing is getting away from us.
> > > >
> > > > When we first switched to rstat, every reader would take the global
> > > > rstat lock and then flush its local subtree of interest.
> > > >
> > > > This was changed in the following commit:
> > > >
> > > > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
> > > > Author: Shakeel Butt <shakeelb@google.com>
> > > > Date:   Fri Nov 5 13:37:34 2021 -0700
> > > >
> > > >     memcg: unify memcg stat flushing
> > > >
> > > >     The memcg stats can be flushed in multiple context and potentially in
> > > >     parallel too.  For example multiple parallel user space readers for
> > > >     memcg stats will contend on the rstat locks with each other.  There is
> > > >     no need for that.  We just need one flusher and everyone else can
> > > >     benefit.
> > > >
> > > >     In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
> > > >     stats") the kernel periodically flush the memcg stats from the root, so,
> > > >     the other flushers will potentially have much less work to do.
> > > >
> > > >     Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
> > > >     Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > >     Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >     Cc: Michal Hocko <mhocko@kernel.org>
> > > >     Cc: "Michal Koutný" <mkoutny@suse.com>
> > > >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > >
> > > > The idea was that we can avoid lock contention if somebody is already
> > > > doing the flushing. However, you're now bringing global serialization.
> > > > Clearly that idea didn't work out. What would be an obstacle to go
> > > > back to the original way of doing it?
> > >
> > > The obstacle is high concurrency among flushers. A good example is
> > > reclaim code, we can have a lot of concurrent reclaimers. When I tried
> > > to go back to the original way of doing it, a stress test with high
> > > reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think
> > > high concurrency among userspace reads would have a similar outcome,
> > > but I hadn't really checked.
> > >
> > > Basically this patch is trying to switch to root flushing when the
> > > cost of contending on the lock is roughly the same as a root flush (or
> > > waiting for one). It's doing that too eagerly now of course (if
> > > contenders > 1), we can try to calibrate this better.
> >
> > I don't quite understand this.
> >
> > If you have two readers on separate subtrees, why is it cheaper to
> > flush the entire tree in sequence (where both readers don't care about
> > the majority of the data), than having each reader flush their own
> > small subset one after another? It's not like the whole tree flush
> > parallelizes its work.
> >
> > Where is that overhead actually coming from?
>
> If you have N concurrent readers flushing different parts of the
> subtree, at some point the overhead of N sequential subtree flushes
> will exceed the overhead of a single root flush.
>
> Ideally, we would be able to identify this point, and switch to a
> single root flush with everyone else waiting.
>
> >
> > > > With one reader, this will work the same as in your proposal.
> > > >
> > > > With two readers, just like in your proposal, flushing must be
> > > > serialized against the root level. But at least the two flushes only
> > > > aggregate the local data they actually care about - not the entire
> > > > tree data that doesn't even have readers! This is much better for lock
> > > > contention and performance.
> > >
> > > Keep in mind that in my testing, I noticed that synchronization using
> > > a completion is more performant than serialization on a lock. I am
> > > assuming because when we contend on the underlying lock, we serially
> > > wake up and do the flush. Even if there is nothing to do (as you
> > > mention below), we still do this work. On the contrary, in this
> > > proposal we just wait for the root flush to finish and return
> > > immediately, and all waiters return at once (that's a lie because of
> > > scheduling internals).
> >
> > Right, because rstat's do-nothing case is still somewhat costly due to
> > the per-cpu loop and the tree walk.
> >
> > But that should be possible to avoid with the outlined caching of
> > recently-flushed state at the memcg level.
>
> This helps only if concurrent flushers are overlapping, right?
>
> >
> > > Also, in the current code, in the two reader scenario, the first
> > > reader will flush the entire tree anyway. The only difference is that
> > > the second reader will wait for it to finish instead of just skipping,
> > > which is the right thing to do from a correctness point of view. Right
> > > now it's a coin flip on whether you get updated stats if someone else
> > > is already flushing.
> >
> > Agreed, it should wait. My mutex would accomplish this, right?
>
> I think what you're describing here is v4 of the patchset I mention in
> the cover letter:
> https://lore.kernel.org/lkml/20230831165611.2610118-5-yosryahmed@google.com/
>
> The problem with that was that with very high concurrency among
> readers, the read latency is unbounded and can get out of hand. Wei
> shared some numbers in that thread.
>
> What this patch is trying to do is to switch to root flushing if the
> mutex is contended to avoid that scenario.  Also, if we keep acquiring
> more flushers at some point we just skip the flush in favor of
> performance (if the number of waiters exceeds the number of cpus). Now
> that I think about it, perhaps we can just go back to the mutex
> approach w/ limiting the number of waiters, without ever falling back
> to a root flush. Not sure how that would work.
>
> Taking a step back, I think what you are implying is that if we make
> the thresholding code per cpu instead of global, this should mitigate
> the issue in a different way than falling back to root flushing or
> limiting the number of waiters, is that right?
> If yes, I think it will work in some cases where the flushes are
> overlapping as I mentioned above, but not if concurrent readers are
> flushing different parts of the tree. Right?
>
> >
> > > > One concern is the thresholding code. The cost of flushing on every
> > > > read is too high: even when there is no flush work, checking for it is
> > > > kind of expensive due to the locks and the for_each_possible_cpu().
> > > >
> > > > Could we do something like the following?
> > > >
> > > >         mem_cgroup_flush(memcg)
> > > >         {
> > > >                 mutex_lock(&memcg_flush_mutex);
> > > >                 if (atomic_read(&memcg->stat_delta) > THRESHOLD)
> > > >                         cgroup_rstat_flush(memcg);
> > > >                 mutex_unlock(&memcg_flush_mutex);
> > > >         }
> > > >
> > > >         mem_cgroup_css_rstat_flush(css, cpu)
> > > >         {
> > > >                 ...
> > > >                 /*
> > > >                  * Reset here instead of mem_cgroup_flush()
> > > >                  * so that each flushed subgroup is reset
> > > >                  * recursively. A recent parent flush will
> > > >                  * allow a child flush to skip.
> > > >                  */
> > > >                 atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
> > > >         }
> > > >
> > > >         memcg_rstat_updated(memcg, val)
> > > >         {
> > > >                 atomic_add(&memcg->stat_delta, val);
> > >
> > > We need to do this for each parent in the hierarchy, not just the
> > > memcg being updated, right? I guess that may be too expensive for the
> > > update path.
> >
> > How so?
> >
> > We need to mark the subgroups that are flushed, so that if you have
> >
> >         root - a - a1 - foo
> >                 `- a2
> >
> > and somebody flushes a, then a1 and a2 also don't need to be flushed
> > for a while.
> >
> > But if we flush a1 first, foo is aggregated into a1. Reading a still
> > needs to aggregate a1 and a2 into a.
> >
> > Maybe I'm missing something blatant, but I don't see how we have to
> > mark parents in any way. We only have to tag cgroups up to the point
> > to which they were recently flushed, and we already visit all those.
> >
> > Let me know if I'm missing something blatant here.
>
> I think we are talking about different paths. I believe you are
> talking about resetting memcg->stat_delta, which I agree should be
> done during the flush. What I am talking about is updating
> memcg->stat_delta when memcg_rstat_updated() is called. We would need
> to update that for all parents. In your example hierarchy, if stat
> updates happened in a2, and someone tries to flush a, they should be
> aware that there are stats to flush.

Following up on this. I tried a simpler approach where the
stats_updates threshold is broken down to be per-memcg instead of
global, the update path looks like this:

  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
  {
          int cpu = smp_processor_id();
          unsigned int x;

          if (!val)
                  return;

          cgroup_rstat_updated(memcg->css.cgroup, cpu);

          for(; memcg; memcg = parent_mem_cgroup(memcg)) {
                  x =
__this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
                                            abs(val));

                  if (x < MEMCG_CHARGE_BATCH)
                          continue;

                  /*
                   * If @memcg is already flush-able, increasing
stats_updates is
                   * redundant. Avoid the overhead of the atomic update.
                   */
                  if (!memcg_should_flush_stats(memcg))
                          atomic64_add(x, &memcg->vmstats->stats_updates);
                  __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
          }
  }

, and the flush path looks like this:

static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
  {
          return atomic64_read(&memcg->vmstats->stats_updates) >
                  MEMCG_CHARGE_BATCH * num_online_cpus();
  }

static void do_flush_stats(struct mem_cgroup *memcg)
  {
          if (mem_cgroup_is_root(memcg))
                  WRITE_ONCE(flush_last_time, jiffies_64);

          cgroup_rstat_flush(memcg->css.cgroup);
  }

  void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
  {
          static DEFINE_MUTEX(memcg_stats_flush_mutex);

          if (!memcg)
                  memcg = root_mem_cgroup;

          if (!memcg_should_flush_stats(memcg))
                  return;

          mutex_lock(&memcg_stats_flush_mutex);
          /* An overlapping flush may have occurred, check again after
locking */
          if (memcg_should_flush_stats(memcg))
                  do_flush_stats(memcg);
          mutex_unlock(&memcg_stats_flush_mutex);
  }

I tested this on a machine with 384 cpus. The flush latency looks very
promising for both in-kernel flushers and userspace readers with high
concurrency using the tests mentioned in the commit message.

On the update side, I wanted to check if this introduces a significant
regression, so I ran neper (https://github.com/google/neper) on two
machines running the same kernel with/without the above changes. I
used the tcp_stream test with 100/1000 flows and 50/500 threads. Neper
is running in a cgroup with 4 ancestors (In /$ROOT/a/b/c/d) to
exercise the parent loop. The throughput difference compared to the
base kernel is in the noise:

Base:
100 Flows, 50 Threads (3 runs):
Server: 125140.00 mbps, 122887.50 mbps, 113245.91 mbps  (average:
120424.47 mbps)
Client:133516.86 mbps, 124805.01 mbps, 140530.75 mbps (average: 132950.87 mbps)

1000 Flows, 100 Threads (3 runs):
Server:116898.27 mbps, 118676.94 mbps, 120715.44 mbps (average: 118763.55 mbps)
Client:122787.29 mbps, 122280.43 mbps, 126153.22 mbps (average: 123740.31 mbps)

Per-memcg thresholding:
100 Flows, 50 Threads (3 runs):
Server: 128207.34 mbps, 127585.55 mbps, 130776.84 mbps (average: 128856.57 mbps)
Client: 143734.97 mbps, 128632.56 mbps, 131616.10 mbps (average: 134661.21 mbps)

1000 Flows, 100 Threads (3 runs):
Server: 117790.86 mbps, 125489.63 mbps, 124459.77 mbps (average: 122580.09 mbps)
Client: 119257.34 mbps, 124650.42 mbps, 123717.24 mbps (average: 122541.67 mbps)

Looking at perf, the time spent in mem_cgroup_charge_skmem() increases
from 1.2% to 2.2% when running with 100 flows and 50 threads, but
stays almost the same when running with 1000 flows and 500 threads. I
guess at very high concurrency the overhead is negligible compared to
other potential bottlenecks.

I am not sure if that's enough signal that the update side can take
this change, but on the flush side things are looking really promising
with this approach. If the overhead is considered high we can split
the extra work between the update and the flush sides. Instead of a
single global atomic (memcg->vmstats->stats_updates), we can have N of
them and multiplex cpus onto them. memcg_should_flush_stats() would
iterate N counters instead of 1. I'd rather not jump to such
approaches if they are not needed.

Any thoughts?
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 11810a2cfd2d..4453cd3fc4b8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1034,7 +1034,7 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
 void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
@@ -1519,7 +1519,7 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d729870505f1..edff41e4b4e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,7 +588,6 @@  mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
-static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
 /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */
 static atomic_t stats_updates_order = ATOMIC_INIT(0);
 static u64 flush_last_time;
@@ -639,36 +638,87 @@  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void do_flush_stats(void)
+/*
+ * do_flush_stats - flush the statistics of a memory cgroup and its tree
+ * @memcg: the memory cgroup to flush
+ * @wait: wait for an ongoing root flush to complete before returning
+ *
+ * All flushes are serialized by the underlying rstat global lock. If there is
+ * no contention, we try to only flush the subtree of the passed @memcg to
+ * minimize the work. Otherwise, we coalesce multiple flushing requests into a
+ * single flush of the root memcg. When there is an ongoing root flush, we wait
+ * for its completion (unless otherwise requested), to get fresh stats. If the
+ * number of waiters exceeds the number of cpus just skip the flush to bound the
+ * flush latency at the tail with very high concurrency.
+ *
+ * This is a trade-off between stats accuracy and flush latency.
+ */
+static void do_flush_stats(struct mem_cgroup *memcg, bool wait)
 {
+	static DECLARE_COMPLETION(root_flush_done);
+	static DEFINE_SPINLOCK(root_flusher_lock);
+	static DEFINE_MUTEX(subtree_flush_mutex);
+	static atomic_t waiters = ATOMIC_INIT(0);
+	static bool root_flush_ongoing;
+	bool root_flusher = false;
+
+	/* Ongoing root flush, just wait for it (unless otherwise requested) */
+	if (READ_ONCE(root_flush_ongoing))
+		goto root_flush_or_wait;
+
 	/*
-	 * We always flush the entire tree, so concurrent flushers can just
-	 * skip. This avoids a thundering herd problem on the rstat global lock
-	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 * Opportunistically try to only flush the requested subtree. Otherwise
+	 * fallback to a coalesced flush below.
 	 */
-	if (atomic_read(&stats_flush_ongoing) ||
-	    atomic_xchg(&stats_flush_ongoing, 1))
+	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
+		cgroup_rstat_flush(memcg->css.cgroup);
+		mutex_unlock(&subtree_flush_mutex);
 		return;
+	}
 
-	WRITE_ONCE(flush_last_time, jiffies_64);
-
-	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+	/* A coalesced root flush is in order. Are we the designated flusher? */
+	spin_lock(&root_flusher_lock);
+	if (!READ_ONCE(root_flush_ongoing)) {
+		reinit_completion(&root_flush_done);
+		/*
+		 * We must reset the completion before setting
+		 * root_flush_ongoing. Otherwise, waiters may call
+		 * wait_for_completion() and immediately return before we flush.
+		 */
+		smp_wmb();
+		WRITE_ONCE(root_flush_ongoing, true);
+		root_flusher = true;
+	}
+	spin_unlock(&root_flusher_lock);
 
-	atomic_set(&stats_updates_order, 0);
-	atomic_set(&stats_flush_ongoing, 0);
+root_flush_or_wait:
+	if (root_flusher) {
+		cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+		complete_all(&root_flush_done);
+		atomic_set(&stats_updates_order, 0);
+		WRITE_ONCE(flush_last_time, jiffies_64);
+		WRITE_ONCE(root_flush_ongoing, false);
+	} else if (wait && atomic_add_unless(&waiters, 1, num_online_cpus())) {
+		smp_rmb(); /* see smp_wmb() above */
+		wait_for_completion_interruptible(&root_flush_done);
+		atomic_dec(&waiters);
+	}
 }
 
-void mem_cgroup_flush_stats(void)
+void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 {
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	if (atomic_read(&stats_updates_order) > STATS_FLUSH_THRESHOLD)
-		do_flush_stats();
+		do_flush_stats(memcg, true);
 }
 
 void mem_cgroup_flush_stats_ratelimited(void)
 {
 	/* Only flush if the periodic flusher is one full cycle late */
 	if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats(root_mem_cgroup);
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
@@ -677,7 +727,7 @@  static void flush_memcg_stats_dwork(struct work_struct *w)
 	 * Deliberately ignore stats_updates_order here so that flushing in
 	 * latency-sensitive paths is as cheap as possible.
 	 */
-	do_flush_stats();
+	do_flush_stats(root_mem_cgroup, false);
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1577,7 +1627,7 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4019,7 +4069,7 @@  static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(memcg);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4094,7 +4144,7 @@  static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4596,7 +4646,7 @@  void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(memcg);
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6606,7 +6656,7 @@  static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
@@ -7764,7 +7814,7 @@  bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 			break;
 		}
 
-		cgroup_rstat_flush(memcg->css.cgroup);
+		mem_cgroup_flush_stats(memcg);
 		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
 		if (pages < max)
 			continue;
@@ -7829,8 +7879,10 @@  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 static u64 zswap_current_read(struct cgroup_subsys_state *css,
 			      struct cftype *cft)
 {
-	cgroup_rstat_flush(css->cgroup);
-	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	mem_cgroup_flush_stats(memcg);
+	return memcg_page_state(memcg, MEMCG_ZSWAP_B);
 }
 
 static int zswap_max_show(struct seq_file *m, void *v)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..fc356b9bc003 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2923,7 +2923,7 @@  static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(sc->target_mem_cgroup);
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..13cbccf907f1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -519,7 +519,11 @@  void workingset_refault(struct folio *folio, void *shadow)
 		return;
 	}
 
-	/* Flush stats (and potentially sleep) before holding RCU read lock */
+	/*
+	 * Flush stats (and potentially sleep) before holding RCU read lock
+	 * XXX: This can be reworked to pass in a memcg, similar to
+	 * mem_cgroup_flush_stats().
+	 */
 	mem_cgroup_flush_stats_ratelimited();
 
 	rcu_read_lock();
@@ -664,7 +668,7 @@  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats(sc->memcg);
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,