diff mbox series

[v4,4/4] mm: memcg: use non-unified stats flushing for userspace reads

Message ID 20230831165611.2610118-5-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series memcg: non-unified flushing for userspace stats | expand

Commit Message

Yosry Ahmed Aug. 31, 2023, 4:56 p.m. UTC
Unified flushing allows for great concurrency for paths that attempt to
flush the stats, at the expense of potential staleness and a single
flusher paying the extra cost of flushing the full tree.

This tradeoff makes sense for in-kernel flushers that may observe high
concurrency (e.g. reclaim, refault). For userspace readers, stale stats
may be unexpected and problematic, especially when such stats are used
for critical paths such as userspace OOM handling. Additionally, a
userspace reader will occasionally pay the cost of flushing the entire
hierarchy, which also causes problems in some cases [1].

Opt userspace reads out of unified flushing. This makes the cost of
reading the stats more predictable (proportional to the size of the
subtree), as well as the freshness of the stats. Userspace readers are
not expected to have similar concurrency to in-kernel flushers,
serializing them among themselves and among in-kernel flushers should be
okay. Nonetheless, for extra safety, introduce a mutex when flushing for
userspace readers to make sure only a single userspace reader can compete
with in-kernel flushers at a time. This takes away userspace ability to
directly influence or hurt in-kernel lock contention.

An alternative is to remove flushing from the stats reading path
completely, and rely on the periodic flusher. This should be accompanied
by making the periodic flushing period tunable, and providing an
interface for userspace to force a flush, following a similar model to
/proc/vmstat. However, such a change will be hard to reverse if the
implementation needs to be changed because:
- The cost of reading stats will be very cheap and we won't be able to
  take that back easily.
- There are user-visible interfaces involved.

Hence, let's go with the change that's most reversible first and revisit
as needed.

This was tested on a machine with 256 cpus by running a synthetic test
script [2] that creates 50 top-level cgroups, each with 5 children (250
leaf cgroups). Each leaf cgroup has 10 processes running that allocate
memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
unified flusher). Concurrently, one thread is spawned per-cgroup to read
the stats every second (including root, top-level, and leaf cgroups --
so total 251 threads). No significant regressions were observed in the
total run time, which means that userspace readers are not significantly
affecting in-kernel flushers:

Base (mm-unstable):

real	0m22.500s
user	0m9.399s
sys	73m41.381s

real	0m22.749s
user	0m15.648s
sys	73m13.113s

real	0m22.466s
user	0m10.000s
sys	73m11.933s

With this patch:

real	0m23.092s
user	0m10.110s
sys	75m42.774s

real	0m22.277s
user	0m10.443s
sys	72m7.182s

real	0m24.127s
user	0m12.617s
sys	78m52.765s

[1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

Comments

Michal Hocko Sept. 4, 2023, 3:15 p.m. UTC | #1
On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> Unified flushing allows for great concurrency for paths that attempt to
> flush the stats, at the expense of potential staleness and a single
> flusher paying the extra cost of flushing the full tree.
> 
> This tradeoff makes sense for in-kernel flushers that may observe high
> concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> may be unexpected and problematic, especially when such stats are used
> for critical paths such as userspace OOM handling. Additionally, a
> userspace reader will occasionally pay the cost of flushing the entire
> hierarchy, which also causes problems in some cases [1].
> 
> Opt userspace reads out of unified flushing. This makes the cost of
> reading the stats more predictable (proportional to the size of the
> subtree), as well as the freshness of the stats. Userspace readers are
> not expected to have similar concurrency to in-kernel flushers,
> serializing them among themselves and among in-kernel flushers should be
> okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> userspace readers to make sure only a single userspace reader can compete
> with in-kernel flushers at a time. This takes away userspace ability to
> directly influence or hurt in-kernel lock contention.

I think it would be helpful to note that the primary reason this is a
concern is that the spinlock is dropped during flushing under
contention.

> An alternative is to remove flushing from the stats reading path
> completely, and rely on the periodic flusher. This should be accompanied
> by making the periodic flushing period tunable, and providing an
> interface for userspace to force a flush, following a similar model to
> /proc/vmstat. However, such a change will be hard to reverse if the
> implementation needs to be changed because:
> - The cost of reading stats will be very cheap and we won't be able to
>   take that back easily.
> - There are user-visible interfaces involved.
> 
> Hence, let's go with the change that's most reversible first and revisit
> as needed.
> 
> This was tested on a machine with 256 cpus by running a synthetic test
> script [2] that creates 50 top-level cgroups, each with 5 children (250
> leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> unified flusher). Concurrently, one thread is spawned per-cgroup to read
> the stats every second (including root, top-level, and leaf cgroups --
> so total 251 threads). No significant regressions were observed in the
> total run time, which means that userspace readers are not significantly
> affecting in-kernel flushers:
> 
> Base (mm-unstable):
> 
> real	0m22.500s
> user	0m9.399s
> sys	73m41.381s
> 
> real	0m22.749s
> user	0m15.648s
> sys	73m13.113s
> 
> real	0m22.466s
> user	0m10.000s
> sys	73m11.933s
> 
> With this patch:
> 
> real	0m23.092s
> user	0m10.110s
> sys	75m42.774s
> 
> real	0m22.277s
> user	0m10.443s
> sys	72m7.182s
> 
> real	0m24.127s
> user	0m12.617s
> sys	78m52.765s
> 
> [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

OK, I can live with that but I still believe that locking involved in
the user interface only begs for issues later on as there is no control
over that lock contention other than the number of processes involved.
As it seems that we cannot make a consensus on this concern now and this
should be already helping existing workloads then let's just buy some
more time ;)

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94d5a6751a9e..46a7abf71c73 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
> +static DEFINE_MUTEX(stats_user_flush_mutex);
>  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>  static u64 flush_next_time;
> @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
>  	cgroup_rstat_flush(memcg->css.cgroup);
>  }
>  
> +/*
> + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> + * @memcg: memory cgroup to flush
> + *
> + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> + * the global rstat spinlock. This protects in-kernel flushers from userspace
> + * readers hogging the lock.

readers hogging the lock as do_stats_flush drops the spinlock under
contention.

> + */
> +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> +{
> +	mutex_lock(&stats_user_flush_mutex);
> +	do_stats_flush(memcg);
> +	mutex_unlock(&stats_user_flush_mutex);
> +}
> +
>  /*
>   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
>   *
Yosry Ahmed Sept. 5, 2023, 3:57 p.m. UTC | #2
On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Userspace readers are
> > not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > userspace readers to make sure only a single userspace reader can compete
> > with in-kernel flushers at a time. This takes away userspace ability to
> > directly influence or hurt in-kernel lock contention.
>
> I think it would be helpful to note that the primary reason this is a
> concern is that the spinlock is dropped during flushing under
> contention.
>
> > An alternative is to remove flushing from the stats reading path
> > completely, and rely on the periodic flusher. This should be accompanied
> > by making the periodic flushing period tunable, and providing an
> > interface for userspace to force a flush, following a similar model to
> > /proc/vmstat. However, such a change will be hard to reverse if the
> > implementation needs to be changed because:
> > - The cost of reading stats will be very cheap and we won't be able to
> >   take that back easily.
> > - There are user-visible interfaces involved.
> >
> > Hence, let's go with the change that's most reversible first and revisit
> > as needed.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No significant regressions were observed in the
> > total run time, which means that userspace readers are not significantly
> > affecting in-kernel flushers:
> >
> > Base (mm-unstable):
> >
> > real  0m22.500s
> > user  0m9.399s
> > sys   73m41.381s
> >
> > real  0m22.749s
> > user  0m15.648s
> > sys   73m13.113s
> >
> > real  0m22.466s
> > user  0m10.000s
> > sys   73m11.933s
> >
> > With this patch:
> >
> > real  0m23.092s
> > user  0m10.110s
> > sys   75m42.774s
> >
> > real  0m22.277s
> > user  0m10.443s
> > sys   72m7.182s
> >
> > real  0m24.127s
> > user  0m12.617s
> > sys   78m52.765s
> >
> > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> OK, I can live with that but I still believe that locking involved in
> the user interface only begs for issues later on as there is no control
> over that lock contention other than the number of processes involved.
> As it seems that we cannot make a consensus on this concern now and this
> should be already helping existing workloads then let's just buy some
> more time ;)
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

I agree, let's fix problems if/when they arise, maybe it will be just fine :)

I will send a v5 collecting Ack's and augmenting the changelog and
comment below as you suggested (probably after we resolve patch 3).

>
> Thanks!
>
> > ---
> >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 94d5a6751a9e..46a7abf71c73 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static DEFINE_MUTEX(stats_user_flush_mutex);
> >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> >       cgroup_rstat_flush(memcg->css.cgroup);
> >  }
> >
> > +/*
> > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > + * @memcg: memory cgroup to flush
> > + *
> > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > + * readers hogging the lock.
>
> readers hogging the lock as do_stats_flush drops the spinlock under
> contention.
>
> > + */
> > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > +{
> > +     mutex_lock(&stats_user_flush_mutex);
> > +     do_stats_flush(memcg);
> > +     mutex_unlock(&stats_user_flush_mutex);
> > +}
> > +
> >  /*
> >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> >   *
> --
> Michal Hocko
> SUSE Labs
Wei Xu Sept. 8, 2023, 12:52 a.m. UTC | #3
On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Userspace readers are
> > not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > userspace readers to make sure only a single userspace reader can compete
> > with in-kernel flushers at a time. This takes away userspace ability to
> > directly influence or hurt in-kernel lock contention.
>
> I think it would be helpful to note that the primary reason this is a
> concern is that the spinlock is dropped during flushing under
> contention.
>
> > An alternative is to remove flushing from the stats reading path
> > completely, and rely on the periodic flusher. This should be accompanied
> > by making the periodic flushing period tunable, and providing an
> > interface for userspace to force a flush, following a similar model to
> > /proc/vmstat. However, such a change will be hard to reverse if the
> > implementation needs to be changed because:
> > - The cost of reading stats will be very cheap and we won't be able to
> >   take that back easily.
> > - There are user-visible interfaces involved.
> >
> > Hence, let's go with the change that's most reversible first and revisit
> > as needed.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No significant regressions were observed in the
> > total run time, which means that userspace readers are not significantly
> > affecting in-kernel flushers:
> >
> > Base (mm-unstable):
> >
> > real  0m22.500s
> > user  0m9.399s
> > sys   73m41.381s
> >
> > real  0m22.749s
> > user  0m15.648s
> > sys   73m13.113s
> >
> > real  0m22.466s
> > user  0m10.000s
> > sys   73m11.933s
> >
> > With this patch:
> >
> > real  0m23.092s
> > user  0m10.110s
> > sys   75m42.774s
> >
> > real  0m22.277s
> > user  0m10.443s
> > sys   72m7.182s
> >
> > real  0m24.127s
> > user  0m12.617s
> > sys   78m52.765s
> >
> > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> OK, I can live with that but I still believe that locking involved in
> the user interface only begs for issues later on as there is no control
> over that lock contention other than the number of processes involved.
> As it seems that we cannot make a consensus on this concern now and this
> should be already helping existing workloads then let's just buy some
> more time ;)

Indeed, even though the new global mutex protects the kernel from the
userspace contention on the rstats spinlock, its current
implementation doesn't have any protection for the lock contention
among the userspace threads and can cause significant delays to memcg
stats reads.

I tested this patch on a machine with 384 CPUs using a microbenchmark
that spawns 10K threads, each reading its memory.stat every 100
milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
~5% reads even exceeding 1 second. This is a significant regression.
In comparison, without contention, each memory.stat read only takes
20us-50us in the kernel.  Almost all of the extra read time is spent
on waiting for the new mutex. The time to flush rstats only accounts
for 10us-50us (This test creates only 1K memory cgroups and doesn't
generate any loads other than these stat reader threads).

 Here are some ideas to control the lock contention on the mutex and
reduce both the median and tail latencies of concurrent memcg stat
reads:

- Bring back the stats_flush_threshold check in
mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
check provides a reasonable bound on the stats staleness while being
able to filter out a large number of rstats flush requests, which
reduces the contention on the mutex.

- When contended, upgrade the per-memcg rstats flush in
mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
contended flushes together.  We can wait for the ongoing flush to
complete and eliminate repeated flush requests.

- Wait for the mutex and the ongoing flush with a timeout.  We should
not use busy-wait, though.  We can bail out to read the stats without
a flush after enough wait.  A long-stalled system call is much worse
than somewhat stale stats in the corner cases in my opinion.

Wei Xu

> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
>
> > ---
> >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 94d5a6751a9e..46a7abf71c73 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static DEFINE_MUTEX(stats_user_flush_mutex);
> >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> >       cgroup_rstat_flush(memcg->css.cgroup);
> >  }
> >
> > +/*
> > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > + * @memcg: memory cgroup to flush
> > + *
> > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > + * readers hogging the lock.
>
> readers hogging the lock as do_stats_flush drops the spinlock under
> contention.
>
> > + */
> > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > +{
> > +     mutex_lock(&stats_user_flush_mutex);
> > +     do_stats_flush(memcg);
> > +     mutex_unlock(&stats_user_flush_mutex);
> > +}
> > +
> >  /*
> >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> >   *
> --
> Michal Hocko
> SUSE Labs
>
Ivan Babrou Sept. 8, 2023, 1:02 a.m. UTC | #4
On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc@google.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > Unified flushing allows for great concurrency for paths that attempt to
> > > flush the stats, at the expense of potential staleness and a single
> > > flusher paying the extra cost of flushing the full tree.
> > >
> > > This tradeoff makes sense for in-kernel flushers that may observe high
> > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > > may be unexpected and problematic, especially when such stats are used
> > > for critical paths such as userspace OOM handling. Additionally, a
> > > userspace reader will occasionally pay the cost of flushing the entire
> > > hierarchy, which also causes problems in some cases [1].
> > >
> > > Opt userspace reads out of unified flushing. This makes the cost of
> > > reading the stats more predictable (proportional to the size of the
> > > subtree), as well as the freshness of the stats. Userspace readers are
> > > not expected to have similar concurrency to in-kernel flushers,
> > > serializing them among themselves and among in-kernel flushers should be
> > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > > userspace readers to make sure only a single userspace reader can compete
> > > with in-kernel flushers at a time. This takes away userspace ability to
> > > directly influence or hurt in-kernel lock contention.
> >
> > I think it would be helpful to note that the primary reason this is a
> > concern is that the spinlock is dropped during flushing under
> > contention.
> >
> > > An alternative is to remove flushing from the stats reading path
> > > completely, and rely on the periodic flusher. This should be accompanied
> > > by making the periodic flushing period tunable, and providing an
> > > interface for userspace to force a flush, following a similar model to
> > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > implementation needs to be changed because:
> > > - The cost of reading stats will be very cheap and we won't be able to
> > >   take that back easily.
> > > - There are user-visible interfaces involved.
> > >
> > > Hence, let's go with the change that's most reversible first and revisit
> > > as needed.
> > >
> > > This was tested on a machine with 256 cpus by running a synthetic test
> > > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > > the stats every second (including root, top-level, and leaf cgroups --
> > > so total 251 threads). No significant regressions were observed in the
> > > total run time, which means that userspace readers are not significantly
> > > affecting in-kernel flushers:
> > >
> > > Base (mm-unstable):
> > >
> > > real  0m22.500s
> > > user  0m9.399s
> > > sys   73m41.381s
> > >
> > > real  0m22.749s
> > > user  0m15.648s
> > > sys   73m13.113s
> > >
> > > real  0m22.466s
> > > user  0m10.000s
> > > sys   73m11.933s
> > >
> > > With this patch:
> > >
> > > real  0m23.092s
> > > user  0m10.110s
> > > sys   75m42.774s
> > >
> > > real  0m22.277s
> > > user  0m10.443s
> > > sys   72m7.182s
> > >
> > > real  0m24.127s
> > > user  0m12.617s
> > > sys   78m52.765s
> > >
> > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > OK, I can live with that but I still believe that locking involved in
> > the user interface only begs for issues later on as there is no control
> > over that lock contention other than the number of processes involved.
> > As it seems that we cannot make a consensus on this concern now and this
> > should be already helping existing workloads then let's just buy some
> > more time ;)
>
> Indeed, even though the new global mutex protects the kernel from the
> userspace contention on the rstats spinlock, its current
> implementation doesn't have any protection for the lock contention
> among the userspace threads and can cause significant delays to memcg
> stats reads.
>
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second. This is a significant regression.
> In comparison, without contention, each memory.stat read only takes
> 20us-50us in the kernel.  Almost all of the extra read time is spent
> on waiting for the new mutex. The time to flush rstats only accounts
> for 10us-50us (This test creates only 1K memory cgroups and doesn't
> generate any loads other than these stat reader threads).
>
>  Here are some ideas to control the lock contention on the mutex and
> reduce both the median and tail latencies of concurrent memcg stat
> reads:
>
> - Bring back the stats_flush_threshold check in
> mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> check provides a reasonable bound on the stats staleness while being
> able to filter out a large number of rstats flush requests, which
> reduces the contention on the mutex.
>
> - When contended, upgrade the per-memcg rstats flush in
> mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> contended flushes together.  We can wait for the ongoing flush to
> complete and eliminate repeated flush requests.

Full root memcg flush being slow is one of the issues that prompted this patch:

* https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

I don't want us to regress in this regard.

> - Wait for the mutex and the ongoing flush with a timeout.  We should
> not use busy-wait, though.  We can bail out to read the stats without
> a flush after enough wait.  A long-stalled system call is much worse
> than somewhat stale stats in the corner cases in my opinion.
>
> Wei Xu
>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > Thanks!
> >
> > > ---
> > >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> > >  1 file changed, 20 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 94d5a6751a9e..46a7abf71c73 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> > >  static void flush_memcg_stats_dwork(struct work_struct *w);
> > >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > > +static DEFINE_MUTEX(stats_user_flush_mutex);
> > >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> > >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> > >  static u64 flush_next_time;
> > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> > >       cgroup_rstat_flush(memcg->css.cgroup);
> > >  }
> > >
> > > +/*
> > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > > + * @memcg: memory cgroup to flush
> > > + *
> > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > > + * readers hogging the lock.
> >
> > readers hogging the lock as do_stats_flush drops the spinlock under
> > contention.
> >
> > > + */
> > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > > +{
> > > +     mutex_lock(&stats_user_flush_mutex);
> > > +     do_stats_flush(memcg);
> > > +     mutex_unlock(&stats_user_flush_mutex);
> > > +}
> > > +
> > >  /*
> > >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> > >   *
> > --
> > Michal Hocko
> > SUSE Labs
> >
Yosry Ahmed Sept. 8, 2023, 1:11 a.m. UTC | #5
On Thu, Sep 7, 2023 at 6:03 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > > Unified flushing allows for great concurrency for paths that attempt to
> > > > flush the stats, at the expense of potential staleness and a single
> > > > flusher paying the extra cost of flushing the full tree.
> > > >
> > > > This tradeoff makes sense for in-kernel flushers that may observe high
> > > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > > > may be unexpected and problematic, especially when such stats are used
> > > > for critical paths such as userspace OOM handling. Additionally, a
> > > > userspace reader will occasionally pay the cost of flushing the entire
> > > > hierarchy, which also causes problems in some cases [1].
> > > >
> > > > Opt userspace reads out of unified flushing. This makes the cost of
> > > > reading the stats more predictable (proportional to the size of the
> > > > subtree), as well as the freshness of the stats. Userspace readers are
> > > > not expected to have similar concurrency to in-kernel flushers,
> > > > serializing them among themselves and among in-kernel flushers should be
> > > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > > > userspace readers to make sure only a single userspace reader can compete
> > > > with in-kernel flushers at a time. This takes away userspace ability to
> > > > directly influence or hurt in-kernel lock contention.
> > >
> > > I think it would be helpful to note that the primary reason this is a
> > > concern is that the spinlock is dropped during flushing under
> > > contention.
> > >
> > > > An alternative is to remove flushing from the stats reading path
> > > > completely, and rely on the periodic flusher. This should be accompanied
> > > > by making the periodic flushing period tunable, and providing an
> > > > interface for userspace to force a flush, following a similar model to
> > > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > > implementation needs to be changed because:
> > > > - The cost of reading stats will be very cheap and we won't be able to
> > > >   take that back easily.
> > > > - There are user-visible interfaces involved.
> > > >
> > > > Hence, let's go with the change that's most reversible first and revisit
> > > > as needed.
> > > >
> > > > This was tested on a machine with 256 cpus by running a synthetic test
> > > > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > > > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > > > the stats every second (including root, top-level, and leaf cgroups --
> > > > so total 251 threads). No significant regressions were observed in the
> > > > total run time, which means that userspace readers are not significantly
> > > > affecting in-kernel flushers:
> > > >
> > > > Base (mm-unstable):
> > > >
> > > > real  0m22.500s
> > > > user  0m9.399s
> > > > sys   73m41.381s
> > > >
> > > > real  0m22.749s
> > > > user  0m15.648s
> > > > sys   73m13.113s
> > > >
> > > > real  0m22.466s
> > > > user  0m10.000s
> > > > sys   73m11.933s
> > > >
> > > > With this patch:
> > > >
> > > > real  0m23.092s
> > > > user  0m10.110s
> > > > sys   75m42.774s
> > > >
> > > > real  0m22.277s
> > > > user  0m10.443s
> > > > sys   72m7.182s
> > > >
> > > > real  0m24.127s
> > > > user  0m12.617s
> > > > sys   78m52.765s
> > > >
> > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > OK, I can live with that but I still believe that locking involved in
> > > the user interface only begs for issues later on as there is no control
> > > over that lock contention other than the number of processes involved.
> > > As it seems that we cannot make a consensus on this concern now and this
> > > should be already helping existing workloads then let's just buy some
> > > more time ;)
> >
> > Indeed, even though the new global mutex protects the kernel from the
> > userspace contention on the rstats spinlock, its current
> > implementation doesn't have any protection for the lock contention
> > among the userspace threads and can cause significant delays to memcg
> > stats reads.
> >
> > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > that spawns 10K threads, each reading its memory.stat every 100
> > milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> > ~5% reads even exceeding 1 second. This is a significant regression.
> > In comparison, without contention, each memory.stat read only takes
> > 20us-50us in the kernel.  Almost all of the extra read time is spent
> > on waiting for the new mutex. The time to flush rstats only accounts
> > for 10us-50us (This test creates only 1K memory cgroups and doesn't
> > generate any loads other than these stat reader threads).
> >
> >  Here are some ideas to control the lock contention on the mutex and
> > reduce both the median and tail latencies of concurrent memcg stat
> > reads:


Thanks for the analysis, Wei!

I will update the patch series based on your ideas to limit the
contention on the userspace read mutex.

>
> >
> > - Bring back the stats_flush_threshold check in
> > mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> > check provides a reasonable bound on the stats staleness while being
> > able to filter out a large number of rstats flush requests, which
> > reduces the contention on the mutex.
> >
> > - When contended, upgrade the per-memcg rstats flush in
> > mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> > contended flushes together.  We can wait for the ongoing flush to
> > complete and eliminate repeated flush requests.
>
> Full root memcg flush being slow is one of the issues that prompted this patch:
>
> * https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
>
> I don't want us to regress in this regard.


It will only be a fallback if there is high concurrency among
userspace reads, which will cause high contention on the mutex. In
that case, the userspace reads will be slowed down by contention,
which can be even worse than flush slowness as it is theoretically
unbounded.

I am working on a v5 now to incorporate Wei's suggestions. Would you
be able to test that and verify that there are no regressions?

>
>
> > - Wait for the mutex and the ongoing flush with a timeout.  We should
> > not use busy-wait, though.  We can bail out to read the stats without
> > a flush after enough wait.  A long-stalled system call is much worse
> > than somewhat stale stats in the corner cases in my opinion.
> >
> > Wei Xu
> >
Michal Hocko Sept. 11, 2023, 1:11 p.m. UTC | #6
On Thu 07-09-23 17:52:12, Wei Xu wrote:
[...]
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.

This is rather extreme case but I wouldn't call it utterly insane
though.

> Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second.

Just curious, what would numbers look like if the mutex is removed and
those threads would be condending on the existing spinlock with lock
dropping in place and removed. Would you be willing to give it a shot?
Wei Xu Sept. 11, 2023, 7:15 p.m. UTC | #7
On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 07-09-23 17:52:12, Wei Xu wrote:
> [...]
> > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > that spawns 10K threads, each reading its memory.stat every 100
> > milliseconds.
>
> This is rather extreme case but I wouldn't call it utterly insane
> though.
>
> > Most of memory.stat reads take 5ms-10ms in kernel, with
> > ~5% reads even exceeding 1 second.
>
> Just curious, what would numbers look like if the mutex is removed and
> those threads would be condending on the existing spinlock with lock
> dropping in place and removed. Would you be willing to give it a shot?

Without the mutex and with the spinlock only, the common read latency
of memory.stat is still 5ms-10ms in kernel. There are very few reads
(<0.003%) going above 10ms and none more than 1 second.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Sept. 11, 2023, 7:34 p.m. UTC | #8
On Mon 11-09-23 12:15:24, Wei Xu wrote:
> On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 07-09-23 17:52:12, Wei Xu wrote:
> > [...]
> > > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > > that spawns 10K threads, each reading its memory.stat every 100
> > > milliseconds.
> >
> > This is rather extreme case but I wouldn't call it utterly insane
> > though.
> >
> > > Most of memory.stat reads take 5ms-10ms in kernel, with
> > > ~5% reads even exceeding 1 second.
> >
> > Just curious, what would numbers look like if the mutex is removed and
> > those threads would be condending on the existing spinlock with lock
> > dropping in place and removed. Would you be willing to give it a shot?
> 
> Without the mutex and with the spinlock only, the common read latency
> of memory.stat is still 5ms-10ms in kernel. There are very few reads
> (<0.003%) going above 10ms and none more than 1 second.

Is this with the existing spinlock dropping and same 10k potentially
contending readers?
Wei Xu Sept. 11, 2023, 8:01 p.m. UTC | #9
On Mon, Sep 11, 2023 at 12:34 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-09-23 12:15:24, Wei Xu wrote:
> > On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 07-09-23 17:52:12, Wei Xu wrote:
> > > [...]
> > > > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > > > that spawns 10K threads, each reading its memory.stat every 100
> > > > milliseconds.
> > >
> > > This is rather extreme case but I wouldn't call it utterly insane
> > > though.
> > >
> > > > Most of memory.stat reads take 5ms-10ms in kernel, with
> > > > ~5% reads even exceeding 1 second.
> > >
> > > Just curious, what would numbers look like if the mutex is removed and
> > > those threads would be condending on the existing spinlock with lock
> > > dropping in place and removed. Would you be willing to give it a shot?
> >
> > Without the mutex and with the spinlock only, the common read latency
> > of memory.stat is still 5ms-10ms in kernel. There are very few reads
> > (<0.003%) going above 10ms and none more than 1 second.
>
> Is this with the existing spinlock dropping and same 10k potentially
> contending readers?

Yes, it is the same test (10K contending readers). The kernel change
is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
so that the concurrent mem_cgroup_user_flush_stats() requests directly
contend on cgroup_rstat_lock in cgroup_rstat_flush().

> --
> Michal Hocko
> SUSE Labs
Tejun Heo Sept. 11, 2023, 8:21 p.m. UTC | #10
Hello,

On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> Yes, it is the same test (10K contending readers). The kernel change
> is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> so that the concurrent mem_cgroup_user_flush_stats() requests directly
> contend on cgroup_rstat_lock in cgroup_rstat_flush().

I don't think it'd be a good idea to twist rstat and other kernel internal
code to accommodate 10k parallel readers. If we want to support that, let's
explicitly support that by implementing better batching in the read path.
The only guarantee you need is that there has been at least one flush since
the read attempt started, so we can do sth like the following in the read
path:

1. Grab a waiter lock. Remember the current timestamp.

2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab
   the waiter lock, update the latest flush time to my start time, wake up
   waiters on the waitqueue (maybe do custom wakeups based on start time?).

3. Release the waiter lock and sleep on the waitqueue.

4. When woken up, regarb the waiter lock, compare whether the latest flush
   timestamp is later than my start time, if so, return the latest result.
   If not go back to #2.

Maybe the above isn't the best way to do it but you get the general idea.
When you have that many concurrent readers, most of them won't need to
actually flush.

Thanks.
Yosry Ahmed Sept. 11, 2023, 8:28 p.m. UTC | #11
On Mon, Sep 11, 2023 at 1:21 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > Yes, it is the same test (10K contending readers). The kernel change
> > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > contend on cgroup_rstat_lock in cgroup_rstat_flush().
>
> I don't think it'd be a good idea to twist rstat and other kernel internal
> code to accommodate 10k parallel readers. If we want to support that, let's
> explicitly support that by implementing better batching in the read path.
> The only guarantee you need is that there has been at least one flush since
> the read attempt started, so we can do sth like the following in the read
> path:
>
> 1. Grab a waiter lock. Remember the current timestamp.
>
> 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab
>    the waiter lock, update the latest flush time to my start time, wake up
>    waiters on the waitqueue (maybe do custom wakeups based on start time?).
>
> 3. Release the waiter lock and sleep on the waitqueue.
>
> 4. When woken up, regarb the waiter lock, compare whether the latest flush
>    timestamp is later than my start time, if so, return the latest result.
>    If not go back to #2.
>
> Maybe the above isn't the best way to do it but you get the general idea.
> When you have that many concurrent readers, most of them won't need to
> actually flush.

I am testing something vaguely similar to this conceptually, but
doesn't depend on timestamps.

I replaced the mutex with a semaphore, and I added a fallback logic to
unified flushing with a timeout:

  static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
  {
          static DEFINE_SEMAPHORE(user_flush_sem, 1);

          if (atomic_read(&stats_flush_order) <= STATS_FLUSH_THRESHOLD)
                  return;

          if (!down_timeout(&user_flush_sem, msecs_to_jiffies(1))) {
                  do_stats_flush(memcg);
                  up(&user_flush_sem);
          } else {
                  do_unified_stats_flush(true);
          }
  }

In do_unified_stats_flush(), I added a wait argument. If set, the
caller will wait for any ongoing flushers before returning (but it
never attempts to flush, so no contention on the underlying rstat
lock). I implemented this using completions. I am running some tests
now, but this should make sure userspace read latency is bounded by
1ms + unified flush time. We basically attempt to flush our subtree
only, if we can't after 1ms, we fallback to unified flushing.

Another benefit I am seeing here is that I tried switching in-kernel
flushers to also use the completion in do_unified_stats_flush().
Basically instead of skipping entirely when someone else is flushing,
they just wait for them to finish (without being serialized or
contending the lock). I see no regressions in my parallel reclaim
benchmark. This should make sure no one ever skips a flush, while
still avoiding too much serialization/contention. I suspect this
should make reclaim heuristics (and other in-kernel flushers) slightly
better.

I will run Wei's benchmark next to see how userspace read latency is affected.

>
> Thanks.
>
> --
> tejun
Michal Hocko Sept. 12, 2023, 11:03 a.m. UTC | #12
On Mon 11-09-23 10:21:24, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > Yes, it is the same test (10K contending readers). The kernel change
> > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > contend on cgroup_rstat_lock in cgroup_rstat_flush().
> 
> I don't think it'd be a good idea to twist rstat and other kernel internal
> code to accommodate 10k parallel readers.

I didn't mean to suggest optimizing for this specific scenario. I was
mostly curious whether the pathological case of unbound high latency due
to lock dropping is easy to trigger by huge number of readers. It seems
it is not and the mutex might not be really needed as a prevention.

> If we want to support that, let's
> explicitly support that by implementing better batching in the read path.

Well, we need to be able to handle those situations because stat files
are generally readable and we do not want unrelated workloads to
influence each other heavily through this path.

[...]

> When you have that many concurrent readers, most of them won't need to
> actually flush.

Agreed!
Yosry Ahmed Sept. 12, 2023, 11:09 a.m. UTC | #13
On Tue, Sep 12, 2023 at 4:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-09-23 10:21:24, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > > Yes, it is the same test (10K contending readers). The kernel change
> > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > > contend on cgroup_rstat_lock in cgroup_rstat_flush().
> >
> > I don't think it'd be a good idea to twist rstat and other kernel internal
> > code to accommodate 10k parallel readers.
>
> I didn't mean to suggest optimizing for this specific scenario. I was
> mostly curious whether the pathological case of unbound high latency due
> to lock dropping is easy to trigger by huge number of readers. It seems
> it is not and the mutex might not be really needed as a prevention.
>
> > If we want to support that, let's
> > explicitly support that by implementing better batching in the read path.
>
> Well, we need to be able to handle those situations because stat files
> are generally readable and we do not want unrelated workloads to
> influence each other heavily through this path.

I am working on a complete rework of this series based on the feedback
I got from Wei and the discussions here. I think I have something
simpler and more generic, and doesn't proliferate the number of
flushing variants we have. I am running some tests right now and will
share it as soon as I can.

It should address the high concurrency use case without adding a lot
of complexity. It basically involves a fast path where we only flush
the needed subtree if there's no contention, and a slow path where we
coalesce all flushing requests, and everyone just waits for a single
flush to complete (without spinning or contending on any locks). I am
trying to use this generic mechanism for both userspace reads and
in-kernel flushers. I am making sure in-kernel flushers do not
regress.

>
> [...]
>
> > When you have that many concurrent readers, most of them won't need to
> > actually flush.
>
> Agreed!
> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94d5a6751a9e..46a7abf71c73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,6 +588,7 @@  mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static DEFINE_MUTEX(stats_user_flush_mutex);
 static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
@@ -655,6 +656,21 @@  static void do_stats_flush(struct mem_cgroup *memcg)
 	cgroup_rstat_flush(memcg->css.cgroup);
 }
 
+/*
+ * mem_cgroup_user_flush_stats - do a stats flush for a user read
+ * @memcg: memory cgroup to flush
+ *
+ * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
+ * the global rstat spinlock. This protects in-kernel flushers from userspace
+ * readers hogging the lock.
+ */
+static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
+{
+	mutex_lock(&stats_user_flush_mutex);
+	do_stats_flush(memcg);
+	mutex_unlock(&stats_user_flush_mutex);
+}
+
 /*
  * do_unified_stats_flush - do a unified flush of memory cgroup statistics
  *
@@ -1608,7 +1624,7 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4050,7 +4066,7 @@  static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4125,7 +4141,7 @@  static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -6642,7 +6658,7 @@  static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;