[v2,5/9] memcg: replace stats_flush_lock with an atomic

Message ID	20230328221644.803272-6-yosryahmed@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 28 Mar 2023 22:16:40 +0000 In-Reply-To: <20230328221644.803272-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230328221644.803272-1-yosryahmed@google.com> Message-ID: <20230328221644.803272-6-yosryahmed@google.com> Subject: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic From: Yosry Ahmed <yosryahmed@google.com> To: Tejun Heo <tj@kernel.org>, Josef Bacik <josef@toxicpanda.com>, Jens Axboe <axboe@kernel.dk>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com> Cc: Vasily Averin <vasily.averin@linux.dev>, cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: make rstat flushing irq and sleep \| expand [v2,0/9] memcg: make rstat flushing irq and sleep [v2,1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" [v2,2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" [v2,3/9] memcg: do not flush stats in irq context [v2,4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context [v2,5/9] memcg: replace stats_flush_lock with an atomic [v2,6/9] memcg: sleep during flushing stats in safe contexts [v2,7/9] workingset: memcg: sleep when flushing stats in workingset_refault() [v2,8/9] vmscan: memcg: sleep when flushing stats during reclaim [v2,9/9] memcg: do not modify rstat tree for zero updates

Message ID

20230328221644.803272-6-yosryahmed@google.com (mailing list archive)

State

New

Headers

Date: Tue, 28 Mar 2023 22:16:40 +0000
In-Reply-To: <20230328221644.803272-1-yosryahmed@google.com>
Mime-Version: 1.0
References: <20230328221644.803272-1-yosryahmed@google.com>
Message-ID: <20230328221644.803272-6-yosryahmed@google.com>
Subject: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic
From: Yosry Ahmed <yosryahmed@google.com>
To: Tejun Heo <tj@kernel.org>, Josef Bacik <josef@toxicpanda.com>,
 Jens Axboe <axboe@kernel.dk>,  Zefan Li <lizefan.x@bytedance.com>,
 Johannes Weiner <hannes@cmpxchg.org>,  Michal Hocko <mhocko@kernel.org>,
 Roman Gushchin <roman.gushchin@linux.dev>,
  Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>,
  Andrew Morton <akpm@linux-foundation.org>,
  " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com>
Cc: Vasily Averin <vasily.averin@linux.dev>, cgroups@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
	bpf@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

memcg: make rstat flushing irq and sleep | expand

Commit Message

Yosry Ahmed March 28, 2023, 10:16 p.m. UTC

As Johannes notes in [1], stats_flush_lock is currently used to:
(a) Protect updated to stats_flush_threshold.
(b) Protect updates to flush_next_time.
(c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.

However:

1. stats_flush_threshold is already an atomic

2. flush_next_time is not atomic. The writer is locked, but the reader
   is lockless. If the reader races with a flush, you could see this:

                                        if (time_after(jiffies, flush_next_time))
        spin_trylock()
        flush_next_time = now + delay
        flush()
        spin_unlock()
                                        spin_trylock()
                                        flush_next_time = now + delay
                                        flush()
                                        spin_unlock()

   which means we already can get flushes at a higher frequency than
   FLUSH_TIME during races. But it isn't really a problem.

   The reader could also see garbled partial updates, so it needs at
   least READ_ONCE and WRITE_ONCE protection.

3. Serializing cgroup_rstat_flush() calls against the ratelimit
   factors is currently broken because of the race in 2. But the race
   is actually harmless, all we might get is the occasional earlier
   flush. If there is no delta, the flush won't do much. And if there
   is, the flush is justified.

So the lock can be removed all together. However, the lock also served
the purpose of preventing a thundering herd problem for concurrent
flushers, see [2]. Use an atomic instead to serve the purpose of
unifying concurrent flushers.

[1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
[2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

Comments

Shakeel Butt March 28, 2023, 10:22 p.m. UTC | #1

On Tue, Mar 28, 2023 at 3:17 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> As Johannes notes in [1], stats_flush_lock is currently used to:
> (a) Protect updated to stats_flush_threshold.
> (b) Protect updates to flush_next_time.
> (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
>
> However:
>
> 1. stats_flush_threshold is already an atomic
>
> 2. flush_next_time is not atomic. The writer is locked, but the reader
>    is lockless. If the reader races with a flush, you could see this:
>
>                                         if (time_after(jiffies, flush_next_time))
>         spin_trylock()
>         flush_next_time = now + delay
>         flush()
>         spin_unlock()
>                                         spin_trylock()
>                                         flush_next_time = now + delay
>                                         flush()
>                                         spin_unlock()
>
>    which means we already can get flushes at a higher frequency than
>    FLUSH_TIME during races. But it isn't really a problem.
>
>    The reader could also see garbled partial updates, so it needs at
>    least READ_ONCE and WRITE_ONCE protection.
>
> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
>    factors is currently broken because of the race in 2. But the race
>    is actually harmless, all we might get is the occasional earlier
>    flush. If there is no delta, the flush won't do much. And if there
>    is, the flush is justified.
>
> So the lock can be removed all together. However, the lock also served
> the purpose of preventing a thundering herd problem for concurrent
> flushers, see [2]. Use an atomic instead to serve the purpose of
> unifying concurrent flushers.
>
> [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

Michal Hocko March 29, 2023, 3:58 p.m. UTC | #2

On Tue 28-03-23 22:16:40, Yosry Ahmed wrote:
> As Johannes notes in [1], stats_flush_lock is currently used to:
> (a) Protect updated to stats_flush_threshold.
> (b) Protect updates to flush_next_time.
> (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
> 
> However:
> 
> 1. stats_flush_threshold is already an atomic
> 
> 2. flush_next_time is not atomic. The writer is locked, but the reader
>    is lockless. If the reader races with a flush, you could see this:
> 
>                                         if (time_after(jiffies, flush_next_time))
>         spin_trylock()
>         flush_next_time = now + delay
>         flush()
>         spin_unlock()
>                                         spin_trylock()
>                                         flush_next_time = now + delay
>                                         flush()
>                                         spin_unlock()
> 
>    which means we already can get flushes at a higher frequency than
>    FLUSH_TIME during races. But it isn't really a problem.
> 
>    The reader could also see garbled partial updates, so it needs at
>    least READ_ONCE and WRITE_ONCE protection.

Just a nit. Sounds more serious than it is actually. This would only
happen if compiler decides to split the write.

> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
>    factors is currently broken because of the race in 2. But the race
>    is actually harmless, all we might get is the occasional earlier
>    flush. If there is no delta, the flush won't do much. And if there
>    is, the flush is justified.
> 
> So the lock can be removed all together. However, the lock also served
> the purpose of preventing a thundering herd problem for concurrent
> flushers, see [2]. Use an atomic instead to serve the purpose of
> unifying concurrent flushers.
> 
> [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ff39f78f962e..65750f8b8259 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> -static DEFINE_SPINLOCK(stats_flush_lock);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
> +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>  static u64 flush_next_time;
>  
> @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  
>  static void __mem_cgroup_flush_stats(void)
>  {
> -	unsigned long flag;
> -
> -	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> +	/*
> +	 * We always flush the entire tree, so concurrent flushers can just
> +	 * skip. This avoids a thundering herd problem on the rstat global lock
> +	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 */
> +	if (atomic_read(&stats_flush_ongoing) ||
> +	    atomic_xchg(&stats_flush_ongoing, 1))
>  		return;
>  
> -	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> +	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
>  	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
>  	atomic_set(&stats_flush_threshold, 0);
> -	spin_unlock_irqrestore(&stats_flush_lock, flag);
> +	atomic_set(&stats_flush_ongoing, 0);
>  }
>  
>  void mem_cgroup_flush_stats(void)
> @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
>  
>  void mem_cgroup_flush_stats_ratelimited(void)
>  {
> -	if (time_after64(jiffies_64, flush_next_time))
> +	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
>  		mem_cgroup_flush_stats();
>  }
>  
> -- 
> 2.40.0.348.gf938b09366-goog

Yosry Ahmed March 29, 2023, 6:45 p.m. UTC | #3

On Wed, Mar 29, 2023 at 8:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-03-23 22:16:40, Yosry Ahmed wrote:
> > As Johannes notes in [1], stats_flush_lock is currently used to:
> > (a) Protect updated to stats_flush_threshold.
> > (b) Protect updates to flush_next_time.
> > (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
> >
> > However:
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> >    is lockless. If the reader races with a flush, you could see this:
> >
> >                                         if (time_after(jiffies, flush_next_time))
> >         spin_trylock()
> >         flush_next_time = now + delay
> >         flush()
> >         spin_unlock()
> >                                         spin_trylock()
> >                                         flush_next_time = now + delay
> >                                         flush()
> >                                         spin_unlock()
> >
> >    which means we already can get flushes at a higher frequency than
> >    FLUSH_TIME during races. But it isn't really a problem.
> >
> >    The reader could also see garbled partial updates, so it needs at
> >    least READ_ONCE and WRITE_ONCE protection.
>
> Just a nit. Sounds more serious than it is actually. This would only
> happen if compiler decides to split the write.

Thanks for the note, Michal. I honestly quoted Johannes here as I do
not have much expertise when it comes to this. I will add "if the
compiler decides to split the write" to the commit log if I respin.

>
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> >    factors is currently broken because of the race in 2. But the race
> >    is actually harmless, all we might get is the occasional earlier
> >    flush. If there is no delta, the flush won't do much. And if there
> >    is, the flush is justified.
> >
> > So the lock can be removed all together. However, the lock also served
> > the purpose of preventing a thundering herd problem for concurrent
> > flushers, see [2]. Use an atomic instead to serve the purpose of
> > unifying concurrent flushers.
> >
> > [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> > [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> > ---
> >  mm/memcontrol.c | 18 +++++++++++-------
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ff39f78f962e..65750f8b8259 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >   */
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > -static DEFINE_SPINLOCK(stats_flush_lock);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> >
> > @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >
> >  static void __mem_cgroup_flush_stats(void)
> >  {
> > -     unsigned long flag;
> > -
> > -     if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > +     /*
> > +      * We always flush the entire tree, so concurrent flushers can just
> > +      * skip. This avoids a thundering herd problem on the rstat global lock
> > +      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      */
> > +     if (atomic_read(&stats_flush_ongoing) ||
> > +         atomic_xchg(&stats_flush_ongoing, 1))
> >               return;
> >
> > -     flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > +     WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> >       cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> >       atomic_set(&stats_flush_threshold, 0);
> > -     spin_unlock_irqrestore(&stats_flush_lock, flag);
> > +     atomic_set(&stats_flush_ongoing, 0);
> >  }
> >
> >  void mem_cgroup_flush_stats(void)
> > @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
> >
> >  void mem_cgroup_flush_stats_ratelimited(void)
> >  {
> > -     if (time_after64(jiffies_64, flush_next_time))
> > +     if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> >               mem_cgroup_flush_stats();
> >  }
> >
> > --
> > 2.40.0.348.gf938b09366-goog
>
> --
> Michal Hocko
> SUSE Labs

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff39f78f962e..65750f8b8259 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -585,8 +585,8 @@  mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
 
@@ -636,15 +636,19 @@  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 
 static void __mem_cgroup_flush_stats(void)
 {
-	unsigned long flag;
-
-	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+	/*
+	 * We always flush the entire tree, so concurrent flushers can just
+	 * skip. This avoids a thundering herd problem on the rstat global lock
+	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 */
+	if (atomic_read(&stats_flush_ongoing) ||
+	    atomic_xchg(&stats_flush_ongoing, 1))
 		return;
 
-	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
+	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
-	spin_unlock_irqrestore(&stats_flush_lock, flag);
+	atomic_set(&stats_flush_ongoing, 0);
 }
 
 void mem_cgroup_flush_stats(void)
@@ -655,7 +659,7 @@  void mem_cgroup_flush_stats(void)
 
 void mem_cgroup_flush_stats_ratelimited(void)
 {
-	if (time_after64(jiffies_64, flush_next_time))
+	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
 		mem_cgroup_flush_stats();
 }

[v2,5/9] memcg: replace stats_flush_lock with an atomic

Commit Message

Comments

Patch