diff mbox series

[RFC,4/7] memcg: sleep during flushing stats in safe contexts

Message ID 20230323040037.2389095-5-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series Make rstat flushing IRQ and sleep friendly | expand

Commit Message

Yosry Ahmed March 23, 2023, 4 a.m. UTC
Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.

Enable choosing whether sleeping is allowed or not when flushing memcg
stats, and allow sleeping in safe contexts to avoid unnecessarily
performing a lot of work without sleeping.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  8 ++++----
 mm/memcontrol.c            | 35 ++++++++++++++++++++++-------------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  3 ++-
 4 files changed, 29 insertions(+), 19 deletions(-)

Comments

Johannes Weiner March 23, 2023, 3:56 p.m. UTC | #1
On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
>  		return;
>  
>  	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> +	cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);

How is it safe to call this with may_sleep=true when it's holding the
stats_flush_lock?

>  	atomic_set(&stats_flush_threshold, 0);
>  	spin_unlock(&stats_flush_lock);
>  }
Yosry Ahmed March 23, 2023, 4:01 p.m. UTC | #2
On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> >               return;
> >
> >       flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > +     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
>
> How is it safe to call this with may_sleep=true when it's holding the
> stats_flush_lock?

stats_flush_lock is always called with trylock, it is only used today
so that we can skip flushing if another cpu is already doing a flush
(which is not 100% correct as they may have not finished flushing yet,
but that's orthogonal here). So I think it should be safe to sleep as
no one can be blocked waiting for this spinlock.

Perhaps it would be better semantically to replace the spinlock with
an atomic test and set, instead of having a lock that can only be used
with trylock?

>
> >       atomic_set(&stats_flush_threshold, 0);
> >       spin_unlock(&stats_flush_lock);
> >  }
Johannes Weiner March 23, 2023, 5:27 p.m. UTC | #3
On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > >               return;
> > >
> > >       flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > +     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> >
> > How is it safe to call this with may_sleep=true when it's holding the
> > stats_flush_lock?
> 
> stats_flush_lock is always called with trylock, it is only used today
> so that we can skip flushing if another cpu is already doing a flush
> (which is not 100% correct as they may have not finished flushing yet,
> but that's orthogonal here). So I think it should be safe to sleep as
> no one can be blocked waiting for this spinlock.

I see. It still cannot sleep while the lock is held, though, because
preemption is disabled. Make sure you have all lock debugging on while
testing this.

> Perhaps it would be better semantically to replace the spinlock with
> an atomic test and set, instead of having a lock that can only be used
> with trylock?

It could be helpful to clarify what stats_flush_lock is protecting
first. Keep in mind that locks should protect data, not code paths.

Right now it's doing multiple things:

1. It protects updates to stats_flush_threshold
2. It protects updates to flush_next_time
3. It serializes calls to cgroup_rstat_flush() based on those ratelimits

However,

1. stats_flush_threshold is already an atomic

2. flush_next_time is not atomic. The writer is locked, but the reader
   is lockless. If the reader races with a flush, you could see this:

					if (time_after(jiffies, flush_next_time))
	spin_trylock()
        flush_next_time = now + delay
        flush()
        spin_unlock()
					spin_trylock()
					flush_next_time = now + delay
					flush()
					spin_unlock()

   which means we already can get flushes at a higher frequency than
   FLUSH_TIME during races. But it isn't really a problem.

   The reader could also see garbled partial updates, so it needs at
   least READ_ONCE and WRITE_ONCE protection.

3. Serializing cgroup_rstat_flush() calls against the ratelimit
   factors is currently broken because of the race in 2. But the race
   is actually harmless, all we might get is the occasional earlier
   flush. If there is no delta, the flush won't do much. And if there
   is, the flush is justified.

In summary, it seems to me the lock can be ditched altogether. All the
code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
Yosry Ahmed March 23, 2023, 6:07 p.m. UTC | #4
On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > >               return;
> > > >
> > > >       flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > +     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > >
> > > How is it safe to call this with may_sleep=true when it's holding the
> > > stats_flush_lock?
> >
> > stats_flush_lock is always called with trylock, it is only used today
> > so that we can skip flushing if another cpu is already doing a flush
> > (which is not 100% correct as they may have not finished flushing yet,
> > but that's orthogonal here). So I think it should be safe to sleep as
> > no one can be blocked waiting for this spinlock.
>
> I see. It still cannot sleep while the lock is held, though, because
> preemption is disabled. Make sure you have all lock debugging on while
> testing this.

Thanks for pointing this out, will do.

>
> > Perhaps it would be better semantically to replace the spinlock with
> > an atomic test and set, instead of having a lock that can only be used
> > with trylock?
>
> It could be helpful to clarify what stats_flush_lock is protecting
> first. Keep in mind that locks should protect data, not code paths.
>
> Right now it's doing multiple things:
>
> 1. It protects updates to stats_flush_threshold
> 2. It protects updates to flush_next_time
> 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
>
> However,
>
> 1. stats_flush_threshold is already an atomic
>
> 2. flush_next_time is not atomic. The writer is locked, but the reader
>    is lockless. If the reader races with a flush, you could see this:
>
>                                         if (time_after(jiffies, flush_next_time))
>         spin_trylock()
>         flush_next_time = now + delay
>         flush()
>         spin_unlock()
>                                         spin_trylock()
>                                         flush_next_time = now + delay
>                                         flush()
>                                         spin_unlock()
>
>    which means we already can get flushes at a higher frequency than
>    FLUSH_TIME during races. But it isn't really a problem.
>
>    The reader could also see garbled partial updates, so it needs at
>    least READ_ONCE and WRITE_ONCE protection.
>
> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
>    factors is currently broken because of the race in 2. But the race
>    is actually harmless, all we might get is the occasional earlier
>    flush. If there is no delta, the flush won't do much. And if there
>    is, the flush is justified.
>
> In summary, it seems to me the lock can be ditched altogether. All the
> code needs is READ_ONCE/WRITE_ONCE around flush_next_time.

Thanks a lot for this analysis. I agree that the lock can be removed
with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
of the lock that we are missing here.

I think one other purpose of the lock is avoiding a thundering herd
problem on cgroup_rstat_lock, particularly from reclaim context, as
mentioned by the log of  commit aa48e47e3906 ("memcg: infrastructure
to flush memcg stats").

While testing, I did notice that removing this lock indeed causes a
thundering herd problem if we have a lot of concurrent reclaimers. The
trylock makes sure we abort immediately if someone else is flushing --
which is not ideal because that flusher might have just started, and
we may end up reading stale data anyway.

This is why I suggested replacing the lock by an atomic, and do
something like this if we want to maintain the current behavior:

static void __mem_cgroup_flush_stats(void)
{
    ...
    if (atomic_xchg(&ongoing_flush, 1))
        return;
    ...
    atomic_set(&ongoing_flush, 0)
}

Alternatively, if we want to change the behavior and wait for the
concurrent flusher to finish flushing, we can maybe spin until
ongoing_flush goes back to 0 and then return:

static void __mem_cgroup_flush_stats(void)
{
    ...
    if (atomic_xchg(&ongoing_flush, 1)) {
        /* wait until the ongoing flusher finishes to get updated stats */
        while (atomic_read(&ongoing_flush) {};
        return;
    }
    /* flush the stats ourselves */
    ...
    atomic_set(&ongoing_flush, 0)
}

WDYT?
Shakeel Butt March 23, 2023, 7:35 p.m. UTC | #5
On Thu, Mar 23, 2023 at 11:08 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > > >               return;
> > > > >
> > > > >       flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > > +     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > > >
> > > > How is it safe to call this with may_sleep=true when it's holding the
> > > > stats_flush_lock?
> > >
> > > stats_flush_lock is always called with trylock, it is only used today
> > > so that we can skip flushing if another cpu is already doing a flush
> > > (which is not 100% correct as they may have not finished flushing yet,
> > > but that's orthogonal here). So I think it should be safe to sleep as
> > > no one can be blocked waiting for this spinlock.
> >
> > I see. It still cannot sleep while the lock is held, though, because
> > preemption is disabled. Make sure you have all lock debugging on while
> > testing this.
>
> Thanks for pointing this out, will do.
>
> >
> > > Perhaps it would be better semantically to replace the spinlock with
> > > an atomic test and set, instead of having a lock that can only be used
> > > with trylock?
> >
> > It could be helpful to clarify what stats_flush_lock is protecting
> > first. Keep in mind that locks should protect data, not code paths.
> >
> > Right now it's doing multiple things:
> >
> > 1. It protects updates to stats_flush_threshold
> > 2. It protects updates to flush_next_time
> > 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
> >
> > However,
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> >    is lockless. If the reader races with a flush, you could see this:
> >
> >                                         if (time_after(jiffies, flush_next_time))
> >         spin_trylock()
> >         flush_next_time = now + delay
> >         flush()
> >         spin_unlock()
> >                                         spin_trylock()
> >                                         flush_next_time = now + delay
> >                                         flush()
> >                                         spin_unlock()
> >
> >    which means we already can get flushes at a higher frequency than
> >    FLUSH_TIME during races. But it isn't really a problem.
> >
> >    The reader could also see garbled partial updates, so it needs at
> >    least READ_ONCE and WRITE_ONCE protection.
> >
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> >    factors is currently broken because of the race in 2. But the race
> >    is actually harmless, all we might get is the occasional earlier
> >    flush. If there is no delta, the flush won't do much. And if there
> >    is, the flush is justified.
> >
> > In summary, it seems to me the lock can be ditched altogether. All the
> > code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
>
> Thanks a lot for this analysis. I agree that the lock can be removed
> with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
> of the lock that we are missing here.
>
> I think one other purpose of the lock is avoiding a thundering herd
> problem on cgroup_rstat_lock, particularly from reclaim context, as
> mentioned by the log of  commit aa48e47e3906 ("memcg: infrastructure
> to flush memcg stats").
>
> While testing, I did notice that removing this lock indeed causes a
> thundering herd problem if we have a lot of concurrent reclaimers. The
> trylock makes sure we abort immediately if someone else is flushing --
> which is not ideal because that flusher might have just started, and
> we may end up reading stale data anyway.
>
> This is why I suggested replacing the lock by an atomic, and do
> something like this if we want to maintain the current behavior:
>
> static void __mem_cgroup_flush_stats(void)
> {
>     ...
>     if (atomic_xchg(&ongoing_flush, 1))
>         return;
>     ...
>     atomic_set(&ongoing_flush, 0)
> }
>
> Alternatively, if we want to change the behavior and wait for the
> concurrent flusher to finish flushing, we can maybe spin until
> ongoing_flush goes back to 0 and then return:
>
> static void __mem_cgroup_flush_stats(void)
> {
>     ...
>     if (atomic_xchg(&ongoing_flush, 1)) {
>         /* wait until the ongoing flusher finishes to get updated stats */
>         while (atomic_read(&ongoing_flush) {};
>         return;
>     }
>     /* flush the stats ourselves */
>     ...
>     atomic_set(&ongoing_flush, 0)
> }
>
> WDYT?

I would go with your first approach i.e. no spinning.
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..0c7b286f2caf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1036,8 +1036,8 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_delayed(void);
+void mem_cgroup_flush_stats(bool may_sleep);
+void mem_cgroup_flush_stats_delayed(bool may_sleep);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1531,11 +1531,11 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_flush_stats(bool may_sleep)
 {
 }
 
-static inline void mem_cgroup_flush_stats_delayed(void)
+static inline void mem_cgroup_flush_stats_delayed(bool may_sleep)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72cd44f88d97..39a9c7a978ae 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,7 +634,7 @@  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void __mem_cgroup_flush_stats(void)
+static void __mem_cgroup_flush_stats(bool may_sleep)
 {
 	/*
 	 * This lock can be acquired from interrupt context, but we only acquire
@@ -644,26 +644,26 @@  static void __mem_cgroup_flush_stats(void)
 		return;
 
 	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
-	cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
+	cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
 	atomic_set(&stats_flush_threshold, 0);
 	spin_unlock(&stats_flush_lock);
 }
 
-void mem_cgroup_flush_stats(void)
+void mem_cgroup_flush_stats(bool may_sleep)
 {
 	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		__mem_cgroup_flush_stats();
+		__mem_cgroup_flush_stats(may_sleep);
 }
 
-void mem_cgroup_flush_stats_delayed(void)
+void mem_cgroup_flush_stats_delayed(bool may_sleep)
 {
 	if (time_after64(jiffies_64, flush_next_time))
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats(may_sleep);
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
-	__mem_cgroup_flush_stats();
+	__mem_cgroup_flush_stats(true);
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1570,7 +1570,7 @@  static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(true);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -3671,7 +3671,11 @@  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
-		mem_cgroup_flush_stats();
+		/*
+		 * mem_cgroup_threshold() calls here from irqsafe context.
+		 * Don't sleep.
+		 */
+		mem_cgroup_flush_stats(false);
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
@@ -4014,7 +4018,7 @@  static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(true);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4090,7 +4094,7 @@  static int memcg_stat_show(struct seq_file *m, void *v)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(true);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4594,7 +4598,12 @@  void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	/*
+	 * wb_writeback() takes a spinlock and calls
+	 * wb_over_bg_thresh()->mem_cgroup_wb_stats().
+	 * Do not sleep.
+	 */
+	mem_cgroup_flush_stats(false);
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6596,7 +6605,7 @@  static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(true);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..59d1830d08ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@  static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats(false);
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index 00c6f4d9d9be..042eabbb43f6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,8 @@  void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats_delayed();
+	/* Do not sleep with RCU lock held */
+	mem_cgroup_flush_stats_delayed(false);
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if