diff mbox series

[RFC,1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT

Message ID 20211222114111.2206248-2-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand

Commit Message

Sebastian Andrzej Siewior Dec. 22, 2021, 11:41 a.m. UTC
The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensure by disabling interrupts for the update.
This breaks on PREEMPT_RT because some sections additionally
acquire a spinlock_t lock (which becomes sleeping and must not be
acquired with disabled interrupts). Another problem is that
mem_cgroup_swapout() expects to be invoked with disabled interrupts
because the caller has to acquire a spinlock_t which is acquired with
disabled interrupts. Since spinlock_t never disables interrupts on
PREEMPT_RT the interrupts are never disabled at this point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT. The
sections with disabled preemption must exclude memcg_check_events() so
that spinlock_t locks can still be acquired (for instance in
eventfd_signal()).

Don't disable interrupts during updates of the per-CPU variables,
instead use shorter sections which disable preemption.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 14 deletions(-)

Comments

Waiman Long Dec. 23, 2021, 2:31 a.m. UTC | #1
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensure by disabling interrupts for the update.
> This breaks on PREEMPT_RT because some sections additionally
> acquire a spinlock_t lock (which becomes sleeping and must not be
> acquired with disabled interrupts). Another problem is that
> mem_cgroup_swapout() expects to be invoked with disabled interrupts
> because the caller has to acquire a spinlock_t which is acquired with
> disabled interrupts. Since spinlock_t never disables interrupts on
> PREEMPT_RT the interrupts are never disabled at this point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore

How do you guarantee that these percpu update functions won't be called 
in in_irq() context for PREEMPT_RT? Do you think we should add a 
WARN_ON_ONCE(in_irq()) just to be sure?

Cheers,
Longman
Sebastian Andrzej Siewior Dec. 23, 2021, 7:34 a.m. UTC | #2
On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > The per-CPU counter are modified with the non-atomic modifier. The
> > consistency is ensure by disabling interrupts for the update.
> > This breaks on PREEMPT_RT because some sections additionally
> > acquire a spinlock_t lock (which becomes sleeping and must not be
> > acquired with disabled interrupts). Another problem is that
> > mem_cgroup_swapout() expects to be invoked with disabled interrupts
> > because the caller has to acquire a spinlock_t which is acquired with
> > disabled interrupts. Since spinlock_t never disables interrupts on
> > PREEMPT_RT the interrupts are never disabled at this point.
> > 
> > The code is never called from in_irq() context on PREEMPT_RT therefore
> 
> How do you guarantee that these percpu update functions won't be called in
> in_irq() context for PREEMPT_RT? Do you think we should add a
> WARN_ON_ONCE(in_irq()) just to be sure?

There are no invocations to the memory allocator (neither malloc() nor
free()) on RT and the memory allocator itself (SLUB and the
page-allocator so both) has sleeping locks. That means invocations
in_atomic() are bad. All interrupt handler are force-threaded. Those
which are not (like timer, per-CPU interrupts or those which explicitly
asked not to be force threaded) are limited in their doing as they can't
invoke anything that has a sleeping lock. Lockdep or
CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
The other counter are protected the same way, see
  c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

> Cheers,
> Longman

Sebastian
Waiman Long Dec. 23, 2021, 4:01 p.m. UTC | #3
On 12/23/21 02:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> The per-CPU counter are modified with the non-atomic modifier. The
>>> consistency is ensure by disabling interrupts for the update.
>>> This breaks on PREEMPT_RT because some sections additionally
>>> acquire a spinlock_t lock (which becomes sleeping and must not be
>>> acquired with disabled interrupts). Another problem is that
>>> mem_cgroup_swapout() expects to be invoked with disabled interrupts
>>> because the caller has to acquire a spinlock_t which is acquired with
>>> disabled interrupts. Since spinlock_t never disables interrupts on
>>> PREEMPT_RT the interrupts are never disabled at this point.
>>>
>>> The code is never called from in_irq() context on PREEMPT_RT therefore
>> How do you guarantee that these percpu update functions won't be called in
>> in_irq() context for PREEMPT_RT? Do you think we should add a
>> WARN_ON_ONCE(in_irq()) just to be sure?
> There are no invocations to the memory allocator (neither malloc() nor
> free()) on RT and the memory allocator itself (SLUB and the
> page-allocator so both) has sleeping locks. That means invocations
> in_atomic() are bad. All interrupt handler are force-threaded. Those
> which are not (like timer, per-CPU interrupts or those which explicitly
> asked not to be force threaded) are limited in their doing as they can't
> invoke anything that has a sleeping lock. Lockdep or
> CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
> The other counter are protected the same way, see
>    c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

Thanks for the explanation as I am less familiar with other PREEMPT_RT 
specific changes.

Cheers,
Longman
Michal Koutný Jan. 5, 2022, 2:16 p.m. UTC | #4
On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> The sections with disabled preemption must exclude
> memcg_check_events() so that spinlock_t locks can still be acquired
> (for instance in eventfd_signal()).

The resulting construct in uncharge_batch() raises eybrows. If you can decouple
per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
it the same way on !PREEMPT_RT too (and have just one variant of the
block)?

(Actually, it doesn't seem to me that memcg_check_events() can be
extracted like this from the preempt disabled block since
mem_cgroup_event_ratelimit() relies on similar RMW pattern.
Things would be simpler if PREEMPT_RT didn't allow the threshold event
handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

Thanks,
Michal
Sebastian Andrzej Siewior Jan. 13, 2022, 1:08 p.m. UTC | #5
On 2022-01-05 15:16:53 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > The sections with disabled preemption must exclude
> > memcg_check_events() so that spinlock_t locks can still be acquired
> > (for instance in eventfd_signal()).
> 
> The resulting construct in uncharge_batch() raises eybrows. If you can decouple
> per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
> it the same way on !PREEMPT_RT too (and have just one variant of the
> block)?

That would mean that mem_cgroup_event_ratelimit() needs a
local_irq_save(). If that is okay then sure I can move it that way.

> (Actually, it doesn't seem to me that memcg_check_events() can be
> extracted like this from the preempt disabled block since
> mem_cgroup_event_ratelimit() relies on similar RMW pattern.
> Things would be simpler if PREEMPT_RT didn't allow the threshold event
> handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

I added a preempt-disable() section restricted to RT to
mem_cgroup_event_ratelimit(). I had to exclude memcg_check_events() from
the block because of the spinlock_t locks involved down the road
(eventfd_signal() for instance).

I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
I didn't know where exactly and its implications. In this block here I
just followed the replacement of irq-off with preempt-off for RT.

> Thanks,
> Michal

Sebastian
Michal Koutný Jan. 13, 2022, 2:48 p.m. UTC | #6
On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> I added a preempt-disable() section restricted to RT to
> mem_cgroup_event_ratelimit().

Oh, I missed that one.

(Than the decoupling of such mem_cgroup_event_ratelimit() also makes
some more sense.)
> That would mean that mem_cgroup_event_ratelimit() needs a
> local_irq_save(). If that is okay then sure I can move it that way.

Whatever avoids the twisted code :-)

---

> I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> I didn't know where exactly and its implications. In this block here I
> just followed the replacement of irq-off with preempt-off for RT.

Both soft limit and (these) event notifications are v1 features. Soft
limit itself is rather considered even misfeature. I guess the
implications would not be many since PREEMPT_RT+memcg users would be
new(?) so should rather start with v2 anyway.

One way to disable it would be to reject writes into
memory.soft_limit_in_bytes or cgroup.event_control + documentation of
that.

Michal
Sebastian Andrzej Siewior Jan. 14, 2022, 9:09 a.m. UTC | #7
On 2022-01-13 15:48:03 [+0100], Michal Koutný wrote:
> On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > I added a preempt-disable() section restricted to RT to
> > mem_cgroup_event_ratelimit().
> 
> Oh, I missed that one.
> 
> (Than the decoupling of such mem_cgroup_event_ratelimit() also makes
> some more sense.)
> > That would mean that mem_cgroup_event_ratelimit() needs a
> > local_irq_save(). If that is okay then sure I can move it that way.
> 
> Whatever avoids the twisted code :-)

Okay. So let me do that.

> ---
> 
> > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> > I didn't know where exactly and its implications. In this block here I
> > just followed the replacement of irq-off with preempt-off for RT.
> 
> Both soft limit and (these) event notifications are v1 features. Soft
> limit itself is rather considered even misfeature. I guess the
> implications would not be many since PREEMPT_RT+memcg users would be
> new(?) so should rather start with v2 anyway.

People often migrate to RT so they take whatever they have. In general I
would like to keep RT & !RT in sync unless there are reasons to do it
differently.

> One way to disable it would be to reject writes into
> memory.soft_limit_in_bytes or cgroup.event_control + documentation of
> that.

So avoiding these two also avoids memcg_check_events()?
Right now it does not look like a big obstacle. It is the same pattern
I'm following for the per-CPU RMW. If it is, I could avoid the writes
and if0 the other function for RT.
If I remove memcg_check_events() from the equation then we could keep
the explicit irq-off regions (plus add a few). The only that would look
odd then is that we disable interrupts for the RMW operation and
preemption in other places (__count_memcg_events() invocations in swap.c
and vmscan.c). 

Are there plans to remove v1 or is this part of "we must not break
userland"?

> Michal

Sebastian
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..d2687d5ed544b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,8 +671,14 @@  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -699,6 +705,9 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -706,6 +715,9 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /**
@@ -788,8 +800,13 @@  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg);
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_enable();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -810,6 +827,9 @@  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 int nr_pages)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
 		__count_memcg_events(memcg, PGPGIN, 1);
@@ -819,12 +839,19 @@  static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
 	unsigned long val, next;
+	bool ret = false;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 
 	val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
 	next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
@@ -841,9 +868,11 @@  static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 			break;
 		}
 		__this_cpu_write(memcg->vmstats_percpu->targets[target], next);
-		return true;
+		ret = true;
 	}
-	return false;
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	return ret;
 }
 
 /*
@@ -5645,12 +5674,14 @@  static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 	nid = folio_nid(folio);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(to, nr_pages);
 	memcg_check_events(to, nid);
 	mem_cgroup_charge_statistics(from, -nr_pages);
 	memcg_check_events(from, nid);
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out_unlock:
 	folio_unlock(folio);
 out:
@@ -6670,10 +6701,12 @@  static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	css_get(&memcg->css);
 	commit_charge(folio, memcg);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(folio));
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out:
 	return ret;
 }
@@ -6785,11 +6818,20 @@  static void uncharge_batch(const struct uncharge_gather *ug)
 		memcg_oom_recover(ug->memcg);
 	}
 
-	local_irq_save(flags);
-	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
-	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
-	memcg_check_events(ug->memcg, ug->nid);
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		local_irq_save(flags);
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		memcg_check_events(ug->memcg, ug->nid);
+		local_irq_restore(flags);
+	} else {
+		preempt_disable();
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		preempt_enable();
+
+		memcg_check_events(ug->memcg, ug->nid);
+	}
 
 	/* drop reference from uncharge_folio */
 	css_put(&ug->memcg->css);
@@ -6930,10 +6972,12 @@  void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	css_get(&memcg->css);
 	commit_charge(new, memcg);
 
-	local_irq_save(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(flags);
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(new));
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
 }
 
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7157,8 +7201,10 @@  void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * i_pages lock which is taken with interrupts-off. It is
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
+	 * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU
+	 * variables are synchronised by keeping preemption disabled.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled());
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
 	memcg_check_events(memcg, page_to_nid(page));