Message ID | 20211222114111.2206248-2-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand |
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: > The per-CPU counter are modified with the non-atomic modifier. The > consistency is ensure by disabling interrupts for the update. > This breaks on PREEMPT_RT because some sections additionally > acquire a spinlock_t lock (which becomes sleeping and must not be > acquired with disabled interrupts). Another problem is that > mem_cgroup_swapout() expects to be invoked with disabled interrupts > because the caller has to acquire a spinlock_t which is acquired with > disabled interrupts. Since spinlock_t never disables interrupts on > PREEMPT_RT the interrupts are never disabled at this point. > > The code is never called from in_irq() context on PREEMPT_RT therefore How do you guarantee that these percpu update functions won't be called in in_irq() context for PREEMPT_RT? Do you think we should add a WARN_ON_ONCE(in_irq()) just to be sure? Cheers, Longman
On 2021-12-22 21:31:36 [-0500], Waiman Long wrote: > On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: > > The per-CPU counter are modified with the non-atomic modifier. The > > consistency is ensure by disabling interrupts for the update. > > This breaks on PREEMPT_RT because some sections additionally > > acquire a spinlock_t lock (which becomes sleeping and must not be > > acquired with disabled interrupts). Another problem is that > > mem_cgroup_swapout() expects to be invoked with disabled interrupts > > because the caller has to acquire a spinlock_t which is acquired with > > disabled interrupts. Since spinlock_t never disables interrupts on > > PREEMPT_RT the interrupts are never disabled at this point. > > > > The code is never called from in_irq() context on PREEMPT_RT therefore > > How do you guarantee that these percpu update functions won't be called in > in_irq() context for PREEMPT_RT? Do you think we should add a > WARN_ON_ONCE(in_irq()) just to be sure? There are no invocations to the memory allocator (neither malloc() nor free()) on RT and the memory allocator itself (SLUB and the page-allocator so both) has sleeping locks. That means invocations in_atomic() are bad. All interrupt handler are force-threaded. Those which are not (like timer, per-CPU interrupts or those which explicitly asked not to be force threaded) are limited in their doing as they can't invoke anything that has a sleeping lock. Lockdep or CONFIG_DEBUG_ATOMIC_SLEEP will yell here. The other counter are protected the same way, see c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT") > Cheers, > Longman Sebastian
On 12/23/21 02:34, Sebastian Andrzej Siewior wrote: > On 2021-12-22 21:31:36 [-0500], Waiman Long wrote: >> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: >>> The per-CPU counter are modified with the non-atomic modifier. The >>> consistency is ensure by disabling interrupts for the update. >>> This breaks on PREEMPT_RT because some sections additionally >>> acquire a spinlock_t lock (which becomes sleeping and must not be >>> acquired with disabled interrupts). Another problem is that >>> mem_cgroup_swapout() expects to be invoked with disabled interrupts >>> because the caller has to acquire a spinlock_t which is acquired with >>> disabled interrupts. Since spinlock_t never disables interrupts on >>> PREEMPT_RT the interrupts are never disabled at this point. >>> >>> The code is never called from in_irq() context on PREEMPT_RT therefore >> How do you guarantee that these percpu update functions won't be called in >> in_irq() context for PREEMPT_RT? Do you think we should add a >> WARN_ON_ONCE(in_irq()) just to be sure? > There are no invocations to the memory allocator (neither malloc() nor > free()) on RT and the memory allocator itself (SLUB and the > page-allocator so both) has sleeping locks. That means invocations > in_atomic() are bad. All interrupt handler are force-threaded. Those > which are not (like timer, per-CPU interrupts or those which explicitly > asked not to be force threaded) are limited in their doing as they can't > invoke anything that has a sleeping lock. Lockdep or > CONFIG_DEBUG_ATOMIC_SLEEP will yell here. > The other counter are protected the same way, see > c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT") Thanks for the explanation as I am less familiar with other PREEMPT_RT specific changes. Cheers, Longman
On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > The sections with disabled preemption must exclude > memcg_check_events() so that spinlock_t locks can still be acquired > (for instance in eventfd_signal()). The resulting construct in uncharge_batch() raises eybrows. If you can decouple per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle it the same way on !PREEMPT_RT too (and have just one variant of the block)? (Actually, it doesn't seem to me that memcg_check_events() can be extracted like this from the preempt disabled block since mem_cgroup_event_ratelimit() relies on similar RMW pattern. Things would be simpler if PREEMPT_RT didn't allow the threshold event handlers (akin to Michal Hocko's suggestion of rejecting soft limit).) Thanks, Michal
On 2022-01-05 15:16:53 [+0100], Michal Koutný wrote: > On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > The sections with disabled preemption must exclude > > memcg_check_events() so that spinlock_t locks can still be acquired > > (for instance in eventfd_signal()). > > The resulting construct in uncharge_batch() raises eybrows. If you can decouple > per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle > it the same way on !PREEMPT_RT too (and have just one variant of the > block)? That would mean that mem_cgroup_event_ratelimit() needs a local_irq_save(). If that is okay then sure I can move it that way. > (Actually, it doesn't seem to me that memcg_check_events() can be > extracted like this from the preempt disabled block since > mem_cgroup_event_ratelimit() relies on similar RMW pattern. > Things would be simpler if PREEMPT_RT didn't allow the threshold event > handlers (akin to Michal Hocko's suggestion of rejecting soft limit).) I added a preempt-disable() section restricted to RT to mem_cgroup_event_ratelimit(). I had to exclude memcg_check_events() from the block because of the spinlock_t locks involved down the road (eventfd_signal() for instance). I remember Michal (Hocko) suggested excluding/ rejecting soft limit but I didn't know where exactly and its implications. In this block here I just followed the replacement of irq-off with preempt-off for RT. > Thanks, > Michal Sebastian
On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > I added a preempt-disable() section restricted to RT to > mem_cgroup_event_ratelimit(). Oh, I missed that one. (Than the decoupling of such mem_cgroup_event_ratelimit() also makes some more sense.) > That would mean that mem_cgroup_event_ratelimit() needs a > local_irq_save(). If that is okay then sure I can move it that way. Whatever avoids the twisted code :-) --- > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but > I didn't know where exactly and its implications. In this block here I > just followed the replacement of irq-off with preempt-off for RT. Both soft limit and (these) event notifications are v1 features. Soft limit itself is rather considered even misfeature. I guess the implications would not be many since PREEMPT_RT+memcg users would be new(?) so should rather start with v2 anyway. One way to disable it would be to reject writes into memory.soft_limit_in_bytes or cgroup.event_control + documentation of that. Michal
On 2022-01-13 15:48:03 [+0100], Michal Koutný wrote: > On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > I added a preempt-disable() section restricted to RT to > > mem_cgroup_event_ratelimit(). > > Oh, I missed that one. > > (Than the decoupling of such mem_cgroup_event_ratelimit() also makes > some more sense.) > > That would mean that mem_cgroup_event_ratelimit() needs a > > local_irq_save(). If that is okay then sure I can move it that way. > > Whatever avoids the twisted code :-) Okay. So let me do that. > --- > > > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but > > I didn't know where exactly and its implications. In this block here I > > just followed the replacement of irq-off with preempt-off for RT. > > Both soft limit and (these) event notifications are v1 features. Soft > limit itself is rather considered even misfeature. I guess the > implications would not be many since PREEMPT_RT+memcg users would be > new(?) so should rather start with v2 anyway. People often migrate to RT so they take whatever they have. In general I would like to keep RT & !RT in sync unless there are reasons to do it differently. > One way to disable it would be to reject writes into > memory.soft_limit_in_bytes or cgroup.event_control + documentation of > that. So avoiding these two also avoids memcg_check_events()? Right now it does not look like a big obstacle. It is the same pattern I'm following for the per-CPU RMW. If it is, I could avoid the writes and if0 the other function for RT. If I remove memcg_check_events() from the equation then we could keep the explicit irq-off regions (plus add a few). The only that would look odd then is that we disable interrupts for the RMW operation and preemption in other places (__count_memcg_events() invocations in swap.c and vmscan.c). Are there plans to remove v1 or is this part of "we must not break userland"? > Michal Sebastian
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2ed5f2a0879d3..d2687d5ed544b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); memcg_rstat_updated(memcg); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } /* idx can be of type enum memcg_stat_item or node_stat_item. */ @@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); memcg = pn->memcg; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + /* Update memcg */ __this_cpu_add(memcg->vmstats_percpu->state[idx], val); @@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); memcg_rstat_updated(memcg); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } /** @@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (mem_cgroup_disabled()) return; + if (IS_ENABLED(PREEMPT_RT)) + preempt_disable(); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); memcg_rstat_updated(memcg); + if (IS_ENABLED(PREEMPT_RT)) + preempt_enable(); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) @@ -810,6 +827,9 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + /* pagein of a big page is an event. So, ignore page size */ if (nr_pages > 0) __count_memcg_events(memcg, PGPGIN, 1); @@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, } __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, enum mem_cgroup_events_target target) { unsigned long val, next; + bool ret = false; + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); @@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, break; } __this_cpu_write(memcg->vmstats_percpu->targets[target], next); - return true; + ret = true; } - return false; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); + return ret; } /* @@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struct page *page, ret = 0; nid = folio_nid(folio); - local_irq_disable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_disable(); mem_cgroup_charge_statistics(to, nr_pages); memcg_check_events(to, nid); mem_cgroup_charge_statistics(from, -nr_pages); memcg_check_events(from, nid); - local_irq_enable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_enable(); out_unlock: folio_unlock(folio); out: @@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, css_get(&memcg->css); commit_charge(folio, memcg); - local_irq_disable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_disable(); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(folio)); - local_irq_enable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_enable(); out: return ret; } @@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct uncharge_gather *ug) memcg_oom_recover(ug->memcg); } - local_irq_save(flags); - __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); - memcg_check_events(ug->memcg, ug->nid); - local_irq_restore(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + local_irq_save(flags); + __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + memcg_check_events(ug->memcg, ug->nid); + local_irq_restore(flags); + } else { + preempt_disable(); + __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + preempt_enable(); + + memcg_check_events(ug->memcg, ug->nid); + } /* drop reference from uncharge_folio */ css_put(&ug->memcg->css); @@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) css_get(&memcg->css); commit_charge(new, memcg); - local_irq_save(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_save(flags); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_restore(flags); } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) * i_pages lock which is taken with interrupts-off. It is * important here to have the interrupts disabled because it is the * only synchronisation we have for updating the per-CPU variables. + * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU + * variables are synchronised by keeping preemption disabled. */ - VM_BUG_ON(!irqs_disabled()); + VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled()); mem_cgroup_charge_statistics(memcg, -nr_entries); memcg_check_events(memcg, page_to_nid(page));
The per-CPU counter are modified with the non-atomic modifier. The consistency is ensure by disabling interrupts for the update. This breaks on PREEMPT_RT because some sections additionally acquire a spinlock_t lock (which becomes sleeping and must not be acquired with disabled interrupts). Another problem is that mem_cgroup_swapout() expects to be invoked with disabled interrupts because the caller has to acquire a spinlock_t which is acquired with disabled interrupts. Since spinlock_t never disables interrupts on PREEMPT_RT the interrupts are never disabled at this point. The code is never called from in_irq() context on PREEMPT_RT therefore disabling preemption during the update is sufficient on PREEMPT_RT. The sections with disabled preemption must exclude memcg_check_events() so that spinlock_t locks can still be acquired (for instance in eventfd_signal()). Don't disable interrupts during updates of the per-CPU variables, instead use shorter sections which disable preemption. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 60 insertions(+), 14 deletions(-)