diff mbox series

[1/1] mm/vmstat: Protect per cpu variables with preempt disable on RT

Message ID 20210805160019.1137-2-mgorman@techsingularity.net (mailing list archive)
State New
Headers show
Series Protect vmstats on PREEMPT_RT | expand

Commit Message

Mel Gorman Aug. 5, 2021, 4 p.m. UTC
From: Ingo Molnar <mingo@elte.hu>

Disable preemption on -RT for the vmstat code. On vanila the code runs
in IRQ-off regions while on -RT it may not when stats are updated under
a local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.

This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly "allowed
to be to be racy" so there is no need to protect them from preemption. Only
the accurate page stats that are updated by a read-modify-write need
protection. This patch also differs in that a preempt_[en|dis]able_rt
helper is not used. As vmstat is the only user of the helper, it was
suggested that it be open-coded in vmstat.c instead of risking the helper
being used in unnecessary contexts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

Comments

Andrew Morton Aug. 5, 2021, 11:22 p.m. UTC | #1
On Thu,  5 Aug 2021 17:00:19 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> Disable preemption on -RT for the vmstat code. On vanila the code runs
> in IRQ-off regions while on -RT it may not when stats are updated under
> a local_lock. "preempt_disable" ensures that the same resources is not
> updated in parallel due to preemption.
> 
> This patch differs from the preempt-rt version where __count_vm_event and
> __count_vm_events are also protected. The counters are explicitly "allowed
> to be to be racy" so there is no need to protect them from preemption. Only
> the accurate page stats that are updated by a read-modify-write need
> protection. This patch also differs in that a preempt_[en|dis]able_rt
> helper is not used. As vmstat is the only user of the helper, it was
> suggested that it be open-coded in vmstat.c instead of risking the helper
> being used in unnecessary contexts.
> 
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -319,6 +319,16 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  	long x;
>  	long t;
>  
> +	/*
> +	 * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
> +	 * atomicity is provided by IRQs being disabled -- either explicitly
> +	 * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
> +	 * CPU migrations and preemption potentially corrupts a counter so
> +	 * disable preemption.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		preempt_disable();

This is so obvious I expect it has been discussed, but...  why not

static inline void preempt_disable_if_rt(void)
{
	if (IS_ENABLED(CONFIG_PREEMPT_RT))
		preempt_disable();
}

?
Vlastimil Babka Aug. 6, 2021, 7:50 a.m. UTC | #2
On 8/6/21 1:22 AM, Andrew Morton wrote:
> On Thu,  5 Aug 2021 17:00:19 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
>> From: Ingo Molnar <mingo@elte.hu>
>> 
>> Disable preemption on -RT for the vmstat code. On vanila the code runs
>> in IRQ-off regions while on -RT it may not when stats are updated under
>> a local_lock. "preempt_disable" ensures that the same resources is not
>> updated in parallel due to preemption.
>> 
>> This patch differs from the preempt-rt version where __count_vm_event and
>> __count_vm_events are also protected. The counters are explicitly "allowed
>> to be to be racy" so there is no need to protect them from preemption. Only
>> the accurate page stats that are updated by a read-modify-write need
>> protection. This patch also differs in that a preempt_[en|dis]able_rt
>> helper is not used. As vmstat is the only user of the helper, it was
>> suggested that it be open-coded in vmstat.c instead of risking the helper
>> being used in unnecessary contexts.
>> 
>> ...
>>
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -319,6 +319,16 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>>  	long x;
>>  	long t;
>>  
>> +	/*
>> +	 * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
>> +	 * atomicity is provided by IRQs being disabled -- either explicitly
>> +	 * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
>> +	 * CPU migrations and preemption potentially corrupts a counter so
>> +	 * disable preemption.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>> +		preempt_disable();
> 
> This is so obvious I expect it has been discussed, but...  why not
> 
> static inline void preempt_disable_if_rt(void)
> {
> 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> 		preempt_disable();
> }

Yeah v1 introduced similar more generic ones in include/linux/preempt.h and
Thomas didn't like that. I guess it would be more acceptable when confined to
mm/vmstat.c
Mel Gorman Aug. 6, 2021, 8:44 a.m. UTC | #3
On Thu, Aug 05, 2021 at 04:22:06PM -0700, Andrew Morton wrote:
> On Thu,  5 Aug 2021 17:00:19 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > Disable preemption on -RT for the vmstat code. On vanila the code runs
> > in IRQ-off regions while on -RT it may not when stats are updated under
> > a local_lock. "preempt_disable" ensures that the same resources is not
> > updated in parallel due to preemption.
> > 
> > This patch differs from the preempt-rt version where __count_vm_event and
> > __count_vm_events are also protected. The counters are explicitly "allowed
> > to be to be racy" so there is no need to protect them from preemption. Only
> > the accurate page stats that are updated by a read-modify-write need
> > protection. This patch also differs in that a preempt_[en|dis]able_rt
> > helper is not used. As vmstat is the only user of the helper, it was
> > suggested that it be open-coded in vmstat.c instead of risking the helper
> > being used in unnecessary contexts.
> > 
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -319,6 +319,16 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> >  	long x;
> >  	long t;
> >  
> > +	/*
> > +	 * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
> > +	 * atomicity is provided by IRQs being disabled -- either explicitly
> > +	 * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
> > +	 * CPU migrations and preemption potentially corrupts a counter so
> > +	 * disable preemption.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		preempt_disable();
> 
> This is so obvious I expect it has been discussed, but...  why not
> 
> static inline void preempt_disable_if_rt(void)
> {
> 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> 		preempt_disable();
> }
> 
> ?
> 

The changelog briefly mentions it "also differs in that a
preempt_[en|dis]able_rt" helper was not used. It is preferred that the RT
helper does not exist and potentially get reused in other contexts that
could have a different solution. Hence, it's open-coded for mm/vmstat.c
even though it looks awkward. Obviously the helper could be in mm/vmstat.c
but the only name that made sense was preempt_[en|dis]able_rt and that
would likely get promoted to a common header for some reason.

The vmstat counters are "special" in that they have to be fast, an
accurate counter must be available cheaply and they are updated from a
mix of IRQ-disabled and local_lock_irq-disabled sections where the latter
only disables CPU migrations (but not preemption) on PREEMPT_RT. It's
not a special case that should be encouraged but is somewhat justified
given how often vmstats get updated and its performance requirements.

The alternative would be to convert vmstat counters to percpu_counters.
It also takes care to protect from IRQ and preempt contexts based on
comments in the code and functionally is very similar to vmstat. However,
based on how it works, I think it would incur a performance regression
as well as having a larger memory footprint. The use of raw IRQ-safe
spinlocks risks parallel update scaling issues that vmstat avoids with
with rmw, cmpxchg and atomics depending on context combined combined with
workqueues to accumulate per-cpu values. Converting to percpu_counters
and then modifying the implementation to be similar to vmstat might work
but would also be high risk with some significant complexities such as
dealing with vmstat shepherd.

Hence, I think the open-coded is justified if somewhat clumsy so indicate
this is a special case we're willing to tolerate but also clumsy enough
that someone trying to copy it will be forced to think heavily about
their problem. The only change I'd like to make to the patch is to

s/See __mod_node_page_state/See __mod_zone_page_state/

which is based on an stupid typo compounded by cut&paste as noted by 
Daniel Vacek.
Vlastimil Babka Aug. 6, 2021, 12:38 p.m. UTC | #4
On 8/5/21 6:00 PM, Mel Gorman wrote:
> From: Ingo Molnar <mingo@elte.hu>
> 
> Disable preemption on -RT for the vmstat code. On vanila the code runs
> in IRQ-off regions while on -RT it may not when stats are updated under
> a local_lock. "preempt_disable" ensures that the same resources is not
> updated in parallel due to preemption.
> 
> This patch differs from the preempt-rt version where __count_vm_event and
> __count_vm_events are also protected. The counters are explicitly "allowed
> to be to be racy" so there is no need to protect them from preemption. Only
> the accurate page stats that are updated by a read-modify-write need
> protection. This patch also differs in that a preempt_[en|dis]able_rt
> helper is not used. As vmstat is the only user of the helper, it was
> suggested that it be open-coded in vmstat.c instead of risking the helper
> being used in unnecessary contexts.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Sebastian Andrzej Siewior Aug. 31, 2021, 4:45 p.m. UTC | #5
On 2021-08-05 17:00:19 [+0100], Mel Gorman wrote:
>  mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)

The version in RT also covered the functions
  __count_vm_event()
  __count_vm_events()

in include/linux/vmstat.h. Were they dropped by mistake or on purpose? 

Sebastian
Andrew Morton Sept. 2, 2021, 8:07 p.m. UTC | #6
On Tue, 31 Aug 2021 18:45:46 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2021-08-05 17:00:19 [+0100], Mel Gorman wrote:
> >  mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 48 insertions(+)
> 
> The version in RT also covered the functions
>   __count_vm_event()
>   __count_vm_events()
> 
> in include/linux/vmstat.h. Were they dropped by mistake or on purpose? 
> 

?
Mel Gorman Sept. 6, 2021, 1:48 p.m. UTC | #7
On Tue, Aug 31, 2021 at 06:45:46PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-08-05 17:00:19 [+0100], Mel Gorman wrote:
> >  mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 48 insertions(+)
> 
> The version in RT also covered the functions
>   __count_vm_event()
>   __count_vm_events()
> 
> in include/linux/vmstat.h. Were they dropped by mistake or on purpose? 
> 

Sorry for the long delay, this got lost in a mess of mail. It was
dropped on purpose and I tried to explain why in the changelog

	This patch differs from the preempt-rt version where
	__count_vm_event and __count_vm_events are also protected. The
	counters are explicitly "allowed to be to be racy" so there is
	no need to protect them from preemption. Only the accurate page
	stats that are updated by a read-modify-write need protection. This
	patch also differs in that a preempt_[en|dis]able_rt helper is not
	used. As vmstat is the only user of the helper, it was suggested
	that it be open-coded in vmstat.c instead of risking the helper
	being used in unnecessary contexts.

Does that answer your question?
Sebastian Andrzej Siewior Sept. 6, 2021, 2:03 p.m. UTC | #8
On 2021-09-06 14:48:20 [+0100], Mel Gorman wrote:
> Does that answer your question?

Yes, completely. Thank you.

Sebastian
diff mbox series

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index b0534e068166..2c7e7569a453 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -319,6 +319,16 @@  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 	long x;
 	long t;
 
+	/*
+	 * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
+	 * atomicity is provided by IRQs being disabled -- either explicitly
+	 * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
+	 * CPU migrations and preemption potentially corrupts a counter so
+	 * disable preemption.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -328,6 +338,9 @@  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -350,6 +363,10 @@  void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		delta >>= PAGE_SHIFT;
 	}
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -359,6 +376,9 @@  void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 EXPORT_SYMBOL(__mod_node_page_state);
 
@@ -391,6 +411,10 @@  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -399,6 +423,9 @@  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -409,6 +436,10 @@  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -417,6 +448,9 @@  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -437,6 +471,10 @@  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -445,6 +483,9 @@  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -455,6 +496,10 @@  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -463,6 +508,9 @@  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)