diff mbox series

[2/2] mm/vmstat: Protect per cpu variables with preempt disable on RT

Message ID 20210723100034.13353-3-mgorman@techsingularity.net (mailing list archive)
State New
Headers show
Series Protect vmstats on PREEMPT_RT | expand

Commit Message

Mel Gorman July 23, 2021, 10 a.m. UTC
From: Ingo Molnar <mingo@elte.hu>

Disable preemption on -RT for the vmstat code. On vanila the code runs
in IRQ-off regions while on -RT it may not when stats are updated under
a local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.

This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly "allowed
to be to be racy" so there is no need to protect them from preemption. Only
the accurate page stats that are updated by a read-modify-write need
protection.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmstat.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Thomas Gleixner Aug. 3, 2021, 11:54 p.m. UTC | #1
Mel!

On Fri, Jul 23 2021 at 11:00, Mel Gorman wrote:
> From: Ingo Molnar <mingo@elte.hu>
>
> Disable preemption on -RT for the vmstat code. On vanila the code runs
> in IRQ-off regions while on -RT it may not when stats are updated under
> a local_lock. "preempt_disable" ensures that the same resources is not
> updated in parallel due to preemption.
>
> This patch differs from the preempt-rt version where __count_vm_event and
> __count_vm_events are also protected. The counters are explicitly "allowed
> to be to be racy" so there is no need to protect them from preemption. Only
> the accurate page stats that are updated by a read-modify-write need
> protection.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmstat.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index b0534e068166..d06332c221b1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -319,6 +319,7 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  	long x;
>  	long t;
>  
> +	preempt_disable_rt();

Yes, this is smart to some extent. But in reality it's a bandaid simply
because nobody can tell which item of vmstat requires which protection.

If you go back in RT history then you will figure out that we were able
to eliminate _all_ occurences of preempt_disable_rt() except for this
one.

Even mm developers are wary about this:

 <tglx> so in vmstat.c there is this magic comment:
 <tglx>  * For use when we know that interrupts are disabled
 <tglx>  * or when we know that preemption is disabled and that
 <tglx>  * particular counter cannot be updated from interrupt context.
 <tglx> how can I know which counters need what?
 <mm_expert> I don't think there's a list, one would have to check on counter to counter basis :/ 
 <tglx> and of course there is nothing which validates that, right?
 <mm_expert> exactly

Brilliant stuff which prevents you to do any validation on this. Over
the years there have been several issues where callers had to be fixed
by analysing bug reports instead of having a proper instrumentation in
that code which would have told the developer that he got it wrong.

Of course on RT kernels the preempt_disable_rt() will serialize
everything correctly, but as we have learned over the years just
slapping _if_rt() or if_not_rt() variants of things around is most of
the time papering over the underlying problem of badly defined
protection scopes. Let's not proliferate that. As I said in the above
IRC conversation:

 <tglx> I fundamentally hate this preempt_disable_rt() muck

Thanks,

        tglx
Mel Gorman Aug. 4, 2021, 9:54 a.m. UTC | #2
On Wed, Aug 04, 2021 at 01:54:47AM +0200, Thomas Gleixner wrote:
> Mel!
> 

I hope the ! is not a sign that the patch is a wtf :)

> On Fri, Jul 23 2021 at 11:00, Mel Gorman wrote:
> > From: Ingo Molnar <mingo@elte.hu>
> >
> > Disable preemption on -RT for the vmstat code. On vanila the code runs
> > in IRQ-off regions while on -RT it may not when stats are updated under
> > a local_lock. "preempt_disable" ensures that the same resources is not
> > updated in parallel due to preemption.
> >
> > This patch differs from the preempt-rt version where __count_vm_event and
> > __count_vm_events are also protected. The counters are explicitly "allowed
> > to be to be racy" so there is no need to protect them from preemption. Only
> > the accurate page stats that are updated by a read-modify-write need
> > protection.
> >
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  mm/vmstat.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index b0534e068166..d06332c221b1 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -319,6 +319,7 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> >  	long x;
> >  	long t;
> >  
> > +	preempt_disable_rt();
> 
> Yes, this is smart to some extent. But in reality it's a bandaid simply
> because nobody can tell which item of vmstat requires which protection.
> 

They were not split correctly when first introduced unfortunately. I don't
recall why but it's very likely that it simply was not thought about very
heavily. The naming doesn't help because there is no clue to the names if
accuracy is needed for functional correctness. I only put heavy thought
into it when I was dealing with excessive overhead from accurate counters.

Most of zone_stat_item and node_stat_item *must* be accurate for
correctness (e.g. NR_FREE_PAGES drifting would be very bad) but not all.
NR_VMSCAN_IMMEDIATE does not need special protection for example, same
for NR_WRITTEN and NR_DIRTIED. Some of the WORKINGSET ones may not need
protection either but I didn't audit every single one. Either way, some
of the events could be moved out.

Commit f19298b9516c ("mm/vmstat: convert NUMA statistics to basic NUMA
counters") is an example where some counters were moved because they did
not require accuracy and were updated from fast paths. The existance of
the path allows some other counters to move relatively easily.

> If you go back in RT history then you will figure out that we were able
> to eliminate _all_ occurences of preempt_disable_rt() except for this
> one.
> 
> Even mm developers are wary about this:
> 
>  <tglx> so in vmstat.c there is this magic comment:
>  <tglx>  * For use when we know that interrupts are disabled
>  <tglx>  * or when we know that preemption is disabled and that
>  <tglx>  * particular counter cannot be updated from interrupt context.
>  <tglx> how can I know which counters need what?
>  <mm_expert> I don't think there's a list, one would have to check on counter to counter basis :/ 
>  <tglx> and of course there is nothing which validates that, right?
>  <mm_expert> exactly
> 

While I'm not "mm_expert", I agree with his/her statements. Each counter
would need to be audited and two question are asked

 o If this counter is inaccurate, does anything break?
 o If this counter is inaccurate, does it both increment and decrement
   allowing the possibility it goes negative?

The decision on that is looking at the counter and seeing if any
functional decision is made based on its value. So two examples;

	NR_VMSCAN_IMMEDIATE is a node-based counter that only every
	increments and is reported to userspace. No kernel code makes
	any decision based on its value. Therefore it's likely safe to
	move to numa_stat_item instead.

	Action: move it

	WORKINGSET_ACTIVATE_FILE is a node-based counter that is used to
	determine if a mem cgroup is potentially congested by looking at
	the ratio of cgroup to node refault rates as well as deciding if
	LRU file pages should be deactivate.  If that value drifts, the
	ratios are miscalculated and could lead to functional oddities
	and therefore must be accurate.

	Action: leave it alone

I guess it could be further split into state that must be accurate from
IRQ and non-IRQ contexts but that probably would be very fragile and
offer limited value.

> Brilliant stuff which prevents you to do any validation on this. Over
> the years there have been several issues where callers had to be fixed
> by analysing bug reports instead of having a proper instrumentation in
> that code which would have told the developer that he got it wrong.
> 

I'm not sure it could be validated at build-time but I'm just back from
holiday and may be lacking imagination.

> Of course on RT kernels the preempt_disable_rt() will serialize
> everything correctly, but as we have learned over the years just
> slapping _if_rt() or if_not_rt() variants of things around is most of
> the time papering over the underlying problem of badly defined
> protection scopes. Let's not proliferate that. As I said in the above
> IRC conversation:
> 
>  <tglx> I fundamentally hate this preempt_disable_rt() muck
> 

The issue is that even if this was properly audited and the inaccurate
and accurate counters were in the proper enums using the correct APIs, it
would still be necessary to protect the accurate counters from updates from
IRQ context. Hence, as I write this, I don't think preempt_[dis|en]able_rt
would go away and that is why I didn't continue with the series to break
out "accurate" stats
Vlastimil Babka Aug. 4, 2021, 1:42 p.m. UTC | #3
On 8/4/21 11:54 AM, Mel Gorman wrote:
> On Wed, Aug 04, 2021 at 01:54:47AM +0200, Thomas Gleixner wrote:
>> 
>>  <tglx> so in vmstat.c there is this magic comment:
>>  <tglx>  * For use when we know that interrupts are disabled
>>  <tglx>  * or when we know that preemption is disabled and that
>>  <tglx>  * particular counter cannot be updated from interrupt context.
>>  <tglx> how can I know which counters need what?
>>  <mm_expert> I don't think there's a list, one would have to check on counter to counter basis :/ 
>>  <tglx> and of course there is nothing which validates that, right?
>>  <mm_expert> exactly
>> 
> 
> While I'm not "mm_expert", I agree with his/her statements.

Phew, since you do, I can now disclose it was me.

> Each counter
> would need to be audited and two question are asked
> 
>  o If this counter is inaccurate, does anything break?
>  o If this counter is inaccurate, does it both increment and decrement
>    allowing the possibility it goes negative?
> 
> The decision on that is looking at the counter and seeing if any
> functional decision is made based on its value. So two examples;
> 
> 	NR_VMSCAN_IMMEDIATE is a node-based counter that only every
> 	increments and is reported to userspace. No kernel code makes
> 	any decision based on its value. Therefore it's likely safe to
> 	move to numa_stat_item instead.
> 
> 	Action: move it
> 
> 	WORKINGSET_ACTIVATE_FILE is a node-based counter that is used to
> 	determine if a mem cgroup is potentially congested by looking at
> 	the ratio of cgroup to node refault rates as well as deciding if
> 	LRU file pages should be deactivate.  If that value drifts, the
> 	ratios are miscalculated and could lead to functional oddities
> 	and therefore must be accurate.
> 
> 	Action: leave it alone
> 
> I guess it could be further split into state that must be accurate from
> IRQ and non-IRQ contexts but that probably would be very fragile and
> offer limited value.
> 
>> Brilliant stuff which prevents you to do any validation on this. Over
>> the years there have been several issues where callers had to be fixed
>> by analysing bug reports instead of having a proper instrumentation in
>> that code which would have told the developer that he got it wrong.
>> 
> 
> I'm not sure it could be validated at build-time but I'm just back from
> holiday and may be lacking imagination.

The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
whatnot), i.e.:

<sched_expert> what that code needs is switch(item) { case foo1: case foo2:
lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
something along those lines

>> Of course on RT kernels the preempt_disable_rt() will serialize
>> everything correctly, but as we have learned over the years just
>> slapping _if_rt() or if_not_rt() variants of things around is most of
>> the time papering over the underlying problem of badly defined
>> protection scopes. Let's not proliferate that. As I said in the above
>> IRC conversation:
>> 
>>  <tglx> I fundamentally hate this preempt_disable_rt() muck
>> 
> 
> The issue is that even if this was properly audited and the inaccurate
> and accurate counters were in the proper enums using the correct APIs, it
> would still be necessary to protect the accurate counters from updates from
> IRQ context. Hence, as I write this, I don't think preempt_[dis|en]able_rt
> would go away and that is why I didn't continue with the series to break
> out "accurate" stats
>
Mel Gorman Aug. 4, 2021, 2:23 p.m. UTC | #4
On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
> > Each counter
> > would need to be audited and two question are asked
> > 
> >  o If this counter is inaccurate, does anything break?
> >  o If this counter is inaccurate, does it both increment and decrement
> >    allowing the possibility it goes negative?
> > 
> > The decision on that is looking at the counter and seeing if any
> > functional decision is made based on its value. So two examples;
> > 
> > 	NR_VMSCAN_IMMEDIATE is a node-based counter that only every
> > 	increments and is reported to userspace. No kernel code makes
> > 	any decision based on its value. Therefore it's likely safe to
> > 	move to numa_stat_item instead.
> > 
> > 	Action: move it
> > 
> > 	WORKINGSET_ACTIVATE_FILE is a node-based counter that is used to
> > 	determine if a mem cgroup is potentially congested by looking at
> > 	the ratio of cgroup to node refault rates as well as deciding if
> > 	LRU file pages should be deactivate.  If that value drifts, the
> > 	ratios are miscalculated and could lead to functional oddities
> > 	and therefore must be accurate.
> > 
> > 	Action: leave it alone
> > 
> > I guess it could be further split into state that must be accurate from
> > IRQ and non-IRQ contexts but that probably would be very fragile and
> > offer limited value.
> > 
> >> Brilliant stuff which prevents you to do any validation on this. Over
> >> the years there have been several issues where callers had to be fixed
> >> by analysing bug reports instead of having a proper instrumentation in
> >> that code which would have told the developer that he got it wrong.
> >> 
> > 
> > I'm not sure it could be validated at build-time but I'm just back from
> > holiday and may be lacking imagination.
> 
> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
> whatnot), i.e.:
> 
> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
> something along those lines
> 

Ok, that would potentially work. It may not even need to split the stats
into different enums. Simply document which stats need protection from
IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
are disabled depending on the kernel config. I don't think it gets rid
of preempt_disable_rt unless the API was completely reworked with entry
points that describe the locking requirements. That would be tricky
because the requirements differ between kernel configurations.

This would be independent of moving stats that do not need to be 100%
accurate to the inaccurate-but-fast API.
Thomas Gleixner Aug. 5, 2021, 12:56 p.m. UTC | #5
On Wed, Aug 04 2021 at 15:23, Mel Gorman wrote:
Mel,

> On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
>> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
>> whatnot), i.e.:
>> 
>> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
>> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
>> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
>> something along those lines
>> 
> Ok, that would potentially work. It may not even need to split the stats
> into different enums. Simply document which stats need protection from
> IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
> are disabled depending on the kernel config. I don't think it gets rid
> of preempt_disable_rt unless the API was completely reworked with entry
> points that describe the locking requirements. That would be tricky
> because the requirements differ between kernel configurations.

Right. This won't get rid of the preempt disabling on RT, but I think we
should rather open code this

       if (IS_ENABLED(CONFIG_PREEMPT_RT))
       		preempt_dis/enable();

instead of proliferating these helper macros which have only one user left.

Thanks,

        tglx
Mel Gorman Aug. 5, 2021, 2:04 p.m. UTC | #6
On Thu, Aug 05, 2021 at 02:56:53PM +0200, Thomas Gleixner wrote:
> On Wed, Aug 04 2021 at 15:23, Mel Gorman wrote:
> Mel,
> 
> > On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
> >> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
> >> whatnot), i.e.:
> >> 
> >> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
> >> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
> >> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
> >> something along those lines
> >> 
> > Ok, that would potentially work. It may not even need to split the stats
> > into different enums. Simply document which stats need protection from
> > IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
> > are disabled depending on the kernel config. I don't think it gets rid
> > of preempt_disable_rt unless the API was completely reworked with entry
> > points that describe the locking requirements. That would be tricky
> > because the requirements differ between kernel configurations.
> 
> Right. This won't get rid of the preempt disabling on RT, but I think we
> should rather open code this
> 
>        if (IS_ENABLED(CONFIG_PREEMPT_RT))
>        		preempt_dis/enable();
> 
> instead of proliferating these helper macros which have only one user left.
> 

Ok, that is reasonable. I tried creating a vmstat-specific helper but the
names were misleading so I ended up with the patch below which open-codes
it as you suggest. The comment is not accurate because "locking/local_lock:
Add RT support" is not upstream but it'll eventually be accurate.

Is this ok?

--8<--
From e5b7a2ffcf55e4b4030fd54e49f5c5a1d1864ebe Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 3 Jul 2009 08:30:13 -0500
Subject: [PATCH] mm/vmstat: Protect per cpu variables with preempt disable on
 RT

Disable preemption on -RT for the vmstat code. On vanila the code runs
in IRQ-off regions while on -RT it may not when stats are updated under
a local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.

This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly "allowed
to be to be racy" so there is no need to protect them from preemption. Only
the accurate page stats that are updated by a read-modify-write need
protection. This patch also differs in that a preempt_[en|dis]able_rt
helper is not used. As vmstat is the only user of the helper, it was
suggested that it be open-coded in vmstat.c instead of risking the helper
being used in unnecessary contexts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index b0534e068166..2c7e7569a453 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -319,6 +319,16 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 	long x;
 	long t;
 
+	/*
+	 * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
+	 * atomicity is provided by IRQs being disabled -- either explicitly
+	 * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
+	 * CPU migrations and preemption potentially corrupts a counter so
+	 * disable preemption.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -328,6 +338,9 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -350,6 +363,10 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		delta >>= PAGE_SHIFT;
 	}
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -359,6 +376,9 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 EXPORT_SYMBOL(__mod_node_page_state);
 
@@ -391,6 +411,10 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -399,6 +423,9 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -409,6 +436,10 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -417,6 +448,9 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -437,6 +471,10 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -445,6 +483,9 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -455,6 +496,10 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	/* See __mod_node_page_state */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -463,6 +508,9 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
 	}
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
Thomas Gleixner Aug. 5, 2021, 3:42 p.m. UTC | #7
On Thu, Aug 05 2021 at 15:04, Mel Gorman wrote:
> On Thu, Aug 05, 2021 at 02:56:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Aug 04 2021 at 15:23, Mel Gorman wrote:
>> Mel,
>> 
>> > On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
>> >> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
>> >> whatnot), i.e.:
>> >> 
>> >> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
>> >> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
>> >> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
>> >> something along those lines
>> >> 
>> > Ok, that would potentially work. It may not even need to split the stats
>> > into different enums. Simply document which stats need protection from
>> > IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
>> > are disabled depending on the kernel config. I don't think it gets rid
>> > of preempt_disable_rt unless the API was completely reworked with entry
>> > points that describe the locking requirements. That would be tricky
>> > because the requirements differ between kernel configurations.
>> 
>> Right. This won't get rid of the preempt disabling on RT, but I think we
>> should rather open code this
>> 
>>        if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>        		preempt_dis/enable();
>> 
>> instead of proliferating these helper macros which have only one user left.
>> 
>
> Ok, that is reasonable. I tried creating a vmstat-specific helper but the
> names were misleading so I ended up with the patch below which open-codes
> it as you suggest. The comment is not accurate because "locking/local_lock:
> Add RT support" is not upstream but it'll eventually be accurate.
>
> Is this ok?

Looks good.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Daniel Vacek Aug. 5, 2021, 6:47 p.m. UTC | #8
Hi Mel.

On Thu, Aug 5, 2021 at 4:48 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Aug 05, 2021 at 02:56:53PM +0200, Thomas Gleixner wrote:
> > On Wed, Aug 04 2021 at 15:23, Mel Gorman wrote:
> > Mel,
> >
> > > On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
> > >> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
> > >> whatnot), i.e.:
> > >>
> > >> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
> > >> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
> > >> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
> > >> something along those lines
> > >>
> > > Ok, that would potentially work. It may not even need to split the stats
> > > into different enums. Simply document which stats need protection from
> > > IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
> > > are disabled depending on the kernel config. I don't think it gets rid
> > > of preempt_disable_rt unless the API was completely reworked with entry
> > > points that describe the locking requirements. That would be tricky
> > > because the requirements differ between kernel configurations.
> >
> > Right. This won't get rid of the preempt disabling on RT, but I think we
> > should rather open code this
> >
> >        if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >                       preempt_dis/enable();
> >
> > instead of proliferating these helper macros which have only one user left.
> >
>
> Ok, that is reasonable. I tried creating a vmstat-specific helper but the
> names were misleading so I ended up with the patch below which open-codes
> it as you suggest. The comment is not accurate because "locking/local_lock:
> Add RT support" is not upstream but it'll eventually be accurate.
>
> Is this ok?
>
> --8<--
> From e5b7a2ffcf55e4b4030fd54e49f5c5a1d1864ebe Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 3 Jul 2009 08:30:13 -0500
> Subject: [PATCH] mm/vmstat: Protect per cpu variables with preempt disable on
>  RT
>
> Disable preemption on -RT for the vmstat code. On vanila the code runs
> in IRQ-off regions while on -RT it may not when stats are updated under
> a local_lock. "preempt_disable" ensures that the same resources is not
> updated in parallel due to preemption.
>
> This patch differs from the preempt-rt version where __count_vm_event and
> __count_vm_events are also protected. The counters are explicitly "allowed
> to be to be racy" so there is no need to protect them from preemption. Only
> the accurate page stats that are updated by a read-modify-write need
> protection. This patch also differs in that a preempt_[en|dis]able_rt
> helper is not used. As vmstat is the only user of the helper, it was
> suggested that it be open-coded in vmstat.c instead of risking the helper
> being used in unnecessary contexts.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmstat.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index b0534e068166..2c7e7569a453 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -319,6 +319,16 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>         long x;
>         long t;
>
> +       /*
> +        * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
> +        * atomicity is provided by IRQs being disabled -- either explicitly
> +        * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
> +        * CPU migrations and preemption potentially corrupts a counter so
> +        * disable preemption.
> +        */
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         x = delta + __this_cpu_read(*p);
>
>         t = __this_cpu_read(pcp->stat_threshold);
> @@ -328,6 +338,9 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>                 x = 0;
>         }
>         __this_cpu_write(*p, x);
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>  EXPORT_SYMBOL(__mod_zone_page_state);
>
> @@ -350,6 +363,10 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>                 delta >>= PAGE_SHIFT;
>         }
>
> +       /* See __mod_node_page_state */

__mod_zone_page_state?

> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         x = delta + __this_cpu_read(*p);
>
>         t = __this_cpu_read(pcp->stat_threshold);
> @@ -359,6 +376,9 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>                 x = 0;
>         }
>         __this_cpu_write(*p, x);
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>  EXPORT_SYMBOL(__mod_node_page_state);
>
> @@ -391,6 +411,10 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>         s8 __percpu *p = pcp->vm_stat_diff + item;
>         s8 v, t;
>
> +       /* See __mod_node_page_state */

ditto.

> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         v = __this_cpu_inc_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v > t)) {
> @@ -399,6 +423,9 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>                 zone_page_state_add(v + overstep, zone, item);
>                 __this_cpu_write(*p, -overstep);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>
>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> @@ -409,6 +436,10 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>
>         VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
>
> +       /* See __mod_node_page_state */

""

> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         v = __this_cpu_inc_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v > t)) {
> @@ -417,6 +448,9 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>                 node_page_state_add(v + overstep, pgdat, item);
>                 __this_cpu_write(*p, -overstep);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>
>  void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
> @@ -437,6 +471,10 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>         s8 __percpu *p = pcp->vm_stat_diff + item;
>         s8 v, t;
>
> +       /* See __mod_node_page_state */

...

> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         v = __this_cpu_dec_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v < - t)) {
> @@ -445,6 +483,9 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>                 zone_page_state_add(v - overstep, zone, item);
>                 __this_cpu_write(*p, overstep);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>
>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> @@ -455,6 +496,10 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>
>         VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
>
> +       /* See __mod_node_page_state */

and one more time here

--nX

> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_disable();
> +
>         v = __this_cpu_dec_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v < - t)) {
> @@ -463,6 +508,9 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>                 node_page_state_add(v - overstep, pgdat, item);
>                 __this_cpu_write(*p, overstep);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               preempt_enable();
>  }
>
>  void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
diff mbox series

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index b0534e068166..d06332c221b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -319,6 +319,7 @@  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 	long x;
 	long t;
 
+	preempt_disable_rt();
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -328,6 +329,7 @@  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	preempt_enable_rt();
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -350,6 +352,7 @@  void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		delta >>= PAGE_SHIFT;
 	}
 
+	preempt_disable_rt();
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -359,6 +362,7 @@  void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	preempt_enable_rt();
 }
 EXPORT_SYMBOL(__mod_node_page_state);
 
@@ -391,6 +395,7 @@  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	preempt_disable_rt();
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -399,6 +404,7 @@  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	preempt_enable_rt();
 }
 
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -409,6 +415,7 @@  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	preempt_disable_rt();
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -417,6 +424,7 @@  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	preempt_enable_rt();
 }
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -437,6 +445,7 @@  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
+	preempt_disable_rt();
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -445,6 +454,7 @@  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
 	}
+	preempt_enable_rt();
 }
 
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -455,6 +465,7 @@  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
+	preempt_disable_rt();
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -463,6 +474,7 @@  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
 	}
+	preempt_enable_rt();
 }
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)