diff mbox series

[v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

Message ID Yin7hDxdt0s/x+fp@fuller.cnet (mailing list archive)
State New
Headers show
Series [v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu | expand

Commit Message

Marcelo Tosatti March 10, 2022, 1:22 p.m. UTC
On systems that run FIFO:1 applications that busy loop,
any SCHED_OTHER task that attempts to execute
on such a CPU (such as work threads) will not
be scheduled, which leads to system hangs.

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees.

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

preempt_disable, local_irq_disable  [bh_lru_lock()]
rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical
section when synchronize_rcu() returns.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>

---
 
 v5: changelog improvements  		(Andrew Morton)
 v4: improve comment clarity, mention synchronize_rcu guarantees
     on v5.1				(Andrew Morton /
						 Paul E. McKenney)
 v3: update stale comment		(Nicolas Saenz Julienne)
 v2: rt_spin_lock calls rcu_read_lock, no need
 to add it before local_lock on swap.c	(Nicolas Saenz Julienne)

Comments

Andrew Morton March 11, 2022, 2:23 a.m. UTC | #1
On Thu, 10 Mar 2022 10:22:12 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:

> 
> On systems that run FIFO:1 applications that busy loop,
> any SCHED_OTHER task that attempts to execute
> on such a CPU (such as work threads) will not
> be scheduled, which leads to system hangs.
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> To fix this, replace the usage of work items with synchronize_rcu,
> which provides the same guarantees.
> 
> Readers of lru_disable_count are protected by either disabling
> preemption or rcu_read_lock:
> 
> preempt_disable, local_irq_disable  [bh_lru_lock()]
> rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
> preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]
> 
> Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> preempt_disable() regions of code. So any CPU which sees
> lru_disable_count = 0 will have exited the critical
> section when synchronize_rcu() returns.

Permitting a realtime thread to hang the entire system warrants a
-stable backport, I think.  That's just rude.

I'm inclined to send this upstream for 5.18-rc1, with that -stable tag.

But if agreeable, how far can we backport this?  Paul, do we know which
kernel version(s) have the desired synchronize_rcu() behaviour?

Now, we don't want -stable people backporting this into kernels where
synchronize_rcu() doesn't do what we want it to do.  So a sneaky thing
we could do is to identify the change which added the desired
synchronize_rcu() behaviour and make this patch Fixes:thatpatch.  That
should prevent people from backporting it too far.
Sebastian Andrzej Siewior March 11, 2022, 8:35 a.m. UTC | #2
+ sched division

On 2022-03-10 18:23:26 [-0800], Andrew Morton wrote:
> On Thu, 10 Mar 2022 10:22:12 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On systems that run FIFO:1 applications that busy loop,
> > any SCHED_OTHER task that attempts to execute
> > on such a CPU (such as work threads) will not
> > be scheduled, which leads to system hangs.> 
> Permitting a realtime thread to hang the entire system warrants a
> -stable backport, I think.  That's just rude.

I'm not sure if someone is not willingly breaking the system. Based on
my experience, a thread with an elevated priority (that FIFO, RR or DL)
should not hog the CPU. A normal user (!root && !CAP_SYS_NICE) can't
increase the priority of the task.
To avoid a system hangup there is sched_rt_runtime_us which ensures that
all RT threads are throttled once the RT class exceed a certain amount
of runtime. This has been relaxed a little on systems with more CPUs so
that the RT runtime can be shared but this sharing (RT_RUNTIME_SHARE)
has been disabled by default a while ago. That safe switch
(sched_rt_runtime_us) can be disabled and is usually disabled on RT
system since the RT tasks usually run longer especially in corner cases.

People often isolate CPUs and have busy-loop tasks running with
SCHED_OTHER given that there is nothing else going on there will be no
preemption. In this case, the worker would preempt the task.
In this scenario I _can_ understand that one wants to avoid that
preemption and try things differently like this patch suggests. We can
even offload RCU thread from isolated CPUs.
But I wouldn't say this requires a backport because there is way for a
RT thread, that claims 100% of the CPU, to break the system.

Sebastian
Andrew Morton March 12, 2022, 12:40 a.m. UTC | #3
On Fri, 11 Mar 2022 09:35:49 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> + sched division
> 
> On 2022-03-10 18:23:26 [-0800], Andrew Morton wrote:
> > On Thu, 10 Mar 2022 10:22:12 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > On systems that run FIFO:1 applications that busy loop,
> > > any SCHED_OTHER task that attempts to execute
> > > on such a CPU (such as work threads) will not
> > > be scheduled, which leads to system hangs.
> …
> > 
> > Permitting a realtime thread to hang the entire system warrants a
> > -stable backport, I think.  That's just rude.
> 
> I'm not sure if someone is not willingly breaking the system. Based on
> my experience, a thread with an elevated priority (that FIFO, RR or DL)
> should not hog the CPU. A normal user (!root && !CAP_SYS_NICE) can't
> increase the priority of the task.
> To avoid a system hangup there is sched_rt_runtime_us which ensures that
> all RT threads are throttled once the RT class exceed a certain amount
> of runtime. This has been relaxed a little on systems with more CPUs so
> that the RT runtime can be shared but this sharing (RT_RUNTIME_SHARE)
> has been disabled by default a while ago. That safe switch
> (sched_rt_runtime_us) can be disabled and is usually disabled on RT
> system since the RT tasks usually run longer especially in corner cases.

Does all this apply if the kernel is non-preemptible?

Marcelo, do you know how the offending system bypassed the failsafe?
Marcelo Tosatti March 12, 2022, 8:39 p.m. UTC | #4
On Fri, Mar 11, 2022 at 09:35:49AM +0100, Sebastian Andrzej Siewior wrote:
> + sched division
> 
> On 2022-03-10 18:23:26 [-0800], Andrew Morton wrote:
> > On Thu, 10 Mar 2022 10:22:12 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > On systems that run FIFO:1 applications that busy loop,
> > > any SCHED_OTHER task that attempts to execute
> > > on such a CPU (such as work threads) will not
> > > be scheduled, which leads to system hangs.
> …
> > 
> > Permitting a realtime thread to hang the entire system warrants a
> > -stable backport, I think.  That's just rude.
> 
> I'm not sure if someone is not willingly breaking the system. Based on
> my experience, a thread with an elevated priority (that FIFO, RR or DL)
> should not hog the CPU. A normal user (!root && !CAP_SYS_NICE) can't
> increase the priority of the task.
> To avoid a system hangup there is sched_rt_runtime_us which ensures that
> all RT threads are throttled once the RT class exceed a certain amount
> of runtime. This has been relaxed a little on systems with more CPUs so
> that the RT runtime can be shared but this sharing (RT_RUNTIME_SHARE)
> has been disabled by default a while ago. That safe switch
> (sched_rt_runtime_us) can be disabled and is usually disabled on RT
> system since the RT tasks usually run longer especially in corner cases.

Sebastian,

Certain classes of applications appear to benefit from very low latency 
(or rather very low, to zero, task interruption length). For example,
5G RAN processing might require maximum interruption of < 10us.

IIRC the resolution of sched_rt_runtime_us being too high led to the
creation of stalld:

https://github.com/bristot/stalld

The stalld program (which stands for 'stall daemon') is a mechanism to
prevent the starvation of operating system threads in a Linux system.
The premise is to start up on a housekeeping cpu (one that is not used
for real-application purposes) and to periodically monitor the state of
each thread in the system, looking for a thread that has been on a run
queue (i.e. ready to run) for a specifed length of time without being
run. This condition is usually hit when the thread is on the same cpu
as a high-priority cpu-intensive task and therefore is being given no
opportunity to run.

When a thread is judged to be starving, stalld changes that thread to
use the SCHED_DEADLINE policy and gives the thread a small slice of time
for that cpu (specified on the command line). The thread then runs and
when that timeslice is used, the thread is then returned to its original
scheduling policy and stalld then continues to monitor thread states.

---

Configuring 10us per second for time slice of lower priority tasks 
that stalld schedules, you'd still get a higher interruption than 
10us to the "RT" high priority application.

So our impression seems to be that for such low latency requirements its
better to avoid any lower priority work at all on the isolated CPUs.

Yes, there are still cases where work queues are required to do certain
things... we have been trying to reduce such cases.

> People often isolate CPUs and have busy-loop tasks running with
> SCHED_OTHER given that there is nothing else going on there will be no
> preemption. In this case, the worker would preempt the task.
> In this scenario I _can_ understand that one wants to avoid that
> preemption and try things differently like this patch suggests. We can
> even offload RCU thread from isolated CPUs.
> But I wouldn't say this requires a backport because there is way for a
> RT thread, that claims 100% of the CPU, to break the system.

On RHEL-8 we had patches (from -RT tree) to avoid workqueues for 
flushing memory. So for RHEL-8 -> RHEL-9 the lack of the patch above
is a regression (and the usage of workqueues and not performing
the flush remotely).

But one might still see the pagevecs not being empty at
__lru_add_drain_all:

                if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
                    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
                    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
                    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
                    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
                    need_activate_page_drain(cpu) ||
                    has_bh_in_lru(cpu, NULL)) {
                        INIT_WORK(work, lru_add_drain_per_cpu);
                        queue_work_on(cpu, mm_percpu_wq, work);
                        __cpumask_set_cpu(cpu, &has_work);
                }

Which needs fixing for a complete solution.
Hillf Danton March 13, 2022, 9:23 a.m. UTC | #5
On Sat, 12 Mar 2022 17:39:40 -0300 Marcelo Tosatti wrote:
> On Fri, Mar 11, 2022 at 09:35:49AM +0100, Sebastian Andrzej Siewior wrote:
> > + sched division
> > 
> > On 2022-03-10 18:23:26 [-0800], Andrew Morton wrote:
> > > On Thu, 10 Mar 2022 10:22:12 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > > On systems that run FIFO:1 applications that busy loop,
> > > > any SCHED_OTHER task that attempts to execute
> > > > on such a CPU (such as work threads) will not
> > > > be scheduled, which leads to system hangs.
> > …
> > > 
> > > Permitting a realtime thread to hang the entire system warrants a
> > > -stable backport, I think.  That's just rude.
> > 
> > I'm not sure if someone is not willingly breaking the system. Based on
> > my experience, a thread with an elevated priority (that FIFO, RR or DL)
> > should not hog the CPU. A normal user (!root && !CAP_SYS_NICE) can't
> > increase the priority of the task.
> > To avoid a system hangup there is sched_rt_runtime_us which ensures that
> > all RT threads are throttled once the RT class exceed a certain amount
> > of runtime. This has been relaxed a little on systems with more CPUs so
> > that the RT runtime can be shared but this sharing (RT_RUNTIME_SHARE)
> > has been disabled by default a while ago. That safe switch
> > (sched_rt_runtime_us) can be disabled and is usually disabled on RT
> > system since the RT tasks usually run longer especially in corner cases.
> 
> Sebastian,
> 
> Certain classes of applications appear to benefit from very low latency 
> (or rather very low, to zero, task interruption length). For example,
> 5G RAN processing might require maximum interruption of < 10us.
> 
> IIRC the resolution of sched_rt_runtime_us being too high led to the
> creation of stalld:
> 
> https://github.com/bristot/stalld
> 
> The stalld program (which stands for 'stall daemon') is a mechanism to
> prevent the starvation of operating system threads in a Linux system.
> The premise is to start up on a housekeeping cpu (one that is not used
> for real-application purposes) and to periodically monitor the state of
> each thread in the system, looking for a thread that has been on a run
> queue (i.e. ready to run) for a specifed length of time without being
> run. This condition is usually hit when the thread is on the same cpu
> as a high-priority cpu-intensive task and therefore is being given no
> opportunity to run.
> 
> When a thread is judged to be starving, stalld changes that thread to
> use the SCHED_DEADLINE policy and gives the thread a small slice of time
> for that cpu (specified on the command line). The thread then runs and
> when that timeslice is used, the thread is then returned to its original
> scheduling policy and stalld then continues to monitor thread states.
> 
> ---
> 
> Configuring 10us per second for time slice of lower priority tasks 
> that stalld schedules, you'd still get a higher interruption than 
> 10us to the "RT" high priority application.
> 
> So our impression seems to be that for such low latency requirements its
> better to avoid any lower priority work at all on the isolated CPUs.

Wonder if it is not a part of isolation to drain the lru cache.

> 
> Yes, there are still cases where work queues are required to do certain
> things... we have been trying to reduce such cases.

If workqueues are still required on the isolated CPUs then how to
walk around the system hang mentioned at the begining of your commit log?

Hillf
> 
> > People often isolate CPUs and have busy-loop tasks running with
> > SCHED_OTHER given that there is nothing else going on there will be no
> > preemption. In this case, the worker would preempt the task.
> > In this scenario I _can_ understand that one wants to avoid that
> > preemption and try things differently like this patch suggests. We can
> > even offload RCU thread from isolated CPUs.
> > But I wouldn't say this requires a backport because there is way for a
> > RT thread, that claims 100% of the CPU, to break the system.
> 
> On RHEL-8 we had patches (from -RT tree) to avoid workqueues for 
> flushing memory. So for RHEL-8 -> RHEL-9 the lack of the patch above
> is a regression (and the usage of workqueues and not performing
> the flush remotely).
> 
> But one might still see the pagevecs not being empty at
> __lru_add_drain_all:
> 
>                 if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>                     data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>                     pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>                     pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
>                     pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
>                     need_activate_page_drain(cpu) ||
>                     has_bh_in_lru(cpu, NULL)) {
>                         INIT_WORK(work, lru_add_drain_per_cpu);
>                         queue_work_on(cpu, mm_percpu_wq, work);
>                         __cpumask_set_cpu(cpu, &has_work);
>                 }
> 
> Which needs fixing for a complete solution.
Borislav Petkov March 31, 2022, 1:52 p.m. UTC | #6
On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> 
> On systems that run FIFO:1 applications that busy loop,
> any SCHED_OTHER task that attempts to execute
> on such a CPU (such as work threads) will not
> be scheduled, which leads to system hangs.
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> To fix this, replace the usage of work items with synchronize_rcu,
> which provides the same guarantees.
> 
> Readers of lru_disable_count are protected by either disabling
> preemption or rcu_read_lock:
> 
> preempt_disable, local_irq_disable  [bh_lru_lock()]
> rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
> preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]
> 
> Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> preempt_disable() regions of code. So any CPU which sees
> lru_disable_count = 0 will have exited the critical
> section when synchronize_rcu() returns.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Acked-by: Minchan Kim <minchan@kernel.org>

Someone pointed me at this:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom

which says this one causes a performance regression with stress-ng's
NUMA test...
Marcelo Tosatti April 28, 2022, 6 p.m. UTC | #7
On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> > 
> > On systems that run FIFO:1 applications that busy loop,
> > any SCHED_OTHER task that attempts to execute
> > on such a CPU (such as work threads) will not
> > be scheduled, which leads to system hangs.
> > 
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> > 
> > To fix this, replace the usage of work items with synchronize_rcu,
> > which provides the same guarantees.
> > 
> > Readers of lru_disable_count are protected by either disabling
> > preemption or rcu_read_lock:
> > 
> > preempt_disable, local_irq_disable  [bh_lru_lock()]
> > rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
> > preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]
> > 
> > Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> > preempt_disable() regions of code. So any CPU which sees
> > lru_disable_count = 0 will have exited the critical
> > section when synchronize_rcu() returns.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> > Acked-by: Minchan Kim <minchan@kernel.org>
> 
> Someone pointed me at this:
> 
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
> 
> which says this one causes a performance regression with stress-ng's
> NUMA test...

Michael,

This is probably do_migrate_pages that is taking too long due to
synchronize_rcu().

Switching to synchronize_rcu_expedited() should probably fix it...
Can you give it a try, please?

diff --git a/mm/swap.c b/mm/swap.c
index bceff0cb559c..04a8bbf9817a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -879,7 +879,7 @@ void lru_cache_disable(void)
 	 * lru_disable_count = 0 will have exited the critical
 	 * section when synchronize_rcu() returns.
 	 */
-	synchronize_rcu();
+	synchronize_rcu_expedited();
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
Andrew Morton May 28, 2022, 9:18 p.m. UTC | #8
On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> > On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> > > 
> ...
>
> > 
> > Someone pointed me at this:
> > 
> > https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
> > 
> > which says this one causes a performance regression with stress-ng's
> > NUMA test...
> 
> Michael,
> 
> This is probably do_migrate_pages that is taking too long due to
> synchronize_rcu().
> 
> Switching to synchronize_rcu_expedited() should probably fix it...
> Can you give it a try, please?

I guess not.

Is anyone else able to demonstrate a stress-ng performance regression
due to ff042f4a9b0508?  And if so, are they able to try Marcelo's
one-liner?

> diff --git a/mm/swap.c b/mm/swap.c
> index bceff0cb559c..04a8bbf9817a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>  	 * lru_disable_count = 0 will have exited the critical
>  	 * section when synchronize_rcu() returns.
>  	 */
> -	synchronize_rcu();
> +	synchronize_rcu_expedited();
>  #ifdef CONFIG_SMP
>  	__lru_add_drain_all(true);
>  #else
> 
>
Michael Larabel May 28, 2022, 10:54 p.m. UTC | #9
On 5/28/22 16:18, Andrew Morton wrote:
> On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
>> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
>>> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>> ...
>>
>>> Someone pointed me at this:
>>>
>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>>>
>>> which says this one causes a performance regression with stress-ng's
>>> NUMA test...
>> Michael,
>>
>> This is probably do_migrate_pages that is taking too long due to
>> synchronize_rcu().
>>
>> Switching to synchronize_rcu_expedited() should probably fix it...
>> Can you give it a try, please?
> I guess not.
>
> Is anyone else able to demonstrate a stress-ng performance regression
> due to ff042f4a9b0508?  And if so, are they able to try Marcelo's
> one-liner?


Apologies I don't believe I got the email previously (or if it ended up 
in spam or otherwise overlooked) so just noticed this thread now...

I have the system around and will work on verifying it can reproduce 
still and can then test the patch, should be able to get it tomorrow.

Thanks and sorry about the delay.

Michael



>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index bceff0cb559c..04a8bbf9817a 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>>   	 * lru_disable_count = 0 will have exited the critical
>>   	 * section when synchronize_rcu() returns.
>>   	 */
>> -	synchronize_rcu();
>> +	synchronize_rcu_expedited();
>>   #ifdef CONFIG_SMP
>>   	__lru_add_drain_all(true);
>>   #else
>>
>>
Michael Larabel May 29, 2022, 12:48 a.m. UTC | #10
On 5/28/22 17:54, Michael Larabel wrote:
> On 5/28/22 16:18, Andrew Morton wrote:
>> On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti 
>> <mtosatti@redhat.com> wrote:
>>
>>> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
>>>> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>>> ...
>>>
>>>> Someone pointed me at this:
>>>>
>>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom 
>>>>
>>>>
>>>> which says this one causes a performance regression with stress-ng's
>>>> NUMA test...
>>> Michael,
>>>
>>> This is probably do_migrate_pages that is taking too long due to
>>> synchronize_rcu().
>>>
>>> Switching to synchronize_rcu_expedited() should probably fix it...
>>> Can you give it a try, please?
>> I guess not.
>>
>> Is anyone else able to demonstrate a stress-ng performance regression
>> due to ff042f4a9b0508?  And if so, are they able to try Marcelo's
>> one-liner?
>
>
> Apologies I don't believe I got the email previously (or if it ended 
> up in spam or otherwise overlooked) so just noticed this thread now...
>
> I have the system around and will work on verifying it can reproduce 
> still and can then test the patch, should be able to get it tomorrow.
>
> Thanks and sorry about the delay.
>
> Michael
>
>

Had a chance to look at it today still. I was able to reproduce the 
regression still on that 5950X system going from v5.17 to v5.18 (using 
newer stress-ng benchmark and other system changes since the prior 
tests). Confirmed it also still showed slower as of today's Git.

I can confirm with Marcelo's patch below that the stress-ng NUMA 
performance is back to the v5.17 level of performance (actually, faster) 
and certainly not like what I was seeing on v5.18 or Git to this point.

So all seems to be good with that one-liner for the stress-ng NUMA test 
case. All the system details and results for those interested is 
documented @ https://openbenchmarking.org/result/2205284-PTS-NUMAREGR17 
but basically amounts to:

     Stress-NG 0.14
     Test: NUMA
     Bogo Ops/s > Higher Is Better
     v5.17: 412.88
     v5.18: 49.33
     20220528 Git: 49.66
     20220528 Git + sched-rcu-exped patch: 468.81

Apologies again about the delay / not seeing the email thread earlier.

Thanks,

Michael


Tested-by: Michael Larabel <Michael@MichaelLarabel.com>



>
>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index bceff0cb559c..04a8bbf9817a 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>>>        * lru_disable_count = 0 will have exited the critical
>>>        * section when synchronize_rcu() returns.
>>>        */
>>> -    synchronize_rcu();
>>> +    synchronize_rcu_expedited();
>>>   #ifdef CONFIG_SMP
>>>       __lru_add_drain_all(true);
>>>   #else
>>>
>>>
Thorsten Leemhuis June 19, 2022, 12:14 p.m. UTC | #11
Hi, this is your Linux kernel regression tracker.

On 29.05.22 02:48, Michael Larabel wrote:
> On 5/28/22 17:54, Michael Larabel wrote:
>> On 5/28/22 16:18, Andrew Morton wrote:
>>> On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti
>>> <mtosatti@redhat.com> wrote:
>>>> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
>>>>> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>>>>> Someone pointed me at this:
>>>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>>>>>
>>>>> which says this one causes a performance regression with stress-ng's
>>>>> NUMA test...
>>>>
>>>> This is probably do_migrate_pages that is taking too long due to
>>>> synchronize_rcu().
>>>>
>>>> Switching to synchronize_rcu_expedited() should probably fix it...
>>>> Can you give it a try, please?
>>> I guess not.
>>>
>>> Is anyone else able to demonstrate a stress-ng performance regression
>>> due to ff042f4a9b0508?  And if so, are they able to try Marcelo's
>>> one-liner?
>>
>> Apologies I don't believe I got the email previously (or if it ended
>> up in spam or otherwise overlooked) so just noticed this thread now...
>>
>> I have the system around and will work on verifying it can reproduce
>> still and can then test the patch, should be able to get it tomorrow.
>>
>> Thanks and sorry about the delay.
> 
> Had a chance to look at it today still. I was able to reproduce the
> regression still on that 5950X system going from v5.17 to v5.18 (using
> newer stress-ng benchmark and other system changes since the prior
> tests). Confirmed it also still showed slower as of today's Git.
> 
> I can confirm with Marcelo's patch below that the stress-ng NUMA
> performance is back to the v5.17 level of performance (actually, faster)
> and certainly not like what I was seeing on v5.18 or Git to this point.
> 
> So all seems to be good with that one-liner for the stress-ng NUMA test
> case. All the system details and results for those interested is
> documented @ https://openbenchmarking.org/result/2205284-PTS-NUMAREGR17
> but basically amounts to:
> 
>     Stress-NG 0.14
>     Test: NUMA
>     Bogo Ops/s > Higher Is Better
>     v5.17: 412.88
>     v5.18: 49.33
>     20220528 Git: 49.66
>     20220528 Git + sched-rcu-exped patch: 468.81
> 
> Apologies again about the delay / not seeing the email thread earlier.
>lru_cache_disable: replace work queue synchronization with synchronize_rcu 
> Thanks,
> 
> Michael
> 
> Tested-by: Michael Larabel <Michael@MichaelLarabel.com>

Andrew, is there a reason why this patch afaics isn't mainlined yet and
lingering in linux-next for so long? Michael confirmed that this patch
fixes a regression three weeks ago and a few days later Stefan confirmed
that his problem was solved as well:
https://lore.kernel.org/regressions/79bb603e-37cb-d1dd-1e12-7ce28d7cfdae@i2se.com/

Reminder: unless there are good reasons it shouldn't take this long to
for reason explained in
https://docs.kernel.org/process/handling-regressions.html

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index bceff0cb559c..04a8bbf9817a 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>>>>        * lru_disable_count = 0 will have exited the critical
>>>>        * section when synchronize_rcu() returns.
>>>>        */
>>>> -    synchronize_rcu();
>>>> +    synchronize_rcu_expedited();
>>>>   #ifdef CONFIG_SMP
>>>>       __lru_add_drain_all(true);
>>>>   #else
>>>>
>>>>
Andrew Morton June 22, 2022, 12:15 a.m. UTC | #12
On Sun, 19 Jun 2022 14:14:03 +0200 Thorsten Leemhuis <regressions@leemhuis.info> wrote:

> Andrew, is there a reason why this patch afaics isn't mainlined yet and
> lingering in linux-next for so long? 

I didn't bother doing a hotfixes merge last week because there wasn't anything
very urgent-looking in there.  I'll be putting together a pull request later this week.
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..b5ee163daa66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@  inline void __lru_add_drain_all(bool force_all_cpus)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (force_all_cpus ||
-		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
 		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,15 +875,21 @@  atomic_t lru_disable_count = ATOMIC_INIT(0);
 void lru_cache_disable(void)
 {
 	atomic_inc(&lru_disable_count);
-#ifdef CONFIG_SMP
 	/*
-	 * lru_add_drain_all in the force mode will schedule draining on
-	 * all online CPUs so any calls of lru_cache_disabled wrapped by
-	 * local_lock or preemption disabled would be ordered by that.
-	 * The atomic operation doesn't need to have stronger ordering
-	 * requirements because that is enforced by the scheduling
-	 * guarantees.
+	 * Readers of lru_disable_count are protected by either disabling
+	 * preemption or rcu_read_lock:
+	 *
+	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
+	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
+	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
+	 *
+	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
+	 * preempt_disable() regions of code. So any CPU which sees
+	 * lru_disable_count = 0 will have exited the critical
+	 * section when synchronize_rcu() returns.
 	 */
+	synchronize_rcu();
+#ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
 	lru_add_and_bh_lrus_drain();