diff mbox series

[v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

Message ID YhUKRzEKxMvlGQ5n@fuller.cnet (mailing list archive)
State New
Headers show
Series [v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu | expand

Commit Message

Marcelo Tosatti Feb. 22, 2022, 4:07 p.m. UTC
On systems that run FIFO:1 applications that busy loop 
on isolated CPUs, executing tasks on such CPUs under
lower priority is undesired (since that will either
hang the system, or cause longer interruption to the
FIFO task due to execution of lower priority task 
with very small sched slices).

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
pagevec during the migration temporarily") relies on 
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

However, its possible to use synchronize_rcu which will provide the same
guarantees (see comment this patch modifies on lru_cache_disable).

Fixes:

[ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
[ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
[ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
[ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
[ 1873.243936] Call Trace:
[ 1873.243938]  __schedule+0x21b/0x5b0
[ 1873.243941]  schedule+0x43/0xe0
[ 1873.243943]  schedule_timeout+0x14d/0x190
[ 1873.243946]  ? resched_curr+0x20/0xe0
[ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
[ 1873.243958]  wait_for_completion+0x84/0xe0
[ 1873.243962]  __flush_work.isra.0+0x146/0x200
[ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
[ 1873.243978]  do_migrate_pages+0x3d/0x2d0
[ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
[ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
[ 1873.243992]  ? pick_next_task+0xb30/0xbd0
[ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
[ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
[ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
[ 1873.244005]  ? __switch_to+0x12f/0x510
[ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
[ 1873.244016]  process_one_work+0x1e0/0x410
[ 1873.244019]  worker_thread+0x50/0x3b0
[ 1873.244022]  ? process_one_work+0x410/0x410
[ 1873.244024]  kthread+0x173/0x190
[ 1873.244027]  ? set_kthread_struct+0x40/0x40
[ 1873.244031]  ret_from_fork+0x1f/0x30

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v3: update stale comment			(Nicolas Saenz Julienne)
v2: rt_spin_lock calls rcu_read_lock, no need
to add it before local_lock on swap.c		(Nicolas Saenz Julienne)

Comments

Nicolas Saenz Julienne Feb. 22, 2022, 4:25 p.m. UTC | #1
On Tue, 2022-02-22 at 13:07 -0300, Marcelo Tosatti wrote:
> On systems that run FIFO:1 applications that busy loop 
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task 
> with very small sched slices).
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
> 
> Fixes:
> 
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938]  __schedule+0x21b/0x5b0
> [ 1873.243941]  schedule+0x43/0xe0
> [ 1873.243943]  schedule_timeout+0x14d/0x190
> [ 1873.243946]  ? resched_curr+0x20/0xe0
> [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958]  wait_for_completion+0x84/0xe0
> [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005]  ? __switch_to+0x12f/0x510
> [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016]  process_one_work+0x1e0/0x410
> [ 1873.244019]  worker_thread+0x50/0x3b0
> [ 1873.244022]  ? process_one_work+0x410/0x410
> [ 1873.244024]  kthread+0x173/0x190
> [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> [ 1873.244031]  ret_from_fork+0x1f/0x30
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Regards,
Andrew Morton March 4, 2022, 1:03 a.m. UTC | #2
(Question for paulmck below, please)

On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:

> 
> On systems that run FIFO:1 applications that busy loop 
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task 
> with very small sched slices).
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
> 
> Fixes:
> 
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938]  __schedule+0x21b/0x5b0
> [ 1873.243941]  schedule+0x43/0xe0
> [ 1873.243943]  schedule_timeout+0x14d/0x190
> [ 1873.243946]  ? resched_curr+0x20/0xe0
> [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958]  wait_for_completion+0x84/0xe0
> [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005]  ? __switch_to+0x12f/0x510
> [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016]  process_one_work+0x1e0/0x410
> [ 1873.244019]  worker_thread+0x50/0x3b0
> [ 1873.244022]  ? process_one_work+0x410/0x410
> [ 1873.244024]  kthread+0x173/0x190
> [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> [ 1873.244031]  ret_from_fork+0x1f/0x30
> 
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> -		if (force_all_cpus ||
> -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||

This change appears to be "don't queue work on CPUs which don't have
any work to do".  Correct?  This isn't changelogged?

> @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>  void lru_cache_disable(void)
>  {
>  	atomic_inc(&lru_disable_count);
> +	synchronize_rcu();
>  #ifdef CONFIG_SMP
>  	/*
> -	 * lru_add_drain_all in the force mode will schedule draining on
> -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> -	 * local_lock or preemption disabled would be ordered by that.
> -	 * The atomic operation doesn't need to have stronger ordering
> -	 * requirements because that is enforced by the scheduling
> -	 * guarantees.
> +	 * synchronize_rcu() waits for preemption disabled
> +	 * and RCU read side critical sections.
> +	 * For the users of lru_disable_count:
> +	 *
> +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> +	 *
> +	 * so any calls of lru_cache_disabled wrapped by local_lock or
> +	 * preemption disabled would be ordered by that.
>  	 */
>  	__lru_add_drain_all(true);
>  #else

Does this also work with CONFIG_TINY_RCU?

This seems abusive of synchronize_rcu().  None of this code uses RCU,
but it so happens that synchronize_rcu() happily provides the desired
effects.  Changes in RCU's happy side-effects might break this. 
Perhaps a formal API function which does whatever-you-want-it-to-do
would be better.

And...  I really don't understand the fix.  What is it about
synchronize_rcu() which guarantees that a work function which is queued
on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
userspace?
Paul E. McKenney March 4, 2022, 1:49 a.m. UTC | #3
On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> (Question for paulmck below, please)
> 
> On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > 
> > On systems that run FIFO:1 applications that busy loop 
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task 
> > with very small sched slices).
> > 
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> > pagevec during the migration temporarily") relies on 
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> > 
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> > 
> > Fixes:
> > 
> > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > [ 1873.243936] Call Trace:
> > [ 1873.243938]  __schedule+0x21b/0x5b0
> > [ 1873.243941]  schedule+0x43/0xe0
> > [ 1873.243943]  schedule_timeout+0x14d/0x190
> > [ 1873.243946]  ? resched_curr+0x20/0xe0
> > [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> > [ 1873.243958]  wait_for_completion+0x84/0xe0
> > [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> > [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> > [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> > [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> > [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> > [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> > [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> > [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> > [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> > [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> > [ 1873.244005]  ? __switch_to+0x12f/0x510
> > [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> > [ 1873.244016]  process_one_work+0x1e0/0x410
> > [ 1873.244019]  worker_thread+0x50/0x3b0
> > [ 1873.244022]  ? process_one_work+0x410/0x410
> > [ 1873.244024]  kthread+0x173/0x190
> > [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> > [ 1873.244031]  ret_from_fork+0x1f/0x30
> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> >  	for_each_online_cpu(cpu) {
> >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >  
> > -		if (force_all_cpus ||
> > -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> >  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> 
> This change appears to be "don't queue work on CPUs which don't have
> any work to do".  Correct?  This isn't changelogged?
> 
> > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> >  void lru_cache_disable(void)
> >  {
> >  	atomic_inc(&lru_disable_count);
> > +	synchronize_rcu();
> >  #ifdef CONFIG_SMP
> >  	/*
> > -	 * lru_add_drain_all in the force mode will schedule draining on
> > -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> > -	 * local_lock or preemption disabled would be ordered by that.
> > -	 * The atomic operation doesn't need to have stronger ordering
> > -	 * requirements because that is enforced by the scheduling
> > -	 * guarantees.
> > +	 * synchronize_rcu() waits for preemption disabled
> > +	 * and RCU read side critical sections.
> > +	 * For the users of lru_disable_count:
> > +	 *
> > +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> > +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> > +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> > +	 *
> > +	 * so any calls of lru_cache_disabled wrapped by local_lock or
> > +	 * preemption disabled would be ordered by that.
> >  	 */
> >  	__lru_add_drain_all(true);
> >  #else
> 
> Does this also work with CONFIG_TINY_RCU?
> 
> This seems abusive of synchronize_rcu().  None of this code uses RCU,
> but it so happens that synchronize_rcu() happily provides the desired
> effects.  Changes in RCU's happy side-effects might break this. 
> Perhaps a formal API function which does whatever-you-want-it-to-do
> would be better.

I don't claim to understand the full lru_cache_disable() use case, but
since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
regions of code.  In contrast, back in the old days, you had to use
synchronize_sched() to wait on preempt_disable() regions, even in
CONFIG_PREEMPT=y kernels.  So if the comment is accurate, it is OK.

Just be careful what you backport past v5.1...

> And...  I really don't understand the fix.  What is it about
> synchronize_rcu() which guarantees that a work function which is queued
> on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> userspace?

I don't understand this part, either.

							Thanx, Paul
Marcelo Tosatti March 4, 2022, 3:08 p.m. UTC | #4
On Thu, Mar 03, 2022 at 05:49:30PM -0800, Paul E. McKenney wrote:
> On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> > (Question for paulmck below, please)
> > 
> > On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> > > 
> > > On systems that run FIFO:1 applications that busy loop 
> > > on isolated CPUs, executing tasks on such CPUs under
> > > lower priority is undesired (since that will either
> > > hang the system, or cause longer interruption to the
> > > FIFO task due to execution of lower priority task 
> > > with very small sched slices).
> > > 
> > > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> > > pagevec during the migration temporarily") relies on 
> > > queueing work items on all online CPUs to ensure visibility
> > > of lru_disable_count.
> > > 
> > > However, its possible to use synchronize_rcu which will provide the same
> > > guarantees (see comment this patch modifies on lru_cache_disable).
> > > 
> > > Fixes:
> > > 
> > > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > > [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> > > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> > > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > > [ 1873.243936] Call Trace:
> > > [ 1873.243938]  __schedule+0x21b/0x5b0
> > > [ 1873.243941]  schedule+0x43/0xe0
> > > [ 1873.243943]  schedule_timeout+0x14d/0x190
> > > [ 1873.243946]  ? resched_curr+0x20/0xe0
> > > [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> > > [ 1873.243958]  wait_for_completion+0x84/0xe0
> > > [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> > > [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> > > [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> > > [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> > > [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> > > [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> > > [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> > > [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> > > [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> > > [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> > > [ 1873.244005]  ? __switch_to+0x12f/0x510
> > > [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> > > [ 1873.244016]  process_one_work+0x1e0/0x410
> > > [ 1873.244019]  worker_thread+0x50/0x3b0
> > > [ 1873.244022]  ? process_one_work+0x410/0x410
> > > [ 1873.244024]  kthread+0x173/0x190
> > > [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> > > [ 1873.244031]  ret_from_fork+0x1f/0x30
> > > 
> > > ...
> > >
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > >  	for_each_online_cpu(cpu) {
> > >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> > >  
> > > -		if (force_all_cpus ||
> > > -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > >  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> > 
> > This change appears to be "don't queue work on CPUs which don't have
> > any work to do".  Correct?  This isn't changelogged?

Its replaced by synchronize_rcu, and its mentioned in the changelog:

"However, its possible to use synchronize_rcu which will provide the same
 guarantees (see comment this patch modifies on lru_cache_disable)."

Will resend -v4 with a more verbose changelog.


> > > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > >  void lru_cache_disable(void)
> > >  {
> > >  	atomic_inc(&lru_disable_count);
> > > +	synchronize_rcu();
> > >  #ifdef CONFIG_SMP
> > >  	/*
> > > -	 * lru_add_drain_all in the force mode will schedule draining on
> > > -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> > > -	 * local_lock or preemption disabled would be ordered by that.
> > > -	 * The atomic operation doesn't need to have stronger ordering
> > > -	 * requirements because that is enforced by the scheduling
> > > -	 * guarantees.
> > > +	 * synchronize_rcu() waits for preemption disabled
> > > +	 * and RCU read side critical sections.
> > > +	 * For the users of lru_disable_count:
> > > +	 *
> > > +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> > > +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> > > +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> > > +	 *
> > > +	 * so any calls of lru_cache_disabled wrapped by local_lock or
> > > +	 * preemption disabled would be ordered by that.
> > >  	 */
> > >  	__lru_add_drain_all(true);
> > >  #else
> > 
> > Does this also work with CONFIG_TINY_RCU?
> > 
> > This seems abusive of synchronize_rcu().  None of this code uses RCU,
> > but it so happens that synchronize_rcu() happily provides the desired
> > effects.  Changes in RCU's happy side-effects might break this. 
> > Perhaps a formal API function which does whatever-you-want-it-to-do
> > would be better.
> 
> I don't claim to understand the full lru_cache_disable() use case, but
> since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
> regions of code.  In contrast, back in the old days, you had to use
> synchronize_sched() to wait on preempt_disable() regions, even in
> CONFIG_PREEMPT=y kernels.  So if the comment is accurate, it is OK.

OK, will add an additional comment regarding v5.1.

> Just be careful what you backport past v5.1...
> 
> > And...  I really don't understand the fix.  What is it about
> > synchronize_rcu() which guarantees that a work function which is queued
> > on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> > userspace?
> 
> I don't understand this part, either.


All CPUs should see lru_disable_count (and therefore not add pages
to per-CPU LRU pvecs, otherwise the page migration bug fixed
by d479960e44f27e0e52ba31b21740b703c538027c can occur.

To do this, the commit above ("mm: disable LRU 
pagevec during the migration temporarily") relies on 
queueing work items on all online CPUs to ensure visibility
of lru_disable_count:

 */
+void lru_cache_disable(void)
+{
+       atomic_inc(&lru_disable_count);
+#ifdef CONFIG_SMP
+       /*
+        * lru_add_drain_all in the force mode will schedule draining on
+        * all online CPUs so any calls of lru_cache_disabled wrapped by
+        * local_lock or preemption disabled would be ordered by that.
+        * The atomic operation doesn't need to have stronger ordering
+        * requirements because that is enforeced by the scheduling
+        * guarantees.
+        */
+       __lru_add_drain_all(true);
+#else


CPU-0					CPU-1

					
					local_lock(&lru_pvecs.lock);
                			pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
					add page to per-CPU LRU pvec
					if atomic_read(lru_disable_count) != 0
						flush per-CPU LRU pvec
atomic_inc(lru_disable_count)

					local_unlock(&lru_pvec.lock)
lru_add_drain_all(force_all_cpus=true)

However queueing the work items disturbs isolated CPUs. To avoid it, its
possible to use synchronize_rcu instead:

CPU-0                                   CPU-1


                                        local_lock(&lru_pvecs.lock);
                                        pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
                                        add page to per-CPU LRU pvec
                                        if atomic_read(lru_disable_count) != 0
                                                flush per-CPU LRU pvec
atomic_inc(lru_disable_count)

                                        local_unlock(&lru_pvec.lock)
synchronize_rcu()

Which will wait for all preemption (or IRQ disabled) sections to
complete, therefore ensuring visibilily of lru_disable_count.
Marcelo Tosatti March 4, 2022, 3:11 p.m. UTC | #5
On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> (Question for paulmck below, please)
> 
> On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > 
> > On systems that run FIFO:1 applications that busy loop 
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task 
> > with very small sched slices).
> > 
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> > pagevec during the migration temporarily") relies on 
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> > 
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> > 
> > Fixes:
> > 
> > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > [ 1873.243936] Call Trace:
> > [ 1873.243938]  __schedule+0x21b/0x5b0
> > [ 1873.243941]  schedule+0x43/0xe0
> > [ 1873.243943]  schedule_timeout+0x14d/0x190
> > [ 1873.243946]  ? resched_curr+0x20/0xe0
> > [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> > [ 1873.243958]  wait_for_completion+0x84/0xe0
> > [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> > [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> > [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> > [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> > [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> > [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> > [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> > [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> > [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> > [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> > [ 1873.244005]  ? __switch_to+0x12f/0x510
> > [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> > [ 1873.244016]  process_one_work+0x1e0/0x410
> > [ 1873.244019]  worker_thread+0x50/0x3b0
> > [ 1873.244022]  ? process_one_work+0x410/0x410
> > [ 1873.244024]  kthread+0x173/0x190
> > [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> > [ 1873.244031]  ret_from_fork+0x1f/0x30
> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> >  	for_each_online_cpu(cpu) {
> >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >  
> > -		if (force_all_cpus ||
> > -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> >  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> 
> This change appears to be "don't queue work on CPUs which don't have
> any work to do".  Correct?  This isn't changelogged?
> 
> > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> >  void lru_cache_disable(void)
> >  {
> >  	atomic_inc(&lru_disable_count);
> > +	synchronize_rcu();
> >  #ifdef CONFIG_SMP
> >  	/*
> > -	 * lru_add_drain_all in the force mode will schedule draining on
> > -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> > -	 * local_lock or preemption disabled would be ordered by that.
> > -	 * The atomic operation doesn't need to have stronger ordering
> > -	 * requirements because that is enforced by the scheduling
> > -	 * guarantees.
> > +	 * synchronize_rcu() waits for preemption disabled
> > +	 * and RCU read side critical sections.
> > +	 * For the users of lru_disable_count:
> > +	 *
> > +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> > +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> > +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> > +	 *
> > +	 * so any calls of lru_cache_disabled wrapped by local_lock or
> > +	 * preemption disabled would be ordered by that.
> >  	 */
> >  	__lru_add_drain_all(true);
> >  #else
> 
> Does this also work with CONFIG_TINY_RCU?
> 
> This seems abusive of synchronize_rcu().  None of this code uses RCU,
> but it so happens that synchronize_rcu() happily provides the desired
> effects.  Changes in RCU's happy side-effects might break this. 
> Perhaps a formal API function which does whatever-you-want-it-to-do
> would be better.
> 
> And...  I really don't understand the fix.  What is it about
> synchronize_rcu() which guarantees that a work function which is queued
> on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> userspace?

It does not. synchronize_rcu() replaces queueing the work functions,
to ensure visibility of lru_disable_count.
Paul E. McKenney March 4, 2022, 4:02 p.m. UTC | #6
On Fri, Mar 04, 2022 at 12:08:46PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 03, 2022 at 05:49:30PM -0800, Paul E. McKenney wrote:
> > On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> > > (Question for paulmck below, please)
> > > 
> > > On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > 
> > > > 
> > > > On systems that run FIFO:1 applications that busy loop 
> > > > on isolated CPUs, executing tasks on such CPUs under
> > > > lower priority is undesired (since that will either
> > > > hang the system, or cause longer interruption to the
> > > > FIFO task due to execution of lower priority task 
> > > > with very small sched slices).
> > > > 
> > > > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> > > > pagevec during the migration temporarily") relies on 
> > > > queueing work items on all online CPUs to ensure visibility
> > > > of lru_disable_count.
> > > > 
> > > > However, its possible to use synchronize_rcu which will provide the same
> > > > guarantees (see comment this patch modifies on lru_cache_disable).
> > > > 
> > > > Fixes:
> > > > 
> > > > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > > > [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> > > > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> > > > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > > > [ 1873.243936] Call Trace:
> > > > [ 1873.243938]  __schedule+0x21b/0x5b0
> > > > [ 1873.243941]  schedule+0x43/0xe0
> > > > [ 1873.243943]  schedule_timeout+0x14d/0x190
> > > > [ 1873.243946]  ? resched_curr+0x20/0xe0
> > > > [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> > > > [ 1873.243958]  wait_for_completion+0x84/0xe0
> > > > [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> > > > [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> > > > [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> > > > [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> > > > [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> > > > [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> > > > [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> > > > [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> > > > [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> > > > [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> > > > [ 1873.244005]  ? __switch_to+0x12f/0x510
> > > > [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> > > > [ 1873.244016]  process_one_work+0x1e0/0x410
> > > > [ 1873.244019]  worker_thread+0x50/0x3b0
> > > > [ 1873.244022]  ? process_one_work+0x410/0x410
> > > > [ 1873.244024]  kthread+0x173/0x190
> > > > [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> > > > [ 1873.244031]  ret_from_fork+0x1f/0x30
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > > >  	for_each_online_cpu(cpu) {
> > > >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> > > >  
> > > > -		if (force_all_cpus ||
> > > > -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > > +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > >  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > > >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > > >  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> > > 
> > > This change appears to be "don't queue work on CPUs which don't have
> > > any work to do".  Correct?  This isn't changelogged?
> 
> Its replaced by synchronize_rcu, and its mentioned in the changelog:
> 
> "However, its possible to use synchronize_rcu which will provide the same
>  guarantees (see comment this patch modifies on lru_cache_disable)."
> 
> Will resend -v4 with a more verbose changelog.
> 
> 
> > > > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > > >  void lru_cache_disable(void)
> > > >  {
> > > >  	atomic_inc(&lru_disable_count);
> > > > +	synchronize_rcu();
> > > >  #ifdef CONFIG_SMP
> > > >  	/*
> > > > -	 * lru_add_drain_all in the force mode will schedule draining on
> > > > -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> > > > -	 * local_lock or preemption disabled would be ordered by that.
> > > > -	 * The atomic operation doesn't need to have stronger ordering
> > > > -	 * requirements because that is enforced by the scheduling
> > > > -	 * guarantees.
> > > > +	 * synchronize_rcu() waits for preemption disabled
> > > > +	 * and RCU read side critical sections.
> > > > +	 * For the users of lru_disable_count:
> > > > +	 *
> > > > +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> > > > +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> > > > +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> > > > +	 *
> > > > +	 * so any calls of lru_cache_disabled wrapped by local_lock or
> > > > +	 * preemption disabled would be ordered by that.
> > > >  	 */
> > > >  	__lru_add_drain_all(true);
> > > >  #else
> > > 
> > > Does this also work with CONFIG_TINY_RCU?
> > > 
> > > This seems abusive of synchronize_rcu().  None of this code uses RCU,
> > > but it so happens that synchronize_rcu() happily provides the desired
> > > effects.  Changes in RCU's happy side-effects might break this. 
> > > Perhaps a formal API function which does whatever-you-want-it-to-do
> > > would be better.
> > 
> > I don't claim to understand the full lru_cache_disable() use case, but
> > since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
> > regions of code.  In contrast, back in the old days, you had to use
> > synchronize_sched() to wait on preempt_disable() regions, even in
> > CONFIG_PREEMPT=y kernels.  So if the comment is accurate, it is OK.
> 
> OK, will add an additional comment regarding v5.1.

And if someone does need to backport to a kernel version with old-style
limited synchronize_rcu() semantics, you can do this:

	synchronize_rcu_mult(call_rcu_mult, call_rcu_sched);

This will wait concurrently for an RCU and and RCU-sched grace period.

If you want the full-up semantics, you can do this to wait on all
three flavors, also including RCU-bh:

	synchronize_rcu_mult(call_rcu_mult, call_rcu_sched, call_rcu_bh);

> > Just be careful what you backport past v5.1...
> > 
> > > And...  I really don't understand the fix.  What is it about
> > > synchronize_rcu() which guarantees that a work function which is queued
> > > on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> > > userspace?
> > 
> > I don't understand this part, either.
> 
> 
> All CPUs should see lru_disable_count (and therefore not add pages
> to per-CPU LRU pvecs, otherwise the page migration bug fixed
> by d479960e44f27e0e52ba31b21740b703c538027c can occur.
> 
> To do this, the commit above ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count:
> 
>  */
> +void lru_cache_disable(void)
> +{
> +       atomic_inc(&lru_disable_count);
> +#ifdef CONFIG_SMP
> +       /*
> +        * lru_add_drain_all in the force mode will schedule draining on
> +        * all online CPUs so any calls of lru_cache_disabled wrapped by
> +        * local_lock or preemption disabled would be ordered by that.
> +        * The atomic operation doesn't need to have stronger ordering
> +        * requirements because that is enforeced by the scheduling
> +        * guarantees.
> +        */
> +       __lru_add_drain_all(true);
> +#else
> 
> 
> CPU-0					CPU-1
> 
> 					
> 					local_lock(&lru_pvecs.lock);
>                 			pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
> 					add page to per-CPU LRU pvec
> 					if atomic_read(lru_disable_count) != 0
> 						flush per-CPU LRU pvec
> atomic_inc(lru_disable_count)
> 
> 					local_unlock(&lru_pvec.lock)
> lru_add_drain_all(force_all_cpus=true)
> 
> However queueing the work items disturbs isolated CPUs. To avoid it, its
> possible to use synchronize_rcu instead:
> 
> CPU-0                                   CPU-1
> 
> 
>                                         local_lock(&lru_pvecs.lock);
>                                         pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
>                                         add page to per-CPU LRU pvec
>                                         if atomic_read(lru_disable_count) != 0
>                                                 flush per-CPU LRU pvec
> atomic_inc(lru_disable_count)
> 
>                                         local_unlock(&lru_pvec.lock)
> synchronize_rcu()
> 
> Which will wait for all preemption (or IRQ disabled) sections to
> complete, therefore ensuring visibilily of lru_disable_count.

OK, that does sound plausible.  (But please note that I am not familiar
with that code.)

							Thanx, Paul
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..abb26293e7c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@  inline void __lru_add_drain_all(bool force_all_cpus)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (force_all_cpus ||
-		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
 		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,14 +875,19 @@  atomic_t lru_disable_count = ATOMIC_INIT(0);
 void lru_cache_disable(void)
 {
 	atomic_inc(&lru_disable_count);
+	synchronize_rcu();
 #ifdef CONFIG_SMP
 	/*
-	 * lru_add_drain_all in the force mode will schedule draining on
-	 * all online CPUs so any calls of lru_cache_disabled wrapped by
-	 * local_lock or preemption disabled would be ordered by that.
-	 * The atomic operation doesn't need to have stronger ordering
-	 * requirements because that is enforced by the scheduling
-	 * guarantees.
+	 * synchronize_rcu() waits for preemption disabled
+	 * and RCU read side critical sections.
+	 * For the users of lru_disable_count:
+	 *
+	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
+	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
+	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
+	 *
+	 * so any calls of lru_cache_disabled wrapped by local_lock or
+	 * preemption disabled would be ordered by that.
 	 */
 	__lru_add_drain_all(true);
 #else