diff mbox series

[v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

Message ID YiI+a9gTr/UBCf0X@fuller.cnet (mailing list archive)
State New
Headers show
Series [v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu | expand

Commit Message

Marcelo Tosatti March 4, 2022, 4:29 p.m. UTC
On systems that run FIFO:1 applications that busy loop 
on isolated CPUs, executing tasks on such CPUs under
lower priority is undesired (since that will either
hang the system, or cause longer interruption to the
FIFO task due to execution of lower priority task 
with very small sched slices).

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
pagevec during the migration temporarily") relies on 
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

However, its possible to use synchronize_rcu which will provide the same
guarantees (see comment this patch modifies on lru_cache_disable).

Fixes:

[ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
[ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
[ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
[ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
[ 1873.243936] Call Trace:
[ 1873.243938]  __schedule+0x21b/0x5b0
[ 1873.243941]  schedule+0x43/0xe0
[ 1873.243943]  schedule_timeout+0x14d/0x190
[ 1873.243946]  ? resched_curr+0x20/0xe0
[ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
[ 1873.243958]  wait_for_completion+0x84/0xe0
[ 1873.243962]  __flush_work.isra.0+0x146/0x200
[ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
[ 1873.243978]  do_migrate_pages+0x3d/0x2d0
[ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
[ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
[ 1873.243992]  ? pick_next_task+0xb30/0xbd0
[ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
[ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
[ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
[ 1873.244005]  ? __switch_to+0x12f/0x510
[ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
[ 1873.244016]  process_one_work+0x1e0/0x410
[ 1873.244019]  worker_thread+0x50/0x3b0
[ 1873.244022]  ? process_one_work+0x410/0x410
[ 1873.244024]  kthread+0x173/0x190
[ 1873.244027]  ? set_kthread_struct+0x40/0x40
[ 1873.244031]  ret_from_fork+0x1f/0x30

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
 
 ---
 
 v4: improve comment clarify, mention synchronize_rcu guarantees
     on v5.1				(Andrew Morton /
						 Paul E. McKenney)
 v3: update stale comment			(Nicolas Saenz Julienne)
 v2: rt_spin_lock calls rcu_read_lock, no need
 to add it before local_lock on swap.c		(Nicolas Saenz Julienne)

Comments

Andrew Morton March 5, 2022, 12:35 a.m. UTC | #1
On Fri, 4 Mar 2022 13:29:31 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:

>  
> On systems that run FIFO:1 applications that busy loop 
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task 
> with very small sched slices).
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
> 
> Fixes:
> 
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> -		if (force_all_cpus ||
> -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||

Please changelog this alteration?

>  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> @@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>  void lru_cache_disable(void)
>  {
>  	atomic_inc(&lru_disable_count);
> -#ifdef CONFIG_SMP
>  	/*
> -	 * lru_add_drain_all in the force mode will schedule draining on
> -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> -	 * local_lock or preemption disabled would be ordered by that.
> -	 * The atomic operation doesn't need to have stronger ordering
> -	 * requirements because that is enforced by the scheduling
> -	 * guarantees.
> +	 * Readers of lru_disable_count are protected by either disabling
> +	 * preemption or rcu_read_lock:
> +	 *
> +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> +	 *
> +	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> +	 * preempt_disable() regions of code. So any CPU which sees
> +	 * lru_disable_count = 0 will have exited the critical
> +	 * section when synchronize_rcu() returns.
>  	 */
> +	synchronize_rcu();
> +#ifdef CONFIG_SMP
>  	__lru_add_drain_all(true);
>  #else
>  	lru_add_and_bh_lrus_drain();
Paul E. McKenney March 5, 2022, 4:33 a.m. UTC | #2
On Fri, Mar 04, 2022 at 01:29:31PM -0300, Marcelo Tosatti wrote:
>  
> On systems that run FIFO:1 applications that busy loop 
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task 
> with very small sched slices).
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
> 
> Fixes:
> 
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938]  __schedule+0x21b/0x5b0
> [ 1873.243941]  schedule+0x43/0xe0
> [ 1873.243943]  schedule_timeout+0x14d/0x190
> [ 1873.243946]  ? resched_curr+0x20/0xe0
> [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958]  wait_for_completion+0x84/0xe0
> [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005]  ? __switch_to+0x12f/0x510
> [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016]  process_one_work+0x1e0/0x410
> [ 1873.244019]  worker_thread+0x50/0x3b0
> [ 1873.244022]  ? process_one_work+0x410/0x410
> [ 1873.244024]  kthread+0x173/0x190
> [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> [ 1873.244031]  ret_from_fork+0x1f/0x30
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Given the explanation and the comments below, this does look plausible
to me.

							Thanx, Paul

>  ---
>  
>  v4: improve comment clarify, mention synchronize_rcu guarantees
>      on v5.1				(Andrew Morton /
> 						 Paul E. McKenney)
>  v3: update stale comment			(Nicolas Saenz Julienne)
>  v2: rt_spin_lock calls rcu_read_lock, no need
>  to add it before local_lock on swap.c		(Nicolas Saenz Julienne) 
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..b5ee163daa66 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> -		if (force_all_cpus ||
> -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> @@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>  void lru_cache_disable(void)
>  {
>  	atomic_inc(&lru_disable_count);
> -#ifdef CONFIG_SMP
>  	/*
> -	 * lru_add_drain_all in the force mode will schedule draining on
> -	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> -	 * local_lock or preemption disabled would be ordered by that.
> -	 * The atomic operation doesn't need to have stronger ordering
> -	 * requirements because that is enforced by the scheduling
> -	 * guarantees.
> +	 * Readers of lru_disable_count are protected by either disabling
> +	 * preemption or rcu_read_lock:
> +	 *
> +	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
> +	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
> +	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
> +	 *
> +	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> +	 * preempt_disable() regions of code. So any CPU which sees
> +	 * lru_disable_count = 0 will have exited the critical
> +	 * section when synchronize_rcu() returns.
>  	 */
> +	synchronize_rcu();
> +#ifdef CONFIG_SMP
>  	__lru_add_drain_all(true);
>  #else
>  	lru_add_and_bh_lrus_drain();
>
Marcelo Tosatti March 7, 2022, 6:52 p.m. UTC | #3
On Fri, Mar 04, 2022 at 04:35:54PM -0800, Andrew Morton wrote:
> On Fri, 4 Mar 2022 13:29:31 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> >  
> > On systems that run FIFO:1 applications that busy loop 
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task 
> > with very small sched slices).
> > 
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> > pagevec during the migration temporarily") relies on 
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> > 
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> > 
> > Fixes:
> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> >  	for_each_online_cpu(cpu) {
> >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >  
> > -		if (force_all_cpus ||
> > -		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > +		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> 
> Please changelog this alteration?

It should be now. Are you OK with this changelog ?
(if not, please let me know what should be improved).

On systems that run FIFO:1 applications that busy loop,
any SCHED_OTHER task that attempts to execute
on such a CPU (such as work threads) will not
be scheduled, which leads to system hangs.

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees:

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

preempt_disable, local_irq_disable  [bh_lru_lock()]
rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical
section when synchronize_rcu() returns.

Fixes:
...

Thanks.
Minchan Kim March 8, 2022, 5:41 p.m. UTC | #4
On Fri, Mar 04, 2022 at 01:29:31PM -0300, Marcelo Tosatti wrote:
>  
> On systems that run FIFO:1 applications that busy loop 
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task 
> with very small sched slices).
> 
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU 
> pagevec during the migration temporarily") relies on 
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
> 
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
> 
> Fixes:
> 
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927]       Tainted: G          I      --------- ---  5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0  state:D stack:    0 pid:    9 ppid:     2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938]  __schedule+0x21b/0x5b0
> [ 1873.243941]  schedule+0x43/0xe0
> [ 1873.243943]  schedule_timeout+0x14d/0x190
> [ 1873.243946]  ? resched_curr+0x20/0xe0
> [ 1873.243953]  ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958]  wait_for_completion+0x84/0xe0
> [ 1873.243962]  __flush_work.isra.0+0x146/0x200
> [ 1873.243966]  ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971]  __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978]  do_migrate_pages+0x3d/0x2d0
> [ 1873.243985]  ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989]  ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992]  ? pick_next_task+0xb30/0xbd0
> [ 1873.243995]  ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000]  ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002]  ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005]  ? __switch_to+0x12f/0x510
> [ 1873.244013]  cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016]  process_one_work+0x1e0/0x410
> [ 1873.244019]  worker_thread+0x50/0x3b0
> [ 1873.244022]  ? process_one_work+0x410/0x410
> [ 1873.244024]  kthread+0x173/0x190
> [ 1873.244027]  ? set_kthread_struct+0x40/0x40
> [ 1873.244031]  ret_from_fork+0x1f/0x30
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Looks like great to me. 

Acked-by: Minchan Kim <minchan@kernel.org>

While I reviewed it, it seems I found a bug that br_lru_install
needs to check lru_cache_disabled under bh_lru_lock. Let me
cook the patch.

Thanks!
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..b5ee163daa66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@  inline void __lru_add_drain_all(bool force_all_cpus)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (force_all_cpus ||
-		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
 		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,15 +875,21 @@  atomic_t lru_disable_count = ATOMIC_INIT(0);
 void lru_cache_disable(void)
 {
 	atomic_inc(&lru_disable_count);
-#ifdef CONFIG_SMP
 	/*
-	 * lru_add_drain_all in the force mode will schedule draining on
-	 * all online CPUs so any calls of lru_cache_disabled wrapped by
-	 * local_lock or preemption disabled would be ordered by that.
-	 * The atomic operation doesn't need to have stronger ordering
-	 * requirements because that is enforced by the scheduling
-	 * guarantees.
+	 * Readers of lru_disable_count are protected by either disabling
+	 * preemption or rcu_read_lock:
+	 *
+	 * preempt_disable, local_irq_disable  [bh_lru_lock()]
+	 * rcu_read_lock		       [rt_spin_lock CONFIG_PREEMPT_RT]
+	 * preempt_disable		       [local_lock !CONFIG_PREEMPT_RT]
+	 *
+	 * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
+	 * preempt_disable() regions of code. So any CPU which sees
+	 * lru_disable_count = 0 will have exited the critical
+	 * section when synchronize_rcu() returns.
 	 */
+	synchronize_rcu();
+#ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
 	lru_add_and_bh_lrus_drain();