diff mbox series

[v2] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Message ID 20240118021842.290665-1-chenzhongjin@huawei.com (mailing list archive)
State Handled Elsewhere
Delegated to: Masami Hiramatsu
Headers show
Series [v2] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer | expand

Commit Message

Chen Zhongjin Jan. 18, 2024, 2:18 a.m. UTC
There is a deadlock scenario in kprobe_optimizer():

pid A				pid B			pid C
kprobe_optimizer()		do_exit()		perf_kprobe_init()
mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu	kernel_wait4()
				// waiting pid C exit

To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
---
v1 -> v2: Add Fixes tag
---
 arch/Kconfig     | 2 +-
 kernel/kprobes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Steven Rostedt Jan. 18, 2024, 2:26 a.m. UTC | #1
On Thu, 18 Jan 2024 02:18:42 +0000
Chen Zhongjin <chenzhongjin@huawei.com> wrote:

> There is a deadlock scenario in kprobe_optimizer():
> 
> pid A				pid B			pid C
> kprobe_optimizer()		do_exit()		perf_kprobe_init()
> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu	kernel_wait4()
> 				// waiting pid C exit
> 
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> 

Did lockdep detect this? If not, we should fix that.

I'm also thinking if we should find another solution, as this seems more of
a work around than a fix.

> Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> ---
> v1 -> v2: Add Fixes tag
> ---
>  arch/Kconfig     | 2 +-
>  kernel/kprobes.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f4b210ab0612..dc6a18854017 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
>  config OPTPROBES
>  	def_bool y
>  	depends on KPROBES && HAVE_OPTPROBES
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RUDE_RCU

Is this still a bug if PREEMPTION is not enabled?

-- Steve

>  
>  config KPROBES_ON_FTRACE
>  	def_bool y
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index d5a0ee40bf66..09056ae50c58 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
>  	 * Note that on non-preemptive kernel, this is transparently converted
>  	 * to synchronoze_sched() to wait for all interrupts to have completed.
>  	 */
> -	synchronize_rcu_tasks();
> +	synchronize_rcu_tasks_rude();
>  
>  	/* Step 3: Optimize kprobes after quiesence period */
>  	do_optimize_kprobes();
Paul E. McKenney Jan. 18, 2024, 2:44 p.m. UTC | #2
On Wed, Jan 17, 2024 at 09:26:46PM -0500, Steven Rostedt wrote:
> On Thu, 18 Jan 2024 02:18:42 +0000
> Chen Zhongjin <chenzhongjin@huawei.com> wrote:
> 
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A				pid B			pid C
> > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > 				// waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> > 
> 
> Did lockdep detect this? If not, we should fix that.
> 
> I'm also thinking if we should find another solution, as this seems more of
> a work around than a fix.

My suggestion is at 526b12e4-4bb0-47b1-bece-66b47bfc0a92@paulmck-laptop.

Better suggestions are of course welcome.  ;-)

> > Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > ---
> > v1 -> v2: Add Fixes tag
> > ---
> >  arch/Kconfig     | 2 +-
> >  kernel/kprobes.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index f4b210ab0612..dc6a18854017 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
> >  config OPTPROBES
> >  	def_bool y
> >  	depends on KPROBES && HAVE_OPTPROBES
> > -	select TASKS_RCU if PREEMPTION
> > +	select TASKS_RUDE_RCU
> 
> Is this still a bug if PREEMPTION is not enabled?

Both "select" clauses would be needed for this patch, if I understand
correctly.

							Thanx, Paul

> -- Steve
> 
> >  
> >  config KPROBES_ON_FTRACE
> >  	def_bool y
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index d5a0ee40bf66..09056ae50c58 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
> >  	 * Note that on non-preemptive kernel, this is transparently converted
> >  	 * to synchronoze_sched() to wait for all interrupts to have completed.
> >  	 */
> > -	synchronize_rcu_tasks();
> > +	synchronize_rcu_tasks_rude();
> >  
> >  	/* Step 3: Optimize kprobes after quiesence period */
> >  	do_optimize_kprobes();
>
Paul E. McKenney Jan. 19, 2024, 10:53 a.m. UTC | #3
On Thu, Jan 18, 2024 at 06:44:54AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 17, 2024 at 09:26:46PM -0500, Steven Rostedt wrote:
> > On Thu, 18 Jan 2024 02:18:42 +0000
> > Chen Zhongjin <chenzhongjin@huawei.com> wrote:
> > 
> > > There is a deadlock scenario in kprobe_optimizer():
> > > 
> > > pid A				pid B			pid C
> > > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > > 				// waiting pid C exit
> > > 
> > > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> > > 
> > 
> > Did lockdep detect this? If not, we should fix that.
> > 
> > I'm also thinking if we should find another solution, as this seems more of
> > a work around than a fix.
> 
> My suggestion is at 526b12e4-4bb0-47b1-bece-66b47bfc0a92@paulmck-laptop.
> 
> Better suggestions are of course welcome.  ;-)
> 
> > > Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> > > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > > ---
> > > v1 -> v2: Add Fixes tag
> > > ---
> > >  arch/Kconfig     | 2 +-
> > >  kernel/kprobes.c | 2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index f4b210ab0612..dc6a18854017 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
> > >  config OPTPROBES
> > >  	def_bool y
> > >  	depends on KPROBES && HAVE_OPTPROBES
> > > -	select TASKS_RCU if PREEMPTION
> > > +	select TASKS_RUDE_RCU
> > 
> > Is this still a bug if PREEMPTION is not enabled?
> 
> Both "select" clauses would be needed for this patch, if I understand
> correctly.
> 
> 							Thanx, Paul
> 
> > -- Steve
> > 
> > >  
> > >  config KPROBES_ON_FTRACE
> > >  	def_bool y
> > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > index d5a0ee40bf66..09056ae50c58 100644
> > > --- a/kernel/kprobes.c
> > > +++ b/kernel/kprobes.c
> > > @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
> > >  	 * Note that on non-preemptive kernel, this is transparently converted
> > >  	 * to synchronoze_sched() to wait for all interrupts to have completed.
> > >  	 */
> > > -	synchronize_rcu_tasks();
> > > +	synchronize_rcu_tasks_rude();

The full comment reads as follows:

	/*
	 * Step 2: Wait for quiesence period to ensure all potentially
	 * preempted tasks to have normally scheduled. Because optprobe
	 * may modify multiple instructions, there is a chance that Nth
	 * instruction is preempted. In that case, such tasks can return
	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
	 * Note that on non-preemptive kernel, this is transparently converted
	 * to synchronoze_sched() to wait for all interrupts to have completed.
	 */

Except that synchronize_rcu_tasks_rude() isn't going to wait for any
preempted tasks.  It instead waits only for regions of code that have
disabled preemption (or interrrupts or ...).  So either the above comment
is wrong and needs to be fixed, or this change breaks kprobes.  Last
I knew, the comment was correct.

So I still like the idea of using non-preemptability to wait for tasks
near the end of do_exit(), but unless I am missing something, this patch
as written is inserting a rare but real bug.

Steve,thoughts?

							Thanx, Paul

> > >  	/* Step 3: Optimize kprobes after quiesence period */
> > >  	do_optimize_kprobes();
> >
Masami Hiramatsu (Google) Jan. 19, 2024, 12:28 p.m. UTC | #4
On Wed, 17 Jan 2024 21:26:46 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 18 Jan 2024 02:18:42 +0000
> Chen Zhongjin <chenzhongjin@huawei.com> wrote:
> 
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A				pid B			pid C
> > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > 				// waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> > 

At first, thanks for finding this scenario! 

> 
> Did lockdep detect this? If not, we should fix that.

Can lockdep find rcu and wait4 related one?

> 
> I'm also thinking if we should find another solution, as this seems more of
> a work around than a fix.

Hmm, IIUC, we may need a synchronizer which will return -EBUSY if
someone starts waiting in exit_tasks_rcu_start(). Then optimizer 
can unlock the mutex and retry it.

Thank you,

> 
> > Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > ---
> > v1 -> v2: Add Fixes tag
> > ---
> >  arch/Kconfig     | 2 +-
> >  kernel/kprobes.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index f4b210ab0612..dc6a18854017 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
> >  config OPTPROBES
> >  	def_bool y
> >  	depends on KPROBES && HAVE_OPTPROBES
> > -	select TASKS_RCU if PREEMPTION
> > +	select TASKS_RUDE_RCU
> 
> Is this still a bug if PREEMPTION is not enabled?
> 
> -- Steve
> 
> >  
> >  config KPROBES_ON_FTRACE
> >  	def_bool y
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index d5a0ee40bf66..09056ae50c58 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
> >  	 * Note that on non-preemptive kernel, this is transparently converted
> >  	 * to synchronoze_sched() to wait for all interrupts to have completed.
> >  	 */
> > -	synchronize_rcu_tasks();
> > +	synchronize_rcu_tasks_rude();
> >  
> >  	/* Step 3: Optimize kprobes after quiesence period */
> >  	do_optimize_kprobes();
>
Paul E. McKenney Jan. 19, 2024, 2:37 p.m. UTC | #5
On Thu, Jan 18, 2024 at 02:18:42AM +0000, Chen Zhongjin wrote:
> There is a deadlock scenario in kprobe_optimizer():
> 
> pid A				pid B			pid C
> kprobe_optimizer()		do_exit()		perf_kprobe_init()
> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu	kernel_wait4()
> 				// waiting pid C exit
> 
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> 
> Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>

Just so you know, your email ends up in gmail's spam folder.  :-/

> ---
> v1 -> v2: Add Fixes tag
> ---
>  arch/Kconfig     | 2 +-
>  kernel/kprobes.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f4b210ab0612..dc6a18854017 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
>  config OPTPROBES
>  	def_bool y
>  	depends on KPROBES && HAVE_OPTPROBES
> -	select TASKS_RCU if PREEMPTION
> +	select TASKS_RUDE_RCU
>  
>  config KPROBES_ON_FTRACE
>  	def_bool y
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index d5a0ee40bf66..09056ae50c58 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
>  	 * Note that on non-preemptive kernel, this is transparently converted
>  	 * to synchronoze_sched() to wait for all interrupts to have completed.
>  	 */
> -	synchronize_rcu_tasks();
> +	synchronize_rcu_tasks_rude();

Again, that comment reads in full as follows:

	/*
	 * Step 2: Wait for quiesence period to ensure all potentially
	 * preempted tasks to have normally scheduled. Because optprobe
	 * may modify multiple instructions, there is a chance that Nth
	 * instruction is preempted. In that case, such tasks can return
	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
	 * Note that on non-preemptive kernel, this is transparently converted
	 * to synchronoze_sched() to wait for all interrupts to have completed.
	 */

Please note well that first sentence.

Unless that first sentence no longer holds, this patch cannot work
because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
preempted tasks.

So how to safely break this deadlock?  Reproducing Chen Zhongjin's
diagram:

pid A				pid B			pid C
kprobe_optimizer()		do_exit()		perf_kprobe_init()
mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu	kernel_wait4()
				// waiting pid C exit

We need to stop synchronize_rcu_tasks() from waiting on tasks like
pid B that are voluntarily blocked.  One way to do that is to replace
SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
current task to this list and does ...

OK, this is getting a bit involved.  If you would like to follow along,
please feel free to look here:

https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing

							Thanx, Paul

>  	/* Step 3: Optimize kprobes after quiesence period */
>  	do_optimize_kprobes();
> -- 
> 2.25.1
>
Paul E. McKenney Jan. 20, 2024, 3:30 p.m. UTC | #6
On Fri, Jan 19, 2024 at 06:37:26AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 18, 2024 at 02:18:42AM +0000, Chen Zhongjin wrote:
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A				pid B			pid C
> > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > 				// waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> > 
> > Fixes: a30b85df7d59 ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y")
> > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> 
> Just so you know, your email ends up in gmail's spam folder.  :-/
> 
> > ---
> > v1 -> v2: Add Fixes tag
> > ---
> >  arch/Kconfig     | 2 +-
> >  kernel/kprobes.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index f4b210ab0612..dc6a18854017 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
> >  config OPTPROBES
> >  	def_bool y
> >  	depends on KPROBES && HAVE_OPTPROBES
> > -	select TASKS_RCU if PREEMPTION
> > +	select TASKS_RUDE_RCU
> >  
> >  config KPROBES_ON_FTRACE
> >  	def_bool y
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index d5a0ee40bf66..09056ae50c58 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
> >  	 * Note that on non-preemptive kernel, this is transparently converted
> >  	 * to synchronoze_sched() to wait for all interrupts to have completed.
> >  	 */
> > -	synchronize_rcu_tasks();
> > +	synchronize_rcu_tasks_rude();
> 
> Again, that comment reads in full as follows:
> 
> 	/*
> 	 * Step 2: Wait for quiesence period to ensure all potentially
> 	 * preempted tasks to have normally scheduled. Because optprobe
> 	 * may modify multiple instructions, there is a chance that Nth
> 	 * instruction is preempted. In that case, such tasks can return
> 	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
> 	 * Note that on non-preemptive kernel, this is transparently converted
> 	 * to synchronoze_sched() to wait for all interrupts to have completed.
> 	 */
> 
> Please note well that first sentence.
> 
> Unless that first sentence no longer holds, this patch cannot work
> because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
> preempted tasks.
> 
> So how to safely break this deadlock?  Reproducing Chen Zhongjin's
> diagram:
> 
> pid A				pid B			pid C
> kprobe_optimizer()		do_exit()		perf_kprobe_init()
> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu	kernel_wait4()
> 				// waiting pid C exit
> 
> We need to stop synchronize_rcu_tasks() from waiting on tasks like
> pid B that are voluntarily blocked.  One way to do that is to replace
> SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
> current task to this list and does ...
> 
> OK, this is getting a bit involved.  If you would like to follow along,
> please feel free to look here:
> 
> https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing

And please see below for a prototype patch, which passes moderate
rcutorture testing.

Chen Zhongjin, does this prevent the deadlock you have been seeing?

							Thanx, Paul

------------------------------------------------------------------------

commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Sat Jan 20 07:07:08 2024 -0800

    rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
    
    Holding a mutex across synchronize_rcu_tasks() and acquiring
    that same mutex in code called from do_exit() after its call to
    exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
    results in deadlock.  This is by design, because tasks that are far
    enough into do_exit() are no longer present on the tasks list, making
    it a bit difficult for RCU Tasks to find them, let alone wait on them
    to do a voluntary context switch.  However, such deadlocks are becoming
    more frequent.  In addition, lockdep currently does not detect such
    deadlocks and they can be difficult to reproduce.
    
    In addition, if a task voluntarily context switches during that time
    (for example, if it blocks acquiring a mutex), then this task is in an
    RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
    just as well take advantage of that fact.
    
    This commit therefore eliminates these deadlock by replacing the
    SRCU-based wait for do_exit() completion with per-CPU lists of tasks
    currently exiting.  A given task will be on one of these per-CPU lists for
    the same period of time that this task would previously have been in the
    previous SRCU read-side critical section.  These lists enable RCU Tasks
    to find the tasks that have already been removed from the tasks list,
    but that must nevertheless be waited upon.
    
    The RCU Tasks grace period gathers any of these do_exit() tasks that it
    must wait on, and adds them to the list of holdouts.  Per-CPU locking
    and get_task_struct() are used to synchronize addition to and removal
    from these lists.
    
    Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
    
    Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dddd10b1b815..3fe36befb613 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -855,6 +855,8 @@ struct task_struct {
 	u8				rcu_tasks_idx;
 	int				rcu_tasks_idle_cpu;
 	struct list_head		rcu_tasks_holdout_list;
+	int				rcu_tasks_exit_cpu;
+	struct list_head		rcu_tasks_exit_list;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
 #ifdef CONFIG_TASKS_TRACE_RCU
diff --git a/init/init_task.c b/init/init_task.c
index 5727d42149c3..65f037bff457 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -152,6 +152,7 @@ struct task_struct init_task
 	.rcu_tasks_holdout = false,
 	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
 	.rcu_tasks_idle_cpu = -1,
+	.rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list),
 #endif
 #ifdef CONFIG_TASKS_TRACE_RCU
 	.trc_reader_nesting = 0,
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..6bacd515e0eb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1981,6 +1981,7 @@ static inline void rcu_copy_process(struct task_struct *p)
 	p->rcu_tasks_holdout = false;
 	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
 	p->rcu_tasks_idle_cpu = -1;
+	INIT_LIST_HEAD(&p->rcu_tasks_exit_list);
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_TRACE_RCU
 	p->trc_reader_nesting = 0;
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 732ad5b39946..bd4a51fd5b1f 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -32,6 +32,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
  * @rtp_irq_work: IRQ work queue for deferred wakeups.
  * @barrier_q_head: RCU callback for barrier operation.
  * @rtp_blkd_tasks: List of tasks blocked as readers.
+ * @rtp_exit_list: List of tasks in the latter portion of do_exit().
  * @cpu: CPU number corresponding to this entry.
  * @rtpp: Pointer to the rcu_tasks structure.
  */
@@ -46,6 +47,7 @@ struct rcu_tasks_percpu {
 	struct irq_work rtp_irq_work;
 	struct rcu_head barrier_q_head;
 	struct list_head rtp_blkd_tasks;
+	struct list_head rtp_exit_list;
 	int cpu;
 	struct rcu_tasks *rtpp;
 };
@@ -144,8 +146,6 @@ static struct rcu_tasks rt_name =							\
 }
 
 #ifdef CONFIG_TASKS_RCU
-/* Track exiting tasks in order to allow them to be waited for. */
-DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
 /* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */
 static void tasks_rcu_exit_srcu_stall(struct timer_list *unused);
@@ -275,6 +275,8 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
 		rtpcp->rtpp = rtp;
 		if (!rtpcp->rtp_blkd_tasks.next)
 			INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks);
+		if (!rtpcp->rtp_exit_list.next)
+			INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
 	}
 
 	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
@@ -851,10 +853,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 //	number of voluntary context switches, and add that task to the
 //	holdout list.
 // rcu_tasks_postscan():
-//	Invoke synchronize_srcu() to ensure that all tasks that were
-//	in the process of exiting (and which thus might not know to
-//	synchronize with this RCU Tasks grace period) have completed
-//	exiting.
+//	Gather per-CPU lists of tasks in do_exit() to ensure that all
+//	tasks that were in the process of exiting (and which thus might
+//	not know to synchronize with this RCU Tasks grace period) have
+//	completed exiting.  The synchronize_rcu() in rcu_tasks_postgp()
+//	will take care of any tasks stuck in the non-preemptible region
+//	of do_exit() following its call to exit_tasks_rcu_stop().
 // check_all_holdout_tasks(), repeatedly until holdout list is empty:
 //	Scans the holdout list, attempting to identify a quiescent state
 //	for each task on the list.  If there is a quiescent state, the
@@ -867,8 +871,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 //	with interrupts disabled.
 //
 // For each exiting task, the exit_tasks_rcu_start() and
-// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU
-// read-side critical sections waited for by rcu_tasks_postscan().
+// exit_tasks_rcu_finish() functions add and remove, respectively, the
+// current task to a per-CPU list of tasks that rcu_tasks_postscan() must
+// wait on.  This is necessary because rcu_tasks_postscan() must wait on
+// tasks that have already been removed from the global list of tasks.
 //
 // Pre-grace-period update-side code is ordered before the grace
 // via the raw_spin_lock.*rcu_node().  Pre-grace-period read-side code
@@ -932,9 +938,13 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
 	}
 }
 
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
+
 /* Processing between scanning taskslist and draining the holdout list. */
 static void rcu_tasks_postscan(struct list_head *hop)
 {
+	int cpu;
 	int rtsi = READ_ONCE(rcu_task_stall_info);
 
 	if (!IS_ENABLED(CONFIG_TINY_RCU)) {
@@ -948,9 +958,9 @@ static void rcu_tasks_postscan(struct list_head *hop)
 	 * this, divide the fragile exit path part in two intersecting
 	 * read side critical sections:
 	 *
-	 * 1) An _SRCU_ read side starting before calling exit_notify(),
-	 *    which may remove the task from the tasklist, and ending after
-	 *    the final preempt_disable() call in do_exit().
+	 * 1) A task_struct list addition before calling exit_notify(),
+	 *    which may remove the task from the tasklist, with the
+	 *    removal after the final preempt_disable() call in do_exit().
 	 *
 	 * 2) An _RCU_ read side starting with the final preempt_disable()
 	 *    call in do_exit() and ending with the final call to schedule()
@@ -959,7 +969,18 @@ static void rcu_tasks_postscan(struct list_head *hop)
 	 * This handles the part 1). And postgp will handle part 2) with a
 	 * call to synchronize_rcu().
 	 */
-	synchronize_srcu(&tasks_rcu_exit_srcu);
+
+	for_each_possible_cpu(cpu) {
+		unsigned long flags;
+		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu);
+		struct task_struct *t;
+
+		raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+		list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list)
+			if (list_empty(&t->rcu_tasks_holdout_list))
+				rcu_tasks_pertask(t, hop);
+		raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	}
 
 	if (!IS_ENABLED(CONFIG_TINY_RCU))
 		del_timer_sync(&tasks_rcu_exit_srcu_stall_timer);
@@ -1027,7 +1048,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 	 *
 	 * In addition, this synchronize_rcu() waits for exiting tasks
 	 * to complete their final preempt_disable() region of execution,
-	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
 	 * enforcing the whole region before tasklist removal until
 	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
 	 * read side critical section.
@@ -1035,9 +1055,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 	synchronize_rcu();
 }
 
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
-
 static void tasks_rcu_exit_srcu_stall(struct timer_list *unused)
 {
 #ifndef CONFIG_TINY_RCU
@@ -1147,25 +1164,45 @@ struct task_struct *get_rcu_tasks_gp_kthread(void)
 EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
 
 /*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Protect against tasklist scan blind spot while the task is exiting and
+ * may be removed from the tasklist.  Do this by adding the task to yet
+ * another list.
  */
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_start(void)
 {
-	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+	unsigned long flags;
+	struct rcu_tasks_percpu *rtpcp;
+	struct task_struct *t = current;
+
+	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
+	get_task_struct(t);
+	preempt_disable();
+	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
+	t->rcu_tasks_exit_cpu = smp_processor_id();
+	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+	if (!rtpcp->rtp_exit_list.next)
+		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
+	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
+	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	preempt_enable();
 }
 
 /*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Remove the task from the "yet another list" because do_exit() is now
+ * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
  */
-void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_stop(void)
 {
+	unsigned long flags;
+	struct rcu_tasks_percpu *rtpcp;
 	struct task_struct *t = current;
 
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
+	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
+	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
+	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+	list_del_init(&t->rcu_tasks_exit_list);
+	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	put_task_struct(t);
 }
 
 /*
Chen Zhongjin Jan. 27, 2024, 10:09 a.m. UTC | #7
On 2024/1/20 23:30, Paul E. McKenney wrote:

Hi Paul,
This patch works for my reproduce test case.

Just a small question, if you dont mind, this problem exsit on LTS 
version but we had struct rcu_tasks_percpu after v5.17. We can't 
backport this patch to LTS 5.10 or 4.19, which are still under maintaince.
If you have any idea or plan to apply this patch to elder version, 
please tell me, thanks very much!

Anyway, it's ok to apply this patch to mainline.

Best,
Chen

>> Again, that comment reads in full as follows:
>>
>> 	/*
>> 	 * Step 2: Wait for quiesence period to ensure all potentially
>> 	 * preempted tasks to have normally scheduled. Because optprobe
>> 	 * may modify multiple instructions, there is a chance that Nth
>> 	 * instruction is preempted. In that case, such tasks can return
>> 	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
>> 	 * Note that on non-preemptive kernel, this is transparently converted
>> 	 * to synchronoze_sched() to wait for all interrupts to have completed.
>> 	 */
>>
>> Please note well that first sentence.
>>
>> Unless that first sentence no longer holds, this patch cannot work
>> because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
>> preempted tasks.
>>
>> So how to safely break this deadlock?  Reproducing Chen Zhongjin's
>> diagram:
>>
>> pid A				pid B			pid C
>> kprobe_optimizer()		do_exit()		perf_kprobe_init()
>> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
>> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
>> // waiting tasks_rcu_exit_srcu	kernel_wait4()
>> 				// waiting pid C exit
>>
>> We need to stop synchronize_rcu_tasks() from waiting on tasks like
>> pid B that are voluntarily blocked.  One way to do that is to replace
>> SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
>> current task to this list and does ...
>>
>> OK, this is getting a bit involved.  If you would like to follow along,
>> please feel free to look here:
>>
>> https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing
> 
> And please see below for a prototype patch, which passes moderate
> rcutorture testing.
> 
> Chen Zhongjin, does this prevent the deadlock you have been seeing?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Sat Jan 20 07:07:08 2024 -0800
> 
>      rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
>      
>      Holding a mutex across synchronize_rcu_tasks() and acquiring
>      that same mutex in code called from do_exit() after its call to
>      exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
>      results in deadlock.  This is by design, because tasks that are far
>      enough into do_exit() are no longer present on the tasks list, making
>      it a bit difficult for RCU Tasks to find them, let alone wait on them
>      to do a voluntary context switch.  However, such deadlocks are becoming
>      more frequent.  In addition, lockdep currently does not detect such
>      deadlocks and they can be difficult to reproduce.
>      
>      In addition, if a task voluntarily context switches during that time
>      (for example, if it blocks acquiring a mutex), then this task is in an
>      RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
>      just as well take advantage of that fact.
>      
>      This commit therefore eliminates these deadlock by replacing the
>      SRCU-based wait for do_exit() completion with per-CPU lists of tasks
>      currently exiting.  A given task will be on one of these per-CPU lists for
>      the same period of time that this task would previously have been in the
>      previous SRCU read-side critical section.  These lists enable RCU Tasks
>      to find the tasks that have already been removed from the tasks list,
>      but that must nevertheless be waited upon.
>      
>      The RCU Tasks grace period gathers any of these do_exit() tasks that it
>      must wait on, and adds them to the list of holdouts.  Per-CPU locking
>      and get_task_struct() are used to synchronize addition to and removal
>      from these lists.
>      
>      Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
>      
>      Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
>      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dddd10b1b815..3fe36befb613 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -855,6 +855,8 @@ struct task_struct {
>   	u8				rcu_tasks_idx;
>   	int				rcu_tasks_idle_cpu;
>   	struct list_head		rcu_tasks_holdout_list;
> +	int				rcu_tasks_exit_cpu;
> +	struct list_head		rcu_tasks_exit_list;
>   #endif /* #ifdef CONFIG_TASKS_RCU */
>   
>   #ifdef CONFIG_TASKS_TRACE_RCU
> diff --git a/init/init_task.c b/init/init_task.c
> index 5727d42149c3..65f037bff457 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -152,6 +152,7 @@ struct task_struct init_task
>   	.rcu_tasks_holdout = false,
>   	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
>   	.rcu_tasks_idle_cpu = -1,
> +	.rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list),
>   #endif
>   #ifdef CONFIG_TASKS_TRACE_RCU
>   	.trc_reader_nesting = 0,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 10917c3e1f03..6bacd515e0eb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1981,6 +1981,7 @@ static inline void rcu_copy_process(struct task_struct *p)
>   	p->rcu_tasks_holdout = false;
>   	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
>   	p->rcu_tasks_idle_cpu = -1;
> +	INIT_LIST_HEAD(&p->rcu_tasks_exit_list);
>   #endif /* #ifdef CONFIG_TASKS_RCU */
>   #ifdef CONFIG_TASKS_TRACE_RCU
>   	p->trc_reader_nesting = 0;
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 732ad5b39946..bd4a51fd5b1f 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -32,6 +32,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
>    * @rtp_irq_work: IRQ work queue for deferred wakeups.
>    * @barrier_q_head: RCU callback for barrier operation.
>    * @rtp_blkd_tasks: List of tasks blocked as readers.
> + * @rtp_exit_list: List of tasks in the latter portion of do_exit().
>    * @cpu: CPU number corresponding to this entry.
>    * @rtpp: Pointer to the rcu_tasks structure.
>    */
> @@ -46,6 +47,7 @@ struct rcu_tasks_percpu {
>   	struct irq_work rtp_irq_work;
>   	struct rcu_head barrier_q_head;
>   	struct list_head rtp_blkd_tasks;
> +	struct list_head rtp_exit_list;
>   	int cpu;
>   	struct rcu_tasks *rtpp;
>   };
> @@ -144,8 +146,6 @@ static struct rcu_tasks rt_name =							\
>   }
>   
>   #ifdef CONFIG_TASKS_RCU
> -/* Track exiting tasks in order to allow them to be waited for. */
> -DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
>   
>   /* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */
>   static void tasks_rcu_exit_srcu_stall(struct timer_list *unused);
> @@ -275,6 +275,8 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
>   		rtpcp->rtpp = rtp;
>   		if (!rtpcp->rtp_blkd_tasks.next)
>   			INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks);
> +		if (!rtpcp->rtp_exit_list.next)
> +			INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
>   	}
>   
>   	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
> @@ -851,10 +853,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
>   //	number of voluntary context switches, and add that task to the
>   //	holdout list.
>   // rcu_tasks_postscan():
> -//	Invoke synchronize_srcu() to ensure that all tasks that were
> -//	in the process of exiting (and which thus might not know to
> -//	synchronize with this RCU Tasks grace period) have completed
> -//	exiting.
> +//	Gather per-CPU lists of tasks in do_exit() to ensure that all
> +//	tasks that were in the process of exiting (and which thus might
> +//	not know to synchronize with this RCU Tasks grace period) have
> +//	completed exiting.  The synchronize_rcu() in rcu_tasks_postgp()
> +//	will take care of any tasks stuck in the non-preemptible region
> +//	of do_exit() following its call to exit_tasks_rcu_stop().
>   // check_all_holdout_tasks(), repeatedly until holdout list is empty:
>   //	Scans the holdout list, attempting to identify a quiescent state
>   //	for each task on the list.  If there is a quiescent state, the
> @@ -867,8 +871,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
>   //	with interrupts disabled.
>   //
>   // For each exiting task, the exit_tasks_rcu_start() and
> -// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU
> -// read-side critical sections waited for by rcu_tasks_postscan().
> +// exit_tasks_rcu_finish() functions add and remove, respectively, the
> +// current task to a per-CPU list of tasks that rcu_tasks_postscan() must
> +// wait on.  This is necessary because rcu_tasks_postscan() must wait on
> +// tasks that have already been removed from the global list of tasks.
>   //
>   // Pre-grace-period update-side code is ordered before the grace
>   // via the raw_spin_lock.*rcu_node().  Pre-grace-period read-side code
> @@ -932,9 +938,13 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
>   	}
>   }
>   
> +void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
> +DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
> +
>   /* Processing between scanning taskslist and draining the holdout list. */
>   static void rcu_tasks_postscan(struct list_head *hop)
>   {
> +	int cpu;
>   	int rtsi = READ_ONCE(rcu_task_stall_info);
>   
>   	if (!IS_ENABLED(CONFIG_TINY_RCU)) {
> @@ -948,9 +958,9 @@ static void rcu_tasks_postscan(struct list_head *hop)
>   	 * this, divide the fragile exit path part in two intersecting
>   	 * read side critical sections:
>   	 *
> -	 * 1) An _SRCU_ read side starting before calling exit_notify(),
> -	 *    which may remove the task from the tasklist, and ending after
> -	 *    the final preempt_disable() call in do_exit().
> +	 * 1) A task_struct list addition before calling exit_notify(),
> +	 *    which may remove the task from the tasklist, with the
> +	 *    removal after the final preempt_disable() call in do_exit().
>   	 *
>   	 * 2) An _RCU_ read side starting with the final preempt_disable()
>   	 *    call in do_exit() and ending with the final call to schedule()
> @@ -959,7 +969,18 @@ static void rcu_tasks_postscan(struct list_head *hop)
>   	 * This handles the part 1). And postgp will handle part 2) with a
>   	 * call to synchronize_rcu().
>   	 */
> -	synchronize_srcu(&tasks_rcu_exit_srcu);
> +
> +	for_each_possible_cpu(cpu) {
> +		unsigned long flags;
> +		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu);
> +		struct task_struct *t;
> +
> +		raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> +		list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list)
> +			if (list_empty(&t->rcu_tasks_holdout_list))
> +				rcu_tasks_pertask(t, hop);
> +		raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> +	}
>   
>   	if (!IS_ENABLED(CONFIG_TINY_RCU))
>   		del_timer_sync(&tasks_rcu_exit_srcu_stall_timer);
> @@ -1027,7 +1048,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
>   	 *
>   	 * In addition, this synchronize_rcu() waits for exiting tasks
>   	 * to complete their final preempt_disable() region of execution,
> -	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
>   	 * enforcing the whole region before tasklist removal until
>   	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
>   	 * read side critical section.
> @@ -1035,9 +1055,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
>   	synchronize_rcu();
>   }
>   
> -void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
> -DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
> -
>   static void tasks_rcu_exit_srcu_stall(struct timer_list *unused)
>   {
>   #ifndef CONFIG_TINY_RCU
> @@ -1147,25 +1164,45 @@ struct task_struct *get_rcu_tasks_gp_kthread(void)
>   EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
>   
>   /*
> - * Contribute to protect against tasklist scan blind spot while the
> - * task is exiting and may be removed from the tasklist. See
> - * corresponding synchronize_srcu() for further details.
> + * Protect against tasklist scan blind spot while the task is exiting and
> + * may be removed from the tasklist.  Do this by adding the task to yet
> + * another list.
>    */
> -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> +void exit_tasks_rcu_start(void)
>   {
> -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> +	unsigned long flags;
> +	struct rcu_tasks_percpu *rtpcp;
> +	struct task_struct *t = current;
> +
> +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> +	get_task_struct(t);
> +	preempt_disable();
> +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> +	t->rcu_tasks_exit_cpu = smp_processor_id();
> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> +	if (!rtpcp->rtp_exit_list.next)
> +		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
> +	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> +	preempt_enable();
>   }
>   
>   /*
> - * Contribute to protect against tasklist scan blind spot while the
> - * task is exiting and may be removed from the tasklist. See
> - * corresponding synchronize_srcu() for further details.
> + * Remove the task from the "yet another list" because do_exit() is now
> + * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
>    */
> -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
> +void exit_tasks_rcu_stop(void)
>   {
> +	unsigned long flags;
> +	struct rcu_tasks_percpu *rtpcp;
>   	struct task_struct *t = current;
>   
> -	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
> +	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
> +	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> +	list_del_init(&t->rcu_tasks_exit_list);
> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> +	put_task_struct(t);
>   }
>   
>   /*
Paul E. McKenney Feb. 1, 2024, 1:47 p.m. UTC | #8
On Sat, Jan 27, 2024 at 06:09:05PM +0800, Chen Zhongjin wrote:
> On 2024/1/20 23:30, Paul E. McKenney wrote:

(Apologies for the delay, despite my attempts to make it otherwise,
your email still got dumped into my spam folder.)

> Hi Paul,
> This patch works for my reproduce test case.

Thank you!!!

> Just a small question, if you dont mind, this problem exsit on LTS version
> but we had struct rcu_tasks_percpu after v5.17. We can't backport this patch
> to LTS 5.10 or 4.19, which are still under maintaince.
> If you have any idea or plan to apply this patch to elder version, please
> tell me, thanks very much!

It should be possible to hand-apply the patch.  Or to backport additional
patches to make this one apply cleanly.   Note that in v4.19, the code
is in kernel/rcu/update.c rather than its new home in kernel/rcu/tasks.h.

> Anyway, it's ok to apply this patch to mainline.

May I have your Tested-by?

							Thanx, Paul

> Best,
> Chen
> 
> > > Again, that comment reads in full as follows:
> > > 
> > > 	/*
> > > 	 * Step 2: Wait for quiesence period to ensure all potentially
> > > 	 * preempted tasks to have normally scheduled. Because optprobe
> > > 	 * may modify multiple instructions, there is a chance that Nth
> > > 	 * instruction is preempted. In that case, such tasks can return
> > > 	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
> > > 	 * Note that on non-preemptive kernel, this is transparently converted
> > > 	 * to synchronoze_sched() to wait for all interrupts to have completed.
> > > 	 */
> > > 
> > > Please note well that first sentence.
> > > 
> > > Unless that first sentence no longer holds, this patch cannot work
> > > because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
> > > preempted tasks.
> > > 
> > > So how to safely break this deadlock?  Reproducing Chen Zhongjin's
> > > diagram:
> > > 
> > > pid A				pid B			pid C
> > > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > > 				// waiting pid C exit
> > > 
> > > We need to stop synchronize_rcu_tasks() from waiting on tasks like
> > > pid B that are voluntarily blocked.  One way to do that is to replace
> > > SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
> > > current task to this list and does ...
> > > 
> > > OK, this is getting a bit involved.  If you would like to follow along,
> > > please feel free to look here:
> > > 
> > > https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing
> > 
> > And please see below for a prototype patch, which passes moderate
> > rcutorture testing.
> > 
> > Chen Zhongjin, does this prevent the deadlock you have been seeing?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Sat Jan 20 07:07:08 2024 -0800
> > 
> >      rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
> >      Holding a mutex across synchronize_rcu_tasks() and acquiring
> >      that same mutex in code called from do_exit() after its call to
> >      exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> >      results in deadlock.  This is by design, because tasks that are far
> >      enough into do_exit() are no longer present on the tasks list, making
> >      it a bit difficult for RCU Tasks to find them, let alone wait on them
> >      to do a voluntary context switch.  However, such deadlocks are becoming
> >      more frequent.  In addition, lockdep currently does not detect such
> >      deadlocks and they can be difficult to reproduce.
> >      In addition, if a task voluntarily context switches during that time
> >      (for example, if it blocks acquiring a mutex), then this task is in an
> >      RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> >      just as well take advantage of that fact.
> >      This commit therefore eliminates these deadlock by replacing the
> >      SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> >      currently exiting.  A given task will be on one of these per-CPU lists for
> >      the same period of time that this task would previously have been in the
> >      previous SRCU read-side critical section.  These lists enable RCU Tasks
> >      to find the tasks that have already been removed from the tasks list,
> >      but that must nevertheless be waited upon.
> >      The RCU Tasks grace period gathers any of these do_exit() tasks that it
> >      must wait on, and adds them to the list of holdouts.  Per-CPU locking
> >      and get_task_struct() are used to synchronize addition to and removal
> >      from these lists.
> >      Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
> >      Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
> >      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dddd10b1b815..3fe36befb613 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -855,6 +855,8 @@ struct task_struct {
> >   	u8				rcu_tasks_idx;
> >   	int				rcu_tasks_idle_cpu;
> >   	struct list_head		rcu_tasks_holdout_list;
> > +	int				rcu_tasks_exit_cpu;
> > +	struct list_head		rcu_tasks_exit_list;
> >   #endif /* #ifdef CONFIG_TASKS_RCU */
> >   #ifdef CONFIG_TASKS_TRACE_RCU
> > diff --git a/init/init_task.c b/init/init_task.c
> > index 5727d42149c3..65f037bff457 100644
> > --- a/init/init_task.c
> > +++ b/init/init_task.c
> > @@ -152,6 +152,7 @@ struct task_struct init_task
> >   	.rcu_tasks_holdout = false,
> >   	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
> >   	.rcu_tasks_idle_cpu = -1,
> > +	.rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list),
> >   #endif
> >   #ifdef CONFIG_TASKS_TRACE_RCU
> >   	.trc_reader_nesting = 0,
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 10917c3e1f03..6bacd515e0eb 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1981,6 +1981,7 @@ static inline void rcu_copy_process(struct task_struct *p)
> >   	p->rcu_tasks_holdout = false;
> >   	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> >   	p->rcu_tasks_idle_cpu = -1;
> > +	INIT_LIST_HEAD(&p->rcu_tasks_exit_list);
> >   #endif /* #ifdef CONFIG_TASKS_RCU */
> >   #ifdef CONFIG_TASKS_TRACE_RCU
> >   	p->trc_reader_nesting = 0;
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index 732ad5b39946..bd4a51fd5b1f 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -32,6 +32,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
> >    * @rtp_irq_work: IRQ work queue for deferred wakeups.
> >    * @barrier_q_head: RCU callback for barrier operation.
> >    * @rtp_blkd_tasks: List of tasks blocked as readers.
> > + * @rtp_exit_list: List of tasks in the latter portion of do_exit().
> >    * @cpu: CPU number corresponding to this entry.
> >    * @rtpp: Pointer to the rcu_tasks structure.
> >    */
> > @@ -46,6 +47,7 @@ struct rcu_tasks_percpu {
> >   	struct irq_work rtp_irq_work;
> >   	struct rcu_head barrier_q_head;
> >   	struct list_head rtp_blkd_tasks;
> > +	struct list_head rtp_exit_list;
> >   	int cpu;
> >   	struct rcu_tasks *rtpp;
> >   };
> > @@ -144,8 +146,6 @@ static struct rcu_tasks rt_name =							\
> >   }
> >   #ifdef CONFIG_TASKS_RCU
> > -/* Track exiting tasks in order to allow them to be waited for. */
> > -DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
> >   /* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */
> >   static void tasks_rcu_exit_srcu_stall(struct timer_list *unused);
> > @@ -275,6 +275,8 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
> >   		rtpcp->rtpp = rtp;
> >   		if (!rtpcp->rtp_blkd_tasks.next)
> >   			INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks);
> > +		if (!rtpcp->rtp_exit_list.next)
> > +			INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
> >   	}
> >   	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
> > @@ -851,10 +853,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
> >   //	number of voluntary context switches, and add that task to the
> >   //	holdout list.
> >   // rcu_tasks_postscan():
> > -//	Invoke synchronize_srcu() to ensure that all tasks that were
> > -//	in the process of exiting (and which thus might not know to
> > -//	synchronize with this RCU Tasks grace period) have completed
> > -//	exiting.
> > +//	Gather per-CPU lists of tasks in do_exit() to ensure that all
> > +//	tasks that were in the process of exiting (and which thus might
> > +//	not know to synchronize with this RCU Tasks grace period) have
> > +//	completed exiting.  The synchronize_rcu() in rcu_tasks_postgp()
> > +//	will take care of any tasks stuck in the non-preemptible region
> > +//	of do_exit() following its call to exit_tasks_rcu_stop().
> >   // check_all_holdout_tasks(), repeatedly until holdout list is empty:
> >   //	Scans the holdout list, attempting to identify a quiescent state
> >   //	for each task on the list.  If there is a quiescent state, the
> > @@ -867,8 +871,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
> >   //	with interrupts disabled.
> >   //
> >   // For each exiting task, the exit_tasks_rcu_start() and
> > -// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU
> > -// read-side critical sections waited for by rcu_tasks_postscan().
> > +// exit_tasks_rcu_finish() functions add and remove, respectively, the
> > +// current task to a per-CPU list of tasks that rcu_tasks_postscan() must
> > +// wait on.  This is necessary because rcu_tasks_postscan() must wait on
> > +// tasks that have already been removed from the global list of tasks.
> >   //
> >   // Pre-grace-period update-side code is ordered before the grace
> >   // via the raw_spin_lock.*rcu_node().  Pre-grace-period read-side code
> > @@ -932,9 +938,13 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
> >   	}
> >   }
> > +void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
> > +DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
> > +
> >   /* Processing between scanning taskslist and draining the holdout list. */
> >   static void rcu_tasks_postscan(struct list_head *hop)
> >   {
> > +	int cpu;
> >   	int rtsi = READ_ONCE(rcu_task_stall_info);
> >   	if (!IS_ENABLED(CONFIG_TINY_RCU)) {
> > @@ -948,9 +958,9 @@ static void rcu_tasks_postscan(struct list_head *hop)
> >   	 * this, divide the fragile exit path part in two intersecting
> >   	 * read side critical sections:
> >   	 *
> > -	 * 1) An _SRCU_ read side starting before calling exit_notify(),
> > -	 *    which may remove the task from the tasklist, and ending after
> > -	 *    the final preempt_disable() call in do_exit().
> > +	 * 1) A task_struct list addition before calling exit_notify(),
> > +	 *    which may remove the task from the tasklist, with the
> > +	 *    removal after the final preempt_disable() call in do_exit().
> >   	 *
> >   	 * 2) An _RCU_ read side starting with the final preempt_disable()
> >   	 *    call in do_exit() and ending with the final call to schedule()
> > @@ -959,7 +969,18 @@ static void rcu_tasks_postscan(struct list_head *hop)
> >   	 * This handles the part 1). And postgp will handle part 2) with a
> >   	 * call to synchronize_rcu().
> >   	 */
> > -	synchronize_srcu(&tasks_rcu_exit_srcu);
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		unsigned long flags;
> > +		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu);
> > +		struct task_struct *t;
> > +
> > +		raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> > +		list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list)
> > +			if (list_empty(&t->rcu_tasks_holdout_list))
> > +				rcu_tasks_pertask(t, hop);
> > +		raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> > +	}
> >   	if (!IS_ENABLED(CONFIG_TINY_RCU))
> >   		del_timer_sync(&tasks_rcu_exit_srcu_stall_timer);
> > @@ -1027,7 +1048,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
> >   	 *
> >   	 * In addition, this synchronize_rcu() waits for exiting tasks
> >   	 * to complete their final preempt_disable() region of execution,
> > -	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
> >   	 * enforcing the whole region before tasklist removal until
> >   	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
> >   	 * read side critical section.
> > @@ -1035,9 +1055,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
> >   	synchronize_rcu();
> >   }
> > -void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
> > -DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
> > -
> >   static void tasks_rcu_exit_srcu_stall(struct timer_list *unused)
> >   {
> >   #ifndef CONFIG_TINY_RCU
> > @@ -1147,25 +1164,45 @@ struct task_struct *get_rcu_tasks_gp_kthread(void)
> >   EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
> >   /*
> > - * Contribute to protect against tasklist scan blind spot while the
> > - * task is exiting and may be removed from the tasklist. See
> > - * corresponding synchronize_srcu() for further details.
> > + * Protect against tasklist scan blind spot while the task is exiting and
> > + * may be removed from the tasklist.  Do this by adding the task to yet
> > + * another list.
> >    */
> > -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> > +void exit_tasks_rcu_start(void)
> >   {
> > -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> > +	unsigned long flags;
> > +	struct rcu_tasks_percpu *rtpcp;
> > +	struct task_struct *t = current;
> > +
> > +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> > +	get_task_struct(t);
> > +	preempt_disable();
> > +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> > +	t->rcu_tasks_exit_cpu = smp_processor_id();
> > +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> > +	if (!rtpcp->rtp_exit_list.next)
> > +		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
> > +	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
> > +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> > +	preempt_enable();
> >   }
> >   /*
> > - * Contribute to protect against tasklist scan blind spot while the
> > - * task is exiting and may be removed from the tasklist. See
> > - * corresponding synchronize_srcu() for further details.
> > + * Remove the task from the "yet another list" because do_exit() is now
> > + * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
> >    */
> > -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
> > +void exit_tasks_rcu_stop(void)
> >   {
> > +	unsigned long flags;
> > +	struct rcu_tasks_percpu *rtpcp;
> >   	struct task_struct *t = current;
> > -	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
> > +	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
> > +	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
> > +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> > +	list_del_init(&t->rcu_tasks_exit_list);
> > +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> > +	put_task_struct(t);
> >   }
> >   /*
Chen Zhongjin Feb. 5, 2024, 6:46 a.m. UTC | #9
On 2024/2/1 21:47, Paul E. McKenney wrote:
> On Sat, Jan 27, 2024 at 06:09:05PM +0800, Chen Zhongjin wrote:
>> On 2024/1/20 23:30, Paul E. McKenney wrote:
>  > (Apologies for the delay, despite my attempts to make it otherwise,
> your email still got dumped into my spam folder.)
> 
Sorry for that, I'll fix this next time I send mail to maillist.

>> Hi Paul,
>> This patch works for my reproduce test case.
> 
> Thank you!!!
> 
>> Just a small question, if you dont mind, this problem exsit on LTS version
>> but we had struct rcu_tasks_percpu after v5.17. We can't backport this patch
>> to LTS 5.10 or 4.19, which are still under maintaince.
>> If you have any idea or plan to apply this patch to elder version, please
>> tell me, thanks very much!
> 
> It should be possible to hand-apply the patch.  Or to backport additional
> patches to make this one apply cleanly.   Note that in v4.19, the code
> is in kernel/rcu/update.c rather than its new home in kernel/rcu/tasks.h.
> 
>> Anyway, it's ok to apply this patch to mainline.
> 
> May I have your Tested-by?
> 
> 							Thanx, Paul
of course
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>

Best,
Chen
> 
>> Best,
>> Chen
>>
>>>> Again, that comment reads in full as follows:
>>>>
>>>> 	/*
>>>> 	 * Step 2: Wait for quiesence period to ensure all potentially
>>>> 	 * preempted tasks to have normally scheduled. Because optprobe
>>>> 	 * may modify multiple instructions, there is a chance that Nth
>>>> 	 * instruction is preempted. In that case, such tasks can return
>>>> 	 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
>>>> 	 * Note that on non-preemptive kernel, this is transparently converted
>>>> 	 * to synchronoze_sched() to wait for all interrupts to have completed.
>>>> 	 */
>>>>
>>>> Please note well that first sentence.
>>>>
>>>> Unless that first sentence no longer holds, this patch cannot work
>>>> because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
>>>> preempted tasks.
>>>>
>>>> So how to safely break this deadlock?  Reproducing Chen Zhongjin's
>>>> diagram:
>>>>
>>>> pid A				pid B			pid C
>>>> kprobe_optimizer()		do_exit()		perf_kprobe_init()
>>>> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
>>>> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
>>>> // waiting tasks_rcu_exit_srcu	kernel_wait4()
>>>> 				// waiting pid C exit
>>>>
>>>> We need to stop synchronize_rcu_tasks() from waiting on tasks like
>>>> pid B that are voluntarily blocked.  One way to do that is to replace
>>>> SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
>>>> current task to this list and does ...
>>>>
>>>> OK, this is getting a bit involved.  If you would like to follow along,
>>>> please feel free to look here:
>>>>
>>>> https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing
>>>
>>> And please see below for a prototype patch, which passes moderate
>>> rcutorture testing.
>>>
>>> Chen Zhongjin, does this prevent the deadlock you have been seeing?
>>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
>>> Author: Paul E. McKenney <paulmck@kernel.org>
>>> Date:   Sat Jan 20 07:07:08 2024 -0800
>>>
>>>       rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
>>>       Holding a mutex across synchronize_rcu_tasks() and acquiring
>>>       that same mutex in code called from do_exit() after its call to
>>>       exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
>>>       results in deadlock.  This is by design, because tasks that are far
>>>       enough into do_exit() are no longer present on the tasks list, making
>>>       it a bit difficult for RCU Tasks to find them, let alone wait on them
>>>       to do a voluntary context switch.  However, such deadlocks are becoming
>>>       more frequent.  In addition, lockdep currently does not detect such
>>>       deadlocks and they can be difficult to reproduce.
>>>       In addition, if a task voluntarily context switches during that time
>>>       (for example, if it blocks acquiring a mutex), then this task is in an
>>>       RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
>>>       just as well take advantage of that fact.
>>>       This commit therefore eliminates these deadlock by replacing the
>>>       SRCU-based wait for do_exit() completion with per-CPU lists of tasks
>>>       currently exiting.  A given task will be on one of these per-CPU lists for
>>>       the same period of time that this task would previously have been in the
>>>       previous SRCU read-side critical section.  These lists enable RCU Tasks
>>>       to find the tasks that have already been removed from the tasks list,
>>>       but that must nevertheless be waited upon.
>>>       The RCU Tasks grace period gathers any of these do_exit() tasks that it
>>>       must wait on, and adds them to the list of holdouts.  Per-CPU locking
>>>       and get_task_struct() are used to synchronize addition to and removal
>>>       from these lists.
>>>       Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
>>>       Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
>>>       Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index dddd10b1b815..3fe36befb613 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -855,6 +855,8 @@ struct task_struct {
>>>    	u8				rcu_tasks_idx;
>>>    	int				rcu_tasks_idle_cpu;
>>>    	struct list_head		rcu_tasks_holdout_list;
>>> +	int				rcu_tasks_exit_cpu;
>>> +	struct list_head		rcu_tasks_exit_list;
>>>    #endif /* #ifdef CONFIG_TASKS_RCU */
>>>    #ifdef CONFIG_TASKS_TRACE_RCU
>>> diff --git a/init/init_task.c b/init/init_task.c
>>> index 5727d42149c3..65f037bff457 100644
>>> --- a/init/init_task.c
>>> +++ b/init/init_task.c
>>> @@ -152,6 +152,7 @@ struct task_struct init_task
>>>    	.rcu_tasks_holdout = false,
>>>    	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
>>>    	.rcu_tasks_idle_cpu = -1,
>>> +	.rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list),
>>>    #endif
>>>    #ifdef CONFIG_TASKS_TRACE_RCU
>>>    	.trc_reader_nesting = 0,
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 10917c3e1f03..6bacd515e0eb 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -1981,6 +1981,7 @@ static inline void rcu_copy_process(struct task_struct *p)
>>>    	p->rcu_tasks_holdout = false;
>>>    	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
>>>    	p->rcu_tasks_idle_cpu = -1;
>>> +	INIT_LIST_HEAD(&p->rcu_tasks_exit_list);
>>>    #endif /* #ifdef CONFIG_TASKS_RCU */
>>>    #ifdef CONFIG_TASKS_TRACE_RCU
>>>    	p->trc_reader_nesting = 0;
>>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>>> index 732ad5b39946..bd4a51fd5b1f 100644
>>> --- a/kernel/rcu/tasks.h
>>> +++ b/kernel/rcu/tasks.h
>>> @@ -32,6 +32,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
>>>     * @rtp_irq_work: IRQ work queue for deferred wakeups.
>>>     * @barrier_q_head: RCU callback for barrier operation.
>>>     * @rtp_blkd_tasks: List of tasks blocked as readers.
>>> + * @rtp_exit_list: List of tasks in the latter portion of do_exit().
>>>     * @cpu: CPU number corresponding to this entry.
>>>     * @rtpp: Pointer to the rcu_tasks structure.
>>>     */
>>> @@ -46,6 +47,7 @@ struct rcu_tasks_percpu {
>>>    	struct irq_work rtp_irq_work;
>>>    	struct rcu_head barrier_q_head;
>>>    	struct list_head rtp_blkd_tasks;
>>> +	struct list_head rtp_exit_list;
>>>    	int cpu;
>>>    	struct rcu_tasks *rtpp;
>>>    };
>>> @@ -144,8 +146,6 @@ static struct rcu_tasks rt_name =							\
>>>    }
>>>    #ifdef CONFIG_TASKS_RCU
>>> -/* Track exiting tasks in order to allow them to be waited for. */
>>> -DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
>>>    /* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */
>>>    static void tasks_rcu_exit_srcu_stall(struct timer_list *unused);
>>> @@ -275,6 +275,8 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
>>>    		rtpcp->rtpp = rtp;
>>>    		if (!rtpcp->rtp_blkd_tasks.next)
>>>    			INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks);
>>> +		if (!rtpcp->rtp_exit_list.next)
>>> +			INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
>>>    	}
>>>    	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
>>> @@ -851,10 +853,12 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
>>>    //	number of voluntary context switches, and add that task to the
>>>    //	holdout list.
>>>    // rcu_tasks_postscan():
>>> -//	Invoke synchronize_srcu() to ensure that all tasks that were
>>> -//	in the process of exiting (and which thus might not know to
>>> -//	synchronize with this RCU Tasks grace period) have completed
>>> -//	exiting.
>>> +//	Gather per-CPU lists of tasks in do_exit() to ensure that all
>>> +//	tasks that were in the process of exiting (and which thus might
>>> +//	not know to synchronize with this RCU Tasks grace period) have
>>> +//	completed exiting.  The synchronize_rcu() in rcu_tasks_postgp()
>>> +//	will take care of any tasks stuck in the non-preemptible region
>>> +//	of do_exit() following its call to exit_tasks_rcu_stop().
>>>    // check_all_holdout_tasks(), repeatedly until holdout list is empty:
>>>    //	Scans the holdout list, attempting to identify a quiescent state
>>>    //	for each task on the list.  If there is a quiescent state, the
>>> @@ -867,8 +871,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
>>>    //	with interrupts disabled.
>>>    //
>>>    // For each exiting task, the exit_tasks_rcu_start() and
>>> -// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU
>>> -// read-side critical sections waited for by rcu_tasks_postscan().
>>> +// exit_tasks_rcu_finish() functions add and remove, respectively, the
>>> +// current task to a per-CPU list of tasks that rcu_tasks_postscan() must
>>> +// wait on.  This is necessary because rcu_tasks_postscan() must wait on
>>> +// tasks that have already been removed from the global list of tasks.
>>>    //
>>>    // Pre-grace-period update-side code is ordered before the grace
>>>    // via the raw_spin_lock.*rcu_node().  Pre-grace-period read-side code
>>> @@ -932,9 +938,13 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
>>>    	}
>>>    }
>>> +void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
>>> +DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
>>> +
>>>    /* Processing between scanning taskslist and draining the holdout list. */
>>>    static void rcu_tasks_postscan(struct list_head *hop)
>>>    {
>>> +	int cpu;
>>>    	int rtsi = READ_ONCE(rcu_task_stall_info);
>>>    	if (!IS_ENABLED(CONFIG_TINY_RCU)) {
>>> @@ -948,9 +958,9 @@ static void rcu_tasks_postscan(struct list_head *hop)
>>>    	 * this, divide the fragile exit path part in two intersecting
>>>    	 * read side critical sections:
>>>    	 *
>>> -	 * 1) An _SRCU_ read side starting before calling exit_notify(),
>>> -	 *    which may remove the task from the tasklist, and ending after
>>> -	 *    the final preempt_disable() call in do_exit().
>>> +	 * 1) A task_struct list addition before calling exit_notify(),
>>> +	 *    which may remove the task from the tasklist, with the
>>> +	 *    removal after the final preempt_disable() call in do_exit().
>>>    	 *
>>>    	 * 2) An _RCU_ read side starting with the final preempt_disable()
>>>    	 *    call in do_exit() and ending with the final call to schedule()
>>> @@ -959,7 +969,18 @@ static void rcu_tasks_postscan(struct list_head *hop)
>>>    	 * This handles the part 1). And postgp will handle part 2) with a
>>>    	 * call to synchronize_rcu().
>>>    	 */
>>> -	synchronize_srcu(&tasks_rcu_exit_srcu);
>>> +
>>> +	for_each_possible_cpu(cpu) {
>>> +		unsigned long flags;
>>> +		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu);
>>> +		struct task_struct *t;
>>> +
>>> +		raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
>>> +		list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list)
>>> +			if (list_empty(&t->rcu_tasks_holdout_list))
>>> +				rcu_tasks_pertask(t, hop);
>>> +		raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
>>> +	}
>>>    	if (!IS_ENABLED(CONFIG_TINY_RCU))
>>>    		del_timer_sync(&tasks_rcu_exit_srcu_stall_timer);
>>> @@ -1027,7 +1048,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
>>>    	 *
>>>    	 * In addition, this synchronize_rcu() waits for exiting tasks
>>>    	 * to complete their final preempt_disable() region of execution,
>>> -	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
>>>    	 * enforcing the whole region before tasklist removal until
>>>    	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
>>>    	 * read side critical section.
>>> @@ -1035,9 +1055,6 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
>>>    	synchronize_rcu();
>>>    }
>>> -void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
>>> -DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
>>> -
>>>    static void tasks_rcu_exit_srcu_stall(struct timer_list *unused)
>>>    {
>>>    #ifndef CONFIG_TINY_RCU
>>> @@ -1147,25 +1164,45 @@ struct task_struct *get_rcu_tasks_gp_kthread(void)
>>>    EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
>>>    /*
>>> - * Contribute to protect against tasklist scan blind spot while the
>>> - * task is exiting and may be removed from the tasklist. See
>>> - * corresponding synchronize_srcu() for further details.
>>> + * Protect against tasklist scan blind spot while the task is exiting and
>>> + * may be removed from the tasklist.  Do this by adding the task to yet
>>> + * another list.
>>>     */
>>> -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
>>> +void exit_tasks_rcu_start(void)
>>>    {
>>> -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
>>> +	unsigned long flags;
>>> +	struct rcu_tasks_percpu *rtpcp;
>>> +	struct task_struct *t = current;
>>> +
>>> +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
>>> +	get_task_struct(t);
>>> +	preempt_disable();
>>> +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
>>> +	t->rcu_tasks_exit_cpu = smp_processor_id();
>>> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
>>> +	if (!rtpcp->rtp_exit_list.next)
>>> +		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
>>> +	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
>>> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
>>> +	preempt_enable();
>>>    }
>>>    /*
>>> - * Contribute to protect against tasklist scan blind spot while the
>>> - * task is exiting and may be removed from the tasklist. See
>>> - * corresponding synchronize_srcu() for further details.
>>> + * Remove the task from the "yet another list" because do_exit() is now
>>> + * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
>>>     */
>>> -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
>>> +void exit_tasks_rcu_stop(void)
>>>    {
>>> +	unsigned long flags;
>>> +	struct rcu_tasks_percpu *rtpcp;
>>>    	struct task_struct *t = current;
>>> -	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
>>> +	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
>>> +	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
>>> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
>>> +	list_del_init(&t->rcu_tasks_exit_list);
>>> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
>>> +	put_task_struct(t);
>>>    }
>>>    /*
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..dc6a18854017 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -104,7 +104,7 @@  config STATIC_CALL_SELFTEST
 config OPTPROBES
 	def_bool y
 	depends on KPROBES && HAVE_OPTPROBES
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RUDE_RCU
 
 config KPROBES_ON_FTRACE
 	def_bool y
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index d5a0ee40bf66..09056ae50c58 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -623,7 +623,7 @@  static void kprobe_optimizer(struct work_struct *work)
 	 * Note that on non-preemptive kernel, this is transparently converted
 	 * to synchronoze_sched() to wait for all interrupts to have completed.
 	 */
-	synchronize_rcu_tasks();
+	synchronize_rcu_tasks_rude();
 
 	/* Step 3: Optimize kprobes after quiesence period */
 	do_optimize_kprobes();