kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Message ID	20240117061636.288412-1-chenzhongjin@huawei.com (mailing list archive)
State	Superseded
Headers	show Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B55C945A; Wed, 17 Jan 2024 06:19:57 +0000 (UTC) From: Chen Zhongjin <chenzhongjin@huawei.com> To: <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org> CC: <chenzhongjin@huawei.com>, <yangjihong1@huawei.com>, <naveen.n.rao@linux.ibm.com>, <anil.s.keshavamurthy@intel.com>, <davem@davemloft.net>, <paulmck@kernel.org>, <mhiramat@kernel.org>, <akpm@linux-foundation.org>, <tglx@linutronix.de>, <peterz@infradead.org>, <pmladek@suse.com>, <dianders@chromium.org>, <npiggin@gmail.com>, <mpe@ellerman.id.au>, <jkl820.git@gmail.com>, <juerg.haefliger@canonical.com>, <rick.p.edgecombe@intel.com>, <eric.devolder@oracle.com>, <mic@digikod.net> Subject: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer Date: Wed, 17 Jan 2024 06:16:36 +0000 Message-ID: <20240117061636.288412-1-chenzhongjin@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer \| expand kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Message ID

20240117061636.288412-1-chenzhongjin@huawei.com (mailing list archive)

State

Superseded

Headers

From: Chen Zhongjin <chenzhongjin@huawei.com>
To: <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>
CC: <chenzhongjin@huawei.com>, <yangjihong1@huawei.com>,
	<naveen.n.rao@linux.ibm.com>, <anil.s.keshavamurthy@intel.com>,
	<davem@davemloft.net>, <paulmck@kernel.org>, <mhiramat@kernel.org>,
	<akpm@linux-foundation.org>, <tglx@linutronix.de>, <peterz@infradead.org>,
	<pmladek@suse.com>, <dianders@chromium.org>, <npiggin@gmail.com>,
	<mpe@ellerman.id.au>, <jkl820.git@gmail.com>,
	<juerg.haefliger@canonical.com>, <rick.p.edgecombe@intel.com>,
	<eric.devolder@oracle.com>, <mic@digikod.net>
Subject: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer
Date: Wed, 17 Jan 2024 06:16:36 +0000
Message-ID: <20240117061636.288412-1-chenzhongjin@huawei.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain

Series

kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer | expand

Commit Message

Chen Zhongjin Jan. 17, 2024, 6:16 a.m. UTC

There is a deadlock scenario in kprobe_optimizer():

pid A				pid B			pid C
kprobe_optimizer()		do_exit()		perf_kprobe_init()
mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu	kernel_wait4()
				// waiting pid C exit

To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
---
 arch/Kconfig     | 2 +-
 kernel/kprobes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Andrew Morton Jan. 17, 2024, 8:31 p.m. UTC | #1

On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote:

> There is a deadlock scenario in kprobe_optimizer():
> 
> pid A				pid B			pid C
> kprobe_optimizer()		do_exit()		perf_kprobe_init()
> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu	kernel_wait4()
> 				// waiting pid C exit
> 
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> 
> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>

Thanks.  Should we backport this fix into earlier kernels?  If so, are
we able to identify a suitable Fixes: target?

Paul E. McKenney Jan. 18, 2024, 2:41 p.m. UTC | #2

On Wed, Jan 17, 2024 at 12:31:33PM -0800, Andrew Morton wrote:
> On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote:
> 
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A				pid B			pid C
> > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > 				// waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Hello, Chen Zhongjin,

Relying on the non-preemptability of the last portion of the do_exit()
function does appear to be the basis for a much better solution than
what we currently have.  So at the very least, thank you for the idea!!!
I feel a bit silly for not having thought of it myself.  ;-)

However, just invoking synchronize_rcu_tasks_rude() will be bad for both
energy efficiency and real-time response.  This is due to the fact that
synchronize_rcu_tasks_rude() sends an IPI to each and every online CPUs,
almost none of which will be in the non-preemptible tail of do_exit()
at any given time.  These additional unnecessary IPIs will drain battery
when they hit idle CPUs and degrade real-time response when they hit
CPUs running aggressive real-time applications.  Which might not make
people happy.

So, would you be willing to use RCU's do_exit() hooks and RCU's hook
into the scheduler (either rcu_note_context_switch() or rcu_tasks_qs(),
whichever would work better) maintain a per-CPU variable that is a
pointer to the task in the non-preemptible tail of do_exit() if there
is one or NULL otherwise?  This would get us the deadlock-avoidance
simplicity of your underlying idea, while avoiding (almost all of)
the energy-efficiency and real-time-response issues with your patch.

This does require a bit of memory-ordering care, so if you would prefer
that I do the implementation, just let me know.

(The memory-ordering trick is to use "smp_mb(); smp_load_acquire();" to
sample the counter and "smp_store_release(); smp_mb();" to update it.)

							Thanx, Paul

> > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> 
> Thanks.  Should we backport this fix into earlier kernels?  If so, are
> we able to identify a suitable Fixes: target?

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..dc6a18854017 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -104,7 +104,7 @@  config STATIC_CALL_SELFTEST
 config OPTPROBES
 	def_bool y
 	depends on KPROBES && HAVE_OPTPROBES
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RUDE_RCU
 
 config KPROBES_ON_FTRACE
 	def_bool y
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index d5a0ee40bf66..09056ae50c58 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -623,7 +623,7 @@  static void kprobe_optimizer(struct work_struct *work)
 	 * Note that on non-preemptive kernel, this is transparently converted
 	 * to synchronoze_sched() to wait for all interrupts to have completed.
 	 */
-	synchronize_rcu_tasks();
+	synchronize_rcu_tasks_rude();
 
 	/* Step 3: Optimize kprobes after quiesence period */
 	do_optimize_kprobes();

kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Commit Message

Comments

Patch