diff mbox series

tracing: Refuse fprobe if RCU is not watching

Message ID 20230321020103.13494-1-laoar.shao@gmail.com (mailing list archive)
State Superseded
Headers show
Series tracing: Refuse fprobe if RCU is not watching | expand

Commit Message

Yafang Shao March 21, 2023, 2:01 a.m. UTC
It hits below warning on my test machine when running
selftests/bpf/test_progs,

[  702.223611] ------------[ cut here ]------------
[  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
[  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
[  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
[  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
[  702.241388] Call Trace:
[  702.241615]  <TASK>
[  702.241811]  fprobe_handler+0x22/0x30
[  702.242129]  0xffffffffc04710f7
[  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
[  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
[  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
[  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
[  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
[  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
[  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
[  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
[  702.250785]  ? preempt_count_sub+0x5/0xa0
[  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
[  702.252368]  ? preempt_count_sub+0x5/0xa0
[  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
[  702.253918]  do_syscall_64+0x16/0x90
[  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  702.255422] RIP: 0033:0x46b793

This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
enters from user mode to kernel mode, or enters from user mode to irq,
it excutes preempt_count_sub before RCU begins watching, and thus this
warning is triggered.

We should not handle fprobe if RCU is not watching.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Jiri Olsa <olsajiri@gmail.com>
---
 kernel/trace/fprobe.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Masami Hiramatsu (Google) March 21, 2023, 2:12 p.m. UTC | #1
On Tue, 21 Mar 2023 02:01:03 +0000
Yafang Shao <laoar.shao@gmail.com> wrote:

> It hits below warning on my test machine when running
> selftests/bpf/test_progs,
> 
> [  702.223611] ------------[ cut here ]------------
> [  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
> [  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
> [  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
> [  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
> [  702.241388] Call Trace:
> [  702.241615]  <TASK>
> [  702.241811]  fprobe_handler+0x22/0x30
> [  702.242129]  0xffffffffc04710f7
> [  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
> [  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
> [  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
> [  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
> [  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
> [  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
> [  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
> [  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
> [  702.250785]  ? preempt_count_sub+0x5/0xa0
> [  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.252368]  ? preempt_count_sub+0x5/0xa0
> [  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.253918]  do_syscall_64+0x16/0x90
> [  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  702.255422] RIP: 0033:0x46b793
> 
> This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
> enters from user mode to kernel mode, or enters from user mode to irq,
> it excutes preempt_count_sub before RCU begins watching, and thus this
> warning is triggered.
> 
> We should not handle fprobe if RCU is not watching.

Good catch!

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thank you!

> 
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: Jiri Olsa <olsajiri@gmail.com>
> ---
>  kernel/trace/fprobe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index e8143e3..fe4b248 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -27,6 +27,9 @@ static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
>  	struct fprobe *fp;
>  	int bit;
>  
> +	if (!rcu_is_watching())
> +		return;
> +
>  	fp = container_of(ops, struct fprobe, ops);
>  	if (fprobe_disabled(fp))
>  		return;
> -- 
> 1.8.3.1
>
Steven Rostedt March 21, 2023, 2:17 p.m. UTC | #2
On Tue, 21 Mar 2023 02:01:03 +0000
Yafang Shao <laoar.shao@gmail.com> wrote:

> It hits below warning on my test machine when running
> selftests/bpf/test_progs,
> 
> [  702.223611] ------------[ cut here ]------------
> [  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
> [  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
> [  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
> [  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
> [  702.241388] Call Trace:
> [  702.241615]  <TASK>
> [  702.241811]  fprobe_handler+0x22/0x30
> [  702.242129]  0xffffffffc04710f7
> [  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
> [  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
> [  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
> [  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
> [  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
> [  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
> [  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
> [  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
> [  702.250785]  ? preempt_count_sub+0x5/0xa0
> [  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.252368]  ? preempt_count_sub+0x5/0xa0
> [  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.253918]  do_syscall_64+0x16/0x90
> [  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  702.255422] RIP: 0033:0x46b793
> 
> This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
> enters from user mode to kernel mode, or enters from user mode to irq,
> it excutes preempt_count_sub before RCU begins watching, and thus this
> warning is triggered.
> 
> We should not handle fprobe if RCU is not watching.
> 
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: Jiri Olsa <olsajiri@gmail.com>
> ---
>  kernel/trace/fprobe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index e8143e3..fe4b248 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -27,6 +27,9 @@ static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
>  	struct fprobe *fp;
>  	int bit;
>  
> +	if (!rcu_is_watching())
> +		return;

Hmm, at least on 6.3, this should not be an issue anymore. I believe that
all locations that have ftrace callbacks should now have rcu watching?

I think we *want* a warn on when this happens.

Peter?

-- Steve


> +
>  	fp = container_of(ops, struct fprobe, ops);
>  	if (fprobe_disabled(fp))
>  		return;
Jiri Olsa March 21, 2023, 2:43 p.m. UTC | #3
On Tue, Mar 21, 2023 at 02:01:03AM +0000, Yafang Shao wrote:
> It hits below warning on my test machine when running
> selftests/bpf/test_progs,
> 
> [  702.223611] ------------[ cut here ]------------
> [  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
> [  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
> [  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
> [  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
> [  702.241388] Call Trace:
> [  702.241615]  <TASK>
> [  702.241811]  fprobe_handler+0x22/0x30
> [  702.242129]  0xffffffffc04710f7
> [  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
> [  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
> [  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
> [  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
> [  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
> [  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
> [  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
> [  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
> [  702.250785]  ? preempt_count_sub+0x5/0xa0
> [  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.252368]  ? preempt_count_sub+0x5/0xa0
> [  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
> [  702.253918]  do_syscall_64+0x16/0x90
> [  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  702.255422] RIP: 0033:0x46b793
> 
> This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
> enters from user mode to kernel mode, or enters from user mode to irq,
> it excutes preempt_count_sub before RCU begins watching, and thus this
> warning is triggered.
> 
> We should not handle fprobe if RCU is not watching.
> 
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: Jiri Olsa <olsajiri@gmail.com>

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

> ---
>  kernel/trace/fprobe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index e8143e3..fe4b248 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -27,6 +27,9 @@ static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
>  	struct fprobe *fp;
>  	int bit;
>  
> +	if (!rcu_is_watching())
> +		return;
> +
>  	fp = container_of(ops, struct fprobe, ops);
>  	if (fprobe_disabled(fp))
>  		return;
> -- 
> 1.8.3.1
>
Peter Zijlstra March 21, 2023, 3:50 p.m. UTC | #4
On Tue, Mar 21, 2023 at 10:17:11AM -0400, Steven Rostedt wrote:
> On Tue, 21 Mar 2023 02:01:03 +0000
> Yafang Shao <laoar.shao@gmail.com> wrote:
> 
> > It hits below warning on my test machine when running
> > selftests/bpf/test_progs,
> > 
> > [  702.223611] ------------[ cut here ]------------
> > [  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
> > [  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
> > [  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
> > [  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
> > [  702.241388] Call Trace:
> > [  702.241615]  <TASK>
> > [  702.241811]  fprobe_handler+0x22/0x30
> > [  702.242129]  0xffffffffc04710f7
> > [  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
> > [  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
> > [  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
> > [  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
> > [  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
> > [  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
> > [  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
> > [  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
> > [  702.250785]  ? preempt_count_sub+0x5/0xa0
> > [  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
> > [  702.252368]  ? preempt_count_sub+0x5/0xa0
> > [  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
> > [  702.253918]  do_syscall_64+0x16/0x90
> > [  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [  702.255422] RIP: 0033:0x46b793
> > 
> > This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
> > enters from user mode to kernel mode, or enters from user mode to irq,
> > it excutes preempt_count_sub before RCU begins watching, and thus this
> > warning is triggered.
> > 
> > We should not handle fprobe if RCU is not watching.
> > 
> > Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Cc: Jiri Olsa <olsajiri@gmail.com>
> > ---
> >  kernel/trace/fprobe.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > index e8143e3..fe4b248 100644
> > --- a/kernel/trace/fprobe.c
> > +++ b/kernel/trace/fprobe.c
> > @@ -27,6 +27,9 @@ static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
> >  	struct fprobe *fp;
> >  	int bit;
> >  
> > +	if (!rcu_is_watching())
> > +		return;
> 
> Hmm, at least on 6.3, this should not be an issue anymore. I believe that
> all locations that have ftrace callbacks should now have rcu watching?
> 
> I think we *want* a warn on when this happens.
> 
> Peter?

You always want a wanring, because silently dropping stuff is very poor
form. But yes, we must not enter tracing then RCU isn't watching, that's
a fundamental fail and should be fixed.
Yafang Shao March 23, 2023, 5:59 a.m. UTC | #5
On Tue, Mar 21, 2023 at 10:17 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 21 Mar 2023 02:01:03 +0000
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > It hits below warning on my test machine when running
> > selftests/bpf/test_progs,
> >
> > [  702.223611] ------------[ cut here ]------------
> > [  702.224168] RCU not on for: preempt_count_sub+0x0/0xa0
> > [  702.224770] WARNING: CPU: 14 PID: 5267 at include/linux/trace_recursion.h:162 fprobe_handler.part.0+0x1b8/0x1c0
> > [  702.231740] CPU: 14 PID: 5267 Comm: main_amd64 Kdump: loaded Tainted: G           O       6.2.0+ #584
> > [  702.233169] RIP: 0010:fprobe_handler.part.0+0x1b8/0x1c0
> > [  702.241388] Call Trace:
> > [  702.241615]  <TASK>
> > [  702.241811]  fprobe_handler+0x22/0x30
> > [  702.242129]  0xffffffffc04710f7
> > [  702.242417] RIP: 0010:preempt_count_sub+0x5/0xa0
> > [  702.242809] Code: c8 50 68 94 42 0e b5 48 cf e9 f9 fd ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 4b cd 38 0b <55> 8b 0d 9c d0 cf 02 48 89 e5 85 c9 75 1b 65 8b 05 be 78 f4 4a 89
> > [  702.244752] RSP: 0018:ffffaf6187d27f10 EFLAGS: 00000082 ORIG_RAX: 0000000000000000
> > [  702.245801] RAX: 000000000000000e RBX: 0000000001b6ab72 RCX: 0000000000000000
> > [  702.246804] RDX: 0000000000000000 RSI: ffffffffb627967d RDI: 0000000000000001
> > [  702.247801] RBP: ffffaf6187d27f30 R08: 0000000000000000 R09: 0000000000000000
> > [  702.248786] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ca
> > [  702.249782] R13: ffffaf6187d27f58 R14: 0000000000000000 R15: 0000000000000000
> > [  702.250785]  ? preempt_count_sub+0x5/0xa0
> > [  702.251540]  ? syscall_enter_from_user_mode+0x96/0xc0
> > [  702.252368]  ? preempt_count_sub+0x5/0xa0
> > [  702.253104]  ? syscall_enter_from_user_mode+0x96/0xc0
> > [  702.253918]  do_syscall_64+0x16/0x90
> > [  702.254613]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [  702.255422] RIP: 0033:0x46b793
> >
> > This issue happens under CONFIG_CONTEXT_TRACKING_USER=y. When a task
> > enters from user mode to kernel mode, or enters from user mode to irq,
> > it excutes preempt_count_sub before RCU begins watching, and thus this
> > warning is triggered.
> >
> > We should not handle fprobe if RCU is not watching.
> >
> > Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Cc: Jiri Olsa <olsajiri@gmail.com>
> > ---
> >  kernel/trace/fprobe.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > index e8143e3..fe4b248 100644
> > --- a/kernel/trace/fprobe.c
> > +++ b/kernel/trace/fprobe.c
> > @@ -27,6 +27,9 @@ static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
> >       struct fprobe *fp;
> >       int bit;
> >
> > +     if (!rcu_is_watching())
> > +             return;
>
> Hmm, at least on 6.3, this should not be an issue anymore. I believe that
> all locations that have ftrace callbacks should now have rcu watching?
>

Hi Steven,

I have verified the latest linux-trace tree,
    git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/core

The result of "uname -r" is ''6.3.0-rc3+".
This issue still exists, and after applying this patch it disappears.
It can be reproduced with a simple bpf program as follows,
    SEC("kprobe.multi/preempt_count_sub")
    int fprobe_test()
    {
        return 0;
    }

> I think we *want* a warn on when this happens.
>
> Peter?
>
Steven Rostedt March 23, 2023, 12:39 p.m. UTC | #6
On Thu, 23 Mar 2023 13:59:34 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> I have verified the latest linux-trace tree,
>     git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> trace/core
> 
> The result of "uname -r" is ''6.3.0-rc3+".
> This issue still exists, and after applying this patch it disappears.
> It can be reproduced with a simple bpf program as follows,
>     SEC("kprobe.multi/preempt_count_sub")
>     int fprobe_test()
>     {
>         return 0;
>     }

Still your patch is hiding a bug, not fixing one.

Can you apply this patch and see if the bug goes away?

-- Steve


diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d4afa8508a80..3a7909ed5498 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -96,6 +96,7 @@ static inline void user_exit_irqoff(void) { }
 static inline int exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline int ct_state(void) { return -1; }
+static inline int __ct_state(void) { return -1; }
 static __always_inline bool context_tracking_guest_enter(void) { return false; }
 static inline void context_tracking_guest_exit(void) { }
 #define CT_WARN_ON(cond) do { } while (0)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 846add8394c4..1314894d2efa 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -21,7 +21,7 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 	arch_enter_from_user_mode(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
-	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	CT_WARN_ON(__ct_state() != CONTEXT_USER);
 	user_exit_irqoff();
 
 	instrumentation_begin();
Yafang Shao March 24, 2023, 2:51 a.m. UTC | #7
On Thu, Mar 23, 2023 at 8:39 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 23 Mar 2023 13:59:34 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > I have verified the latest linux-trace tree,
> >     git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> > trace/core
> >
> > The result of "uname -r" is ''6.3.0-rc3+".
> > This issue still exists, and after applying this patch it disappears.
> > It can be reproduced with a simple bpf program as follows,
> >     SEC("kprobe.multi/preempt_count_sub")
> >     int fprobe_test()
> >     {
> >         return 0;
> >     }
>
> Still your patch is hiding a bug, not fixing one.
>
> Can you apply this patch and see if the bug goes away?
>

I have verified that the bug goes away after applying this patch.
Thanks for the fix.
Steven Rostedt March 24, 2023, 3:01 a.m. UTC | #8
On Fri, 24 Mar 2023 10:51:49 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> On Thu, Mar 23, 2023 at 8:39 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 23 Mar 2023 13:59:34 +0800
> > Yafang Shao <laoar.shao@gmail.com> wrote:
> >  
> > > I have verified the latest linux-trace tree,
> > >     git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> > > trace/core
> > >
> > > The result of "uname -r" is ''6.3.0-rc3+".
> > > This issue still exists, and after applying this patch it disappears.
> > > It can be reproduced with a simple bpf program as follows,
> > >     SEC("kprobe.multi/preempt_count_sub")
> > >     int fprobe_test()
> > >     {
> > >         return 0;
> > >     }  
> >
> > Still your patch is hiding a bug, not fixing one.
> >
> > Can you apply this patch and see if the bug goes away?
> >  
> 
> I have verified that the bug goes away after applying this patch.
> Thanks for the fix.
> 

It's in tip:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=f87d28673b71b35b248231a2086f9404afbb7f28

Hopefully it will make it to mainline soon.

-- Steve
Yafang Shao April 9, 2023, 5:32 a.m. UTC | #9
On Fri, Mar 24, 2023 at 11:01 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 24 Mar 2023 10:51:49 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > On Thu, Mar 23, 2023 at 8:39 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Thu, 23 Mar 2023 13:59:34 +0800
> > > Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > > I have verified the latest linux-trace tree,
> > > >     git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> > > > trace/core
> > > >
> > > > The result of "uname -r" is ''6.3.0-rc3+".
> > > > This issue still exists, and after applying this patch it disappears.
> > > > It can be reproduced with a simple bpf program as follows,
> > > >     SEC("kprobe.multi/preempt_count_sub")
> > > >     int fprobe_test()
> > > >     {
> > > >         return 0;
> > > >     }
> > >
> > > Still your patch is hiding a bug, not fixing one.
> > >
> > > Can you apply this patch and see if the bug goes away?
> > >
> >
> > I have verified that the bug goes away after applying this patch.
> > Thanks for the fix.
> >
>
> It's in tip:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=f87d28673b71b35b248231a2086f9404afbb7f28
>
> Hopefully it will make it to mainline soon.
>

Hi Steven,

When I was trying to attach fentry to preempt_count_{sub,add}, the
kernel just crashed. The crash info as follows,

[  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf
(stack is 0000000046a46a15..00000000537e7b28)
[  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
[  867.843071] RIP: 0010:exc_int3+0x6/0xe0
[  867.843078] Code: e9 a6 fe ff ff e8 6a 3d 00 00 66 2e 0f 1f 84 00
00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89
e5 41 55 <41> 54 49 89 fc e8 10 11 00 00 85 c0 75 31 4c 89 e7 41 f6 84
24 88
[  867.843080] RSP: 0018:ffffaaac44f1c000 EFLAGS: 00010093
[  867.843083] RAX: ffffaaac44f1c018 RBX: 0000000000000000 RCX: ffffffff98e0102d
[  867.843085] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffaaac44f1c018
[  867.843086] RBP: ffffaaac44f1c008 R08: 0000000000000000 R09: 0000000000000000
[  867.843087] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  867.843089] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  867.843092] FS:  00007f8af6fbe740(0000) GS:ffff96d77f800000(0000)
knlGS:0000000000000000
[  867.843094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  867.843096] CR2: ffffaaac44f1bff8 CR3: 0000000105a9c002 CR4: 0000000000770ee0
[  867.843097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  867.843098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  867.843099] PKRU: 55555554
[  867.843100] Call Trace:
[  867.843101]  <TASK>
[  867.843104]  asm_exc_int3+0x3a/0x40
[  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
[  867.843112] Code: c7 c7 40 06 ff 9a 48 89 e5 e8 8b c6 1d 00 5d c3
cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
90 90 cc <1f> 44 00 00 55 8b 0d 2c 60 f0 02 48 89 e5 85 c9 75 1b 65 8b
05 4e
[  867.843113] RSP: 0018:ffffaaac44f1c0f0 EFLAGS: 00000002
[  867.843115] RAX: 0000000000000001 RBX: ffff96d77f82c380 RCX: 0000000000000000
[  867.843116] RDX: 0000000000000000 RSI: ffffffff9947d6fd RDI: 0000000000000001
[  867.843117] RBP: ffffaaac44f1c108 R08: 0000000000000020 R09: 0000000000000000
[  867.843118] R10: 0000000000000000 R11: 0000000040000000 R12: ffff96c886c3c000
[  867.843119] R13: 0000000000000009 R14: ffff96c880050000 R15: ffff96c8800504b8
[  867.843128]  ? preempt_count_sub+0x1/0xa0
[  867.843131]  ? migrate_disable+0x77/0x90
[  867.843135]  __bpf_prog_enter_recur+0x17/0x90
[  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
[  867.843154]  ? preempt_count_sub+0x1/0xa0
[  867.843157]  preempt_count_sub+0x5/0xa0
[  867.843159]  ? migrate_enable+0xac/0xf0
[  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
...
[  867.843788]  preempt_count_sub+0x5/0xa0
[  867.843793]  ? migrate_enable+0xac/0xf0
[  867.843829]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228
(stack is 00000000b23e2bc4..000000006d95af35)
[  867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924
(stack is 00000000ffd69623..0000000014eb594c)
[  867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0
(stack is 00000000034b6438..0000000078d1bcec)
[  867.843842]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.843845] BUG: IRQ stack guard page was hit at 00000000eb4e0327
(stack is 0000000051fdb66e..000000000b8aec75)
[  867.843868] BUG: IRQ stack guard page was hit at 00000000bc042b0f
(stack is 00000000c47a7d1a..00000000dbf95de4)
[  867.843879]  preempt_count_sub+0x5/0xa0
[  867.843883]  ? migrate_enable+0xac/0xf0
[  867.843884] BUG: IRQ stack guard page was hit at 0000000057cdfb36
(stack is 00000000891fe30a..00000000ec0deb36)
[  867.843885] BUG: IRQ stack guard page was hit at 00000000fe8f1d98
(stack is 000000001dd7c502..0000000012a36ba8)
[  867.843896] BUG: IRQ stack guard page was hit at 000000000142218d
(stack is 000000003183818c..00000000e07aa409)
[  867.843905]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843908] BUG: IRQ stack guard page was hit at 00000000cb520206
(stack is 00000000c91ca3c1..000000009babc41a)
[  867.843916]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.843922] BUG: IRQ stack guard page was hit at 00000000f11fb13e
(stack is 000000001783e178..0000000081ea69f6)
[  867.843928] BUG: IRQ stack guard page was hit at 000000004595335f
(stack is 00000000fb0c1e0a..00000000709006eb)
[  867.843928] BUG: IRQ stack guard page was hit at 000000002e4f1f60
(stack is 0000000058872888..00000000a19cf709)
[  867.843935] BUG: IRQ stack guard page was hit at 000000008565ee1f
(stack is 000000001fb0ab41..000000005f350279)
[  867.843935] BUG: IRQ stack guard page was hit at 00000000f5b6f09c
(stack is 00000000c559990d..000000009d9c1743)
[  867.843936] BUG: IRQ stack guard page was hit at 000000007e7f0f94
(stack is 00000000fdf8aa59..000000007bc6ea3c)
[  867.843938]  preempt_count_sub+0x5/0xa0
[  867.843941] BUG: IRQ stack guard page was hit at 0000000059318181
(stack is 0000000090d665e0..000000003fd542c0)
[  867.843944]  ? migrate_enable+0xac/0xf0
[  867.843954] BUG: IRQ stack guard page was hit at 00000000ad5083b1
(stack is 00000000461d91de..000000007d780ece)
[  867.843957]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843956] BUG: IRQ stack guard page was hit at 00000000d3cb4856
(stack is 000000003c68a653..0000000008d44434)
[  867.843958] BUG: IRQ stack guard page was hit at 00000000603028eb
(stack is 0000000004727704..00000000c9a13aea)
[  867.843962] BUG: IRQ stack guard page was hit at 000000002ec0341b
(stack is 000000005380b682..00000000ebf37ef3)
[  867.843962] BUG: IRQ stack guard page was hit at 00000000b1fdeb3b
(stack is 000000005165c122..000000009bb21a76)
[  867.843963]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.843965] BUG: IRQ stack guard page was hit at 00000000b346fffe
(stack is 00000000a66e8ebf..00000000dc1301b4)
[  867.843967] BUG: IRQ stack guard page was hit at 0000000091f94ea6
(stack is 0000000027387f08..00000000e26754af)
[  867.843970] BUG: IRQ stack guard page was hit at 000000003db0b944
(stack is 00000000461caa85..00000000ec9e1206)
[  867.843969] BUG: IRQ stack guard page was hit at 00000000805a1c05
(stack is 000000008babf572..0000000050ea07c6)
[  867.843971]  preempt_count_sub+0x5/0xa0
[  867.843971] BUG: IRQ stack guard page was hit at 00000000bc064bef
(stack is 000000000af91875..00000000199ce5b2)
[  867.843974]  ? migrate_enable+0xac/0xf0
[  867.843975] BUG: IRQ stack guard page was hit at 00000000ca074a82
(stack is 0000000044474ecd..00000000a7810d3e)
[  867.843976] BUG: IRQ stack guard page was hit at 00000000fa662e1f
(stack is 00000000664c1004..0000000058f78c52)
[  867.843975] BUG: IRQ stack guard page was hit at 000000003fb16f99
(stack is 00000000e1751c78..0000000068b06279)
[  867.843978] BUG: IRQ stack guard page was hit at 000000002e4315e3
(stack is 00000000daa8a157..00000000cfd7e703)
[  867.843979]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843979] BUG: IRQ stack guard page was hit at 000000009404a418
(stack is 0000000029b3d6a5..00000000b47c3671)
[  867.843983]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.843990]  preempt_count_sub+0x5/0xa0
[  867.843992]  ? migrate_enable+0xac/0xf0
[  867.843997]  __bpf_prog_exit_recur+0x2d/0x40
...
[  867.845387]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.845394]  preempt_count_sub+0x5/0xa0
[  867.845396]  ? migrate_enable+0xac/0xf0
[  867.845401]  __bpf_prog_exit_recur+0x2d/0x40
[  867.845404]  bpf_trampoline_6442468108_0+0x55/0x1000
[  867.845411]  preempt_count_sub+0x5/0xa0
[  867.845413]  ? flush_tlb_mm_range+0x11b/0x190
[  867.845417]  ? __pfx_preempt_count_sub+0x10/0x10
[  867.845421]  __text_poke+0x2c0/0x460
[  867.845426]  ? __pfx_text_poke_memcpy+0x10/0x10
[  867.845431]  text_poke_bp_batch+0x8c/0x2d0
[  867.845434]  ? __static_call_text_end+0x96a68/0x1ca748
[  867.845440]  text_poke_finish+0x1f/0x40
[  867.845442]  ftrace_replace_code+0x1a1/0x1f0
[  867.845448]  ftrace_modify_all_code+0xc6/0x180
[  867.845454]  arch_ftrace_update_code+0x9/0x10
[  867.845456]  ftrace_startup+0xd7/0x190
[  867.845459]  ? 0xffffffffc0556000
[  867.845468]  register_ftrace_direct_multi+0x22e/0x250
[  867.845475]  ? 0xffffffffc0556000
[  867.845478]  bpf_trampoline_update+0x28a/0x4c0
[  867.845481]  ? __pfx_preempt_count_sub+0x10/0x10
[  867.845486]  __bpf_trampoline_link_prog+0xd3/0x1d0
[  867.845490]  bpf_trampoline_link_prog+0x2c/0x50
[  867.845494]  bpf_tracing_prog_attach+0x3ab/0x5d0
[  867.845498]  ? trace_hardirqs_off+0x10/0x20
[  867.845508]  link_create+0x11d/0x260
[  867.845510]  ? security_bpf+0x32/0x50
[  867.845518]  __sys_bpf+0x6e6/0xdb0
[  867.845530]  __x64_sys_bpf+0x1a/0x30
[  867.845532]  do_syscall_64+0x38/0x90
[  867.845536]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  867.845538] RIP: 0033:0x7f8af60f8e29
[  867.845542] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89
01 48
[  867.845543] RSP: 002b:00007ffcf51a64e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000141
[  867.845546] RAX: ffffffffffffffda RBX: 00007ffcf51a65d0 RCX: 00007f8af60f8e29
[  867.845547] RDX: 0000000000000030 RSI: 00007ffcf51a6500 RDI: 000000000000001c
[  867.845549] RBP: 0000000000000018 R08: 0000000000000020 R09: 0000000000000000
[  867.845550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[  867.845551] R13: 0000000000000006 R14: 0000000000922010 R15: 0000000000000000
[  867.845561]  </TASK>

The reason is that we will call migrate_disable before entering bpf prog
and call migrate_enable after bpf prog exits. In
migrate_disable, preempt_count_{add,sub} will be called, so the bpf prog
will end up with dead looping there. We can't avoid calling
preempt_count_{add,sub} in this procedure, so we have to hide them
from ftrace, then they can't be traced.

So I think we'd better fix it with below change,  what do you think ?

---
 kernel/sched/core.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af017e0..b049a07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5758,7 +5758,7 @@ static inline void preempt_latency_start(int val)
  }
 }

-void preempt_count_add(int val)
+void notrace preempt_count_add(int val)
 {
 #ifdef CONFIG_DEBUG_PREEMPT
  /*
@@ -5778,7 +5778,6 @@ void preempt_count_add(int val)
  preempt_latency_start(val);
 }
 EXPORT_SYMBOL(preempt_count_add);
-NOKPROBE_SYMBOL(preempt_count_add);

 /*
  * If the value passed in equals to the current preempt count
@@ -5790,7 +5789,7 @@ static inline void preempt_latency_stop(int val)
  trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
 }

-void preempt_count_sub(int val)
+void notrace preempt_count_sub(int val)
 {
 #ifdef CONFIG_DEBUG_PREEMPT
  /*
@@ -5810,7 +5809,6 @@ void preempt_count_sub(int val)
  __preempt_count_sub(val);
 }
 EXPORT_SYMBOL(preempt_count_sub);
-NOKPROBE_SYMBOL(preempt_count_sub);

 #else
 static inline void preempt_latency_start(int val) { }
Steven Rostedt April 9, 2023, 11:55 a.m. UTC | #10
On Sun, 9 Apr 2023 13:32:12 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:
> 
> Hi Steven,
> 
> When I was trying to attach fentry to preempt_count_{sub,add}, the
> kernel just crashed. The crash info as follows,
> 
> [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf
> (stack is 0000000046a46a15..00000000537e7b28)
> [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
> [  867.843071] RIP: 0010:exc_int3+0x6/0xe0
> [  867.843078] Code: e9 a6 fe ff ff e8 6a 3d 00 00 66 2e 0f 1f 84 00
> 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89
> e5 41 55 <41> 54 49 89 fc e8 10 11 00 00 85 c0 75 31 4c 89 e7 41 f6 84
> 24 88
> [  867.843080] RSP: 0018:ffffaaac44f1c000 EFLAGS: 00010093
> [  867.843083] RAX: ffffaaac44f1c018 RBX: 0000000000000000 RCX: ffffffff98e0102d
> [  867.843085] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffaaac44f1c018
> [  867.843086] RBP: ffffaaac44f1c008 R08: 0000000000000000 R09: 0000000000000000
> [  867.843087] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  867.843089] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  867.843092] FS:  00007f8af6fbe740(0000) GS:ffff96d77f800000(0000)
> knlGS:0000000000000000
> [  867.843094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  867.843096] CR2: ffffaaac44f1bff8 CR3: 0000000105a9c002 CR4: 0000000000770ee0
> [  867.843097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  867.843098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  867.843099] PKRU: 55555554
> [  867.843100] Call Trace:
> [  867.843101]  <TASK>
> [  867.843104]  asm_exc_int3+0x3a/0x40
> [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
> [  867.843112] Code: c7 c7 40 06 ff 9a 48 89 e5 e8 8b c6 1d 00 5d c3
> cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> 90 90 cc <1f> 44 00 00 55 8b 0d 2c 60 f0 02 48 89 e5 85 c9 75 1b 65 8b
> 05 4e
> [  867.843113] RSP: 0018:ffffaaac44f1c0f0 EFLAGS: 00000002
> [  867.843115] RAX: 0000000000000001 RBX: ffff96d77f82c380 RCX: 0000000000000000
> [  867.843116] RDX: 0000000000000000 RSI: ffffffff9947d6fd RDI: 0000000000000001
> [  867.843117] RBP: ffffaaac44f1c108 R08: 0000000000000020 R09: 0000000000000000
> [  867.843118] R10: 0000000000000000 R11: 0000000040000000 R12: ffff96c886c3c000
> [  867.843119] R13: 0000000000000009 R14: ffff96c880050000 R15: ffff96c8800504b8
> [  867.843128]  ? preempt_count_sub+0x1/0xa0
> [  867.843131]  ? migrate_disable+0x77/0x90
> [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
> [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
> [  867.843154]  ? preempt_count_sub+0x1/0xa0
> [  867.843157]  preempt_count_sub+0x5/0xa0
> [  867.843159]  ? migrate_enable+0xac/0xf0
> [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
> [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
> ...
> [  867.843788]  preempt_count_sub+0x5/0xa0
[..]
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89
> 01 48
> [  867.845543] RSP: 002b:00007ffcf51a64e8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000141
> [  867.845546] RAX: ffffffffffffffda RBX: 00007ffcf51a65d0 RCX: 00007f8af60f8e29
> [  867.845547] RDX: 0000000000000030 RSI: 00007ffcf51a6500 RDI: 000000000000001c
> [  867.845549] RBP: 0000000000000018 R08: 0000000000000020 R09: 0000000000000000
> [  867.845550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
> [  867.845551] R13: 0000000000000006 R14: 0000000000922010 R15: 0000000000000000
> [  867.845561]  </TASK>
> 
> The reason is that we will call migrate_disable before entering bpf prog
> and call migrate_enable after bpf prog exits. In
> migrate_disable, preempt_count_{add,sub} will be called, so the bpf prog
> will end up with dead looping there. We can't avoid calling
> preempt_count_{add,sub} in this procedure, so we have to hide them
> from ftrace, then they can't be traced.
> 
> So I think we'd better fix it with below change,  what do you think ?

Sounds like a bug in BPF. ftrace has recursion protection (see
ftrace_test_recursion_trylock()).

> 
> ---
>  kernel/sched/core.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index af017e0..b049a07 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5758,7 +5758,7 @@ static inline void preempt_latency_start(int val)
>   }
>  }
> 
> -void preempt_count_add(int val)
> +void notrace preempt_count_add(int val)
>  {
>  #ifdef CONFIG_DEBUG_PREEMPT
>   /*
> @@ -5778,7 +5778,6 @@ void preempt_count_add(int val)
>   preempt_latency_start(val);
>  }
>  EXPORT_SYMBOL(preempt_count_add);
> -NOKPROBE_SYMBOL(preempt_count_add);
> 
>  /*
>   * If the value passed in equals to the current preempt count
> @@ -5790,7 +5789,7 @@ static inline void preempt_latency_stop(int val)
>   trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
>  }
> 
> -void preempt_count_sub(int val)
> +void notrace preempt_count_sub(int val)

NACK!

I attach to these functions all the time, I'm not going to make them
hidden because one new user of ftrace is broken.

The fix is either to block bpf from attaching to these functions
(without stopping all other users that can safely trace it)

Or use something similar to the ftrace_test_recursion_trylock() that
prevents this and can be wrapped around the migrate_disable/enable()
callers.

Or maybe we can make migrate_disable() use preempt_disable_notrace() if
it's proven not to take disable preemption for a long time.

-- Steve


>  {
>  #ifdef CONFIG_DEBUG_PREEMPT
>   /*
> @@ -5810,7 +5809,6 @@ void preempt_count_sub(int val)
>   __preempt_count_sub(val);
>  }
>  EXPORT_SYMBOL(preempt_count_sub);
> -NOKPROBE_SYMBOL(preempt_count_sub);
> 
>  #else
>  static inline void preempt_latency_start(int val) { }
Yafang Shao April 9, 2023, 12:45 p.m. UTC | #11
On Sun, Apr 9, 2023 at 7:55 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 9 Apr 2023 13:32:12 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Hi Steven,
> >
> > When I was trying to attach fentry to preempt_count_{sub,add}, the
> > kernel just crashed. The crash info as follows,
> >
> > [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf
> > (stack is 0000000046a46a15..00000000537e7b28)
> > [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> > [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
> > [  867.843071] RIP: 0010:exc_int3+0x6/0xe0
> > [  867.843078] Code: e9 a6 fe ff ff e8 6a 3d 00 00 66 2e 0f 1f 84 00
> > 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89
> > e5 41 55 <41> 54 49 89 fc e8 10 11 00 00 85 c0 75 31 4c 89 e7 41 f6 84
> > 24 88
> > [  867.843080] RSP: 0018:ffffaaac44f1c000 EFLAGS: 00010093
> > [  867.843083] RAX: ffffaaac44f1c018 RBX: 0000000000000000 RCX: ffffffff98e0102d
> > [  867.843085] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffaaac44f1c018
> > [  867.843086] RBP: ffffaaac44f1c008 R08: 0000000000000000 R09: 0000000000000000
> > [  867.843087] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [  867.843089] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [  867.843092] FS:  00007f8af6fbe740(0000) GS:ffff96d77f800000(0000)
> > knlGS:0000000000000000
> > [  867.843094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  867.843096] CR2: ffffaaac44f1bff8 CR3: 0000000105a9c002 CR4: 0000000000770ee0
> > [  867.843097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  867.843098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  867.843099] PKRU: 55555554
> > [  867.843100] Call Trace:
> > [  867.843101]  <TASK>
> > [  867.843104]  asm_exc_int3+0x3a/0x40
> > [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
> > [  867.843112] Code: c7 c7 40 06 ff 9a 48 89 e5 e8 8b c6 1d 00 5d c3
> > cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > 90 90 cc <1f> 44 00 00 55 8b 0d 2c 60 f0 02 48 89 e5 85 c9 75 1b 65 8b
> > 05 4e
> > [  867.843113] RSP: 0018:ffffaaac44f1c0f0 EFLAGS: 00000002
> > [  867.843115] RAX: 0000000000000001 RBX: ffff96d77f82c380 RCX: 0000000000000000
> > [  867.843116] RDX: 0000000000000000 RSI: ffffffff9947d6fd RDI: 0000000000000001
> > [  867.843117] RBP: ffffaaac44f1c108 R08: 0000000000000020 R09: 0000000000000000
> > [  867.843118] R10: 0000000000000000 R11: 0000000040000000 R12: ffff96c886c3c000
> > [  867.843119] R13: 0000000000000009 R14: ffff96c880050000 R15: ffff96c8800504b8
> > [  867.843128]  ? preempt_count_sub+0x1/0xa0
> > [  867.843131]  ? migrate_disable+0x77/0x90
> > [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
> > [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
> > [  867.843154]  ? preempt_count_sub+0x1/0xa0
> > [  867.843157]  preempt_count_sub+0x5/0xa0
> > [  867.843159]  ? migrate_enable+0xac/0xf0
> > [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
> > [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
> > ...
> > [  867.843788]  preempt_count_sub+0x5/0xa0
> [..]
> > 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> > 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89
> > 01 48
> > [  867.845543] RSP: 002b:00007ffcf51a64e8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000141
> > [  867.845546] RAX: ffffffffffffffda RBX: 00007ffcf51a65d0 RCX: 00007f8af60f8e29
> > [  867.845547] RDX: 0000000000000030 RSI: 00007ffcf51a6500 RDI: 000000000000001c
> > [  867.845549] RBP: 0000000000000018 R08: 0000000000000020 R09: 0000000000000000
> > [  867.845550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
> > [  867.845551] R13: 0000000000000006 R14: 0000000000922010 R15: 0000000000000000
> > [  867.845561]  </TASK>
> >
> > The reason is that we will call migrate_disable before entering bpf prog
> > and call migrate_enable after bpf prog exits. In
> > migrate_disable, preempt_count_{add,sub} will be called, so the bpf prog
> > will end up with dead looping there. We can't avoid calling
> > preempt_count_{add,sub} in this procedure, so we have to hide them
> > from ftrace, then they can't be traced.
> >
> > So I think we'd better fix it with below change,  what do you think ?
>
> Sounds like a bug in BPF. ftrace has recursion protection (see
> ftrace_test_recursion_trylock()).
>

bpf trampoline (AKA. fentry) uses bpf_prog->active to avoid the
recursion, but migrate_disable() is called before checking
bpf_prog->active, because bpf_prog->active is a percpu value.

> >
> > ---
> >  kernel/sched/core.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index af017e0..b049a07 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5758,7 +5758,7 @@ static inline void preempt_latency_start(int val)
> >   }
> >  }
> >
> > -void preempt_count_add(int val)
> > +void notrace preempt_count_add(int val)
> >  {
> >  #ifdef CONFIG_DEBUG_PREEMPT
> >   /*
> > @@ -5778,7 +5778,6 @@ void preempt_count_add(int val)
> >   preempt_latency_start(val);
> >  }
> >  EXPORT_SYMBOL(preempt_count_add);
> > -NOKPROBE_SYMBOL(preempt_count_add);
> >
> >  /*
> >   * If the value passed in equals to the current preempt count
> > @@ -5790,7 +5789,7 @@ static inline void preempt_latency_stop(int val)
> >   trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> >  }
> >
> > -void preempt_count_sub(int val)
> > +void notrace preempt_count_sub(int val)
>
> NACK!
>
> I attach to these functions all the time, I'm not going to make them
> hidden because one new user of ftrace is broken.
>

Understood.

> The fix is either to block bpf from attaching to these functions
> (without stopping all other users that can safely trace it)
>
> Or use something similar to the ftrace_test_recursion_trylock() that
> prevents this and can be wrapped around the migrate_disable/enable()
> callers.
>
> Or maybe we can make migrate_disable() use preempt_disable_notrace() if
> it's proven not to take disable preemption for a long time.
>

I didn't notice preempt_{disable,enable}_notrace before :(
Seems that is a better solution. I have verified that it really works.

BTW, why can we attach fentry to preempt_count_sub(), but can't attach
it to migrate_disable() ?
migrate_disable() isn't hidden from ftrace either...
Masami Hiramatsu (Google) April 9, 2023, 1:54 p.m. UTC | #12
On Sun, 9 Apr 2023 20:45:39 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> On Sun, Apr 9, 2023 at 7:55 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Sun, 9 Apr 2023 13:32:12 +0800
> > Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Hi Steven,
> > >
> > > When I was trying to attach fentry to preempt_count_{sub,add}, the
> > > kernel just crashed. The crash info as follows,
> > >
> > > [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf
> > > (stack is 0000000046a46a15..00000000537e7b28)
> > > [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> > > [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
> > > [  867.843071] RIP: 0010:exc_int3+0x6/0xe0
> > > [  867.843078] Code: e9 a6 fe ff ff e8 6a 3d 00 00 66 2e 0f 1f 84 00
> > > 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89
> > > e5 41 55 <41> 54 49 89 fc e8 10 11 00 00 85 c0 75 31 4c 89 e7 41 f6 84
> > > 24 88
> > > [  867.843080] RSP: 0018:ffffaaac44f1c000 EFLAGS: 00010093
> > > [  867.843083] RAX: ffffaaac44f1c018 RBX: 0000000000000000 RCX: ffffffff98e0102d
> > > [  867.843085] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffaaac44f1c018
> > > [  867.843086] RBP: ffffaaac44f1c008 R08: 0000000000000000 R09: 0000000000000000
> > > [  867.843087] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > [  867.843089] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > [  867.843092] FS:  00007f8af6fbe740(0000) GS:ffff96d77f800000(0000)
> > > knlGS:0000000000000000
> > > [  867.843094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  867.843096] CR2: ffffaaac44f1bff8 CR3: 0000000105a9c002 CR4: 0000000000770ee0
> > > [  867.843097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  867.843098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  867.843099] PKRU: 55555554
> > > [  867.843100] Call Trace:
> > > [  867.843101]  <TASK>
> > > [  867.843104]  asm_exc_int3+0x3a/0x40
> > > [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
> > > [  867.843112] Code: c7 c7 40 06 ff 9a 48 89 e5 e8 8b c6 1d 00 5d c3
> > > cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > > 90 90 cc <1f> 44 00 00 55 8b 0d 2c 60 f0 02 48 89 e5 85 c9 75 1b 65 8b
> > > 05 4e
> > > [  867.843113] RSP: 0018:ffffaaac44f1c0f0 EFLAGS: 00000002
> > > [  867.843115] RAX: 0000000000000001 RBX: ffff96d77f82c380 RCX: 0000000000000000
> > > [  867.843116] RDX: 0000000000000000 RSI: ffffffff9947d6fd RDI: 0000000000000001
> > > [  867.843117] RBP: ffffaaac44f1c108 R08: 0000000000000020 R09: 0000000000000000
> > > [  867.843118] R10: 0000000000000000 R11: 0000000040000000 R12: ffff96c886c3c000
> > > [  867.843119] R13: 0000000000000009 R14: ffff96c880050000 R15: ffff96c8800504b8
> > > [  867.843128]  ? preempt_count_sub+0x1/0xa0
> > > [  867.843131]  ? migrate_disable+0x77/0x90
> > > [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
> > > [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
> > > [  867.843154]  ? preempt_count_sub+0x1/0xa0
> > > [  867.843157]  preempt_count_sub+0x5/0xa0
> > > [  867.843159]  ? migrate_enable+0xac/0xf0
> > > [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
> > > [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
> > > ...
> > > [  867.843788]  preempt_count_sub+0x5/0xa0
> > [..]
> > > 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> > > 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89
> > > 01 48
> > > [  867.845543] RSP: 002b:00007ffcf51a64e8 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000141
> > > [  867.845546] RAX: ffffffffffffffda RBX: 00007ffcf51a65d0 RCX: 00007f8af60f8e29
> > > [  867.845547] RDX: 0000000000000030 RSI: 00007ffcf51a6500 RDI: 000000000000001c
> > > [  867.845549] RBP: 0000000000000018 R08: 0000000000000020 R09: 0000000000000000
> > > [  867.845550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
> > > [  867.845551] R13: 0000000000000006 R14: 0000000000922010 R15: 0000000000000000
> > > [  867.845561]  </TASK>
> > >
> > > The reason is that we will call migrate_disable before entering bpf prog
> > > and call migrate_enable after bpf prog exits. In
> > > migrate_disable, preempt_count_{add,sub} will be called, so the bpf prog
> > > will end up with dead looping there. We can't avoid calling
> > > preempt_count_{add,sub} in this procedure, so we have to hide them
> > > from ftrace, then they can't be traced.
> > >
> > > So I think we'd better fix it with below change,  what do you think ?
> >
> > Sounds like a bug in BPF. ftrace has recursion protection (see
> > ftrace_test_recursion_trylock()).
> >
> 
> bpf trampoline (AKA. fentry) uses bpf_prog->active to avoid the
> recursion, but migrate_disable() is called before checking
> bpf_prog->active, because bpf_prog->active is a percpu value.

You can use ftrace_test_recursion_trylock() before using migrate_disable()
if you ensure the callback is only for fentry. Can you prepare a fentry
specific callback?

Thank you,
Yafang Shao April 9, 2023, 2:44 p.m. UTC | #13
On Sun, Apr 9, 2023 at 9:54 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Sun, 9 Apr 2023 20:45:39 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > On Sun, Apr 9, 2023 at 7:55 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Sun, 9 Apr 2023 13:32:12 +0800
> > > Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Hi Steven,
> > > >
> > > > When I was trying to attach fentry to preempt_count_{sub,add}, the
> > > > kernel just crashed. The crash info as follows,
> > > >
> > > > [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf
> > > > (stack is 0000000046a46a15..00000000537e7b28)
> > > > [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> > > > [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
> > > > [  867.843071] RIP: 0010:exc_int3+0x6/0xe0
> > > > [  867.843078] Code: e9 a6 fe ff ff e8 6a 3d 00 00 66 2e 0f 1f 84 00
> > > > 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89
> > > > e5 41 55 <41> 54 49 89 fc e8 10 11 00 00 85 c0 75 31 4c 89 e7 41 f6 84
> > > > 24 88
> > > > [  867.843080] RSP: 0018:ffffaaac44f1c000 EFLAGS: 00010093
> > > > [  867.843083] RAX: ffffaaac44f1c018 RBX: 0000000000000000 RCX: ffffffff98e0102d
> > > > [  867.843085] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffaaac44f1c018
> > > > [  867.843086] RBP: ffffaaac44f1c008 R08: 0000000000000000 R09: 0000000000000000
> > > > [  867.843087] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > [  867.843089] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > [  867.843092] FS:  00007f8af6fbe740(0000) GS:ffff96d77f800000(0000)
> > > > knlGS:0000000000000000
> > > > [  867.843094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  867.843096] CR2: ffffaaac44f1bff8 CR3: 0000000105a9c002 CR4: 0000000000770ee0
> > > > [  867.843097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [  867.843098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [  867.843099] PKRU: 55555554
> > > > [  867.843100] Call Trace:
> > > > [  867.843101]  <TASK>
> > > > [  867.843104]  asm_exc_int3+0x3a/0x40
> > > > [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
> > > > [  867.843112] Code: c7 c7 40 06 ff 9a 48 89 e5 e8 8b c6 1d 00 5d c3
> > > > cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > > > 90 90 cc <1f> 44 00 00 55 8b 0d 2c 60 f0 02 48 89 e5 85 c9 75 1b 65 8b
> > > > 05 4e
> > > > [  867.843113] RSP: 0018:ffffaaac44f1c0f0 EFLAGS: 00000002
> > > > [  867.843115] RAX: 0000000000000001 RBX: ffff96d77f82c380 RCX: 0000000000000000
> > > > [  867.843116] RDX: 0000000000000000 RSI: ffffffff9947d6fd RDI: 0000000000000001
> > > > [  867.843117] RBP: ffffaaac44f1c108 R08: 0000000000000020 R09: 0000000000000000
> > > > [  867.843118] R10: 0000000000000000 R11: 0000000040000000 R12: ffff96c886c3c000
> > > > [  867.843119] R13: 0000000000000009 R14: ffff96c880050000 R15: ffff96c8800504b8
> > > > [  867.843128]  ? preempt_count_sub+0x1/0xa0
> > > > [  867.843131]  ? migrate_disable+0x77/0x90
> > > > [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
> > > > [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
> > > > [  867.843154]  ? preempt_count_sub+0x1/0xa0
> > > > [  867.843157]  preempt_count_sub+0x5/0xa0
> > > > [  867.843159]  ? migrate_enable+0xac/0xf0
> > > > [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
> > > > [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
> > > > ...
> > > > [  867.843788]  preempt_count_sub+0x5/0xa0
> > > [..]
> > > > 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> > > > 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89
> > > > 01 48
> > > > [  867.845543] RSP: 002b:00007ffcf51a64e8 EFLAGS: 00000246 ORIG_RAX:
> > > > 0000000000000141
> > > > [  867.845546] RAX: ffffffffffffffda RBX: 00007ffcf51a65d0 RCX: 00007f8af60f8e29
> > > > [  867.845547] RDX: 0000000000000030 RSI: 00007ffcf51a6500 RDI: 000000000000001c
> > > > [  867.845549] RBP: 0000000000000018 R08: 0000000000000020 R09: 0000000000000000
> > > > [  867.845550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
> > > > [  867.845551] R13: 0000000000000006 R14: 0000000000922010 R15: 0000000000000000
> > > > [  867.845561]  </TASK>
> > > >
> > > > The reason is that we will call migrate_disable before entering bpf prog
> > > > and call migrate_enable after bpf prog exits. In
> > > > migrate_disable, preempt_count_{add,sub} will be called, so the bpf prog
> > > > will end up with dead looping there. We can't avoid calling
> > > > preempt_count_{add,sub} in this procedure, so we have to hide them
> > > > from ftrace, then they can't be traced.
> > > >
> > > > So I think we'd better fix it with below change,  what do you think ?
> > >
> > > Sounds like a bug in BPF. ftrace has recursion protection (see
> > > ftrace_test_recursion_trylock()).
> > >
> >
> > bpf trampoline (AKA. fentry) uses bpf_prog->active to avoid the
> > recursion, but migrate_disable() is called before checking
> > bpf_prog->active, because bpf_prog->active is a percpu value.
>
> You can use ftrace_test_recursion_trylock() before using migrate_disable()
> if you ensure the callback is only for fentry. Can you prepare a fentry
> specific callback?
>

fentry uses both ftrace-based probe and text-poke-based probe.
ftrace_test_recursion_trylock() can avoid the recursion in the
ftrace-based probe. Not sure if we can split bpf_trampoline_enter into
two callbacks, one for ftrace and another for text poke. I will
analyze it. Thanks for your suggestion.
Steven Rostedt April 9, 2023, 5:25 p.m. UTC | #14
On Sun, 9 Apr 2023 20:45:39 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:
> 
> I didn't notice preempt_{disable,enable}_notrace before :(
> Seems that is a better solution. I have verified that it really works.

Great.

> 
> BTW, why can we attach fentry to preempt_count_sub(), but can't attach
> it to migrate_disable() ?
> migrate_disable() isn't hidden from ftrace either...

You can't?

 # trace-cmd -p function -l migrate_disable
 # trace-cmd show
          <idle>-0       [000] ..s2. 153664.937829: migrate_disable <-bpf_prog_run_clear_cb
          <idle>-0       [000] ..s2. 153664.937834: migrate_disable <-bpf_prog_run_clear_cb
          <idle>-0       [000] ..s2. 153664.937835: migrate_disable <-bpf_prog_run_clear_cb
          <idle>-0       [000] ..s2. 153665.769555: migrate_disable <-bpf_prog_run_clear_cb
          <idle>-0       [000] ..s2. 153665.772109: migrate_disable <-bpf_prog_run_clear_cb

I think bpf prevents it. Perhaps that's another solution:

See 35e3815fa8102 ("bpf: Add deny list of btf ids check for tracing programs")

-- Steve
Steven Rostedt April 10, 2023, 2:02 a.m. UTC | #15
On Sun, 9 Apr 2023 22:44:37 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> > You can use ftrace_test_recursion_trylock() before using migrate_disable()
> > if you ensure the callback is only for fentry. Can you prepare a fentry
> > specific callback?
> >  
> 
> fentry uses both ftrace-based probe and text-poke-based probe.
> ftrace_test_recursion_trylock() can avoid the recursion in the
> ftrace-based probe. Not sure if we can split bpf_trampoline_enter into
> two callbacks, one for ftrace and another for text poke. I will
> analyze it. Thanks for your suggestion.

ftrace_test_recusion_trylock() works with anything. What it basically
is doing is to make sure that you do not pass that same location in the
same context.

That is, it sets a specific bit to which context it is in (normal,
softirq, irq or NMI). If you call it again in the same context, it will
return false (recursion detected).

That is,

  normal:
    ftrace_test_recursion_trylock(); <- OK

    softirq:
       ftrace_test_recursion_trylock() <- OK

       hard-irq:
          ftrace_test_recusion_trylock() <- OK
          [ recusion happens ]
          ftrace_test_recursion_trylock() <- FAIL!

That's because it is perfectly fine for the same code path to enter
ftrace code in different contexts, but if it happens in the same
context, that has to be due something calling back into itself.

Thus if you had:

	bit = ftrace_test_recursion_trylock();
	if (bit < 0)
		return;
	migrate_disable();
	ftrace_test_test_recursion_unlock(bit);

It would prevent migrate_disable from recursing.

    bpf_func()
	bit = ftrace_test_recursion_trylock() <<- returns 1
	migrate_disable()
	    preempt_count_add()
		bpf_func()
		    bit = ftrace_test_recusion_trylock() <<- returns -1
		    if (bit < 0)
			return;

It would prevent the crash.

-- Steve
Yafang Shao April 10, 2023, 4:28 a.m. UTC | #16
On Mon, Apr 10, 2023 at 1:25 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 9 Apr 2023 20:45:39 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > I didn't notice preempt_{disable,enable}_notrace before :(
> > Seems that is a better solution. I have verified that it really works.
>
> Great.
>
> >
> > BTW, why can we attach fentry to preempt_count_sub(), but can't attach
> > it to migrate_disable() ?
> > migrate_disable() isn't hidden from ftrace either...
>
> You can't?
>
>  # trace-cmd -p function -l migrate_disable
>  # trace-cmd show
>           <idle>-0       [000] ..s2. 153664.937829: migrate_disable <-bpf_prog_run_clear_cb
>           <idle>-0       [000] ..s2. 153664.937834: migrate_disable <-bpf_prog_run_clear_cb
>           <idle>-0       [000] ..s2. 153664.937835: migrate_disable <-bpf_prog_run_clear_cb
>           <idle>-0       [000] ..s2. 153665.769555: migrate_disable <-bpf_prog_run_clear_cb
>           <idle>-0       [000] ..s2. 153665.772109: migrate_disable <-bpf_prog_run_clear_cb
>
> I think bpf prevents it. Perhaps that's another solution:
>
> See 35e3815fa8102 ("bpf: Add deny list of btf ids check for tracing programs")
>

Right, it is because of the BTF_ID(). I think we should introduce
migrate_{disable,enable}_notrace instead. Because actually we only
don't want it to attach migrate_{disable,enable} used in bpf prog
enter and exit, while it is safe to attach it at other places.
Yafang Shao April 10, 2023, 5:36 a.m. UTC | #17
On Mon, Apr 10, 2023 at 10:02 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 9 Apr 2023 22:44:37 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > > You can use ftrace_test_recursion_trylock() before using migrate_disable()
> > > if you ensure the callback is only for fentry. Can you prepare a fentry
> > > specific callback?
> > >
> >
> > fentry uses both ftrace-based probe and text-poke-based probe.
> > ftrace_test_recursion_trylock() can avoid the recursion in the
> > ftrace-based probe. Not sure if we can split bpf_trampoline_enter into
> > two callbacks, one for ftrace and another for text poke. I will
> > analyze it. Thanks for your suggestion.
>
> ftrace_test_recusion_trylock() works with anything. What it basically
> is doing is to make sure that you do not pass that same location in the
> same context.
>
> That is, it sets a specific bit to which context it is in (normal,
> softirq, irq or NMI). If you call it again in the same context, it will
> return false (recursion detected).
>
> That is,
>
>   normal:
>     ftrace_test_recursion_trylock(); <- OK
>
>     softirq:
>        ftrace_test_recursion_trylock() <- OK
>
>        hard-irq:
>           ftrace_test_recusion_trylock() <- OK
>           [ recusion happens ]
>           ftrace_test_recursion_trylock() <- FAIL!
>
> That's because it is perfectly fine for the same code path to enter
> ftrace code in different contexts, but if it happens in the same
> context, that has to be due something calling back into itself.
>
> Thus if you had:
>
>         bit = ftrace_test_recursion_trylock();
>         if (bit < 0)
>                 return;
>         migrate_disable();
>         ftrace_test_test_recursion_unlock(bit);
>
> It would prevent migrate_disable from recursing.
>
>     bpf_func()
>         bit = ftrace_test_recursion_trylock() <<- returns 1
>         migrate_disable()
>             preempt_count_add()
>                 bpf_func()
>                     bit = ftrace_test_recusion_trylock() <<- returns -1
>                     if (bit < 0)
>                         return;
>
> It would prevent the crash.
>

Many thanks for the detailed explanation.
I think ftrace_test_recursion_trylock() can't apply to migreate_enable().
If we change as follows to prevent migrate_enable() from recursing,

         bit = ftrace_test_recursion_trylock();
         if (bit < 0)
                 return;
         migrate_enable();
         ftrace_test_recursion_unlock(bit);

We have called migrate_disable() before, so if we don't call
migrate_enable() it will cause other issues.
Steven Rostedt April 10, 2023, 10:30 a.m. UTC | #18
On Mon, 10 Apr 2023 13:36:32 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> Many thanks for the detailed explanation.
> I think ftrace_test_recursion_trylock() can't apply to migreate_enable().
> If we change as follows to prevent migrate_enable() from recursing,
> 
>          bit = ftrace_test_recursion_trylock();
>          if (bit < 0)
>                  return;
>          migrate_enable();
>          ftrace_test_recursion_unlock(bit);
> 
> We have called migrate_disable() before, so if we don't call
> migrate_enable() it will cause other issues.

Right. Because you called migrate_disable() before (and protected it
with the ftrace_test_recursion_trylock(), the second call is guaranteed
to succeed!

[1]	bit = ftrace_test_recursion_trylock();
	if (bit < 0)
		return;
	migrate_disable();
	ftrace_test_recursion_trylock(bit);

	[..]

[2]	ftrace_test_recursion_trylock();
	migrate_enable();
	ftrace_test_recursion_trylock(bit);

You don't even need to read the bit again, because it will be the same
as the first call [1]. That's because it returns the recursion level
you are in. A function will have the same recursion level through out
the function (as long as it always calls ftrace_test_recursion_unlock()
between them).

	bpf_func()
	    bit = ftrace_test_recursion_trylock(); <-- OK
	    migrate_disable();
	    ftrace_test_recursion_unlock(bit);

	    [..]

	    ftrace_test_recursion_trylock(); <<-- guaranteed to be OK
	    migrate_enable() <<<-- gets traced

		bpf_func()
		    bit = ftrace_test_recursion_trylock() <-- FAILED
		    if (bit < 0)
			return;

	    ftrace_test_recursion_unlock(bit);


See, still works!

The migrate_enable() will never be called without the migrate_disable()
as the migrate_disable() only gets called when not being recursed.

Note, the ftrace_test_recursion_*() code needs to be updated because it
currently does disable preemption, which it doesn't have to. And that
can cause migrate_disable() to do something different. It only disabled
preemption, as there was a time that it needed to, but now it doesn't.
But the users of it will need to be audited to make sure that they
don't need the side effect of it disabling preemption.

-- Steve
Yafang Shao April 10, 2023, 1:56 p.m. UTC | #19
On Mon, Apr 10, 2023 at 6:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 10 Apr 2023 13:36:32 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Many thanks for the detailed explanation.
> > I think ftrace_test_recursion_trylock() can't apply to migreate_enable().
> > If we change as follows to prevent migrate_enable() from recursing,
> >
> >          bit = ftrace_test_recursion_trylock();
> >          if (bit < 0)
> >                  return;
> >          migrate_enable();
> >          ftrace_test_recursion_unlock(bit);
> >
> > We have called migrate_disable() before, so if we don't call
> > migrate_enable() it will cause other issues.
>
> Right. Because you called migrate_disable() before (and protected it
> with the ftrace_test_recursion_trylock(), the second call is guaranteed
> to succeed!
>
> [1]     bit = ftrace_test_recursion_trylock();
>         if (bit < 0)
>                 return;
>         migrate_disable();
>         ftrace_test_recursion_trylock(bit);
>
>         [..]
>
> [2]     ftrace_test_recursion_trylock();
>         migrate_enable();
>         ftrace_test_recursion_trylock(bit);
>
> You don't even need to read the bit again, because it will be the same
> as the first call [1]. That's because it returns the recursion level
> you are in. A function will have the same recursion level through out
> the function (as long as it always calls ftrace_test_recursion_unlock()
> between them).
>
>         bpf_func()
>             bit = ftrace_test_recursion_trylock(); <-- OK
>             migrate_disable();
>             ftrace_test_recursion_unlock(bit);
>
>             [..]
>
>             ftrace_test_recursion_trylock(); <<-- guaranteed to be OK
>             migrate_enable() <<<-- gets traced
>
>                 bpf_func()
>                     bit = ftrace_test_recursion_trylock() <-- FAILED
>                     if (bit < 0)
>                         return;
>
>             ftrace_test_recursion_unlock(bit);
>
>
> See, still works!
>
> The migrate_enable() will never be called without the migrate_disable()
> as the migrate_disable() only gets called when not being recursed.
>

Thanks for your explanation again.
BPF trampoline is a little special. It includes three parts, as follows,

    ret = __bpf_prog_enter();
    if (ret)
        prog->bpf_func();
     __bpf_prog_exit();

migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
in __bpf_prog_exit():

    ret = __bpf_prog_enter();
                migrate_disable();
    if (ret)
        prog->bpf_func();
     __bpf_prog_exit();
          migrate_enable();

That said, if we haven't executed migrate_disable() in
__bpf_prog_enter(), we shouldn't execute migrate_enable() in
__bpf_prog_exit().
Can ftrace_test_recursion_trylock() be applied to this pattern ?

> Note, the ftrace_test_recursion_*() code needs to be updated because it
> currently does disable preemption, which it doesn't have to. And that
> can cause migrate_disable() to do something different. It only disabled
> preemption, as there was a time that it needed to, but now it doesn't.
> But the users of it will need to be audited to make sure that they
> don't need the side effect of it disabling preemption.
>

disabling preemption is not expected by bpf prog, so I think we should
change it.
Steven Rostedt April 10, 2023, 2:12 p.m. UTC | #20
On Mon, 10 Apr 2023 21:56:16 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> Thanks for your explanation again.
> BPF trampoline is a little special. It includes three parts, as follows,
> 
>     ret = __bpf_prog_enter();
>     if (ret)
>         prog->bpf_func();
>      __bpf_prog_exit();
> 
> migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
> in __bpf_prog_exit():
> 
>     ret = __bpf_prog_enter();
>                 migrate_disable();
>     if (ret)
>         prog->bpf_func();
>      __bpf_prog_exit();
>           migrate_enable();
> 
> That said, if we haven't executed migrate_disable() in
> __bpf_prog_enter(), we shouldn't execute migrate_enable() in
> __bpf_prog_exit().
> Can ftrace_test_recursion_trylock() be applied to this pattern ?

Yes, it can! And in this you would need to not call migrate_enable()
because if the trace_recursion_trylock() failed, it would prevent
migrate_disable() from being called (and should not let the bpf_func() from
being called either. And then the migrate_enable in __bpf_prog_exit() would
need to know not to call migrate_enable() which checking the return value
of ftrace_test_recursion_trylock() would give the same value as what the
one before migrate_disable() had.


> 
> > Note, the ftrace_test_recursion_*() code needs to be updated because it
> > currently does disable preemption, which it doesn't have to. And that
> > can cause migrate_disable() to do something different. It only disabled
> > preemption, as there was a time that it needed to, but now it doesn't.
> > But the users of it will need to be audited to make sure that they
> > don't need the side effect of it disabling preemption.
> >  
> 
> disabling preemption is not expected by bpf prog, so I think we should
> change it.

The disabling of preemption was just done because every place that used it
happened to also disable preemption. So it was just a clean up, not a
requirement. Although the documentation said it did disable preemption :-/

 See ce5e48036c9e7 ("ftrace: disable preemption when recursion locked")

I think I can add a ftrace_test_recursion_try_aquire() and release() that
is does the same thing without preemption. That way, we don't need to
revert that patch, and use that instead.

-- Steve
Yafang Shao April 10, 2023, 2:20 p.m. UTC | #21
On Mon, Apr 10, 2023 at 10:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 10 Apr 2023 21:56:16 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Thanks for your explanation again.
> > BPF trampoline is a little special. It includes three parts, as follows,
> >
> >     ret = __bpf_prog_enter();
> >     if (ret)
> >         prog->bpf_func();
> >      __bpf_prog_exit();
> >
> > migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
> > in __bpf_prog_exit():
> >
> >     ret = __bpf_prog_enter();
> >                 migrate_disable();
> >     if (ret)
> >         prog->bpf_func();
> >      __bpf_prog_exit();
> >           migrate_enable();
> >
> > That said, if we haven't executed migrate_disable() in
> > __bpf_prog_enter(), we shouldn't execute migrate_enable() in
> > __bpf_prog_exit().
> > Can ftrace_test_recursion_trylock() be applied to this pattern ?
>
> Yes, it can! And in this you would need to not call migrate_enable()
> because if the trace_recursion_trylock() failed, it would prevent
> migrate_disable() from being called (and should not let the bpf_func() from
> being called either. And then the migrate_enable in __bpf_prog_exit() would
> need to know not to call migrate_enable() which checking the return value
> of ftrace_test_recursion_trylock() would give the same value as what the
> one before migrate_disable() had.
>

That needs some changes in invoke_bpf_prog() in files
arch/${ARCH}/net/bpf_jit_comp.c.
But I will have a try. We can then remove the bpf_prog->active, that
will be a good cleanup as well.

>
> >
> > > Note, the ftrace_test_recursion_*() code needs to be updated because it
> > > currently does disable preemption, which it doesn't have to. And that
> > > can cause migrate_disable() to do something different. It only disabled
> > > preemption, as there was a time that it needed to, but now it doesn't.
> > > But the users of it will need to be audited to make sure that they
> > > don't need the side effect of it disabling preemption.
> > >
> >
> > disabling preemption is not expected by bpf prog, so I think we should
> > change it.
>
> The disabling of preemption was just done because every place that used it
> happened to also disable preemption. So it was just a clean up, not a
> requirement. Although the documentation said it did disable preemption :-/
>
>  See ce5e48036c9e7 ("ftrace: disable preemption when recursion locked")
>
> I think I can add a ftrace_test_recursion_try_aquire() and release() that
> is does the same thing without preemption. That way, we don't need to
> revert that patch, and use that instead.
>
> -- Steve
Steven Rostedt April 10, 2023, 2:31 p.m. UTC | #22
On Mon, 10 Apr 2023 10:12:24 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The disabling of preemption was just done because every place that used it
> happened to also disable preemption. So it was just a clean up, not a
> requirement. Although the documentation said it did disable preemption :-/
> 
>  See ce5e48036c9e7 ("ftrace: disable preemption when recursion locked")
> 
> I think I can add a ftrace_test_recursion_try_aquire() and release() that
> is does the same thing without preemption. That way, we don't need to
> revert that patch, and use that instead.

This patch adds a:

   test_recursion_try_acquire() and test_recursion_release()

I called it "acquire" and "release" so that "lock" can imply preemption being
disabled, and this only protects against recursion (and I removed "ftrace"
in the name, as it is meant for non-ftrace uses, which I may give it a
different set of recursion bits).

Note, the reason to pass in ip, and parent_ip (_THIS_IP_ and _RET_IP_) is
for debugging purposes. They *should* be optimized out, as everything is
__always_inline or macros, and those are only used if
CONFIG_FTRACE_RECORD_RECURSION is enabled.

-- Steve


diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..80de2ee7b4c3 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -150,9 +150,6 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
 # define trace_warn_on_no_rcu(ip)	false
 #endif
 
-/*
- * Preemption is promised to be disabled when return bit >= 0.
- */
 static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
 							int start)
 {
@@ -182,18 +179,11 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
 	val |= 1 << bit;
 	current->trace_recursion = val;
 	barrier();
-
-	preempt_disable_notrace();
-
 	return bit;
 }
 
-/*
- * Preemption will be enabled (if it was previously enabled).
- */
 static __always_inline void trace_clear_recursion(int bit)
 {
-	preempt_enable_notrace();
 	barrier();
 	trace_recursion_clear(bit);
 }
@@ -205,12 +195,18 @@ static __always_inline void trace_clear_recursion(int bit)
  * tracing recursed in the same context (normal vs interrupt),
  *
  * Returns: -1 if a recursion happened.
- *           >= 0 if no recursion.
+ *           >= 0 if no recursion and preemption will be disabled.
  */
 static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
 							 unsigned long parent_ip)
 {
-	return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
+	int bit;
+
+	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
+	if (unlikely(bit < 0))
+		return -1;
+	preempt_disable_notrace();
+	return bit;
 }
 
 /**
@@ -220,6 +216,33 @@ static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
  * This is used at the end of a ftrace callback.
  */
 static __always_inline void ftrace_test_recursion_unlock(int bit)
+{
+	preempt_enable_notrace();
+	trace_clear_recursion(bit);
+}
+
+/**
+ * test_recursion_try_acquire - tests for recursion in same context
+ *
+ * This will detect recursion of a function.
+ *
+ * Returns: -1 if a recursion happened.
+ *           >= 0 if no recursion
+ */
+static __always_inline int test_recursion_try_acquire(unsigned long ip,
+						      unsigned long parent_ip)
+{
+	return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
+}
+
+/**
+ * test_recursion_release - called after a success of test_recursion_try_acquire()
+ * @bit: The return of a successful test_recursion_try_acquire()
+ *
+ * This releases the recursion lock taken by a non-negative return call
+ * by test_recursion_try_acquire().
+ */
+static __always_inline void test_recursion_release(int bit)
 {
 	trace_clear_recursion(bit);
 }
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 0feea145bb29..ff8172ba48b0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7644,6 +7644,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 	if (bit < 0)
 		return;
 
+	preempt_disable();
 	do_for_each_ftrace_op(op, ftrace_ops_list) {
 		/* Stub functions don't need to be called nor tested */
 		if (op->flags & FTRACE_OPS_FL_STUB)
@@ -7665,6 +7666,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 		}
 	} while_for_each_ftrace_op(op);
 out:
+	preempt_enable();
 	trace_clear_recursion(bit);
 }
Jiri Olsa April 10, 2023, 9:35 p.m. UTC | #23
On Mon, Apr 10, 2023 at 10:20:31PM +0800, Yafang Shao wrote:
> On Mon, Apr 10, 2023 at 10:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Mon, 10 Apr 2023 21:56:16 +0800
> > Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > > Thanks for your explanation again.
> > > BPF trampoline is a little special. It includes three parts, as follows,
> > >
> > >     ret = __bpf_prog_enter();
> > >     if (ret)
> > >         prog->bpf_func();
> > >      __bpf_prog_exit();
> > >
> > > migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
> > > in __bpf_prog_exit():
> > >
> > >     ret = __bpf_prog_enter();
> > >                 migrate_disable();
> > >     if (ret)
> > >         prog->bpf_func();
> > >      __bpf_prog_exit();
> > >           migrate_enable();
> > >
> > > That said, if we haven't executed migrate_disable() in
> > > __bpf_prog_enter(), we shouldn't execute migrate_enable() in
> > > __bpf_prog_exit().
> > > Can ftrace_test_recursion_trylock() be applied to this pattern ?
> >
> > Yes, it can! And in this you would need to not call migrate_enable()
> > because if the trace_recursion_trylock() failed, it would prevent
> > migrate_disable() from being called (and should not let the bpf_func() from
> > being called either. And then the migrate_enable in __bpf_prog_exit() would
> > need to know not to call migrate_enable() which checking the return value
> > of ftrace_test_recursion_trylock() would give the same value as what the
> > one before migrate_disable() had.
> >
> 
> That needs some changes in invoke_bpf_prog() in files
> arch/${ARCH}/net/bpf_jit_comp.c.
> But I will have a try. We can then remove the bpf_prog->active, that
> will be a good cleanup as well.

I was wondering if it's worth the effort to do that just to be able to attach
bpf prog to preempt_count_add/sub and was going to suggest to add them to
btf_id_deny as Steven pointed out earlier as possible solution

but if that might turn out as alternative to prog->active, that'd be great

jirka

> 
> >
> > >
> > > > Note, the ftrace_test_recursion_*() code needs to be updated because it
> > > > currently does disable preemption, which it doesn't have to. And that
> > > > can cause migrate_disable() to do something different. It only disabled
> > > > preemption, as there was a time that it needed to, but now it doesn't.
> > > > But the users of it will need to be audited to make sure that they
> > > > don't need the side effect of it disabling preemption.
> > > >
> > >
> > > disabling preemption is not expected by bpf prog, so I think we should
> > > change it.
> >
> > The disabling of preemption was just done because every place that used it
> > happened to also disable preemption. So it was just a clean up, not a
> > requirement. Although the documentation said it did disable preemption :-/
> >
> >  See ce5e48036c9e7 ("ftrace: disable preemption when recursion locked")
> >
> > I think I can add a ftrace_test_recursion_try_aquire() and release() that
> > is does the same thing without preemption. That way, we don't need to
> > revert that patch, and use that instead.
> >
> > -- Steve
> 
> 
> 
> -- 
> Regards
> Yafang
Yafang Shao April 12, 2023, 2:30 p.m. UTC | #24
On Mon, Apr 10, 2023 at 10:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 10 Apr 2023 10:12:24 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > The disabling of preemption was just done because every place that used it
> > happened to also disable preemption. So it was just a clean up, not a
> > requirement. Although the documentation said it did disable preemption :-/
> >
> >  See ce5e48036c9e7 ("ftrace: disable preemption when recursion locked")
> >
> > I think I can add a ftrace_test_recursion_try_aquire() and release() that
> > is does the same thing without preemption. That way, we don't need to
> > revert that patch, and use that instead.
>
> This patch adds a:
>
>    test_recursion_try_acquire() and test_recursion_release()
>
> I called it "acquire" and "release" so that "lock" can imply preemption being
> disabled, and this only protects against recursion (and I removed "ftrace"
> in the name, as it is meant for non-ftrace uses, which I may give it a
> different set of recursion bits).
>
> Note, the reason to pass in ip, and parent_ip (_THIS_IP_ and _RET_IP_) is
> for debugging purposes. They *should* be optimized out, as everything is
> __always_inline or macros, and those are only used if
> CONFIG_FTRACE_RECORD_RECURSION is enabled.
>
> -- Steve
>

Thanks for the explanation.

>
> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index d48cd92d2364..80de2ee7b4c3 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -150,9 +150,6 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
>  # define trace_warn_on_no_rcu(ip)      false
>  #endif
>
> -/*
> - * Preemption is promised to be disabled when return bit >= 0.
> - */
>  static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
>                                                         int start)
>  {
> @@ -182,18 +179,11 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign
>         val |= 1 << bit;
>         current->trace_recursion = val;
>         barrier();
> -
> -       preempt_disable_notrace();
> -
>         return bit;
>  }
>
> -/*
> - * Preemption will be enabled (if it was previously enabled).
> - */
>  static __always_inline void trace_clear_recursion(int bit)
>  {
> -       preempt_enable_notrace();
>         barrier();
>         trace_recursion_clear(bit);
>  }
> @@ -205,12 +195,18 @@ static __always_inline void trace_clear_recursion(int bit)
>   * tracing recursed in the same context (normal vs interrupt),
>   *
>   * Returns: -1 if a recursion happened.
> - *           >= 0 if no recursion.
> + *           >= 0 if no recursion and preemption will be disabled.
>   */
>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
>                                                          unsigned long parent_ip)
>  {
> -       return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
> +       int bit;
> +
> +       bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
> +       if (unlikely(bit < 0))
> +               return -1;
> +       preempt_disable_notrace();
> +       return bit;
>  }
>
>  /**
> @@ -220,6 +216,33 @@ static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
>   * This is used at the end of a ftrace callback.
>   */
>  static __always_inline void ftrace_test_recursion_unlock(int bit)
> +{
> +       preempt_enable_notrace();
> +       trace_clear_recursion(bit);
> +}
> +
> +/**
> + * test_recursion_try_acquire - tests for recursion in same context
> + *
> + * This will detect recursion of a function.
> + *
> + * Returns: -1 if a recursion happened.
> + *           >= 0 if no recursion
> + */
> +static __always_inline int test_recursion_try_acquire(unsigned long ip,
> +                                                     unsigned long parent_ip)
> +{
> +       return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START);
> +}
> +
> +/**
> + * test_recursion_release - called after a success of test_recursion_try_acquire()
> + * @bit: The return of a successful test_recursion_try_acquire()
> + *
> + * This releases the recursion lock taken by a non-negative return call
> + * by test_recursion_try_acquire().
> + */
> +static __always_inline void test_recursion_release(int bit)
>  {
>         trace_clear_recursion(bit);
>  }
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 0feea145bb29..ff8172ba48b0 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7644,6 +7644,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>         if (bit < 0)
>                 return;
>
> +       preempt_disable();
>         do_for_each_ftrace_op(op, ftrace_ops_list) {
>                 /* Stub functions don't need to be called nor tested */
>                 if (op->flags & FTRACE_OPS_FL_STUB)
> @@ -7665,6 +7666,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>                 }
>         } while_for_each_ftrace_op(op);
>  out:
> +       preempt_enable();
>         trace_clear_recursion(bit);
>  }
>

Great!
Yafang Shao April 12, 2023, 2:37 p.m. UTC | #25
On Tue, Apr 11, 2023 at 5:35 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Apr 10, 2023 at 10:20:31PM +0800, Yafang Shao wrote:
> > On Mon, Apr 10, 2023 at 10:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Mon, 10 Apr 2023 21:56:16 +0800
> > > Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > > Thanks for your explanation again.
> > > > BPF trampoline is a little special. It includes three parts, as follows,
> > > >
> > > >     ret = __bpf_prog_enter();
> > > >     if (ret)
> > > >         prog->bpf_func();
> > > >      __bpf_prog_exit();
> > > >
> > > > migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
> > > > in __bpf_prog_exit():
> > > >
> > > >     ret = __bpf_prog_enter();
> > > >                 migrate_disable();
> > > >     if (ret)
> > > >         prog->bpf_func();
> > > >      __bpf_prog_exit();
> > > >           migrate_enable();
> > > >
> > > > That said, if we haven't executed migrate_disable() in
> > > > __bpf_prog_enter(), we shouldn't execute migrate_enable() in
> > > > __bpf_prog_exit().
> > > > Can ftrace_test_recursion_trylock() be applied to this pattern ?
> > >
> > > Yes, it can! And in this you would need to not call migrate_enable()
> > > because if the trace_recursion_trylock() failed, it would prevent
> > > migrate_disable() from being called (and should not let the bpf_func() from
> > > being called either. And then the migrate_enable in __bpf_prog_exit() would
> > > need to know not to call migrate_enable() which checking the return value
> > > of ftrace_test_recursion_trylock() would give the same value as what the
> > > one before migrate_disable() had.
> > >
> >
> > That needs some changes in invoke_bpf_prog() in files
> > arch/${ARCH}/net/bpf_jit_comp.c.
> > But I will have a try. We can then remove the bpf_prog->active, that
> > will be a good cleanup as well.
>
> I was wondering if it's worth the effort to do that just to be able to attach
> bpf prog to preempt_count_add/sub and was going to suggest to add them to
> btf_id_deny as Steven pointed out earlier as possible solution
>
> but if that might turn out as alternative to prog->active, that'd be great
>

I think we can do it in two steps,
1. Fix this crash by adding preempt_count_{sub,add} into btf_id deny list.
   The stable kernel may need this fix, so we'd better make it
simpler, then it can be backported easily.

2. Replace prog->active with the new
test_recursion_try_{acquire,release} introduced by Steven
   That's an improvement. We can do it in a separate patchset.

WDYT?

BTW, maybe we need to add a new fentry test case to attach all
available FUNCs parsed from /sys/kernel/btf/vmlinux.
Jiri Olsa April 12, 2023, 7:04 p.m. UTC | #26
On Wed, Apr 12, 2023 at 10:37:07PM +0800, Yafang Shao wrote:
> On Tue, Apr 11, 2023 at 5:35 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 10:20:31PM +0800, Yafang Shao wrote:
> > > On Mon, Apr 10, 2023 at 10:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > On Mon, 10 Apr 2023 21:56:16 +0800
> > > > Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > > Thanks for your explanation again.
> > > > > BPF trampoline is a little special. It includes three parts, as follows,
> > > > >
> > > > >     ret = __bpf_prog_enter();
> > > > >     if (ret)
> > > > >         prog->bpf_func();
> > > > >      __bpf_prog_exit();
> > > > >
> > > > > migrate_disable() is called in __bpf_prog_enter() and migrate_enable()
> > > > > in __bpf_prog_exit():
> > > > >
> > > > >     ret = __bpf_prog_enter();
> > > > >                 migrate_disable();
> > > > >     if (ret)
> > > > >         prog->bpf_func();
> > > > >      __bpf_prog_exit();
> > > > >           migrate_enable();
> > > > >
> > > > > That said, if we haven't executed migrate_disable() in
> > > > > __bpf_prog_enter(), we shouldn't execute migrate_enable() in
> > > > > __bpf_prog_exit().
> > > > > Can ftrace_test_recursion_trylock() be applied to this pattern ?
> > > >
> > > > Yes, it can! And in this you would need to not call migrate_enable()
> > > > because if the trace_recursion_trylock() failed, it would prevent
> > > > migrate_disable() from being called (and should not let the bpf_func() from
> > > > being called either. And then the migrate_enable in __bpf_prog_exit() would
> > > > need to know not to call migrate_enable() which checking the return value
> > > > of ftrace_test_recursion_trylock() would give the same value as what the
> > > > one before migrate_disable() had.
> > > >
> > >
> > > That needs some changes in invoke_bpf_prog() in files
> > > arch/${ARCH}/net/bpf_jit_comp.c.
> > > But I will have a try. We can then remove the bpf_prog->active, that
> > > will be a good cleanup as well.
> >
> > I was wondering if it's worth the effort to do that just to be able to attach
> > bpf prog to preempt_count_add/sub and was going to suggest to add them to
> > btf_id_deny as Steven pointed out earlier as possible solution
> >
> > but if that might turn out as alternative to prog->active, that'd be great
> >
> 
> I think we can do it in two steps,
> 1. Fix this crash by adding preempt_count_{sub,add} into btf_id deny list.
>    The stable kernel may need this fix, so we'd better make it
> simpler, then it can be backported easily.
> 
> 2. Replace prog->active with the new
> test_recursion_try_{acquire,release} introduced by Steven
>    That's an improvement. We can do it in a separate patchset.
> 
> WDYT?

sounds good

> 
> BTW, maybe we need to add a new fentry test case to attach all
> available FUNCs parsed from /sys/kernel/btf/vmlinux.

that might be tricky because we don't have multi trampoline attach
support at the moment, so it will take forever

jirka
diff mbox series

Patch

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index e8143e3..fe4b248 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -27,6 +27,9 @@  static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
 	struct fprobe *fp;
 	int bit;
 
+	if (!rcu_is_watching())
+		return;
+
 	fp = container_of(ops, struct fprobe, ops);
 	if (fprobe_disabled(fp))
 		return;