diff mbox series

tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release

Message ID 20240820130001.124768-1-tglozar@redhat.com (mailing list archive)
State New
Headers show
Series tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release | expand

Commit Message

Tomas Glozar Aug. 20, 2024, 1 p.m. UTC
From: Tomas Glozar <tglozar@redhat.com>

When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
NULL pointer dereference can be triggered by sending consequent SIGINT
and SIGTERM signals to the workload process. That then causes
timerlat_fd_release to be called twice in a row, and the second time,
hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
dereference.

This can be reproduced using rtla:
```
$ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
 kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
[1] 1675
[1]+  Aborted (SIGTERM)      rtla timerlat top -u -q
[1] 1688
client_loop: send disconnect: Broken pipe
```
triggering the bug:
```
BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted
6.10.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
RIP: 0010:hrtimer_active+0xd/0x50
RSP: 0018:ffffa86641567cc0 EFLAGS: 00010286
RAX: 000000000002e2c0 RBX: ffff994c6bf2e2c8 RCX: ffff994b0911ac18
RDX: 0000000000000000 RSI: ffff994b02f10700 RDI: ffff994c6bf2e2c8
RBP: ffff994c6bf2e340 R08: ffff994b158f7400 R09: ffff994b0911ac18
R10: 0000000000000010 R11: ffff994b00d40f00 R12: ffff994c6bf2e2c8
R13: ffff994b02049b20 R14: ffff994b011806c0 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff994c6bf00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000139020006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
  <TASK>
  ? __die+0x24/0x70
  ? page_fault_oops+0x75/0x170
  ? mt_destroy_walk.isra.0+0x2b3/0x320
  ? exc_page_fault+0x70/0x160
  ? asm_exc_page_fault+0x26/0x30
  ? hrtimer_active+0xd/0x50
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x48/0xe0
  __fput+0xed/0x2c0
  task_work_run+0x59/0x90
  do_exit+0x275/0x4b0
  do_group_exit+0x30/0x80
  get_signal+0x917/0x960
  ? vfs_read+0xb7/0x340
  arch_do_signal_or_restart+0x29/0xf0
  ? syscall_exit_to_user_mode+0x70/0x1f0
  ? syscall_exit_work+0xf3/0x120
  syscall_exit_to_user_mode+0x1a0/0x1f0
  do_syscall_64+0x89/0x160
  ? clear_bhb_loop+0x25/0x80
  ? clear_bhb_loop+0x25/0x80
  ? clear_bhb_loop+0x25/0x80
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f75790fd9ec
...
  </TASK>
```

Fix the NULL pointer dereference by checking tlat_var->kthread for zero
first in timerlat_fd_release, before calling hrtimer_cancel.
tlat_var->kthread is always non-zero unless the entire tlat_var is zero,
since it is set to the TID of the userspace workload in timerlat_fd_open
under a mutex.

Cc: stable@vger.kernel.org
Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface")
Co-developed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 kernel/trace/trace_osnoise.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Steven Rostedt Aug. 21, 2024, 8:03 p.m. UTC | #1
On Tue, 20 Aug 2024 15:00:01 +0200
tglozar@redhat.com wrote:

> From: Tomas Glozar <tglozar@redhat.com>
> 
> When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
> NULL pointer dereference can be triggered by sending consequent SIGINT
> and SIGTERM signals to the workload process. That then causes
> timerlat_fd_release to be called twice in a row, and the second time,
> hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
> dereference.
> 
> This can be reproduced using rtla:
> ```
> $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
>  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
> [1] 1675
> [1]+  Aborted (SIGTERM)      rtla timerlat top -u -q
> [1] 1688
> client_loop: send disconnect: Broken pipe
> ```
> triggering the bug:

I'm able to reproduce this with the above. Unfortunately, I can still
reproduce it after applying this patch :-(

Looking at the code, the logic for handling the kthread seems off. I'll
spend a little time to see if I can figure it out.

-- Steve
Tomas Glozar Aug. 22, 2024, 9:32 a.m. UTC | #2
st 21. 8. 2024 v 22:02 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
>
> I'm able to reproduce this with the above. Unfortunately, I can still
> reproduce it after applying this patch :-(
>

Thank you for looking at this. I was at first not too sure about
whether this is the proper fix, but after some discussion with Luis
(in CC), we have come to the conclusion that the double-close of the
timerlat_fd might be a possible explanation, and this patch worked for
both of us. Are you reproducing the same bug (NULL pointer dereference
in hrtimer_active) with the patch? IIUC that should not happen anymore
since the patch explicitly checks for zero in the hrtimer structure.

I have caught however a different panic in addition to the one
reported above while testing "rtla: Support idle state disabling via
libcpupower in timerlat" on an EL9 RT kernel:

BUG: kernel NULL pointer dereference, address: 0000000000000014
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G        W
-------  ---  5.14.0-452.el9.x86_64+rt #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
RIP: 0010:task_dump_owner+0x3d/0x100
RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580
RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80
R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c
R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8
FS:  00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? proc_pid_make_inode+0xa0/0x110
? __die_body.cold+0x8/0xd
? page_fault_oops+0x140/0x180
? do_user_addr_fault+0x61/0x690
? kvm_read_and_reset_apf_flags+0x45/0x60
? exc_page_fault+0x65/0x180
? asm_exc_page_fault+0x22/0x30
? task_dump_owner+0x3d/0x100
? task_dump_owner+0x36/0x100
proc_pid_make_inode+0xa0/0x110
proc_pid_instantiate+0x21/0xb0
proc_pid_lookup+0x95/0x170
proc_root_lookup+0x1d/0x50
__lookup_slow+0x9c/0x150
walk_component+0x158/0x1d0
link_path_walk.part.0.constprop.0+0x24e/0x3c0
? path_init+0x326/0x4d0
path_openat+0xb1/0x280
do_filp_open+0xb2/0x160
? migrate_enable+0xd5/0x150
? rt_spin_unlock+0x13/0x40
do_sys_openat2+0x96/0xd0
__x64_sys_openat+0x53/0xa0
...
</TASK>

This was preceded by a WARN:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 6 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Not tainted
5.14.0-452.el9.x86_64+rt #1
RIP: 0010:refcount_warn_saturate+0x74/0x110
Call Trace:
<TASK>
[   78.184877]  proc_pid_lookup+0x161/0x170
[   78.184883]  proc_root_lookup+0x1d/0x50
[   78.184890]  __lookup_slow+0x9c/0x150
[   78.184899]  walk_component+0x158/0x1d0
[   78.184908]  link_path_walk.part.0.constprop.0+0x24e/0x3c0
[   78.184915]  ? path_init+0x326/0x4d0
[   78.184922]  path_openat+0xb1/0x280
[   78.184926]  do_filp_open+0xb2/0x160
[   78.184934]  ? migrate_enable+0xd5/0x150
[   78.184942]  ? rt_spin_unlock+0x13/0x40
[   78.184950]  do_sys_openat2+0x96/0xd0
[   78.184958]  __x64_sys_openat+0x53/0xa0
[   78.184964]  do_syscall_64+0x5c/0xf0
[   78.185011]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
...
</TASK>

> Looking at the code, the logic for handling the kthread seems off. I'll
> spend a little time to see if I can figure it out.
>

Yeah, it seems there might be multiple bugs in the user workload
handling, the other NULL pointer dereference and refcount warning
above might be related (but I have yet to reproduce it on an upstream
kernel). I'm also going to look at the code and will post any findings
here.



Tomas
Luis Claudio R. Goncalves Aug. 22, 2024, 11:20 a.m. UTC | #3
On Wed, Aug 21, 2024 at 04:03:16PM -0400, Steven Rostedt wrote:
> On Tue, 20 Aug 2024 15:00:01 +0200
> tglozar@redhat.com wrote:
> 
> > From: Tomas Glozar <tglozar@redhat.com>
> > 
> > When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
> > NULL pointer dereference can be triggered by sending consequent SIGINT
> > and SIGTERM signals to the workload process. That then causes
> > timerlat_fd_release to be called twice in a row, and the second time,
> > hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
> > dereference.
> > 
> > This can be reproduced using rtla:
> > ```
> > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> >  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
> > [1] 1675
> > [1]+  Aborted (SIGTERM)      rtla timerlat top -u -q
> > [1] 1688
> > client_loop: send disconnect: Broken pipe
> > ```
> > triggering the bug:
> 
> I'm able to reproduce this with the above. Unfortunately, I can still
> reproduce it after applying this patch :-(

We were able to mitigate the problem (triggered by that command line) simply
by handling SIGTERM in the userspace tool the same way it handles SIGINT. That
was one of the factors that helped the "closing the file descriptor twice"
theory.

Would you mind sharing the backtrace you got? That would also help us
investigating.

> Looking at the code, the logic for handling the kthread seems off. I'll
> spend a little time to see if I can figure it out.

You mean the 

+	if (!tlat_var->kthread) {
+		/* the fd has been closed already */

bit or the kthread handling in rtla itself?

As Tomas already said, thank you for testing and reviewing the suggested fix!

Luis
Steven Rostedt Aug. 22, 2024, 2:32 p.m. UTC | #4
On Thu, 22 Aug 2024 11:32:07 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> st 21. 8. 2024 v 22:02 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> >
> > I'm able to reproduce this with the above. Unfortunately, I can still
> > reproduce it after applying this patch :-(
> >  
> 
> Thank you for looking at this. I was at first not too sure about
> whether this is the proper fix, but after some discussion with Luis
> (in CC), we have come to the conclusion that the double-close of the
> timerlat_fd might be a possible explanation, and this patch worked for
> both of us. Are you reproducing the same bug (NULL pointer dereference
> in hrtimer_active) with the patch? IIUC that should not happen anymore
> since the patch explicitly checks for zero in the hrtimer structure.

There isn't a double close. But there are two bugs and you did sorta fix
one of them.

> 
> I have caught however a different panic in addition to the one
> reported above while testing "rtla: Support idle state disabling via
> libcpupower in timerlat" on an EL9 RT kernel:
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000014
> CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G        W
> -------  ---  5.14.0-452.el9.x86_64+rt #1
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
> 04/01/2014
> RIP: 0010:task_dump_owner+0x3d/0x100
> RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202
> RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580
> RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80
> R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c
> R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8
> FS:  00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? show_trace_log_lvl+0x1c4/0x2df
> ? show_trace_log_lvl+0x1c4/0x2df
> ? proc_pid_make_inode+0xa0/0x110
> ? __die_body.cold+0x8/0xd
> ? page_fault_oops+0x140/0x180
> ? do_user_addr_fault+0x61/0x690
> ? kvm_read_and_reset_apf_flags+0x45/0x60
> ? exc_page_fault+0x65/0x180
> ? asm_exc_page_fault+0x22/0x30
> ? task_dump_owner+0x3d/0x100
> ? task_dump_owner+0x36/0x100
> proc_pid_make_inode+0xa0/0x110
> proc_pid_instantiate+0x21/0xb0
> proc_pid_lookup+0x95/0x170
> proc_root_lookup+0x1d/0x50
> __lookup_slow+0x9c/0x150
> walk_component+0x158/0x1d0
> link_path_walk.part.0.constprop.0+0x24e/0x3c0
> ? path_init+0x326/0x4d0
> path_openat+0xb1/0x280
> do_filp_open+0xb2/0x160
> ? migrate_enable+0xd5/0x150
> ? rt_spin_unlock+0x13/0x40
> do_sys_openat2+0x96/0xd0
> __x64_sys_openat+0x53/0xa0
> ...


> Yeah, it seems there might be multiple bugs in the user workload
> handling, the other NULL pointer dereference and refcount warning
> above might be related (but I have yet to reproduce it on an upstream
> kernel). I'm also going to look at the code and will post any findings
> here.

Yes that is the second bug and it is related to the that this addresses.

-- Steve
Steven Rostedt Aug. 22, 2024, 2:34 p.m. UTC | #5
On Thu, 22 Aug 2024 08:20:52 -0300
"Luis Claudio R. Goncalves" <lgoncalv@redhat.com> wrote:

> You mean the 
> 
> +	if (!tlat_var->kthread) {
> +		/* the fd has been closed already */
> 
> bit or the kthread handling in rtla itself?
> 
> As Tomas already said, thank you for testing and reviewing the suggested fix!

The close only happens once, but the kthread does get freed elsewhere.

I used my new persistent ring buffer to debug this! I'll be posting some
more patches that allow trace_printk() and friends to use the persistent
ring buffer (or any instance) soon.

For this bug, I would recommend instead doing:

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..53de719f35cb 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2579,7 +2587,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
 	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
 	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
-	hrtimer_cancel(&tlat_var->timer);
+	if (tlat_var->kthread)
+		hrtimer_cancel(&tlat_var->timer);
 	memset(tlat_var, 0, sizeof(*tlat_var));
 
 	osn_var->sampling = 0;

-- Steve
Steven Rostedt Aug. 22, 2024, 2:37 p.m. UTC | #6
On Thu, 22 Aug 2024 10:32:02 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > Yeah, it seems there might be multiple bugs in the user workload
> > handling, the other NULL pointer dereference and refcount warning
> > above might be related (but I have yet to reproduce it on an upstream
> > kernel). I'm also going to look at the code and will post any findings
> > here.  
> 
> Yes that is the second bug and it is related to the that this addresses.

There's nothing protecting the clearing of the kthreads and calling
put_task_struct(). Here's the fix to the second bug:

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..53de719f35cb 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2106,7 +2106,9 @@ static int osnoise_cpu_init(unsigned int cpu)
  */
 static int osnoise_cpu_die(unsigned int cpu)
 {
+	mutex_lock(&interface_lock);
 	stop_kthread(cpu);
+	mutex_unlock(&interface_lock);
 	return 0;
 }
 
@@ -2239,8 +2241,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
 	 */
 	mutex_lock(&trace_types_lock);
 	running = osnoise_has_registered_instances();
-	if (running)
+	if (running) {
+		mutex_lock(&interface_lock);
 		stop_per_cpu_kthreads();
+		mutex_unlock(&interface_lock);
+	}
 
 	mutex_lock(&interface_lock);
 	/*
@@ -2355,8 +2360,11 @@ osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
 	 */
 	mutex_lock(&trace_types_lock);
 	running = osnoise_has_registered_instances();
-	if (running)
+	if (running) {
+		mutex_lock(&interface_lock);
 		stop_per_cpu_kthreads();
+		mutex_unlock(&interface_lock);
+	}
 
 	mutex_lock(&interface_lock);
 	/*
@@ -2951,7 +2960,9 @@ static void osnoise_workload_stop(void)
 	 */
 	barrier();
 
+	mutex_lock(&interface_lock);
 	stop_per_cpu_kthreads();
+	mutex_unlock(&interface_lock);
 
 	osnoise_unhook_events();
 }


With both of these fixes, the bug goes away.

I'll add this fix (after enabling lockdep and making sure I didn't screw up
the locking). Can you resend this patch with just not calling cancel if
kthread is NULL. No need to exit out early. I still like to make sure the
clean up happens, and not assume it will already be done.

-- Steve
Steven Rostedt Aug. 23, 2024, 4:54 p.m. UTC | #7
On Tue, 20 Aug 2024 15:00:01 +0200
tglozar@redhat.com wrote:

> From: Tomas Glozar <tglozar@redhat.com>
> 
> When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
> NULL pointer dereference can be triggered by sending consequent SIGINT
> and SIGTERM signals to the workload process. That then causes
> timerlat_fd_release to be called twice in a row, and the second time,
> hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
> dereference.
> 
> This can be reproduced using rtla:
> ```
> $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
>  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
> [1] 1675
> [1]+  Aborted (SIGTERM)      rtla timerlat top -u -q
> [1] 1688
> client_loop: send disconnect: Broken pipe
> ```
> triggering the bug:
> ```
> BUG: kernel NULL pointer dereference, address: 0000000000000010
> Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted
> 6.10.0-rc2+ #1
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
> 04/01/2014
> RIP: 0010:hrtimer_active+0xd/0x50
> RSP: 0018:ffffa86641567cc0 EFLAGS: 00010286
> RAX: 000000000002e2c0 RBX: ffff994c6bf2e2c8 RCX: ffff994b0911ac18
> RDX: 0000000000000000 RSI: ffff994b02f10700 RDI: ffff994c6bf2e2c8
> RBP: ffff994c6bf2e340 R08: ffff994b158f7400 R09: ffff994b0911ac18
> R10: 0000000000000010 R11: ffff994b00d40f00 R12: ffff994c6bf2e2c8
> R13: ffff994b02049b20 R14: ffff994b011806c0 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff994c6bf00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 0000000139020006 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   ? __die+0x24/0x70
>   ? page_fault_oops+0x75/0x170
>   ? mt_destroy_walk.isra.0+0x2b3/0x320
>   ? exc_page_fault+0x70/0x160
>   ? asm_exc_page_fault+0x26/0x30
>   ? hrtimer_active+0xd/0x50
>   hrtimer_cancel+0x15/0x40
>   timerlat_fd_release+0x48/0xe0
>   __fput+0xed/0x2c0
>   task_work_run+0x59/0x90
>   do_exit+0x275/0x4b0
>   do_group_exit+0x30/0x80
>   get_signal+0x917/0x960
>   ? vfs_read+0xb7/0x340
>   arch_do_signal_or_restart+0x29/0xf0
>   ? syscall_exit_to_user_mode+0x70/0x1f0
>   ? syscall_exit_work+0xf3/0x120
>   syscall_exit_to_user_mode+0x1a0/0x1f0
>   do_syscall_64+0x89/0x160
>   ? clear_bhb_loop+0x25/0x80
>   ? clear_bhb_loop+0x25/0x80
>   ? clear_bhb_loop+0x25/0x80
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   RIP: 0033:0x7f75790fd9ec
> ...
>   </TASK>
> ```
> 
> Fix the NULL pointer dereference by checking tlat_var->kthread for zero
> first in timerlat_fd_release, before calling hrtimer_cancel.
> tlat_var->kthread is always non-zero unless the entire tlat_var is zero,
> since it is set to the TID of the userspace workload in timerlat_fd_open
> under a mutex.
> 

So debugging this a bit more (with my persistent ring buffer!), I realize
this is just fixing a symptom and not the real issue (and I added my other
patch which fixes the kthread issue).

I added trace_printk() for every place the tlat->timer gets initialized,
started, canceled and memset to zero (via the memset() function). Those
looked like this:

	for_each_cpu(cpu, cpu_online_mask) {
		tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
		trace_printk("memset 0 %px\n", &tlat_var->timer);
		memset(tlat_var, 0, sizeof(*tlat_var));
	}

	trace_printk("start timer %px for %px\n", &tlat->timer, tlat->kthread);
	hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);

 etc.


And here's the result I found:

           <...>-909     [005] ....1    69.830509: timerlat_fd_release: cancel timer ffff88823c6b5b28 for ffff88810dae6780
           <...>-909     [005] ....1    69.830511: timerlat_fd_release: memset 0 ffff88823c6b5b28
           <...>-902     [001] ....1    69.831296: timerlat_fd_release: cancel timer ffff88823c4b5b28 for ffff88810dae5640
           <...>-902     [001] ....1    69.831300: timerlat_fd_release: memset 0 ffff88823c4b5b28
           <...>-901     [000] ....1    69.831319: timerlat_fd_release: cancel timer ffff88823c435b28 for ffff888110b8d640
           <...>-901     [000] ....1    69.831323: timerlat_fd_release: memset 0 ffff88823c435b28
           <...>-905     [003] ....1    69.831371: timerlat_fd_release: cancel timer ffff88823c5b5b28 for ffff88810dae0000
           <...>-905     [003] ....1    69.831375: timerlat_fd_release: memset 0 ffff88823c5b5b28
           <...>-904     [002] ....1    69.831390: timerlat_fd_release: cancel timer ffff88823c535b28 for ffff88810dae2280
           <...>-904     [002] ....1    69.831394: timerlat_fd_release: memset 0 ffff88823c535b28
           <...>-907     [004] ....1    69.831442: timerlat_fd_release: cancel timer ffff88823c635b28 for ffff88810dae33c0
           <...>-907     [004] ....1    69.831446: timerlat_fd_release: memset 0 ffff88823c635b28
           <...>-911     [007] ....1    69.832689: timerlat_fd_release: cancel timer ffff88823c7b5b28 for ffff88810dae1140
           <...>-911     [007] ....1    69.832693: timerlat_fd_release: memset 0 ffff88823c7b5b28
           <...>-916     [005] .....    69.838927: osnoise_workload_start: memset 0 ffff88823c435b28
           <...>-916     [005] .....    69.838931: osnoise_workload_start: memset 0 ffff88823c4b5b28
           <...>-916     [005] .....    69.838932: osnoise_workload_start: memset 0 ffff88823c535b28
           <...>-916     [005] .....    69.838933: osnoise_workload_start: memset 0 ffff88823c5b5b28
           <...>-916     [005] .....    69.838934: osnoise_workload_start: memset 0 ffff88823c635b28
           <...>-916     [005] .....    69.838935: osnoise_workload_start: memset 0 ffff88823c6b5b28
           <...>-916     [005] .....    69.838936: osnoise_workload_start: memset 0 ffff88823c735b28
           <...>-916     [005] .....    69.838938: osnoise_workload_start: memset 0 ffff88823c7b5b28
           <...>-910     [006] ....1    69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000
           <...>-910     [006] d...1    69.841134: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
           <...>-910     [006] d...1    69.844649: console: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
           <...>-910     [006] d...1    69.846887: console: CPU: 6 UID: 0 PID: 910 Comm: timerlatu/6 Not tainted 6.11.0-rc3-test-00027-g014f473a3416-dirty #128 70de8528c323e5e87113ad73b5320a6d112a9ae6
           <...>-910     [006] d...1    69.851008: console: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
           <...>-910     [006] d...1    69.853750: console: RIP: 0010:hrtimer_active+0x58/0x300
           <...>-910     [006] d...1    69.855118: console: Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
           <...>-910     [006] d...1    69.860562: console: RSP: 0018:ffff88810c7a7718 EFLAGS: 00010202
           <...>-910     [006] d...1    69.860569: console:
           <...>-910     [006] d...1    69.862099: console: RAX: 0000000000000002 RBX: ffff88823c735b28 RCX: ffffed10478e6b6b
           <...>-910     [006] d...1    69.864181: console: RDX: dffffc0000000000 RSI: ffffed10478e6b6c RDI: ffff88823c735b28
           <...>-910     [006] d...1    69.866262: console: RBP: 0000000000000000 R08: ffff88823c735b58 R09: ffff88823c735b60
           <...>-910     [006] d...1    69.868359: console: R10: ffff88810dae4fe7 R11: ffffed102027af0a R12: ffff88823c735b28
           <...>-910     [006] d...1    69.870459: console: R13: ffffffffa14f8d8b R14: 0000000000000010 R15: ffff88823c735b28
           <...>-910     [006] d...1    69.872576: console: FS:  0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
           <...>-910     [006] d...1    69.874764: console: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
           <...>-910     [006] d...1    69.876332: console: CR2: 00007fc741ad5f90 CR3: 00000001f729e006 CR4: 0000000000170ef0
           <...>-910     [006] d...1    69.878275: console: Call Trace:
           <...>-910     [006] d...1    69.878954: console:  <TASK>

The timer that crashed (and the other timers would eventually too) was
ffff88823c735b28, and just looking at that timer from the above:

           <...>-910     [006] ....1    69.827589: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
           <...>-910     [006] ....1    69.828557: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
           <...>-910     [006] ....1    69.829529: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
           <...>-916     [005] .....    69.838936: osnoise_workload_start: memset 0 ffff88823c735b28
           <...>-910     [006] ....1    69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000

It was the memset that cleared it and then calling hrtimer_cancel() on that
broke it. But notice something else here. The timer is never canceled!
That's a real bug.

This wasn't caused by the file descriptor closing twice and
timerlat_fd_release called again. There's no such thing otherwise there
would be lots of broken code in the kernel.

I added a trace_dump_stack():

	for_each_cpu(cpu, cpu_online_mask) {
		tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
		trace_printk("memset 0 %px\n", &tlat_var->timer);
		trace_dump_stack(0);
		memset(tlat_var, 0, sizeof(*tlat_var));
	}

And have this:

           <...>-924     [006] .....    42.574419: osnoise_workload_start: memset 0 ffff88823c7b5b28
           <...>-924     [006] .....    42.574423: <stack trace>
 => __ftrace_trace_stack
 => osnoise_workload_start
 => timerlat_tracer_start
 => rb_simple_write
 => vfs_write
 => ksys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
           <...>-913     [007] ....1    42.580553: timerlat_fd_release: cancel timer ffff88823c7b5b28 for 0000000000000000
           <...>-913     [007] d...1    42.580599: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI


From your test case:

> $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
>  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done

The "kill -INT $PID" caused the write to osnoise_workload_start(), and the
after 1ms you do the "kill -TERM $PID" that kill the process which closes
the file descriptor right after the reset.

The real fix here looks to be:

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..400a72cd6ab5 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -265,6 +265,8 @@ static inline void tlat_var_reset(void)
 	 */
 	for_each_cpu(cpu, cpu_online_mask) {
 		tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
+		if (tlat_var->kthread)
+			hrtimer_cancel(&tlat_var->timer);
 		memset(tlat_var, 0, sizeof(*tlat_var));
 	}
 }
@@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
 	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
 	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
-	hrtimer_cancel(&tlat_var->timer);
+	if (tlat_var->kthread)
+		hrtimer_cancel(&tlat_var->timer);
 	memset(tlat_var, 0, sizeof(*tlat_var));
 
 	osn_var->sampling = 0;

I'll make this into a real patch and send it out.

-- Steve
Steven Rostedt Aug. 23, 2024, 6:52 p.m. UTC | #8
On Fri, 23 Aug 2024 12:54:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> >  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done  
> 
> The "kill -INT $PID" caused the write to osnoise_workload_start(), and the
> after 1ms you do the "kill -TERM $PID" that kill the process which closes
> the file descriptor right after the reset.
> 
> The real fix here looks to be:
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 66a871553d4a..400a72cd6ab5 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -265,6 +265,8 @@ static inline void tlat_var_reset(void)
>  	 */
>  	for_each_cpu(cpu, cpu_online_mask) {
>  		tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
> +		if (tlat_var->kthread)
> +			hrtimer_cancel(&tlat_var->timer);
>  		memset(tlat_var, 0, sizeof(*tlat_var));
>  	}
>  }
> @@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
>  	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>  	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
>  
> -	hrtimer_cancel(&tlat_var->timer);
> +	if (tlat_var->kthread)
> +		hrtimer_cancel(&tlat_var->timer);
>  	memset(tlat_var, 0, sizeof(*tlat_var));
>  
>  	osn_var->sampling = 0;
> 
> I'll make this into a real patch and send it out.

Egad, I don't think this is even good enough. I noticed this in the trace
(adding kthread to the memset trace_printk):

           <...>-916     [003] .....   134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000
           <...>-916     [003] .....   134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000
           <...>-916     [003] .....   134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000
           <...>-916     [003] .....   134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640
           <...>-916     [003] .....   134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000

Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
on me that this is a global per CPU variable. It gets initialized when the
tracer starts. If another program is has the timerlat fd open when the
tracer ends, the tracer starts again, and you close the fd, it will cancel
the hrtimer for the new task.

I think there needs to be some ref counting here, that keeps the tracer
from starting again if there's still files opened.

This looks to be a bigger problem than I have time to work on it for now.
I'll just apply the mutex patch for the kthreads, but this bug is going to
take a bit more work in solving.

-- Steve
Tomas Glozar Aug. 26, 2024, 1:01 p.m. UTC | #9
pá 23. 8. 2024 v 20:51 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
>
> Egad, I don't think this is even good enough. I noticed this in the trace
> (adding kthread to the memset trace_printk):
>
>            <...>-916     [003] .....   134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000
>            <...>-916     [003] .....   134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000
>            <...>-916     [003] .....   134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000
>            <...>-916     [003] .....   134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000
>            <...>-916     [003] .....   134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000
>            <...>-916     [003] .....   134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000
>            <...>-916     [003] .....   134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640
>            <...>-916     [003] .....   134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000
>
> Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
> on me that this is a global per CPU variable. It gets initialized when the
> tracer starts. If another program is has the timerlat fd open when the
> tracer ends, the tracer starts again, and you close the fd, it will cancel
> the hrtimer for the new task.
>
> I think there needs to be some ref counting here, that keeps the tracer
> from starting again if there's still files opened.
>

The timerlat fd is not supposed to be open when the tracer ends/starts
again, since osnoise_workload_stop() calls stop_kthread(), which in
turn calls kill_pid() to SIGKILL the user workload, which will also
close the file descriptor. Only one PID per CPU should have the
timerlat fd open at one moment, since timerlat_fd_open() will refuse
to open if tlat->pid is set. It appears that this is somehow bypassed
and osnoise_workload_start() happens before closing the fd, not sure
why.

> This looks to be a bigger problem than I have time to work on it for now.
> I'll just apply the mutex patch for the kthreads, but this bug is going to
> take a bit more work in solving.
>

Yeah, unfortunately the issue looks more complicated now after looking
at the traces you posted, I will probably have to do more tracing to
see what is actually happening here. Thank you again for helping us
with this and also for the patch for the mutex.

Tomas
Steven Rostedt Aug. 26, 2024, 5:26 p.m. UTC | #10
On Mon, 26 Aug 2024 15:01:24 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> > Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
> > on me that this is a global per CPU variable. It gets initialized when the
> > tracer starts. If another program is has the timerlat fd open when the
> > tracer ends, the tracer starts again, and you close the fd, it will cancel
> > the hrtimer for the new task.
> >
> > I think there needs to be some ref counting here, that keeps the tracer
> > from starting again if there's still files opened.
> >  
> 
> The timerlat fd is not supposed to be open when the tracer ends/starts
> again, since osnoise_workload_stop() calls stop_kthread(), which in
> turn calls kill_pid() to SIGKILL the user workload, which will also
> close the file descriptor. Only one PID per CPU should have the
> timerlat fd open at one moment, since timerlat_fd_open() will refuse
> to open if tlat->pid is set. It appears that this is somehow bypassed
> and osnoise_workload_start() happens before closing the fd, not sure
> why.
> 
> > This looks to be a bigger problem than I have time to work on it for now.
> > I'll just apply the mutex patch for the kthreads, but this bug is going to
> > take a bit more work in solving.
> >  
> 
> Yeah, unfortunately the issue looks more complicated now after looking
> at the traces you posted, I will probably have to do more tracing to
> see what is actually happening here. Thank you again for helping us
> with this and also for the patch for the mutex.

Yeah, I think I finally found the real issue. I don't think we need the ref
counting. The problem is the creating and killing of the threads via the
start and stop callbacks. That's not their purpose. The purpose of stop
and start callbacks is when tracing_on is set to off and back on again. I
think this is what is racing with the close.

Anyway, the start and stop should probably just pause the threads and not
kill them an start them again. That is, the osnoise_workload_start() should
be called by the init callbacks and the osnoise_workload_stop should be
called by reset callback.

The start and stop callbacks should just pause and restart the the threads.

-- Steve
Tomas Glozar Aug. 27, 2024, 2:34 p.m. UTC | #11
po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
>
> Yeah, I think I finally found the real issue. I don't think we need the ref
> counting. The problem is the creating and killing of the threads via the
> start and stop callbacks. That's not their purpose. The purpose of stop
> and start callbacks is when tracing_on is set to off and back on again. I
> think this is what is racing with the close.
>

I believe I discovered the exact mechanism of the bug. Since rtla is
killed with SIGTERM, it leaves the timerlat tracer active and its
timerlatu processes running. The rtla from the next iteration of the
while loop in the reproducer deactivates the timerlat tracer,
triggering the killing of the timerlatu processes. However, this
killing is asynchronous, and while the processes are gradually dying
one by one and closing their fds, the rtla from the next iteration is
already activating the timerlat tracer again. One of the timerlatu
processes does not die fast enough to release the fd before the
tlat_var is cleared in osnoise_workload_start, triggering the panic.

You can see the bug in action in this detailed trace I got today, with
the use of in-kernel trace_printk together with bpftrace and
bpf_trace_printk and captured using ftrace_dump_on_oops (running the
same reproducer on a 8 CPU KVM system):
[  103.227679]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=0,
tlat_var=0000000034996efa
[  103.227844]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=1,
tlat_var=0000000062d1f520
[  103.228009]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=2,
tlat_var=0000000002e262d8
[  103.228173]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=3,
tlat_var=00000000b9d3a280
[  103.228349]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=4,
tlat_var=000000000fba7a6e
[  103.228510]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=5,
tlat_var=000000008ed0b87f
[  103.228670]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=6,
tlat_var=00000000a02fe267
[  103.228831]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=7,
tlat_var=000000008e1d5095
[  103.228992]     rtla-2327      3...11 98256551us :
bpf_trace_printk: rtla pid 2325 forks into 2328
[  103.229156]     rtla-2327      3...11 98256873us :
bpf_trace_printk: rtla pid 2325 forks into 2329
[  103.229325]     rtla-2327      3...11 98257124us :
bpf_trace_printk: rtla pid 2325 forks into 2330
[  103.229489]     rtla-2327      3...11 98257353us :
bpf_trace_printk: rtla pid 2325 forks into 2331
[  103.229651]     rtla-2327      3...11 98257574us :
bpf_trace_printk: rtla pid 2325 forks into 2332
[  103.229810]     rtla-2327      3...11 98257801us :
bpf_trace_printk: rtla pid 2325 forks into 2333
[  103.229968]     rtla-2327      3...11 98258008us :
bpf_trace_printk: rtla pid 2325 forks into 2334
[  103.230124]     rtla-2327      3...11 98258216us :
bpf_trace_printk: rtla pid 2325 forks into 2335
[  103.230293]     bash-2094      0d..31 102987914us :
bpf_trace_printk: rtla pid 2325 received signal 2
[  103.230452]     bash-2094      0d..31 102987915us :
bpf_trace_printk: ...from bash pid 2094
[  103.230609]     bash-2094      0d..31 102990000us :
bpf_trace_printk: rtla pid 2325 received signal 15
[  103.230766]     bash-2094      0d..31 102990001us :
bpf_trace_printk: ...from bash pid 2094
[  103.230922]     rtla-2325      2...21 102990010us :
bpf_trace_printk: rtla pid 2325 exiting
[  103.231079]     rtla-2327      3...21 102990503us :
bpf_trace_printk: rtla pid 2327 exiting
[  103.231242]     rtla-2337      3d..31 103232021us :
bpf_trace_printk: timerlatu/0 pid 2328 received signal 9
[  103.231399]     rtla-2337      3d..31 103232022us :
bpf_trace_printk: ...from rtla pid 2337
[  103.231554]     rtla-2337      3d..31 103232024us :
bpf_trace_printk: timerlatu/1 pid 2329 received signal 9
[  103.231711]     rtla-2337      3d..31 103232025us :
bpf_trace_printk: ...from rtla pid 2337
[  103.231866]     rtla-2337      3d..31 103232029us :
bpf_trace_printk: timerlatu/2 pid 2330 received signal 9
[  103.232022]     rtla-2337      3d..31 103232030us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232178]     rtla-2337      3dN.31 103232031us :
bpf_trace_printk: timerlatu/3 pid 2331 received signal 9
[  103.232347]     rtla-2337      3dN.31 103232031us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232506]     rtla-2337      3dN.31 103232034us :
bpf_trace_printk: timerlatu/4 pid 2332 received signal 9
[  103.232685]     rtla-2337      3dN.31 103232034us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232846]     rtla-2337      3dN.31 103232039us :
bpf_trace_printk: timerlatu/5 pid 2333 received signal 9
[  103.233010]     rtla-2337      3dN.31 103232039us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233171]     rtla-2337      3dN.31 103232042us :
bpf_trace_printk: timerlatu/6 pid 2334 received signal 9
[  103.233357]     rtla-2337      3dN.31 103232042us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233517]     rtla-2337      3dN.31 103232046us :
bpf_trace_printk: timerlatu/7 pid 2335 received signal 9
[  103.233678]     rtla-2337      3dN.31 103232046us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233835] timerlat-2331      3...21 103232701us :
bpf_trace_printk: timerlatu/3 pid 2331 exiting
[  103.233995] timerlat-2329      1...21 103232704us :
bpf_trace_printk: timerlatu/1 pid 2329 exiting
[  103.234155] timerlat-2331      3....1 103232705us :
timerlat_fd_release: timerlat_fd_release: cpu=3,
tlat_var=000000000fba7a6e, osn_var->pid=2331
[  103.234357] timerlat-2329      1....1 103232706us :
timerlat_fd_release: timerlat_fd_release: cpu=1,
tlat_var=0000000002e262d8, osn_var->pid=2329
[  103.234508] timerlat-2334      6...21 103232821us :
bpf_trace_printk: timerlatu/6 pid 2334 exiting
[  103.234657] timerlat-2334      6....1 103232826us :
timerlat_fd_release: timerlat_fd_release: cpu=6,
tlat_var=000000008e1d5095, osn_var->pid=2334
[  103.234813] timerlat-2330      2...21 103232850us :
bpf_trace_printk: timerlatu/2 pid 2330 exiting
[  103.234966] timerlat-2330      2....1 103232853us :
timerlat_fd_release: timerlat_fd_release: cpu=2,
tlat_var=00000000b9d3a280, osn_var->pid=2330
[  103.235122] timerlat-2333      5...21 103232856us :
bpf_trace_printk: timerlatu/5 pid 2333 exiting
[  103.235291] timerlat-2333      5....1 103232858us :
timerlat_fd_release: timerlat_fd_release: cpu=5,
tlat_var=00000000a02fe267, osn_var->pid=2333
[  103.235452] timerlat-2332      4...21 103232861us :
bpf_trace_printk: timerlatu/4 pid 2332 exiting
[  103.235610] timerlat-2332      4....1 103232864us :
timerlat_fd_release: timerlat_fd_release: cpu=4,
tlat_var=000000008ed0b87f, osn_var->pid=2332
[  103.235769] timerlat-2335      7...21 103232867us :
bpf_trace_printk: timerlatu/7 pid 2335 exiting
[  103.235932] timerlat-2335      7....1 103232869us :
timerlat_fd_release: timerlat_fd_release: cpu=7,
tlat_var=000000003113e297, osn_var->pid=2335
[  103.236103]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=0,
tlat_var=0000000034996efa
[  103.236316]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=1,
tlat_var=0000000062d1f520
[  103.236473]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=2,
tlat_var=0000000002e262d8
[  103.236630]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=3,
tlat_var=00000000b9d3a280
[  103.236797]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=4,
tlat_var=000000000fba7a6e
[  103.236972]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=5,
tlat_var=000000008ed0b87f
[  103.237144]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=6,
tlat_var=00000000a02fe267
[  103.237345]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=7,
tlat_var=000000008e1d5095
[  103.237516] timerlat-2328      0...21 103234149us :
bpf_trace_printk: timerlatu/0 pid 2328 exiting
[  103.237690] timerlat-2328      0....1 103234152us :
timerlat_fd_release: timerlat_fd_release: cpu=0,
tlat_var=0000000062d1f520, osn_var->pid=0
[  103.237871] ---------------------------------
[  103.238053] CR2: 0000000000000010

On the kernel side, I'm using similar trace_printk debug prints as you
did; the bpftrace program used is this:
uretprobe:libc:fork
/strncmp(comm, "rtla", 5) == 0 && retval != 0/
{
        debugf("rtla pid %d forks into %d", pid, retval)
}
tracepoint:signal:signal_generate
/strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
"timerlatu", 9) == 0/ {
        debugf("%s pid %d received signal %d", args->comm, args->pid,
args->sig);
        debugf("...from %s pid %d", comm, pid);
}
tracepoint:sched:sched_process_exit
/strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
"timerlatu", 9) == 0/ {
        debugf("%s pid %d exiting", args->comm, args->pid)
}

> Anyway, the start and stop should probably just pause the threads and not
> kill them an start them again. That is, the osnoise_workload_start() should
> be called by the init callbacks and the osnoise_workload_stop should be
> called by reset callback.
>
> The start and stop callbacks should just pause and restart the the threads.
>

I'm not sure if that fits into the design of timerlat. At least for
timerlat with user workload, each activation of the timerlat tracer
implies the creation of a new user workload, not the re-activation of
a lingering one (that might be an entirely different program). A
timerlat tracer is tightly associated with its threads: if the threads
exit, the timer itself ceases to have meaning. Of course also rtla
should make sure there are no lingering timerlatu processes.

I think waiting on the threads to actually exit in stop_kthread() is
the proper solution:

/*
 * stop_kthread - stop a workload thread
 */
static void stop_kthread(unsigned int cpu){
    struct task_struct *kthread;

    kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
    if (kthread) {
        if (test_bit(OSN_WORKLOAD, &osnoise_options)) {
            kthread_stop(kthread);
        } else {
            /*
             * This is a user thread waiting on the timerlat_fd. We need
             * to close all users, and the best way to guarantee this is
             * by killing the thread. NOTE: this is a purpose specific file.
             */
            kill_pid(kthread->thread_pid, SIGKILL, 1);
            /* ^^ here wait until kthread (actually the user workload) exits */
            put_task_struct(kthread);
        }
        per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
    }

There might even be another bug, not sure if calling put_task_struct
on a task that is still exiting is sane.

Tomas
Luis Claudio R. Goncalves Aug. 29, 2024, 11:40 p.m. UTC | #12
On Tue, Aug 27, 2024 at 04:34:39PM +0200, Tomas Glozar wrote:
> po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> >
> > Yeah, I think I finally found the real issue. I don't think we need the ref
> > counting. The problem is the creating and killing of the threads via the
> > start and stop callbacks. That's not their purpose. The purpose of stop
> > and start callbacks is when tracing_on is set to off and back on again. I
> > think this is what is racing with the close.
> >
> 
> I believe I discovered the exact mechanism of the bug. Since rtla is
> killed with SIGTERM, it leaves the timerlat tracer active and its
> timerlatu processes running. The rtla from the next iteration of the
> while loop in the reproducer deactivates the timerlat tracer,
> triggering the killing of the timerlatu processes. However, this
> killing is asynchronous, and while the processes are gradually dying
> one by one and closing their fds, the rtla from the next iteration is
> already activating the timerlat tracer again. One of the timerlatu
> processes does not die fast enough to release the fd before the
> tlat_var is cleared in osnoise_workload_start, triggering the panic.

With this in mind I tried a simpler approach. I used a kernel without any
extra patches from Steven or ourselves, just used the code below. Not
saying it is a definitive solution, but as a proof of concept I have two
boxes running the loop used to reproduce the problem and so far not a
single problem. The patch:

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a1..331c69bc7b0d4 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2239,8 +2239,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
 	 */
 	mutex_lock(&trace_types_lock);
 	running = osnoise_has_registered_instances();
-	if (running)
+	if (running) {
 		stop_per_cpu_kthreads();
+		mutex_unlock(&trace_types_lock);
+		return -EBUSY;
+	}
 
 	mutex_lock(&interface_lock);
 	/*
@@ -2579,7 +2582,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
 	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
 	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
-	hrtimer_cancel(&tlat_var->timer);
+	if (tlat_var->kthread)
+		hrtimer_cancel(&tlat_var->timer);
 	memset(tlat_var, 0, sizeof(*tlat_var));
 
 	osn_var->sampling = 0;
---

In short, do not start new instances if the old one still registered. And
only cancel the timer if the timerlat per cpu var has not been zeroed-out.

Again, not a final patch but a simpler approach to protect start/stop
races.

Luis

> 
> You can see the bug in action in this detailed trace I got today, with
> the use of in-kernel trace_printk together with bpftrace and
> bpf_trace_printk and captured using ftrace_dump_on_oops (running the
> same reproducer on a 8 CPU KVM system):
> [  103.227679]     rtla-2325      2..... 98242607us :
> osnoise_workload_start: tlat_var_reset: cpu=0,
> tlat_var=0000000034996efa
> [  103.227844]     rtla-2325      2..... 98242607us :
> osnoise_workload_start: tlat_var_reset: cpu=1,
> tlat_var=0000000062d1f520
> [  103.228009]     rtla-2325      2..... 98242607us :
> osnoise_workload_start: tlat_var_reset: cpu=2,
> tlat_var=0000000002e262d8
> [  103.228173]     rtla-2325      2..... 98242607us :
> osnoise_workload_start: tlat_var_reset: cpu=3,
> tlat_var=00000000b9d3a280
> [  103.228349]     rtla-2325      2..... 98242607us :
> osnoise_workload_start: tlat_var_reset: cpu=4,
> tlat_var=000000000fba7a6e
> [  103.228510]     rtla-2325      2..... 98242608us :
> osnoise_workload_start: tlat_var_reset: cpu=5,
> tlat_var=000000008ed0b87f
> [  103.228670]     rtla-2325      2..... 98242608us :
> osnoise_workload_start: tlat_var_reset: cpu=6,
> tlat_var=00000000a02fe267
> [  103.228831]     rtla-2325      2..... 98242608us :
> osnoise_workload_start: tlat_var_reset: cpu=7,
> tlat_var=000000008e1d5095
> [  103.228992]     rtla-2327      3...11 98256551us :
> bpf_trace_printk: rtla pid 2325 forks into 2328
> [  103.229156]     rtla-2327      3...11 98256873us :
> bpf_trace_printk: rtla pid 2325 forks into 2329
> [  103.229325]     rtla-2327      3...11 98257124us :
> bpf_trace_printk: rtla pid 2325 forks into 2330
> [  103.229489]     rtla-2327      3...11 98257353us :
> bpf_trace_printk: rtla pid 2325 forks into 2331
> [  103.229651]     rtla-2327      3...11 98257574us :
> bpf_trace_printk: rtla pid 2325 forks into 2332
> [  103.229810]     rtla-2327      3...11 98257801us :
> bpf_trace_printk: rtla pid 2325 forks into 2333
> [  103.229968]     rtla-2327      3...11 98258008us :
> bpf_trace_printk: rtla pid 2325 forks into 2334
> [  103.230124]     rtla-2327      3...11 98258216us :
> bpf_trace_printk: rtla pid 2325 forks into 2335
> [  103.230293]     bash-2094      0d..31 102987914us :
> bpf_trace_printk: rtla pid 2325 received signal 2
> [  103.230452]     bash-2094      0d..31 102987915us :
> bpf_trace_printk: ...from bash pid 2094
> [  103.230609]     bash-2094      0d..31 102990000us :
> bpf_trace_printk: rtla pid 2325 received signal 15
> [  103.230766]     bash-2094      0d..31 102990001us :
> bpf_trace_printk: ...from bash pid 2094
> [  103.230922]     rtla-2325      2...21 102990010us :
> bpf_trace_printk: rtla pid 2325 exiting
> [  103.231079]     rtla-2327      3...21 102990503us :
> bpf_trace_printk: rtla pid 2327 exiting
> [  103.231242]     rtla-2337      3d..31 103232021us :
> bpf_trace_printk: timerlatu/0 pid 2328 received signal 9
> [  103.231399]     rtla-2337      3d..31 103232022us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.231554]     rtla-2337      3d..31 103232024us :
> bpf_trace_printk: timerlatu/1 pid 2329 received signal 9
> [  103.231711]     rtla-2337      3d..31 103232025us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.231866]     rtla-2337      3d..31 103232029us :
> bpf_trace_printk: timerlatu/2 pid 2330 received signal 9
> [  103.232022]     rtla-2337      3d..31 103232030us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.232178]     rtla-2337      3dN.31 103232031us :
> bpf_trace_printk: timerlatu/3 pid 2331 received signal 9
> [  103.232347]     rtla-2337      3dN.31 103232031us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.232506]     rtla-2337      3dN.31 103232034us :
> bpf_trace_printk: timerlatu/4 pid 2332 received signal 9
> [  103.232685]     rtla-2337      3dN.31 103232034us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.232846]     rtla-2337      3dN.31 103232039us :
> bpf_trace_printk: timerlatu/5 pid 2333 received signal 9
> [  103.233010]     rtla-2337      3dN.31 103232039us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.233171]     rtla-2337      3dN.31 103232042us :
> bpf_trace_printk: timerlatu/6 pid 2334 received signal 9
> [  103.233357]     rtla-2337      3dN.31 103232042us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.233517]     rtla-2337      3dN.31 103232046us :
> bpf_trace_printk: timerlatu/7 pid 2335 received signal 9
> [  103.233678]     rtla-2337      3dN.31 103232046us :
> bpf_trace_printk: ...from rtla pid 2337
> [  103.233835] timerlat-2331      3...21 103232701us :
> bpf_trace_printk: timerlatu/3 pid 2331 exiting
> [  103.233995] timerlat-2329      1...21 103232704us :
> bpf_trace_printk: timerlatu/1 pid 2329 exiting
> [  103.234155] timerlat-2331      3....1 103232705us :
> timerlat_fd_release: timerlat_fd_release: cpu=3,
> tlat_var=000000000fba7a6e, osn_var->pid=2331
> [  103.234357] timerlat-2329      1....1 103232706us :
> timerlat_fd_release: timerlat_fd_release: cpu=1,
> tlat_var=0000000002e262d8, osn_var->pid=2329
> [  103.234508] timerlat-2334      6...21 103232821us :
> bpf_trace_printk: timerlatu/6 pid 2334 exiting
> [  103.234657] timerlat-2334      6....1 103232826us :
> timerlat_fd_release: timerlat_fd_release: cpu=6,
> tlat_var=000000008e1d5095, osn_var->pid=2334
> [  103.234813] timerlat-2330      2...21 103232850us :
> bpf_trace_printk: timerlatu/2 pid 2330 exiting
> [  103.234966] timerlat-2330      2....1 103232853us :
> timerlat_fd_release: timerlat_fd_release: cpu=2,
> tlat_var=00000000b9d3a280, osn_var->pid=2330
> [  103.235122] timerlat-2333      5...21 103232856us :
> bpf_trace_printk: timerlatu/5 pid 2333 exiting
> [  103.235291] timerlat-2333      5....1 103232858us :
> timerlat_fd_release: timerlat_fd_release: cpu=5,
> tlat_var=00000000a02fe267, osn_var->pid=2333
> [  103.235452] timerlat-2332      4...21 103232861us :
> bpf_trace_printk: timerlatu/4 pid 2332 exiting
> [  103.235610] timerlat-2332      4....1 103232864us :
> timerlat_fd_release: timerlat_fd_release: cpu=4,
> tlat_var=000000008ed0b87f, osn_var->pid=2332
> [  103.235769] timerlat-2335      7...21 103232867us :
> bpf_trace_printk: timerlatu/7 pid 2335 exiting
> [  103.235932] timerlat-2335      7....1 103232869us :
> timerlat_fd_release: timerlat_fd_release: cpu=7,
> tlat_var=000000003113e297, osn_var->pid=2335
> [  103.236103]     rtla-2337      3..... 103234109us :
> osnoise_workload_start: tlat_var_reset: cpu=0,
> tlat_var=0000000034996efa
> [  103.236316]     rtla-2337      3..... 103234109us :
> osnoise_workload_start: tlat_var_reset: cpu=1,
> tlat_var=0000000062d1f520
> [  103.236473]     rtla-2337      3..... 103234109us :
> osnoise_workload_start: tlat_var_reset: cpu=2,
> tlat_var=0000000002e262d8
> [  103.236630]     rtla-2337      3..... 103234110us :
> osnoise_workload_start: tlat_var_reset: cpu=3,
> tlat_var=00000000b9d3a280
> [  103.236797]     rtla-2337      3..... 103234110us :
> osnoise_workload_start: tlat_var_reset: cpu=4,
> tlat_var=000000000fba7a6e
> [  103.236972]     rtla-2337      3..... 103234110us :
> osnoise_workload_start: tlat_var_reset: cpu=5,
> tlat_var=000000008ed0b87f
> [  103.237144]     rtla-2337      3..... 103234110us :
> osnoise_workload_start: tlat_var_reset: cpu=6,
> tlat_var=00000000a02fe267
> [  103.237345]     rtla-2337      3..... 103234110us :
> osnoise_workload_start: tlat_var_reset: cpu=7,
> tlat_var=000000008e1d5095
> [  103.237516] timerlat-2328      0...21 103234149us :
> bpf_trace_printk: timerlatu/0 pid 2328 exiting
> [  103.237690] timerlat-2328      0....1 103234152us :
> timerlat_fd_release: timerlat_fd_release: cpu=0,
> tlat_var=0000000062d1f520, osn_var->pid=0
> [  103.237871] ---------------------------------
> [  103.238053] CR2: 0000000000000010
> 
> On the kernel side, I'm using similar trace_printk debug prints as you
> did; the bpftrace program used is this:
> uretprobe:libc:fork
> /strncmp(comm, "rtla", 5) == 0 && retval != 0/
> {
>         debugf("rtla pid %d forks into %d", pid, retval)
> }
> tracepoint:signal:signal_generate
> /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
> "timerlatu", 9) == 0/ {
>         debugf("%s pid %d received signal %d", args->comm, args->pid,
> args->sig);
>         debugf("...from %s pid %d", comm, pid);
> }
> tracepoint:sched:sched_process_exit
> /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
> "timerlatu", 9) == 0/ {
>         debugf("%s pid %d exiting", args->comm, args->pid)
> }
> 
> > Anyway, the start and stop should probably just pause the threads and not
> > kill them an start them again. That is, the osnoise_workload_start() should
> > be called by the init callbacks and the osnoise_workload_stop should be
> > called by reset callback.
> >
> > The start and stop callbacks should just pause and restart the the threads.
> >
> 
> I'm not sure if that fits into the design of timerlat. At least for
> timerlat with user workload, each activation of the timerlat tracer
> implies the creation of a new user workload, not the re-activation of
> a lingering one (that might be an entirely different program). A
> timerlat tracer is tightly associated with its threads: if the threads
> exit, the timer itself ceases to have meaning. Of course also rtla
> should make sure there are no lingering timerlatu processes.
> 
> I think waiting on the threads to actually exit in stop_kthread() is
> the proper solution:
> 
> /*
>  * stop_kthread - stop a workload thread
>  */
> static void stop_kthread(unsigned int cpu){
>     struct task_struct *kthread;
> 
>     kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
>     if (kthread) {
>         if (test_bit(OSN_WORKLOAD, &osnoise_options)) {
>             kthread_stop(kthread);
>         } else {
>             /*
>              * This is a user thread waiting on the timerlat_fd. We need
>              * to close all users, and the best way to guarantee this is
>              * by killing the thread. NOTE: this is a purpose specific file.
>              */
>             kill_pid(kthread->thread_pid, SIGKILL, 1);
>             /* ^^ here wait until kthread (actually the user workload) exits */
>             put_task_struct(kthread);
>         }
>         per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
>     }
> 
> There might even be another bug, not sure if calling put_task_struct
> on a task that is still exiting is sane.
> 
> Tomas
> 
---end quoted text---
Luis Claudio R. Goncalves Aug. 30, 2024, 12:31 a.m. UTC | #13
On Thu, Aug 29, 2024 at 08:40:25PM -0300, Luis Claudio R. Goncalves wrote:
> On Tue, Aug 27, 2024 at 04:34:39PM +0200, Tomas Glozar wrote:
> > po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> > >
> > > Yeah, I think I finally found the real issue. I don't think we need the ref
> > > counting. The problem is the creating and killing of the threads via the
> > > start and stop callbacks. That's not their purpose. The purpose of stop
> > > and start callbacks is when tracing_on is set to off and back on again. I
> > > think this is what is racing with the close.
> > >
> > 
> > I believe I discovered the exact mechanism of the bug. Since rtla is
> > killed with SIGTERM, it leaves the timerlat tracer active and its
> > timerlatu processes running. The rtla from the next iteration of the
> > while loop in the reproducer deactivates the timerlat tracer,
> > triggering the killing of the timerlatu processes. However, this
> > killing is asynchronous, and while the processes are gradually dying
> > one by one and closing their fds, the rtla from the next iteration is
> > already activating the timerlat tracer again. One of the timerlatu
> > processes does not die fast enough to release the fd before the
> > tlat_var is cleared in osnoise_workload_start, triggering the panic.
> 
> With this in mind I tried a simpler approach. I used a kernel without any
> extra patches from Steven or ourselves, just used the code below. Not
> saying it is a definitive solution, but as a proof of concept I have two
> boxes running the loop used to reproduce the problem and so far not a
> single problem. The patch:

Premature commemoration. One of the boxes rebooted a minute ago. Sounds
like I just diminished a bit the window for triggering the problem.
Tomorrow I will run tests with all the suggested patches and work of a
better idea, more comprehensive, to address what Tomas described above.
 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 66a871553d4a1..331c69bc7b0d4 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -2239,8 +2239,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
>  	 */
>  	mutex_lock(&trace_types_lock);
>  	running = osnoise_has_registered_instances();
> -	if (running)
> +	if (running) {
>  		stop_per_cpu_kthreads();
> +		mutex_unlock(&trace_types_lock);
> +		return -EBUSY;
> +	}
>  
>  	mutex_lock(&interface_lock);
>  	/*
> @@ -2579,7 +2582,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
>  	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>  	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
>  
> -	hrtimer_cancel(&tlat_var->timer);
> +	if (tlat_var->kthread)
> +		hrtimer_cancel(&tlat_var->timer);
>  	memset(tlat_var, 0, sizeof(*tlat_var));
>  
>  	osn_var->sampling = 0;
> ---
> 
> In short, do not start new instances if the old one still registered. And
> only cancel the timer if the timerlat per cpu var has not been zeroed-out.
> 
> Again, not a final patch but a simpler approach to protect start/stop
> races.
> 
> Luis
> 
> > 
> > You can see the bug in action in this detailed trace I got today, with
> > the use of in-kernel trace_printk together with bpftrace and
> > bpf_trace_printk and captured using ftrace_dump_on_oops (running the
> > same reproducer on a 8 CPU KVM system):
> > [  103.227679]     rtla-2325      2..... 98242607us :
> > osnoise_workload_start: tlat_var_reset: cpu=0,
> > tlat_var=0000000034996efa
> > [  103.227844]     rtla-2325      2..... 98242607us :
> > osnoise_workload_start: tlat_var_reset: cpu=1,
> > tlat_var=0000000062d1f520
> > [  103.228009]     rtla-2325      2..... 98242607us :
> > osnoise_workload_start: tlat_var_reset: cpu=2,
> > tlat_var=0000000002e262d8
> > [  103.228173]     rtla-2325      2..... 98242607us :
> > osnoise_workload_start: tlat_var_reset: cpu=3,
> > tlat_var=00000000b9d3a280
> > [  103.228349]     rtla-2325      2..... 98242607us :
> > osnoise_workload_start: tlat_var_reset: cpu=4,
> > tlat_var=000000000fba7a6e
> > [  103.228510]     rtla-2325      2..... 98242608us :
> > osnoise_workload_start: tlat_var_reset: cpu=5,
> > tlat_var=000000008ed0b87f
> > [  103.228670]     rtla-2325      2..... 98242608us :
> > osnoise_workload_start: tlat_var_reset: cpu=6,
> > tlat_var=00000000a02fe267
> > [  103.228831]     rtla-2325      2..... 98242608us :
> > osnoise_workload_start: tlat_var_reset: cpu=7,
> > tlat_var=000000008e1d5095
> > [  103.228992]     rtla-2327      3...11 98256551us :
> > bpf_trace_printk: rtla pid 2325 forks into 2328
> > [  103.229156]     rtla-2327      3...11 98256873us :
> > bpf_trace_printk: rtla pid 2325 forks into 2329
> > [  103.229325]     rtla-2327      3...11 98257124us :
> > bpf_trace_printk: rtla pid 2325 forks into 2330
> > [  103.229489]     rtla-2327      3...11 98257353us :
> > bpf_trace_printk: rtla pid 2325 forks into 2331
> > [  103.229651]     rtla-2327      3...11 98257574us :
> > bpf_trace_printk: rtla pid 2325 forks into 2332
> > [  103.229810]     rtla-2327      3...11 98257801us :
> > bpf_trace_printk: rtla pid 2325 forks into 2333
> > [  103.229968]     rtla-2327      3...11 98258008us :
> > bpf_trace_printk: rtla pid 2325 forks into 2334
> > [  103.230124]     rtla-2327      3...11 98258216us :
> > bpf_trace_printk: rtla pid 2325 forks into 2335
> > [  103.230293]     bash-2094      0d..31 102987914us :
> > bpf_trace_printk: rtla pid 2325 received signal 2
> > [  103.230452]     bash-2094      0d..31 102987915us :
> > bpf_trace_printk: ...from bash pid 2094
> > [  103.230609]     bash-2094      0d..31 102990000us :
> > bpf_trace_printk: rtla pid 2325 received signal 15
> > [  103.230766]     bash-2094      0d..31 102990001us :
> > bpf_trace_printk: ...from bash pid 2094
> > [  103.230922]     rtla-2325      2...21 102990010us :
> > bpf_trace_printk: rtla pid 2325 exiting
> > [  103.231079]     rtla-2327      3...21 102990503us :
> > bpf_trace_printk: rtla pid 2327 exiting
> > [  103.231242]     rtla-2337      3d..31 103232021us :
> > bpf_trace_printk: timerlatu/0 pid 2328 received signal 9
> > [  103.231399]     rtla-2337      3d..31 103232022us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.231554]     rtla-2337      3d..31 103232024us :
> > bpf_trace_printk: timerlatu/1 pid 2329 received signal 9
> > [  103.231711]     rtla-2337      3d..31 103232025us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.231866]     rtla-2337      3d..31 103232029us :
> > bpf_trace_printk: timerlatu/2 pid 2330 received signal 9
> > [  103.232022]     rtla-2337      3d..31 103232030us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.232178]     rtla-2337      3dN.31 103232031us :
> > bpf_trace_printk: timerlatu/3 pid 2331 received signal 9
> > [  103.232347]     rtla-2337      3dN.31 103232031us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.232506]     rtla-2337      3dN.31 103232034us :
> > bpf_trace_printk: timerlatu/4 pid 2332 received signal 9
> > [  103.232685]     rtla-2337      3dN.31 103232034us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.232846]     rtla-2337      3dN.31 103232039us :
> > bpf_trace_printk: timerlatu/5 pid 2333 received signal 9
> > [  103.233010]     rtla-2337      3dN.31 103232039us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.233171]     rtla-2337      3dN.31 103232042us :
> > bpf_trace_printk: timerlatu/6 pid 2334 received signal 9
> > [  103.233357]     rtla-2337      3dN.31 103232042us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.233517]     rtla-2337      3dN.31 103232046us :
> > bpf_trace_printk: timerlatu/7 pid 2335 received signal 9
> > [  103.233678]     rtla-2337      3dN.31 103232046us :
> > bpf_trace_printk: ...from rtla pid 2337
> > [  103.233835] timerlat-2331      3...21 103232701us :
> > bpf_trace_printk: timerlatu/3 pid 2331 exiting
> > [  103.233995] timerlat-2329      1...21 103232704us :
> > bpf_trace_printk: timerlatu/1 pid 2329 exiting
> > [  103.234155] timerlat-2331      3....1 103232705us :
> > timerlat_fd_release: timerlat_fd_release: cpu=3,
> > tlat_var=000000000fba7a6e, osn_var->pid=2331
> > [  103.234357] timerlat-2329      1....1 103232706us :
> > timerlat_fd_release: timerlat_fd_release: cpu=1,
> > tlat_var=0000000002e262d8, osn_var->pid=2329
> > [  103.234508] timerlat-2334      6...21 103232821us :
> > bpf_trace_printk: timerlatu/6 pid 2334 exiting
> > [  103.234657] timerlat-2334      6....1 103232826us :
> > timerlat_fd_release: timerlat_fd_release: cpu=6,
> > tlat_var=000000008e1d5095, osn_var->pid=2334
> > [  103.234813] timerlat-2330      2...21 103232850us :
> > bpf_trace_printk: timerlatu/2 pid 2330 exiting
> > [  103.234966] timerlat-2330      2....1 103232853us :
> > timerlat_fd_release: timerlat_fd_release: cpu=2,
> > tlat_var=00000000b9d3a280, osn_var->pid=2330
> > [  103.235122] timerlat-2333      5...21 103232856us :
> > bpf_trace_printk: timerlatu/5 pid 2333 exiting
> > [  103.235291] timerlat-2333      5....1 103232858us :
> > timerlat_fd_release: timerlat_fd_release: cpu=5,
> > tlat_var=00000000a02fe267, osn_var->pid=2333
> > [  103.235452] timerlat-2332      4...21 103232861us :
> > bpf_trace_printk: timerlatu/4 pid 2332 exiting
> > [  103.235610] timerlat-2332      4....1 103232864us :
> > timerlat_fd_release: timerlat_fd_release: cpu=4,
> > tlat_var=000000008ed0b87f, osn_var->pid=2332
> > [  103.235769] timerlat-2335      7...21 103232867us :
> > bpf_trace_printk: timerlatu/7 pid 2335 exiting
> > [  103.235932] timerlat-2335      7....1 103232869us :
> > timerlat_fd_release: timerlat_fd_release: cpu=7,
> > tlat_var=000000003113e297, osn_var->pid=2335
> > [  103.236103]     rtla-2337      3..... 103234109us :
> > osnoise_workload_start: tlat_var_reset: cpu=0,
> > tlat_var=0000000034996efa
> > [  103.236316]     rtla-2337      3..... 103234109us :
> > osnoise_workload_start: tlat_var_reset: cpu=1,
> > tlat_var=0000000062d1f520
> > [  103.236473]     rtla-2337      3..... 103234109us :
> > osnoise_workload_start: tlat_var_reset: cpu=2,
> > tlat_var=0000000002e262d8
> > [  103.236630]     rtla-2337      3..... 103234110us :
> > osnoise_workload_start: tlat_var_reset: cpu=3,
> > tlat_var=00000000b9d3a280
> > [  103.236797]     rtla-2337      3..... 103234110us :
> > osnoise_workload_start: tlat_var_reset: cpu=4,
> > tlat_var=000000000fba7a6e
> > [  103.236972]     rtla-2337      3..... 103234110us :
> > osnoise_workload_start: tlat_var_reset: cpu=5,
> > tlat_var=000000008ed0b87f
> > [  103.237144]     rtla-2337      3..... 103234110us :
> > osnoise_workload_start: tlat_var_reset: cpu=6,
> > tlat_var=00000000a02fe267
> > [  103.237345]     rtla-2337      3..... 103234110us :
> > osnoise_workload_start: tlat_var_reset: cpu=7,
> > tlat_var=000000008e1d5095
> > [  103.237516] timerlat-2328      0...21 103234149us :
> > bpf_trace_printk: timerlatu/0 pid 2328 exiting
> > [  103.237690] timerlat-2328      0....1 103234152us :
> > timerlat_fd_release: timerlat_fd_release: cpu=0,
> > tlat_var=0000000062d1f520, osn_var->pid=0
> > [  103.237871] ---------------------------------
> > [  103.238053] CR2: 0000000000000010
> > 
> > On the kernel side, I'm using similar trace_printk debug prints as you
> > did; the bpftrace program used is this:
> > uretprobe:libc:fork
> > /strncmp(comm, "rtla", 5) == 0 && retval != 0/
> > {
> >         debugf("rtla pid %d forks into %d", pid, retval)
> > }
> > tracepoint:signal:signal_generate
> > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
> > "timerlatu", 9) == 0/ {
> >         debugf("%s pid %d received signal %d", args->comm, args->pid,
> > args->sig);
> >         debugf("...from %s pid %d", comm, pid);
> > }
> > tracepoint:sched:sched_process_exit
> > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
> > "timerlatu", 9) == 0/ {
> >         debugf("%s pid %d exiting", args->comm, args->pid)
> > }
> > 
> > > Anyway, the start and stop should probably just pause the threads and not
> > > kill them an start them again. That is, the osnoise_workload_start() should
> > > be called by the init callbacks and the osnoise_workload_stop should be
> > > called by reset callback.
> > >
> > > The start and stop callbacks should just pause and restart the the threads.
> > >
> > 
> > I'm not sure if that fits into the design of timerlat. At least for
> > timerlat with user workload, each activation of the timerlat tracer
> > implies the creation of a new user workload, not the re-activation of
> > a lingering one (that might be an entirely different program). A
> > timerlat tracer is tightly associated with its threads: if the threads
> > exit, the timer itself ceases to have meaning. Of course also rtla
> > should make sure there are no lingering timerlatu processes.
> > 
> > I think waiting on the threads to actually exit in stop_kthread() is
> > the proper solution:
> > 
> > /*
> >  * stop_kthread - stop a workload thread
> >  */
> > static void stop_kthread(unsigned int cpu){
> >     struct task_struct *kthread;
> > 
> >     kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
> >     if (kthread) {
> >         if (test_bit(OSN_WORKLOAD, &osnoise_options)) {
> >             kthread_stop(kthread);
> >         } else {
> >             /*
> >              * This is a user thread waiting on the timerlat_fd. We need
> >              * to close all users, and the best way to guarantee this is
> >              * by killing the thread. NOTE: this is a purpose specific file.
> >              */
> >             kill_pid(kthread->thread_pid, SIGKILL, 1);
> >             /* ^^ here wait until kthread (actually the user workload) exits */
> >             put_task_struct(kthread);
> >         }
> >         per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
> >     }
> > 
> > There might even be another bug, not sure if calling put_task_struct
> > on a task that is still exiting is sane.
> > 
> > Tomas
> > 
> ---end quoted text---
---end quoted text---
Tomas Glozar Sept. 3, 2024, 12:47 p.m. UTC | #14
út 27. 8. 2024 v 16:34 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> I think waiting on the threads to actually exit in stop_kthread() is
> the proper solution:
>
> /*
>  * stop_kthread - stop a workload thread
>  */
> static void stop_kthread(unsigned int cpu){
>     struct task_struct *kthread;
>
>     kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
>     if (kthread) {
>         if (test_bit(OSN_WORKLOAD, &osnoise_options)) {
>             kthread_stop(kthread);
>         } else {
>             /*
>              * This is a user thread waiting on the timerlat_fd. We need
>              * to close all users, and the best way to guarantee this is
>              * by killing the thread. NOTE: this is a purpose specific file.
>              */
>             kill_pid(kthread->thread_pid, SIGKILL, 1);
>             /* ^^ here wait until kthread (actually the user workload) exits */
>             put_task_struct(kthread);
>         }
>         per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
>     }
>

I tried a patch that waits on the user workload to exit when doing
osnoise_workload_stop, similarly to what we already do for the kernel
workload. I added a wait_on_completion in osnoise_workload_stop that
is completed in timerlat_fd_release, mimicking what kthread_stop()
does. But that deadlocks if the user workload process attempts to free
a file descriptor taking trace_types_lock before the timerlat one,
since the caller of osnoise_workload_stop() is also holding
trace_types_lock. There seems to be no way to prevent this deadlock,
since the user workload, unlike the kthread which has a fixed program,
is free to open anything it wants.

I suggest using the workaround suggested in
https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home
for the time being. Together with the patch that adds locks around
stopping the threads
(https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/),
this should prevent the kernel panic at least until we have a solution
for the race itself.

Tomas




Tomas
Steven Rostedt Sept. 3, 2024, 3 p.m. UTC | #15
On Tue, 3 Sep 2024 14:47:42 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> I suggest using the workaround suggested in
> https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home
> for the time being. Together with the patch that adds locks around
> stopping the threads
> (https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/),
> this should prevent the kernel panic at least until we have a solution
> for the race itself.

I'll make the first one an official patch and start running it through my
tests. We need something soon, as this is a standing bug in the kernel
where the fix needs to go to stable.

Thanks,

-- Steve
Steven Rostedt Sept. 4, 2024, 2:21 p.m. UTC | #16
On Tue, 3 Sep 2024 14:47:42 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> I suggest using the workaround suggested in
> https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home
> for the time being. Together with the patch that adds locks around
> stopping the threads
> (https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/),
> this should prevent the kernel panic at least until we have a solution
> for the race itself.

When running my tests, the second one would end up deadlocking and
triggering lockdep. I found a way to do basically the same thing with a
cpumask and no added locking. I'm currently testing it and will be sending
out a patch later today (if it passes the tests).

-- Steve
Tomas Glozar Sept. 5, 2024, 10:38 a.m. UTC | #17
st 4. 9. 2024 v 16:23 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
>
> When running my tests, the second one would end up deadlocking and
> triggering lockdep. I found a way to do basically the same thing with a
> cpumask and no added locking. I'm currently testing it and will be sending
> out a patch later today (if it passes the tests).
>
> -- Steve
>

Oh that's unfortunate. Your cpumask patch does protect from timerlat
trying to stop a user workload thread via kthread_stop, but I don't
think it prevents this race in the code from the other patch:

static int timerlat_fd_release(struct inode *inode, struct file *file)
{
    ...
    mutex_lock(&interface_lock);
    <-- this can now run at the same time as tlat_var_reset(), since
the latter is not done under interface_lock in the latest version
    ...
if (tlat_var->kthread)
        <-- if tlat_var is zeroed here, we still panic on NULL dereference
hrtimer_cancel(&tlat_var->timer);
    ...
}

Either way, the results are much better: before, the kernel panicked
in <10 iterations of the while loop of the reproducer; with the
cpumask patch and the tlat_var->kthread check patch, it has been
running on my test VM for a few hours already with no panic.

Tomas
Steven Rostedt Sept. 5, 2024, 12:31 p.m. UTC | #18
On Thu, 5 Sep 2024 12:38:37 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> st 4. 9. 2024 v 16:23 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> >
> > When running my tests, the second one would end up deadlocking and
> > triggering lockdep. I found a way to do basically the same thing with a
> > cpumask and no added locking. I'm currently testing it and will be sending
> > out a patch later today (if it passes the tests).
> >
> > -- Steve
> >  
> 
> Oh that's unfortunate. Your cpumask patch does protect from timerlat
> trying to stop a user workload thread via kthread_stop, but I don't
> think it prevents this race in the code from the other patch:
> 
> static int timerlat_fd_release(struct inode *inode, struct file *file)
> {
>     ...
>     mutex_lock(&interface_lock);
>     <-- this can now run at the same time as tlat_var_reset(), since
> the latter is not done under interface_lock in the latest version
>     ...
> if (tlat_var->kthread)
>         <-- if tlat_var is zeroed here, we still panic on NULL dereference
> hrtimer_cancel(&tlat_var->timer);
>     ...
> }

Right, but my other patch was protecting the osn variable too, which I
don't think is necessary.

> 
> Either way, the results are much better: before, the kernel panicked
> in <10 iterations of the while loop of the reproducer; with the
> cpumask patch and the tlat_var->kthread check patch, it has been
> running on my test VM for a few hours already with no panic.

I think the hrtimer change is missing this:

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 8543d941b870..8e611dcee68a 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -259,6 +259,9 @@ static inline void tlat_var_reset(void)
 {
 	struct timerlat_variables *tlat_var;
 	int cpu;
+
+	/* Synchronize with the timerlat interfaces */
+	mutex_lock(&interface_lock);
 	/*
 	 * So far, all the values are initialized as 0, so
 	 * zeroing the structure is perfect.
@@ -269,6 +272,7 @@ static inline void tlat_var_reset(void)
 			hrtimer_cancel(&tlat_var->timer);
 		memset(tlat_var, 0, sizeof(*tlat_var));
 	}
+	mutex_unlock(&interface_lock);
 }
 #else /* CONFIG_TIMERLAT_TRACER */
 #define tlat_var_reset()	do {} while (0)

I'll add it and start testing it. I haven't pushed to Linus yet.

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..6d2b39da4dce 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2572,6 +2572,7 @@  static int timerlat_fd_release(struct inode *inode, struct file *file)
 	struct osnoise_variables *osn_var;
 	struct timerlat_variables *tlat_var;
 	long cpu = (long) file->private_data;
+	int ret = 0;
 
 	migrate_disable();
 	mutex_lock(&interface_lock);
@@ -2579,6 +2580,12 @@  static int timerlat_fd_release(struct inode *inode, struct file *file)
 	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
 	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
+	if (!tlat_var->kthread) {
+		/* the fd has been closed already */
+		ret = -EBADF;
+		goto out;
+	}
+
 	hrtimer_cancel(&tlat_var->timer);
 	memset(tlat_var, 0, sizeof(*tlat_var));
 
@@ -2593,9 +2600,10 @@  static int timerlat_fd_release(struct inode *inode, struct file *file)
 		osn_var->kthread = NULL;
 	}
 
+out:
 	mutex_unlock(&interface_lock);
 	migrate_enable();
-	return 0;
+	return ret;
 }
 #endif