Message ID | 20240820130001.124768-1-tglozar@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release | expand |
On Tue, 20 Aug 2024 15:00:01 +0200 tglozar@redhat.com wrote: > From: Tomas Glozar <tglozar@redhat.com> > > When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD), > NULL pointer dereference can be triggered by sending consequent SIGINT > and SIGTERM signals to the workload process. That then causes > timerlat_fd_release to be called twice in a row, and the second time, > hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL > dereference. > > This can be reproduced using rtla: > ``` > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done > [1] 1675 > [1]+ Aborted (SIGTERM) rtla timerlat top -u -q > [1] 1688 > client_loop: send disconnect: Broken pipe > ``` > triggering the bug: I'm able to reproduce this with the above. Unfortunately, I can still reproduce it after applying this patch :-( Looking at the code, the logic for handling the kthread seems off. I'll spend a little time to see if I can figure it out. -- Steve
st 21. 8. 2024 v 22:02 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > I'm able to reproduce this with the above. Unfortunately, I can still > reproduce it after applying this patch :-( > Thank you for looking at this. I was at first not too sure about whether this is the proper fix, but after some discussion with Luis (in CC), we have come to the conclusion that the double-close of the timerlat_fd might be a possible explanation, and this patch worked for both of us. Are you reproducing the same bug (NULL pointer dereference in hrtimer_active) with the patch? IIUC that should not happen anymore since the patch explicitly checks for zero in the hrtimer structure. I have caught however a different panic in addition to the one reported above while testing "rtla: Support idle state disabling via libcpupower in timerlat" on an EL9 RT kernel: BUG: kernel NULL pointer dereference, address: 0000000000000014 CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G W ------- --- 5.14.0-452.el9.x86_64+rt #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:task_dump_owner+0x3d/0x100 RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580 RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80 R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8 FS: 00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> ? show_trace_log_lvl+0x1c4/0x2df ? show_trace_log_lvl+0x1c4/0x2df ? proc_pid_make_inode+0xa0/0x110 ? __die_body.cold+0x8/0xd ? page_fault_oops+0x140/0x180 ? do_user_addr_fault+0x61/0x690 ? kvm_read_and_reset_apf_flags+0x45/0x60 ? exc_page_fault+0x65/0x180 ? asm_exc_page_fault+0x22/0x30 ? task_dump_owner+0x3d/0x100 ? task_dump_owner+0x36/0x100 proc_pid_make_inode+0xa0/0x110 proc_pid_instantiate+0x21/0xb0 proc_pid_lookup+0x95/0x170 proc_root_lookup+0x1d/0x50 __lookup_slow+0x9c/0x150 walk_component+0x158/0x1d0 link_path_walk.part.0.constprop.0+0x24e/0x3c0 ? path_init+0x326/0x4d0 path_openat+0xb1/0x280 do_filp_open+0xb2/0x160 ? migrate_enable+0xd5/0x150 ? rt_spin_unlock+0x13/0x40 do_sys_openat2+0x96/0xd0 __x64_sys_openat+0x53/0xa0 ... </TASK> This was preceded by a WARN: refcount_t: addition on 0; use-after-free. WARNING: CPU: 6 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110 CPU: 6 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-452.el9.x86_64+rt #1 RIP: 0010:refcount_warn_saturate+0x74/0x110 Call Trace: <TASK> [ 78.184877] proc_pid_lookup+0x161/0x170 [ 78.184883] proc_root_lookup+0x1d/0x50 [ 78.184890] __lookup_slow+0x9c/0x150 [ 78.184899] walk_component+0x158/0x1d0 [ 78.184908] link_path_walk.part.0.constprop.0+0x24e/0x3c0 [ 78.184915] ? path_init+0x326/0x4d0 [ 78.184922] path_openat+0xb1/0x280 [ 78.184926] do_filp_open+0xb2/0x160 [ 78.184934] ? migrate_enable+0xd5/0x150 [ 78.184942] ? rt_spin_unlock+0x13/0x40 [ 78.184950] do_sys_openat2+0x96/0xd0 [ 78.184958] __x64_sys_openat+0x53/0xa0 [ 78.184964] do_syscall_64+0x5c/0xf0 [ 78.185011] entry_SYSCALL_64_after_hwframe+0x6e/0x76 ... </TASK> > Looking at the code, the logic for handling the kthread seems off. I'll > spend a little time to see if I can figure it out. > Yeah, it seems there might be multiple bugs in the user workload handling, the other NULL pointer dereference and refcount warning above might be related (but I have yet to reproduce it on an upstream kernel). I'm also going to look at the code and will post any findings here. Tomas
On Wed, Aug 21, 2024 at 04:03:16PM -0400, Steven Rostedt wrote: > On Tue, 20 Aug 2024 15:00:01 +0200 > tglozar@redhat.com wrote: > > > From: Tomas Glozar <tglozar@redhat.com> > > > > When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD), > > NULL pointer dereference can be triggered by sending consequent SIGINT > > and SIGTERM signals to the workload process. That then causes > > timerlat_fd_release to be called twice in a row, and the second time, > > hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL > > dereference. > > > > This can be reproduced using rtla: > > ``` > > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ > > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done > > [1] 1675 > > [1]+ Aborted (SIGTERM) rtla timerlat top -u -q > > [1] 1688 > > client_loop: send disconnect: Broken pipe > > ``` > > triggering the bug: > > I'm able to reproduce this with the above. Unfortunately, I can still > reproduce it after applying this patch :-( We were able to mitigate the problem (triggered by that command line) simply by handling SIGTERM in the userspace tool the same way it handles SIGINT. That was one of the factors that helped the "closing the file descriptor twice" theory. Would you mind sharing the backtrace you got? That would also help us investigating. > Looking at the code, the logic for handling the kthread seems off. I'll > spend a little time to see if I can figure it out. You mean the + if (!tlat_var->kthread) { + /* the fd has been closed already */ bit or the kthread handling in rtla itself? As Tomas already said, thank you for testing and reviewing the suggested fix! Luis
On Thu, 22 Aug 2024 11:32:07 +0200 Tomas Glozar <tglozar@redhat.com> wrote: > st 21. 8. 2024 v 22:02 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > > > I'm able to reproduce this with the above. Unfortunately, I can still > > reproduce it after applying this patch :-( > > > > Thank you for looking at this. I was at first not too sure about > whether this is the proper fix, but after some discussion with Luis > (in CC), we have come to the conclusion that the double-close of the > timerlat_fd might be a possible explanation, and this patch worked for > both of us. Are you reproducing the same bug (NULL pointer dereference > in hrtimer_active) with the patch? IIUC that should not happen anymore > since the patch explicitly checks for zero in the hrtimer structure. There isn't a double close. But there are two bugs and you did sorta fix one of them. > > I have caught however a different panic in addition to the one > reported above while testing "rtla: Support idle state disabling via > libcpupower in timerlat" on an EL9 RT kernel: > > BUG: kernel NULL pointer dereference, address: 0000000000000014 > CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G W > ------- --- 5.14.0-452.el9.x86_64+rt #1 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 > 04/01/2014 > RIP: 0010:task_dump_owner+0x3d/0x100 > RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202 > RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580 > RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80 > R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c > R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8 > FS: 00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0 > PKRU: 55555554 > Call Trace: > <TASK> > ? show_trace_log_lvl+0x1c4/0x2df > ? show_trace_log_lvl+0x1c4/0x2df > ? proc_pid_make_inode+0xa0/0x110 > ? __die_body.cold+0x8/0xd > ? page_fault_oops+0x140/0x180 > ? do_user_addr_fault+0x61/0x690 > ? kvm_read_and_reset_apf_flags+0x45/0x60 > ? exc_page_fault+0x65/0x180 > ? asm_exc_page_fault+0x22/0x30 > ? task_dump_owner+0x3d/0x100 > ? task_dump_owner+0x36/0x100 > proc_pid_make_inode+0xa0/0x110 > proc_pid_instantiate+0x21/0xb0 > proc_pid_lookup+0x95/0x170 > proc_root_lookup+0x1d/0x50 > __lookup_slow+0x9c/0x150 > walk_component+0x158/0x1d0 > link_path_walk.part.0.constprop.0+0x24e/0x3c0 > ? path_init+0x326/0x4d0 > path_openat+0xb1/0x280 > do_filp_open+0xb2/0x160 > ? migrate_enable+0xd5/0x150 > ? rt_spin_unlock+0x13/0x40 > do_sys_openat2+0x96/0xd0 > __x64_sys_openat+0x53/0xa0 > ... > Yeah, it seems there might be multiple bugs in the user workload > handling, the other NULL pointer dereference and refcount warning > above might be related (but I have yet to reproduce it on an upstream > kernel). I'm also going to look at the code and will post any findings > here. Yes that is the second bug and it is related to the that this addresses. -- Steve
On Thu, 22 Aug 2024 08:20:52 -0300 "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> wrote: > You mean the > > + if (!tlat_var->kthread) { > + /* the fd has been closed already */ > > bit or the kthread handling in rtla itself? > > As Tomas already said, thank you for testing and reviewing the suggested fix! The close only happens once, but the kthread does get freed elsewhere. I used my new persistent ring buffer to debug this! I'll be posting some more patches that allow trace_printk() and friends to use the persistent ring buffer (or any instance) soon. For this bug, I would recommend instead doing: diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a..53de719f35cb 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2579,7 +2587,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); - hrtimer_cancel(&tlat_var->timer); + if (tlat_var->kthread) + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); osn_var->sampling = 0; -- Steve
On Thu, 22 Aug 2024 10:32:02 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > > Yeah, it seems there might be multiple bugs in the user workload > > handling, the other NULL pointer dereference and refcount warning > > above might be related (but I have yet to reproduce it on an upstream > > kernel). I'm also going to look at the code and will post any findings > > here. > > Yes that is the second bug and it is related to the that this addresses. There's nothing protecting the clearing of the kthreads and calling put_task_struct(). Here's the fix to the second bug: diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a..53de719f35cb 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2106,7 +2106,9 @@ static int osnoise_cpu_init(unsigned int cpu) */ static int osnoise_cpu_die(unsigned int cpu) { + mutex_lock(&interface_lock); stop_kthread(cpu); + mutex_unlock(&interface_lock); return 0; } @@ -2239,8 +2241,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf, */ mutex_lock(&trace_types_lock); running = osnoise_has_registered_instances(); - if (running) + if (running) { + mutex_lock(&interface_lock); stop_per_cpu_kthreads(); + mutex_unlock(&interface_lock); + } mutex_lock(&interface_lock); /* @@ -2355,8 +2360,11 @@ osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count, */ mutex_lock(&trace_types_lock); running = osnoise_has_registered_instances(); - if (running) + if (running) { + mutex_lock(&interface_lock); stop_per_cpu_kthreads(); + mutex_unlock(&interface_lock); + } mutex_lock(&interface_lock); /* @@ -2951,7 +2960,9 @@ static void osnoise_workload_stop(void) */ barrier(); + mutex_lock(&interface_lock); stop_per_cpu_kthreads(); + mutex_unlock(&interface_lock); osnoise_unhook_events(); } With both of these fixes, the bug goes away. I'll add this fix (after enabling lockdep and making sure I didn't screw up the locking). Can you resend this patch with just not calling cancel if kthread is NULL. No need to exit out early. I still like to make sure the clean up happens, and not assume it will already be done. -- Steve
On Tue, 20 Aug 2024 15:00:01 +0200 tglozar@redhat.com wrote: > From: Tomas Glozar <tglozar@redhat.com> > > When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD), > NULL pointer dereference can be triggered by sending consequent SIGINT > and SIGTERM signals to the workload process. That then causes > timerlat_fd_release to be called twice in a row, and the second time, > hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL > dereference. > > This can be reproduced using rtla: > ``` > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done > [1] 1675 > [1]+ Aborted (SIGTERM) rtla timerlat top -u -q > [1] 1688 > client_loop: send disconnect: Broken pipe > ``` > triggering the bug: > ``` > BUG: kernel NULL pointer dereference, address: 0000000000000010 > Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted > 6.10.0-rc2+ #1 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 > 04/01/2014 > RIP: 0010:hrtimer_active+0xd/0x50 > RSP: 0018:ffffa86641567cc0 EFLAGS: 00010286 > RAX: 000000000002e2c0 RBX: ffff994c6bf2e2c8 RCX: ffff994b0911ac18 > RDX: 0000000000000000 RSI: ffff994b02f10700 RDI: ffff994c6bf2e2c8 > RBP: ffff994c6bf2e340 R08: ffff994b158f7400 R09: ffff994b0911ac18 > R10: 0000000000000010 R11: ffff994b00d40f00 R12: ffff994c6bf2e2c8 > R13: ffff994b02049b20 R14: ffff994b011806c0 R15: 0000000000000000 > FS: 0000000000000000(0000) GS:ffff994c6bf00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000010 CR3: 0000000139020006 CR4: 0000000000770ef0 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die+0x24/0x70 > ? page_fault_oops+0x75/0x170 > ? mt_destroy_walk.isra.0+0x2b3/0x320 > ? exc_page_fault+0x70/0x160 > ? asm_exc_page_fault+0x26/0x30 > ? hrtimer_active+0xd/0x50 > hrtimer_cancel+0x15/0x40 > timerlat_fd_release+0x48/0xe0 > __fput+0xed/0x2c0 > task_work_run+0x59/0x90 > do_exit+0x275/0x4b0 > do_group_exit+0x30/0x80 > get_signal+0x917/0x960 > ? vfs_read+0xb7/0x340 > arch_do_signal_or_restart+0x29/0xf0 > ? syscall_exit_to_user_mode+0x70/0x1f0 > ? syscall_exit_work+0xf3/0x120 > syscall_exit_to_user_mode+0x1a0/0x1f0 > do_syscall_64+0x89/0x160 > ? clear_bhb_loop+0x25/0x80 > ? clear_bhb_loop+0x25/0x80 > ? clear_bhb_loop+0x25/0x80 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f75790fd9ec > ... > </TASK> > ``` > > Fix the NULL pointer dereference by checking tlat_var->kthread for zero > first in timerlat_fd_release, before calling hrtimer_cancel. > tlat_var->kthread is always non-zero unless the entire tlat_var is zero, > since it is set to the TID of the userspace workload in timerlat_fd_open > under a mutex. > So debugging this a bit more (with my persistent ring buffer!), I realize this is just fixing a symptom and not the real issue (and I added my other patch which fixes the kthread issue). I added trace_printk() for every place the tlat->timer gets initialized, started, canceled and memset to zero (via the memset() function). Those looked like this: for_each_cpu(cpu, cpu_online_mask) { tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); trace_printk("memset 0 %px\n", &tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); } trace_printk("start timer %px for %px\n", &tlat->timer, tlat->kthread); hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD); etc. And here's the result I found: <...>-909 [005] ....1 69.830509: timerlat_fd_release: cancel timer ffff88823c6b5b28 for ffff88810dae6780 <...>-909 [005] ....1 69.830511: timerlat_fd_release: memset 0 ffff88823c6b5b28 <...>-902 [001] ....1 69.831296: timerlat_fd_release: cancel timer ffff88823c4b5b28 for ffff88810dae5640 <...>-902 [001] ....1 69.831300: timerlat_fd_release: memset 0 ffff88823c4b5b28 <...>-901 [000] ....1 69.831319: timerlat_fd_release: cancel timer ffff88823c435b28 for ffff888110b8d640 <...>-901 [000] ....1 69.831323: timerlat_fd_release: memset 0 ffff88823c435b28 <...>-905 [003] ....1 69.831371: timerlat_fd_release: cancel timer ffff88823c5b5b28 for ffff88810dae0000 <...>-905 [003] ....1 69.831375: timerlat_fd_release: memset 0 ffff88823c5b5b28 <...>-904 [002] ....1 69.831390: timerlat_fd_release: cancel timer ffff88823c535b28 for ffff88810dae2280 <...>-904 [002] ....1 69.831394: timerlat_fd_release: memset 0 ffff88823c535b28 <...>-907 [004] ....1 69.831442: timerlat_fd_release: cancel timer ffff88823c635b28 for ffff88810dae33c0 <...>-907 [004] ....1 69.831446: timerlat_fd_release: memset 0 ffff88823c635b28 <...>-911 [007] ....1 69.832689: timerlat_fd_release: cancel timer ffff88823c7b5b28 for ffff88810dae1140 <...>-911 [007] ....1 69.832693: timerlat_fd_release: memset 0 ffff88823c7b5b28 <...>-916 [005] ..... 69.838927: osnoise_workload_start: memset 0 ffff88823c435b28 <...>-916 [005] ..... 69.838931: osnoise_workload_start: memset 0 ffff88823c4b5b28 <...>-916 [005] ..... 69.838932: osnoise_workload_start: memset 0 ffff88823c535b28 <...>-916 [005] ..... 69.838933: osnoise_workload_start: memset 0 ffff88823c5b5b28 <...>-916 [005] ..... 69.838934: osnoise_workload_start: memset 0 ffff88823c635b28 <...>-916 [005] ..... 69.838935: osnoise_workload_start: memset 0 ffff88823c6b5b28 <...>-916 [005] ..... 69.838936: osnoise_workload_start: memset 0 ffff88823c735b28 <...>-916 [005] ..... 69.838938: osnoise_workload_start: memset 0 ffff88823c7b5b28 <...>-910 [006] ....1 69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000 <...>-910 [006] d...1 69.841134: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI <...>-910 [006] d...1 69.844649: console: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] <...>-910 [006] d...1 69.846887: console: CPU: 6 UID: 0 PID: 910 Comm: timerlatu/6 Not tainted 6.11.0-rc3-test-00027-g014f473a3416-dirty #128 70de8528c323e5e87113ad73b5320a6d112a9ae6 <...>-910 [006] d...1 69.851008: console: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <...>-910 [006] d...1 69.853750: console: RIP: 0010:hrtimer_active+0x58/0x300 <...>-910 [006] d...1 69.855118: console: Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f <...>-910 [006] d...1 69.860562: console: RSP: 0018:ffff88810c7a7718 EFLAGS: 00010202 <...>-910 [006] d...1 69.860569: console: <...>-910 [006] d...1 69.862099: console: RAX: 0000000000000002 RBX: ffff88823c735b28 RCX: ffffed10478e6b6b <...>-910 [006] d...1 69.864181: console: RDX: dffffc0000000000 RSI: ffffed10478e6b6c RDI: ffff88823c735b28 <...>-910 [006] d...1 69.866262: console: RBP: 0000000000000000 R08: ffff88823c735b58 R09: ffff88823c735b60 <...>-910 [006] d...1 69.868359: console: R10: ffff88810dae4fe7 R11: ffffed102027af0a R12: ffff88823c735b28 <...>-910 [006] d...1 69.870459: console: R13: ffffffffa14f8d8b R14: 0000000000000010 R15: ffff88823c735b28 <...>-910 [006] d...1 69.872576: console: FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000 <...>-910 [006] d...1 69.874764: console: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <...>-910 [006] d...1 69.876332: console: CR2: 00007fc741ad5f90 CR3: 00000001f729e006 CR4: 0000000000170ef0 <...>-910 [006] d...1 69.878275: console: Call Trace: <...>-910 [006] d...1 69.878954: console: <TASK> The timer that crashed (and the other timers would eventually too) was ffff88823c735b28, and just looking at that timer from the above: <...>-910 [006] ....1 69.827589: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500 <...>-910 [006] ....1 69.828557: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500 <...>-910 [006] ....1 69.829529: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500 <...>-916 [005] ..... 69.838936: osnoise_workload_start: memset 0 ffff88823c735b28 <...>-910 [006] ....1 69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000 It was the memset that cleared it and then calling hrtimer_cancel() on that broke it. But notice something else here. The timer is never canceled! That's a real bug. This wasn't caused by the file descriptor closing twice and timerlat_fd_release called again. There's no such thing otherwise there would be lots of broken code in the kernel. I added a trace_dump_stack(): for_each_cpu(cpu, cpu_online_mask) { tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); trace_printk("memset 0 %px\n", &tlat_var->timer); trace_dump_stack(0); memset(tlat_var, 0, sizeof(*tlat_var)); } And have this: <...>-924 [006] ..... 42.574419: osnoise_workload_start: memset 0 ffff88823c7b5b28 <...>-924 [006] ..... 42.574423: <stack trace> => __ftrace_trace_stack => osnoise_workload_start => timerlat_tracer_start => rb_simple_write => vfs_write => ksys_write => do_syscall_64 => entry_SYSCALL_64_after_hwframe <...>-913 [007] ....1 42.580553: timerlat_fd_release: cancel timer ffff88823c7b5b28 for 0000000000000000 <...>-913 [007] d...1 42.580599: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI From your test case: > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done The "kill -INT $PID" caused the write to osnoise_workload_start(), and the after 1ms you do the "kill -TERM $PID" that kill the process which closes the file descriptor right after the reset. The real fix here looks to be: diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a..400a72cd6ab5 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -265,6 +265,8 @@ static inline void tlat_var_reset(void) */ for_each_cpu(cpu, cpu_online_mask) { tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); + if (tlat_var->kthread) + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); } } @@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); - hrtimer_cancel(&tlat_var->timer); + if (tlat_var->kthread) + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); osn_var->sampling = 0; I'll make this into a real patch and send it out. -- Steve
On Fri, 23 Aug 2024 12:54:26 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ > > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done > > The "kill -INT $PID" caused the write to osnoise_workload_start(), and the > after 1ms you do the "kill -TERM $PID" that kill the process which closes > the file descriptor right after the reset. > > The real fix here looks to be: > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c > index 66a871553d4a..400a72cd6ab5 100644 > --- a/kernel/trace/trace_osnoise.c > +++ b/kernel/trace/trace_osnoise.c > @@ -265,6 +265,8 @@ static inline void tlat_var_reset(void) > */ > for_each_cpu(cpu, cpu_online_mask) { > tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); > + if (tlat_var->kthread) > + hrtimer_cancel(&tlat_var->timer); > memset(tlat_var, 0, sizeof(*tlat_var)); > } > } > @@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) > osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); > tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); > > - hrtimer_cancel(&tlat_var->timer); > + if (tlat_var->kthread) > + hrtimer_cancel(&tlat_var->timer); > memset(tlat_var, 0, sizeof(*tlat_var)); > > osn_var->sampling = 0; > > I'll make this into a real patch and send it out. Egad, I don't think this is even good enough. I noticed this in the trace (adding kthread to the memset trace_printk): <...>-916 [003] ..... 134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000 <...>-916 [003] ..... 134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000 <...>-916 [003] ..... 134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000 <...>-916 [003] ..... 134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000 <...>-916 [003] ..... 134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000 <...>-916 [003] ..... 134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000 <...>-916 [003] ..... 134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640 <...>-916 [003] ..... 134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000 Before the reset, all but one of the tlat->kthread is NULL. Then it dawned on me that this is a global per CPU variable. It gets initialized when the tracer starts. If another program is has the timerlat fd open when the tracer ends, the tracer starts again, and you close the fd, it will cancel the hrtimer for the new task. I think there needs to be some ref counting here, that keeps the tracer from starting again if there's still files opened. This looks to be a bigger problem than I have time to work on it for now. I'll just apply the mutex patch for the kthreads, but this bug is going to take a bit more work in solving. -- Steve
pá 23. 8. 2024 v 20:51 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > Egad, I don't think this is even good enough. I noticed this in the trace > (adding kthread to the memset trace_printk): > > <...>-916 [003] ..... 134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000 > <...>-916 [003] ..... 134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000 > <...>-916 [003] ..... 134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000 > <...>-916 [003] ..... 134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000 > <...>-916 [003] ..... 134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000 > <...>-916 [003] ..... 134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000 > <...>-916 [003] ..... 134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640 > <...>-916 [003] ..... 134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000 > > Before the reset, all but one of the tlat->kthread is NULL. Then it dawned > on me that this is a global per CPU variable. It gets initialized when the > tracer starts. If another program is has the timerlat fd open when the > tracer ends, the tracer starts again, and you close the fd, it will cancel > the hrtimer for the new task. > > I think there needs to be some ref counting here, that keeps the tracer > from starting again if there's still files opened. > The timerlat fd is not supposed to be open when the tracer ends/starts again, since osnoise_workload_stop() calls stop_kthread(), which in turn calls kill_pid() to SIGKILL the user workload, which will also close the file descriptor. Only one PID per CPU should have the timerlat fd open at one moment, since timerlat_fd_open() will refuse to open if tlat->pid is set. It appears that this is somehow bypassed and osnoise_workload_start() happens before closing the fd, not sure why. > This looks to be a bigger problem than I have time to work on it for now. > I'll just apply the mutex patch for the kthreads, but this bug is going to > take a bit more work in solving. > Yeah, unfortunately the issue looks more complicated now after looking at the traces you posted, I will probably have to do more tracing to see what is actually happening here. Thank you again for helping us with this and also for the patch for the mutex. Tomas
On Mon, 26 Aug 2024 15:01:24 +0200 Tomas Glozar <tglozar@redhat.com> wrote: > > Before the reset, all but one of the tlat->kthread is NULL. Then it dawned > > on me that this is a global per CPU variable. It gets initialized when the > > tracer starts. If another program is has the timerlat fd open when the > > tracer ends, the tracer starts again, and you close the fd, it will cancel > > the hrtimer for the new task. > > > > I think there needs to be some ref counting here, that keeps the tracer > > from starting again if there's still files opened. > > > > The timerlat fd is not supposed to be open when the tracer ends/starts > again, since osnoise_workload_stop() calls stop_kthread(), which in > turn calls kill_pid() to SIGKILL the user workload, which will also > close the file descriptor. Only one PID per CPU should have the > timerlat fd open at one moment, since timerlat_fd_open() will refuse > to open if tlat->pid is set. It appears that this is somehow bypassed > and osnoise_workload_start() happens before closing the fd, not sure > why. > > > This looks to be a bigger problem than I have time to work on it for now. > > I'll just apply the mutex patch for the kthreads, but this bug is going to > > take a bit more work in solving. > > > > Yeah, unfortunately the issue looks more complicated now after looking > at the traces you posted, I will probably have to do more tracing to > see what is actually happening here. Thank you again for helping us > with this and also for the patch for the mutex. Yeah, I think I finally found the real issue. I don't think we need the ref counting. The problem is the creating and killing of the threads via the start and stop callbacks. That's not their purpose. The purpose of stop and start callbacks is when tracing_on is set to off and back on again. I think this is what is racing with the close. Anyway, the start and stop should probably just pause the threads and not kill them an start them again. That is, the osnoise_workload_start() should be called by the init callbacks and the osnoise_workload_stop should be called by reset callback. The start and stop callbacks should just pause and restart the the threads. -- Steve
po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > Yeah, I think I finally found the real issue. I don't think we need the ref > counting. The problem is the creating and killing of the threads via the > start and stop callbacks. That's not their purpose. The purpose of stop > and start callbacks is when tracing_on is set to off and back on again. I > think this is what is racing with the close. > I believe I discovered the exact mechanism of the bug. Since rtla is killed with SIGTERM, it leaves the timerlat tracer active and its timerlatu processes running. The rtla from the next iteration of the while loop in the reproducer deactivates the timerlat tracer, triggering the killing of the timerlatu processes. However, this killing is asynchronous, and while the processes are gradually dying one by one and closing their fds, the rtla from the next iteration is already activating the timerlat tracer again. One of the timerlatu processes does not die fast enough to release the fd before the tlat_var is cleared in osnoise_workload_start, triggering the panic. You can see the bug in action in this detailed trace I got today, with the use of in-kernel trace_printk together with bpftrace and bpf_trace_printk and captured using ftrace_dump_on_oops (running the same reproducer on a 8 CPU KVM system): [ 103.227679] rtla-2325 2..... 98242607us : osnoise_workload_start: tlat_var_reset: cpu=0, tlat_var=0000000034996efa [ 103.227844] rtla-2325 2..... 98242607us : osnoise_workload_start: tlat_var_reset: cpu=1, tlat_var=0000000062d1f520 [ 103.228009] rtla-2325 2..... 98242607us : osnoise_workload_start: tlat_var_reset: cpu=2, tlat_var=0000000002e262d8 [ 103.228173] rtla-2325 2..... 98242607us : osnoise_workload_start: tlat_var_reset: cpu=3, tlat_var=00000000b9d3a280 [ 103.228349] rtla-2325 2..... 98242607us : osnoise_workload_start: tlat_var_reset: cpu=4, tlat_var=000000000fba7a6e [ 103.228510] rtla-2325 2..... 98242608us : osnoise_workload_start: tlat_var_reset: cpu=5, tlat_var=000000008ed0b87f [ 103.228670] rtla-2325 2..... 98242608us : osnoise_workload_start: tlat_var_reset: cpu=6, tlat_var=00000000a02fe267 [ 103.228831] rtla-2325 2..... 98242608us : osnoise_workload_start: tlat_var_reset: cpu=7, tlat_var=000000008e1d5095 [ 103.228992] rtla-2327 3...11 98256551us : bpf_trace_printk: rtla pid 2325 forks into 2328 [ 103.229156] rtla-2327 3...11 98256873us : bpf_trace_printk: rtla pid 2325 forks into 2329 [ 103.229325] rtla-2327 3...11 98257124us : bpf_trace_printk: rtla pid 2325 forks into 2330 [ 103.229489] rtla-2327 3...11 98257353us : bpf_trace_printk: rtla pid 2325 forks into 2331 [ 103.229651] rtla-2327 3...11 98257574us : bpf_trace_printk: rtla pid 2325 forks into 2332 [ 103.229810] rtla-2327 3...11 98257801us : bpf_trace_printk: rtla pid 2325 forks into 2333 [ 103.229968] rtla-2327 3...11 98258008us : bpf_trace_printk: rtla pid 2325 forks into 2334 [ 103.230124] rtla-2327 3...11 98258216us : bpf_trace_printk: rtla pid 2325 forks into 2335 [ 103.230293] bash-2094 0d..31 102987914us : bpf_trace_printk: rtla pid 2325 received signal 2 [ 103.230452] bash-2094 0d..31 102987915us : bpf_trace_printk: ...from bash pid 2094 [ 103.230609] bash-2094 0d..31 102990000us : bpf_trace_printk: rtla pid 2325 received signal 15 [ 103.230766] bash-2094 0d..31 102990001us : bpf_trace_printk: ...from bash pid 2094 [ 103.230922] rtla-2325 2...21 102990010us : bpf_trace_printk: rtla pid 2325 exiting [ 103.231079] rtla-2327 3...21 102990503us : bpf_trace_printk: rtla pid 2327 exiting [ 103.231242] rtla-2337 3d..31 103232021us : bpf_trace_printk: timerlatu/0 pid 2328 received signal 9 [ 103.231399] rtla-2337 3d..31 103232022us : bpf_trace_printk: ...from rtla pid 2337 [ 103.231554] rtla-2337 3d..31 103232024us : bpf_trace_printk: timerlatu/1 pid 2329 received signal 9 [ 103.231711] rtla-2337 3d..31 103232025us : bpf_trace_printk: ...from rtla pid 2337 [ 103.231866] rtla-2337 3d..31 103232029us : bpf_trace_printk: timerlatu/2 pid 2330 received signal 9 [ 103.232022] rtla-2337 3d..31 103232030us : bpf_trace_printk: ...from rtla pid 2337 [ 103.232178] rtla-2337 3dN.31 103232031us : bpf_trace_printk: timerlatu/3 pid 2331 received signal 9 [ 103.232347] rtla-2337 3dN.31 103232031us : bpf_trace_printk: ...from rtla pid 2337 [ 103.232506] rtla-2337 3dN.31 103232034us : bpf_trace_printk: timerlatu/4 pid 2332 received signal 9 [ 103.232685] rtla-2337 3dN.31 103232034us : bpf_trace_printk: ...from rtla pid 2337 [ 103.232846] rtla-2337 3dN.31 103232039us : bpf_trace_printk: timerlatu/5 pid 2333 received signal 9 [ 103.233010] rtla-2337 3dN.31 103232039us : bpf_trace_printk: ...from rtla pid 2337 [ 103.233171] rtla-2337 3dN.31 103232042us : bpf_trace_printk: timerlatu/6 pid 2334 received signal 9 [ 103.233357] rtla-2337 3dN.31 103232042us : bpf_trace_printk: ...from rtla pid 2337 [ 103.233517] rtla-2337 3dN.31 103232046us : bpf_trace_printk: timerlatu/7 pid 2335 received signal 9 [ 103.233678] rtla-2337 3dN.31 103232046us : bpf_trace_printk: ...from rtla pid 2337 [ 103.233835] timerlat-2331 3...21 103232701us : bpf_trace_printk: timerlatu/3 pid 2331 exiting [ 103.233995] timerlat-2329 1...21 103232704us : bpf_trace_printk: timerlatu/1 pid 2329 exiting [ 103.234155] timerlat-2331 3....1 103232705us : timerlat_fd_release: timerlat_fd_release: cpu=3, tlat_var=000000000fba7a6e, osn_var->pid=2331 [ 103.234357] timerlat-2329 1....1 103232706us : timerlat_fd_release: timerlat_fd_release: cpu=1, tlat_var=0000000002e262d8, osn_var->pid=2329 [ 103.234508] timerlat-2334 6...21 103232821us : bpf_trace_printk: timerlatu/6 pid 2334 exiting [ 103.234657] timerlat-2334 6....1 103232826us : timerlat_fd_release: timerlat_fd_release: cpu=6, tlat_var=000000008e1d5095, osn_var->pid=2334 [ 103.234813] timerlat-2330 2...21 103232850us : bpf_trace_printk: timerlatu/2 pid 2330 exiting [ 103.234966] timerlat-2330 2....1 103232853us : timerlat_fd_release: timerlat_fd_release: cpu=2, tlat_var=00000000b9d3a280, osn_var->pid=2330 [ 103.235122] timerlat-2333 5...21 103232856us : bpf_trace_printk: timerlatu/5 pid 2333 exiting [ 103.235291] timerlat-2333 5....1 103232858us : timerlat_fd_release: timerlat_fd_release: cpu=5, tlat_var=00000000a02fe267, osn_var->pid=2333 [ 103.235452] timerlat-2332 4...21 103232861us : bpf_trace_printk: timerlatu/4 pid 2332 exiting [ 103.235610] timerlat-2332 4....1 103232864us : timerlat_fd_release: timerlat_fd_release: cpu=4, tlat_var=000000008ed0b87f, osn_var->pid=2332 [ 103.235769] timerlat-2335 7...21 103232867us : bpf_trace_printk: timerlatu/7 pid 2335 exiting [ 103.235932] timerlat-2335 7....1 103232869us : timerlat_fd_release: timerlat_fd_release: cpu=7, tlat_var=000000003113e297, osn_var->pid=2335 [ 103.236103] rtla-2337 3..... 103234109us : osnoise_workload_start: tlat_var_reset: cpu=0, tlat_var=0000000034996efa [ 103.236316] rtla-2337 3..... 103234109us : osnoise_workload_start: tlat_var_reset: cpu=1, tlat_var=0000000062d1f520 [ 103.236473] rtla-2337 3..... 103234109us : osnoise_workload_start: tlat_var_reset: cpu=2, tlat_var=0000000002e262d8 [ 103.236630] rtla-2337 3..... 103234110us : osnoise_workload_start: tlat_var_reset: cpu=3, tlat_var=00000000b9d3a280 [ 103.236797] rtla-2337 3..... 103234110us : osnoise_workload_start: tlat_var_reset: cpu=4, tlat_var=000000000fba7a6e [ 103.236972] rtla-2337 3..... 103234110us : osnoise_workload_start: tlat_var_reset: cpu=5, tlat_var=000000008ed0b87f [ 103.237144] rtla-2337 3..... 103234110us : osnoise_workload_start: tlat_var_reset: cpu=6, tlat_var=00000000a02fe267 [ 103.237345] rtla-2337 3..... 103234110us : osnoise_workload_start: tlat_var_reset: cpu=7, tlat_var=000000008e1d5095 [ 103.237516] timerlat-2328 0...21 103234149us : bpf_trace_printk: timerlatu/0 pid 2328 exiting [ 103.237690] timerlat-2328 0....1 103234152us : timerlat_fd_release: timerlat_fd_release: cpu=0, tlat_var=0000000062d1f520, osn_var->pid=0 [ 103.237871] --------------------------------- [ 103.238053] CR2: 0000000000000010 On the kernel side, I'm using similar trace_printk debug prints as you did; the bpftrace program used is this: uretprobe:libc:fork /strncmp(comm, "rtla", 5) == 0 && retval != 0/ { debugf("rtla pid %d forks into %d", pid, retval) } tracepoint:signal:signal_generate /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, "timerlatu", 9) == 0/ { debugf("%s pid %d received signal %d", args->comm, args->pid, args->sig); debugf("...from %s pid %d", comm, pid); } tracepoint:sched:sched_process_exit /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, "timerlatu", 9) == 0/ { debugf("%s pid %d exiting", args->comm, args->pid) } > Anyway, the start and stop should probably just pause the threads and not > kill them an start them again. That is, the osnoise_workload_start() should > be called by the init callbacks and the osnoise_workload_stop should be > called by reset callback. > > The start and stop callbacks should just pause and restart the the threads. > I'm not sure if that fits into the design of timerlat. At least for timerlat with user workload, each activation of the timerlat tracer implies the creation of a new user workload, not the re-activation of a lingering one (that might be an entirely different program). A timerlat tracer is tightly associated with its threads: if the threads exit, the timer itself ceases to have meaning. Of course also rtla should make sure there are no lingering timerlatu processes. I think waiting on the threads to actually exit in stop_kthread() is the proper solution: /* * stop_kthread - stop a workload thread */ static void stop_kthread(unsigned int cpu){ struct task_struct *kthread; kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread; if (kthread) { if (test_bit(OSN_WORKLOAD, &osnoise_options)) { kthread_stop(kthread); } else { /* * This is a user thread waiting on the timerlat_fd. We need * to close all users, and the best way to guarantee this is * by killing the thread. NOTE: this is a purpose specific file. */ kill_pid(kthread->thread_pid, SIGKILL, 1); /* ^^ here wait until kthread (actually the user workload) exits */ put_task_struct(kthread); } per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL; } There might even be another bug, not sure if calling put_task_struct on a task that is still exiting is sane. Tomas
On Tue, Aug 27, 2024 at 04:34:39PM +0200, Tomas Glozar wrote: > po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > > > Yeah, I think I finally found the real issue. I don't think we need the ref > > counting. The problem is the creating and killing of the threads via the > > start and stop callbacks. That's not their purpose. The purpose of stop > > and start callbacks is when tracing_on is set to off and back on again. I > > think this is what is racing with the close. > > > > I believe I discovered the exact mechanism of the bug. Since rtla is > killed with SIGTERM, it leaves the timerlat tracer active and its > timerlatu processes running. The rtla from the next iteration of the > while loop in the reproducer deactivates the timerlat tracer, > triggering the killing of the timerlatu processes. However, this > killing is asynchronous, and while the processes are gradually dying > one by one and closing their fds, the rtla from the next iteration is > already activating the timerlat tracer again. One of the timerlatu > processes does not die fast enough to release the fd before the > tlat_var is cleared in osnoise_workload_start, triggering the panic. With this in mind I tried a simpler approach. I used a kernel without any extra patches from Steven or ourselves, just used the code below. Not saying it is a definitive solution, but as a proof of concept I have two boxes running the loop used to reproduce the problem and so far not a single problem. The patch: diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a1..331c69bc7b0d4 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2239,8 +2239,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf, */ mutex_lock(&trace_types_lock); running = osnoise_has_registered_instances(); - if (running) + if (running) { stop_per_cpu_kthreads(); + mutex_unlock(&trace_types_lock); + return -EBUSY; + } mutex_lock(&interface_lock); /* @@ -2579,7 +2582,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); - hrtimer_cancel(&tlat_var->timer); + if (tlat_var->kthread) + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); osn_var->sampling = 0; --- In short, do not start new instances if the old one still registered. And only cancel the timer if the timerlat per cpu var has not been zeroed-out. Again, not a final patch but a simpler approach to protect start/stop races. Luis > > You can see the bug in action in this detailed trace I got today, with > the use of in-kernel trace_printk together with bpftrace and > bpf_trace_printk and captured using ftrace_dump_on_oops (running the > same reproducer on a 8 CPU KVM system): > [ 103.227679] rtla-2325 2..... 98242607us : > osnoise_workload_start: tlat_var_reset: cpu=0, > tlat_var=0000000034996efa > [ 103.227844] rtla-2325 2..... 98242607us : > osnoise_workload_start: tlat_var_reset: cpu=1, > tlat_var=0000000062d1f520 > [ 103.228009] rtla-2325 2..... 98242607us : > osnoise_workload_start: tlat_var_reset: cpu=2, > tlat_var=0000000002e262d8 > [ 103.228173] rtla-2325 2..... 98242607us : > osnoise_workload_start: tlat_var_reset: cpu=3, > tlat_var=00000000b9d3a280 > [ 103.228349] rtla-2325 2..... 98242607us : > osnoise_workload_start: tlat_var_reset: cpu=4, > tlat_var=000000000fba7a6e > [ 103.228510] rtla-2325 2..... 98242608us : > osnoise_workload_start: tlat_var_reset: cpu=5, > tlat_var=000000008ed0b87f > [ 103.228670] rtla-2325 2..... 98242608us : > osnoise_workload_start: tlat_var_reset: cpu=6, > tlat_var=00000000a02fe267 > [ 103.228831] rtla-2325 2..... 98242608us : > osnoise_workload_start: tlat_var_reset: cpu=7, > tlat_var=000000008e1d5095 > [ 103.228992] rtla-2327 3...11 98256551us : > bpf_trace_printk: rtla pid 2325 forks into 2328 > [ 103.229156] rtla-2327 3...11 98256873us : > bpf_trace_printk: rtla pid 2325 forks into 2329 > [ 103.229325] rtla-2327 3...11 98257124us : > bpf_trace_printk: rtla pid 2325 forks into 2330 > [ 103.229489] rtla-2327 3...11 98257353us : > bpf_trace_printk: rtla pid 2325 forks into 2331 > [ 103.229651] rtla-2327 3...11 98257574us : > bpf_trace_printk: rtla pid 2325 forks into 2332 > [ 103.229810] rtla-2327 3...11 98257801us : > bpf_trace_printk: rtla pid 2325 forks into 2333 > [ 103.229968] rtla-2327 3...11 98258008us : > bpf_trace_printk: rtla pid 2325 forks into 2334 > [ 103.230124] rtla-2327 3...11 98258216us : > bpf_trace_printk: rtla pid 2325 forks into 2335 > [ 103.230293] bash-2094 0d..31 102987914us : > bpf_trace_printk: rtla pid 2325 received signal 2 > [ 103.230452] bash-2094 0d..31 102987915us : > bpf_trace_printk: ...from bash pid 2094 > [ 103.230609] bash-2094 0d..31 102990000us : > bpf_trace_printk: rtla pid 2325 received signal 15 > [ 103.230766] bash-2094 0d..31 102990001us : > bpf_trace_printk: ...from bash pid 2094 > [ 103.230922] rtla-2325 2...21 102990010us : > bpf_trace_printk: rtla pid 2325 exiting > [ 103.231079] rtla-2327 3...21 102990503us : > bpf_trace_printk: rtla pid 2327 exiting > [ 103.231242] rtla-2337 3d..31 103232021us : > bpf_trace_printk: timerlatu/0 pid 2328 received signal 9 > [ 103.231399] rtla-2337 3d..31 103232022us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.231554] rtla-2337 3d..31 103232024us : > bpf_trace_printk: timerlatu/1 pid 2329 received signal 9 > [ 103.231711] rtla-2337 3d..31 103232025us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.231866] rtla-2337 3d..31 103232029us : > bpf_trace_printk: timerlatu/2 pid 2330 received signal 9 > [ 103.232022] rtla-2337 3d..31 103232030us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.232178] rtla-2337 3dN.31 103232031us : > bpf_trace_printk: timerlatu/3 pid 2331 received signal 9 > [ 103.232347] rtla-2337 3dN.31 103232031us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.232506] rtla-2337 3dN.31 103232034us : > bpf_trace_printk: timerlatu/4 pid 2332 received signal 9 > [ 103.232685] rtla-2337 3dN.31 103232034us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.232846] rtla-2337 3dN.31 103232039us : > bpf_trace_printk: timerlatu/5 pid 2333 received signal 9 > [ 103.233010] rtla-2337 3dN.31 103232039us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.233171] rtla-2337 3dN.31 103232042us : > bpf_trace_printk: timerlatu/6 pid 2334 received signal 9 > [ 103.233357] rtla-2337 3dN.31 103232042us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.233517] rtla-2337 3dN.31 103232046us : > bpf_trace_printk: timerlatu/7 pid 2335 received signal 9 > [ 103.233678] rtla-2337 3dN.31 103232046us : > bpf_trace_printk: ...from rtla pid 2337 > [ 103.233835] timerlat-2331 3...21 103232701us : > bpf_trace_printk: timerlatu/3 pid 2331 exiting > [ 103.233995] timerlat-2329 1...21 103232704us : > bpf_trace_printk: timerlatu/1 pid 2329 exiting > [ 103.234155] timerlat-2331 3....1 103232705us : > timerlat_fd_release: timerlat_fd_release: cpu=3, > tlat_var=000000000fba7a6e, osn_var->pid=2331 > [ 103.234357] timerlat-2329 1....1 103232706us : > timerlat_fd_release: timerlat_fd_release: cpu=1, > tlat_var=0000000002e262d8, osn_var->pid=2329 > [ 103.234508] timerlat-2334 6...21 103232821us : > bpf_trace_printk: timerlatu/6 pid 2334 exiting > [ 103.234657] timerlat-2334 6....1 103232826us : > timerlat_fd_release: timerlat_fd_release: cpu=6, > tlat_var=000000008e1d5095, osn_var->pid=2334 > [ 103.234813] timerlat-2330 2...21 103232850us : > bpf_trace_printk: timerlatu/2 pid 2330 exiting > [ 103.234966] timerlat-2330 2....1 103232853us : > timerlat_fd_release: timerlat_fd_release: cpu=2, > tlat_var=00000000b9d3a280, osn_var->pid=2330 > [ 103.235122] timerlat-2333 5...21 103232856us : > bpf_trace_printk: timerlatu/5 pid 2333 exiting > [ 103.235291] timerlat-2333 5....1 103232858us : > timerlat_fd_release: timerlat_fd_release: cpu=5, > tlat_var=00000000a02fe267, osn_var->pid=2333 > [ 103.235452] timerlat-2332 4...21 103232861us : > bpf_trace_printk: timerlatu/4 pid 2332 exiting > [ 103.235610] timerlat-2332 4....1 103232864us : > timerlat_fd_release: timerlat_fd_release: cpu=4, > tlat_var=000000008ed0b87f, osn_var->pid=2332 > [ 103.235769] timerlat-2335 7...21 103232867us : > bpf_trace_printk: timerlatu/7 pid 2335 exiting > [ 103.235932] timerlat-2335 7....1 103232869us : > timerlat_fd_release: timerlat_fd_release: cpu=7, > tlat_var=000000003113e297, osn_var->pid=2335 > [ 103.236103] rtla-2337 3..... 103234109us : > osnoise_workload_start: tlat_var_reset: cpu=0, > tlat_var=0000000034996efa > [ 103.236316] rtla-2337 3..... 103234109us : > osnoise_workload_start: tlat_var_reset: cpu=1, > tlat_var=0000000062d1f520 > [ 103.236473] rtla-2337 3..... 103234109us : > osnoise_workload_start: tlat_var_reset: cpu=2, > tlat_var=0000000002e262d8 > [ 103.236630] rtla-2337 3..... 103234110us : > osnoise_workload_start: tlat_var_reset: cpu=3, > tlat_var=00000000b9d3a280 > [ 103.236797] rtla-2337 3..... 103234110us : > osnoise_workload_start: tlat_var_reset: cpu=4, > tlat_var=000000000fba7a6e > [ 103.236972] rtla-2337 3..... 103234110us : > osnoise_workload_start: tlat_var_reset: cpu=5, > tlat_var=000000008ed0b87f > [ 103.237144] rtla-2337 3..... 103234110us : > osnoise_workload_start: tlat_var_reset: cpu=6, > tlat_var=00000000a02fe267 > [ 103.237345] rtla-2337 3..... 103234110us : > osnoise_workload_start: tlat_var_reset: cpu=7, > tlat_var=000000008e1d5095 > [ 103.237516] timerlat-2328 0...21 103234149us : > bpf_trace_printk: timerlatu/0 pid 2328 exiting > [ 103.237690] timerlat-2328 0....1 103234152us : > timerlat_fd_release: timerlat_fd_release: cpu=0, > tlat_var=0000000062d1f520, osn_var->pid=0 > [ 103.237871] --------------------------------- > [ 103.238053] CR2: 0000000000000010 > > On the kernel side, I'm using similar trace_printk debug prints as you > did; the bpftrace program used is this: > uretprobe:libc:fork > /strncmp(comm, "rtla", 5) == 0 && retval != 0/ > { > debugf("rtla pid %d forks into %d", pid, retval) > } > tracepoint:signal:signal_generate > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, > "timerlatu", 9) == 0/ { > debugf("%s pid %d received signal %d", args->comm, args->pid, > args->sig); > debugf("...from %s pid %d", comm, pid); > } > tracepoint:sched:sched_process_exit > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, > "timerlatu", 9) == 0/ { > debugf("%s pid %d exiting", args->comm, args->pid) > } > > > Anyway, the start and stop should probably just pause the threads and not > > kill them an start them again. That is, the osnoise_workload_start() should > > be called by the init callbacks and the osnoise_workload_stop should be > > called by reset callback. > > > > The start and stop callbacks should just pause and restart the the threads. > > > > I'm not sure if that fits into the design of timerlat. At least for > timerlat with user workload, each activation of the timerlat tracer > implies the creation of a new user workload, not the re-activation of > a lingering one (that might be an entirely different program). A > timerlat tracer is tightly associated with its threads: if the threads > exit, the timer itself ceases to have meaning. Of course also rtla > should make sure there are no lingering timerlatu processes. > > I think waiting on the threads to actually exit in stop_kthread() is > the proper solution: > > /* > * stop_kthread - stop a workload thread > */ > static void stop_kthread(unsigned int cpu){ > struct task_struct *kthread; > > kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread; > if (kthread) { > if (test_bit(OSN_WORKLOAD, &osnoise_options)) { > kthread_stop(kthread); > } else { > /* > * This is a user thread waiting on the timerlat_fd. We need > * to close all users, and the best way to guarantee this is > * by killing the thread. NOTE: this is a purpose specific file. > */ > kill_pid(kthread->thread_pid, SIGKILL, 1); > /* ^^ here wait until kthread (actually the user workload) exits */ > put_task_struct(kthread); > } > per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL; > } > > There might even be another bug, not sure if calling put_task_struct > on a task that is still exiting is sane. > > Tomas > ---end quoted text---
On Thu, Aug 29, 2024 at 08:40:25PM -0300, Luis Claudio R. Goncalves wrote: > On Tue, Aug 27, 2024 at 04:34:39PM +0200, Tomas Glozar wrote: > > po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > > > > > Yeah, I think I finally found the real issue. I don't think we need the ref > > > counting. The problem is the creating and killing of the threads via the > > > start and stop callbacks. That's not their purpose. The purpose of stop > > > and start callbacks is when tracing_on is set to off and back on again. I > > > think this is what is racing with the close. > > > > > > > I believe I discovered the exact mechanism of the bug. Since rtla is > > killed with SIGTERM, it leaves the timerlat tracer active and its > > timerlatu processes running. The rtla from the next iteration of the > > while loop in the reproducer deactivates the timerlat tracer, > > triggering the killing of the timerlatu processes. However, this > > killing is asynchronous, and while the processes are gradually dying > > one by one and closing their fds, the rtla from the next iteration is > > already activating the timerlat tracer again. One of the timerlatu > > processes does not die fast enough to release the fd before the > > tlat_var is cleared in osnoise_workload_start, triggering the panic. > > With this in mind I tried a simpler approach. I used a kernel without any > extra patches from Steven or ourselves, just used the code below. Not > saying it is a definitive solution, but as a proof of concept I have two > boxes running the loop used to reproduce the problem and so far not a > single problem. The patch: Premature commemoration. One of the boxes rebooted a minute ago. Sounds like I just diminished a bit the window for triggering the problem. Tomorrow I will run tests with all the suggested patches and work of a better idea, more comprehensive, to address what Tomas described above. > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c > index 66a871553d4a1..331c69bc7b0d4 100644 > --- a/kernel/trace/trace_osnoise.c > +++ b/kernel/trace/trace_osnoise.c > @@ -2239,8 +2239,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf, > */ > mutex_lock(&trace_types_lock); > running = osnoise_has_registered_instances(); > - if (running) > + if (running) { > stop_per_cpu_kthreads(); > + mutex_unlock(&trace_types_lock); > + return -EBUSY; > + } > > mutex_lock(&interface_lock); > /* > @@ -2579,7 +2582,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) > osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); > tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); > > - hrtimer_cancel(&tlat_var->timer); > + if (tlat_var->kthread) > + hrtimer_cancel(&tlat_var->timer); > memset(tlat_var, 0, sizeof(*tlat_var)); > > osn_var->sampling = 0; > --- > > In short, do not start new instances if the old one still registered. And > only cancel the timer if the timerlat per cpu var has not been zeroed-out. > > Again, not a final patch but a simpler approach to protect start/stop > races. > > Luis > > > > > You can see the bug in action in this detailed trace I got today, with > > the use of in-kernel trace_printk together with bpftrace and > > bpf_trace_printk and captured using ftrace_dump_on_oops (running the > > same reproducer on a 8 CPU KVM system): > > [ 103.227679] rtla-2325 2..... 98242607us : > > osnoise_workload_start: tlat_var_reset: cpu=0, > > tlat_var=0000000034996efa > > [ 103.227844] rtla-2325 2..... 98242607us : > > osnoise_workload_start: tlat_var_reset: cpu=1, > > tlat_var=0000000062d1f520 > > [ 103.228009] rtla-2325 2..... 98242607us : > > osnoise_workload_start: tlat_var_reset: cpu=2, > > tlat_var=0000000002e262d8 > > [ 103.228173] rtla-2325 2..... 98242607us : > > osnoise_workload_start: tlat_var_reset: cpu=3, > > tlat_var=00000000b9d3a280 > > [ 103.228349] rtla-2325 2..... 98242607us : > > osnoise_workload_start: tlat_var_reset: cpu=4, > > tlat_var=000000000fba7a6e > > [ 103.228510] rtla-2325 2..... 98242608us : > > osnoise_workload_start: tlat_var_reset: cpu=5, > > tlat_var=000000008ed0b87f > > [ 103.228670] rtla-2325 2..... 98242608us : > > osnoise_workload_start: tlat_var_reset: cpu=6, > > tlat_var=00000000a02fe267 > > [ 103.228831] rtla-2325 2..... 98242608us : > > osnoise_workload_start: tlat_var_reset: cpu=7, > > tlat_var=000000008e1d5095 > > [ 103.228992] rtla-2327 3...11 98256551us : > > bpf_trace_printk: rtla pid 2325 forks into 2328 > > [ 103.229156] rtla-2327 3...11 98256873us : > > bpf_trace_printk: rtla pid 2325 forks into 2329 > > [ 103.229325] rtla-2327 3...11 98257124us : > > bpf_trace_printk: rtla pid 2325 forks into 2330 > > [ 103.229489] rtla-2327 3...11 98257353us : > > bpf_trace_printk: rtla pid 2325 forks into 2331 > > [ 103.229651] rtla-2327 3...11 98257574us : > > bpf_trace_printk: rtla pid 2325 forks into 2332 > > [ 103.229810] rtla-2327 3...11 98257801us : > > bpf_trace_printk: rtla pid 2325 forks into 2333 > > [ 103.229968] rtla-2327 3...11 98258008us : > > bpf_trace_printk: rtla pid 2325 forks into 2334 > > [ 103.230124] rtla-2327 3...11 98258216us : > > bpf_trace_printk: rtla pid 2325 forks into 2335 > > [ 103.230293] bash-2094 0d..31 102987914us : > > bpf_trace_printk: rtla pid 2325 received signal 2 > > [ 103.230452] bash-2094 0d..31 102987915us : > > bpf_trace_printk: ...from bash pid 2094 > > [ 103.230609] bash-2094 0d..31 102990000us : > > bpf_trace_printk: rtla pid 2325 received signal 15 > > [ 103.230766] bash-2094 0d..31 102990001us : > > bpf_trace_printk: ...from bash pid 2094 > > [ 103.230922] rtla-2325 2...21 102990010us : > > bpf_trace_printk: rtla pid 2325 exiting > > [ 103.231079] rtla-2327 3...21 102990503us : > > bpf_trace_printk: rtla pid 2327 exiting > > [ 103.231242] rtla-2337 3d..31 103232021us : > > bpf_trace_printk: timerlatu/0 pid 2328 received signal 9 > > [ 103.231399] rtla-2337 3d..31 103232022us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.231554] rtla-2337 3d..31 103232024us : > > bpf_trace_printk: timerlatu/1 pid 2329 received signal 9 > > [ 103.231711] rtla-2337 3d..31 103232025us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.231866] rtla-2337 3d..31 103232029us : > > bpf_trace_printk: timerlatu/2 pid 2330 received signal 9 > > [ 103.232022] rtla-2337 3d..31 103232030us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.232178] rtla-2337 3dN.31 103232031us : > > bpf_trace_printk: timerlatu/3 pid 2331 received signal 9 > > [ 103.232347] rtla-2337 3dN.31 103232031us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.232506] rtla-2337 3dN.31 103232034us : > > bpf_trace_printk: timerlatu/4 pid 2332 received signal 9 > > [ 103.232685] rtla-2337 3dN.31 103232034us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.232846] rtla-2337 3dN.31 103232039us : > > bpf_trace_printk: timerlatu/5 pid 2333 received signal 9 > > [ 103.233010] rtla-2337 3dN.31 103232039us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.233171] rtla-2337 3dN.31 103232042us : > > bpf_trace_printk: timerlatu/6 pid 2334 received signal 9 > > [ 103.233357] rtla-2337 3dN.31 103232042us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.233517] rtla-2337 3dN.31 103232046us : > > bpf_trace_printk: timerlatu/7 pid 2335 received signal 9 > > [ 103.233678] rtla-2337 3dN.31 103232046us : > > bpf_trace_printk: ...from rtla pid 2337 > > [ 103.233835] timerlat-2331 3...21 103232701us : > > bpf_trace_printk: timerlatu/3 pid 2331 exiting > > [ 103.233995] timerlat-2329 1...21 103232704us : > > bpf_trace_printk: timerlatu/1 pid 2329 exiting > > [ 103.234155] timerlat-2331 3....1 103232705us : > > timerlat_fd_release: timerlat_fd_release: cpu=3, > > tlat_var=000000000fba7a6e, osn_var->pid=2331 > > [ 103.234357] timerlat-2329 1....1 103232706us : > > timerlat_fd_release: timerlat_fd_release: cpu=1, > > tlat_var=0000000002e262d8, osn_var->pid=2329 > > [ 103.234508] timerlat-2334 6...21 103232821us : > > bpf_trace_printk: timerlatu/6 pid 2334 exiting > > [ 103.234657] timerlat-2334 6....1 103232826us : > > timerlat_fd_release: timerlat_fd_release: cpu=6, > > tlat_var=000000008e1d5095, osn_var->pid=2334 > > [ 103.234813] timerlat-2330 2...21 103232850us : > > bpf_trace_printk: timerlatu/2 pid 2330 exiting > > [ 103.234966] timerlat-2330 2....1 103232853us : > > timerlat_fd_release: timerlat_fd_release: cpu=2, > > tlat_var=00000000b9d3a280, osn_var->pid=2330 > > [ 103.235122] timerlat-2333 5...21 103232856us : > > bpf_trace_printk: timerlatu/5 pid 2333 exiting > > [ 103.235291] timerlat-2333 5....1 103232858us : > > timerlat_fd_release: timerlat_fd_release: cpu=5, > > tlat_var=00000000a02fe267, osn_var->pid=2333 > > [ 103.235452] timerlat-2332 4...21 103232861us : > > bpf_trace_printk: timerlatu/4 pid 2332 exiting > > [ 103.235610] timerlat-2332 4....1 103232864us : > > timerlat_fd_release: timerlat_fd_release: cpu=4, > > tlat_var=000000008ed0b87f, osn_var->pid=2332 > > [ 103.235769] timerlat-2335 7...21 103232867us : > > bpf_trace_printk: timerlatu/7 pid 2335 exiting > > [ 103.235932] timerlat-2335 7....1 103232869us : > > timerlat_fd_release: timerlat_fd_release: cpu=7, > > tlat_var=000000003113e297, osn_var->pid=2335 > > [ 103.236103] rtla-2337 3..... 103234109us : > > osnoise_workload_start: tlat_var_reset: cpu=0, > > tlat_var=0000000034996efa > > [ 103.236316] rtla-2337 3..... 103234109us : > > osnoise_workload_start: tlat_var_reset: cpu=1, > > tlat_var=0000000062d1f520 > > [ 103.236473] rtla-2337 3..... 103234109us : > > osnoise_workload_start: tlat_var_reset: cpu=2, > > tlat_var=0000000002e262d8 > > [ 103.236630] rtla-2337 3..... 103234110us : > > osnoise_workload_start: tlat_var_reset: cpu=3, > > tlat_var=00000000b9d3a280 > > [ 103.236797] rtla-2337 3..... 103234110us : > > osnoise_workload_start: tlat_var_reset: cpu=4, > > tlat_var=000000000fba7a6e > > [ 103.236972] rtla-2337 3..... 103234110us : > > osnoise_workload_start: tlat_var_reset: cpu=5, > > tlat_var=000000008ed0b87f > > [ 103.237144] rtla-2337 3..... 103234110us : > > osnoise_workload_start: tlat_var_reset: cpu=6, > > tlat_var=00000000a02fe267 > > [ 103.237345] rtla-2337 3..... 103234110us : > > osnoise_workload_start: tlat_var_reset: cpu=7, > > tlat_var=000000008e1d5095 > > [ 103.237516] timerlat-2328 0...21 103234149us : > > bpf_trace_printk: timerlatu/0 pid 2328 exiting > > [ 103.237690] timerlat-2328 0....1 103234152us : > > timerlat_fd_release: timerlat_fd_release: cpu=0, > > tlat_var=0000000062d1f520, osn_var->pid=0 > > [ 103.237871] --------------------------------- > > [ 103.238053] CR2: 0000000000000010 > > > > On the kernel side, I'm using similar trace_printk debug prints as you > > did; the bpftrace program used is this: > > uretprobe:libc:fork > > /strncmp(comm, "rtla", 5) == 0 && retval != 0/ > > { > > debugf("rtla pid %d forks into %d", pid, retval) > > } > > tracepoint:signal:signal_generate > > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, > > "timerlatu", 9) == 0/ { > > debugf("%s pid %d received signal %d", args->comm, args->pid, > > args->sig); > > debugf("...from %s pid %d", comm, pid); > > } > > tracepoint:sched:sched_process_exit > > /strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm, > > "timerlatu", 9) == 0/ { > > debugf("%s pid %d exiting", args->comm, args->pid) > > } > > > > > Anyway, the start and stop should probably just pause the threads and not > > > kill them an start them again. That is, the osnoise_workload_start() should > > > be called by the init callbacks and the osnoise_workload_stop should be > > > called by reset callback. > > > > > > The start and stop callbacks should just pause and restart the the threads. > > > > > > > I'm not sure if that fits into the design of timerlat. At least for > > timerlat with user workload, each activation of the timerlat tracer > > implies the creation of a new user workload, not the re-activation of > > a lingering one (that might be an entirely different program). A > > timerlat tracer is tightly associated with its threads: if the threads > > exit, the timer itself ceases to have meaning. Of course also rtla > > should make sure there are no lingering timerlatu processes. > > > > I think waiting on the threads to actually exit in stop_kthread() is > > the proper solution: > > > > /* > > * stop_kthread - stop a workload thread > > */ > > static void stop_kthread(unsigned int cpu){ > > struct task_struct *kthread; > > > > kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread; > > if (kthread) { > > if (test_bit(OSN_WORKLOAD, &osnoise_options)) { > > kthread_stop(kthread); > > } else { > > /* > > * This is a user thread waiting on the timerlat_fd. We need > > * to close all users, and the best way to guarantee this is > > * by killing the thread. NOTE: this is a purpose specific file. > > */ > > kill_pid(kthread->thread_pid, SIGKILL, 1); > > /* ^^ here wait until kthread (actually the user workload) exits */ > > put_task_struct(kthread); > > } > > per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL; > > } > > > > There might even be another bug, not sure if calling put_task_struct > > on a task that is still exiting is sane. > > > > Tomas > > > ---end quoted text--- ---end quoted text---
út 27. 8. 2024 v 16:34 odesílatel Tomas Glozar <tglozar@redhat.com> napsal: > > I think waiting on the threads to actually exit in stop_kthread() is > the proper solution: > > /* > * stop_kthread - stop a workload thread > */ > static void stop_kthread(unsigned int cpu){ > struct task_struct *kthread; > > kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread; > if (kthread) { > if (test_bit(OSN_WORKLOAD, &osnoise_options)) { > kthread_stop(kthread); > } else { > /* > * This is a user thread waiting on the timerlat_fd. We need > * to close all users, and the best way to guarantee this is > * by killing the thread. NOTE: this is a purpose specific file. > */ > kill_pid(kthread->thread_pid, SIGKILL, 1); > /* ^^ here wait until kthread (actually the user workload) exits */ > put_task_struct(kthread); > } > per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL; > } > I tried a patch that waits on the user workload to exit when doing osnoise_workload_stop, similarly to what we already do for the kernel workload. I added a wait_on_completion in osnoise_workload_stop that is completed in timerlat_fd_release, mimicking what kthread_stop() does. But that deadlocks if the user workload process attempts to free a file descriptor taking trace_types_lock before the timerlat one, since the caller of osnoise_workload_stop() is also holding trace_types_lock. There seems to be no way to prevent this deadlock, since the user workload, unlike the kthread which has a fixed program, is free to open anything it wants. I suggest using the workaround suggested in https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home for the time being. Together with the patch that adds locks around stopping the threads (https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/), this should prevent the kernel panic at least until we have a solution for the race itself. Tomas Tomas
On Tue, 3 Sep 2024 14:47:42 +0200 Tomas Glozar <tglozar@redhat.com> wrote: > I suggest using the workaround suggested in > https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home > for the time being. Together with the patch that adds locks around > stopping the threads > (https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/), > this should prevent the kernel panic at least until we have a solution > for the race itself. I'll make the first one an official patch and start running it through my tests. We need something soon, as this is a standing bug in the kernel where the fix needs to go to stable. Thanks, -- Steve
On Tue, 3 Sep 2024 14:47:42 +0200 Tomas Glozar <tglozar@redhat.com> wrote: > I suggest using the workaround suggested in > https://lore.kernel.org/linux-trace-kernel/20240823125426.404f2705@gandalf.local.home > for the time being. Together with the patch that adds locks around > stopping the threads > (https://patchwork.kernel.org/project/linux-trace-kernel/patch/20240823102816.5e55753b@gandalf.local.home/), > this should prevent the kernel panic at least until we have a solution > for the race itself. When running my tests, the second one would end up deadlocking and triggering lockdep. I found a way to do basically the same thing with a cpumask and no added locking. I'm currently testing it and will be sending out a patch later today (if it passes the tests). -- Steve
st 4. 9. 2024 v 16:23 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > When running my tests, the second one would end up deadlocking and > triggering lockdep. I found a way to do basically the same thing with a > cpumask and no added locking. I'm currently testing it and will be sending > out a patch later today (if it passes the tests). > > -- Steve > Oh that's unfortunate. Your cpumask patch does protect from timerlat trying to stop a user workload thread via kthread_stop, but I don't think it prevents this race in the code from the other patch: static int timerlat_fd_release(struct inode *inode, struct file *file) { ... mutex_lock(&interface_lock); <-- this can now run at the same time as tlat_var_reset(), since the latter is not done under interface_lock in the latest version ... if (tlat_var->kthread) <-- if tlat_var is zeroed here, we still panic on NULL dereference hrtimer_cancel(&tlat_var->timer); ... } Either way, the results are much better: before, the kernel panicked in <10 iterations of the while loop of the reproducer; with the cpumask patch and the tlat_var->kthread check patch, it has been running on my test VM for a few hours already with no panic. Tomas
On Thu, 5 Sep 2024 12:38:37 +0200 Tomas Glozar <tglozar@redhat.com> wrote: > st 4. 9. 2024 v 16:23 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal: > > > > When running my tests, the second one would end up deadlocking and > > triggering lockdep. I found a way to do basically the same thing with a > > cpumask and no added locking. I'm currently testing it and will be sending > > out a patch later today (if it passes the tests). > > > > -- Steve > > > > Oh that's unfortunate. Your cpumask patch does protect from timerlat > trying to stop a user workload thread via kthread_stop, but I don't > think it prevents this race in the code from the other patch: > > static int timerlat_fd_release(struct inode *inode, struct file *file) > { > ... > mutex_lock(&interface_lock); > <-- this can now run at the same time as tlat_var_reset(), since > the latter is not done under interface_lock in the latest version > ... > if (tlat_var->kthread) > <-- if tlat_var is zeroed here, we still panic on NULL dereference > hrtimer_cancel(&tlat_var->timer); > ... > } Right, but my other patch was protecting the osn variable too, which I don't think is necessary. > > Either way, the results are much better: before, the kernel panicked > in <10 iterations of the while loop of the reproducer; with the > cpumask patch and the tlat_var->kthread check patch, it has been > running on my test VM for a few hours already with no panic. I think the hrtimer change is missing this: diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 8543d941b870..8e611dcee68a 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -259,6 +259,9 @@ static inline void tlat_var_reset(void) { struct timerlat_variables *tlat_var; int cpu; + + /* Synchronize with the timerlat interfaces */ + mutex_lock(&interface_lock); /* * So far, all the values are initialized as 0, so * zeroing the structure is perfect. @@ -269,6 +272,7 @@ static inline void tlat_var_reset(void) hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); } + mutex_unlock(&interface_lock); } #else /* CONFIG_TIMERLAT_TRACER */ #define tlat_var_reset() do {} while (0) I'll add it and start testing it. I haven't pushed to Linus yet. -- Steve
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a..6d2b39da4dce 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2572,6 +2572,7 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) struct osnoise_variables *osn_var; struct timerlat_variables *tlat_var; long cpu = (long) file->private_data; + int ret = 0; migrate_disable(); mutex_lock(&interface_lock); @@ -2579,6 +2580,12 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); + if (!tlat_var->kthread) { + /* the fd has been closed already */ + ret = -EBADF; + goto out; + } + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); @@ -2593,9 +2600,10 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var->kthread = NULL; } +out: mutex_unlock(&interface_lock); migrate_enable(); - return 0; + return ret; } #endif