Message ID | 20220830214919.53220-28-surenb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Code tagging framework and applications | expand |
On Tue, Aug 30, 2022 at 6:53 PM Randy Dunlap <rdunlap@infradead.org> wrote: > > > > On 8/30/22 14:49, Suren Baghdasaryan wrote: > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > > index b7d03afbc808..b0f86643b8f0 100644 > > --- a/lib/Kconfig.debug > > +++ b/lib/Kconfig.debug > > @@ -1728,6 +1728,14 @@ config LATENCYTOP > > Enable this option if you want to use the LatencyTOP tool > > to find out which userspace is blocking on what kernel operations. > > > > +config CODETAG_TIME_STATS > > + bool "Code tagging based latency measuring" > > + depends on DEBUG_FS > > + select TIME_STATS > > + select CODE_TAGGING > > + help > > + Enabling this option makes latency statistics available in debugfs > > Missing period at the end of the sentence. Ack. > > -- > ~Randy
On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote: > From: Kent Overstreet <kent.overstreet@linux.dev> > > This adds the ability to easily instrument code for measuring latency. > To use, add the following to calls to your code, at the start and end of > the event you wish to measure: > > code_tag_time_stats_start(start_time); > code_tag_time_stats_finish(start_time); > > Stastistics will then show up in debugfs under > /sys/kernel/debug/time_stats, listed by file and line number. > > Stastics measured include weighted averages of frequency, duration, max > duration, as well as quantiles. > > This patch also instruments all calls to init_wait and finish_wait, > which includes all calls to wait_event. Example debugfs output: How can't you do this with a simple eBPF script on top of trace_sched_stat_* and friends?
On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet <kent.overstreet@linux.dev> > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > > > Stastistics will then show up in debugfs under > > /sys/kernel/debug/time_stats, listed by file and line number. > > > > Stastics measured include weighted averages of frequency, duration, max > > duration, as well as quantiles. > > > > This patch also instruments all calls to init_wait and finish_wait, > > which includes all calls to wait_event. Example debugfs output: > > How can't you do this with a simple eBPF script on top of > trace_sched_stat_* and friends? I know about those tracepoints, and I've never found them to be usable. I've never succesfully used them for debugging latency issues, or known anyone who has. And an eBPF script to do everything this does wouldn't be simple at all. Honesly, the time stats stuff looks _far_ simpler to me than anything involving tracing - and with tracing you have to correlate the start and end events after the fact.
On Tue, 30 Aug 2022 14:49:16 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > From: Kent Overstreet <kent.overstreet@linux.dev> > > This adds the ability to easily instrument code for measuring latency. > To use, add the following to calls to your code, at the start and end of > the event you wish to measure: > > code_tag_time_stats_start(start_time); > code_tag_time_stats_finish(start_time); So you need to modify the code to see what you want? > > Stastistics will then show up in debugfs under > /sys/kernel/debug/time_stats, listed by file and line number. > > Stastics measured include weighted averages of frequency, duration, max > duration, as well as quantiles. > > This patch also instruments all calls to init_wait and finish_wait, > which includes all calls to wait_event. Example debugfs output: > > fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync > count: 17 > rate: 0/sec > frequency: 2 sec > avg duration: 10 us > max duration: 232 us > quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 > > lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait > count: 3 > rate: 0/sec > frequency: 4 sec > avg duration: 4 sec > max duration: 4 sec > quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 > > net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets > count: 10 > rate: 1/sec > frequency: 859 ms > avg duration: 472 ms > max duration: 30 sec > quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 17217 17217 17217 17217 For function length you could just do something like this: # cd /sys/kernel/tracing # echo __skb_wait_for_more_packets > set_ftrace_filter # echo 1 > function_profile_enabled # cat trace_stat/function* Function Hit Time Avg s^2 -------- --- ---- --- --- __skb_wait_for_more_packets 1 0.000 us 0.000 us 0.000 us Function Hit Time Avg s^2 -------- --- ---- --- --- __skb_wait_for_more_packets 1 74.813 us 74.813 us 0.000 us Function Hit Time Avg s^2 -------- --- ---- --- --- Function Hit Time Avg s^2 -------- --- ---- --- --- The above is for a 4 CPU machine. The s^2 is the square of the standard deviation (makes not having to do divisions while it runs). But if you are looking for latency between two events (which can be kprobes too, where you do not need to rebuild your kernel): From: https://man.archlinux.org/man/sqlhist.1.en which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ if not already installed on your distro. # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100' The above creates a synthetic event called "wakeup_lat" that joins two events (sched_waking and sched_switch) when the pid field of sched_waking matches the next_pid field of sched_switch. When there is a match, it will trigger the wakeup_lat event only if the prio of the sched_waking event is less than 100 (which in the kernel means any real-time task). The wakeup_lat event will record the next_comm (as comm field), the pid of woken task and the time delta in microseconds between the two events. # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger The above starts a histogram tracing the name of the woken task, the priority and the delta (but placing the delta in buckets of size 10, as we do not need to see every latency number). # chrt -f 20 sleep 1 Force something to be woken up that is interesting. # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist # event histogram # # trigger info: hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active] # { comm: migration/5 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/2 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: sleep , prio: 79, delta: ~ 10-19 } hitcount: 1 { comm: migration/7 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/4 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/6 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/1 , prio: 0, delta: ~ 10-19 } hitcount: 2 { comm: migration/0 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/2 , prio: 0, delta: ~ 20-29 } hitcount: 1 { comm: migration/0 , prio: 0, delta: ~ 20-29 } hitcount: 1 Totals: Hits: 11 Entries: 10 Dropped: 0 That is a histogram of the wakeup latency of all real time tasks that woke up. Oh, and it does not drop events unless the number of entries is bigger than the size of the count of buckets, which I haven't actually encountered, as there's 2048 buckets. But you can make it bigger with the "size" attribute in the creation of the histogram. -- Steve
On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger
The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):
# sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'
("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)
# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#
{ comm: migration/4 , prio: 0, delta: ~ 2^5 } hitcount: 1
{ comm: migration/0 , prio: 0, delta: ~ 2^4 } hitcount: 2
{ comm: rtkit-daemon , prio: 0, delta: ~ 2^7 } hitcount: 2
{ comm: rtkit-daemon , prio: 0, delta: ~ 2^6 } hitcount: 4
{ comm: migration/0 , prio: 0, delta: ~ 2^5 } hitcount: 8
{ comm: migration/4 , prio: 0, delta: ~ 2^4 } hitcount: 9
{ comm: migration/2 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/5 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/7 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/1 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/6 , prio: 0, delta: ~ 2^4 } hitcount: 10
Totals:
Hits: 76
Entries: 11
Dropped: 0
-- Steve
On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote: > On Tue, 30 Aug 2022 14:49:16 -0700 > Suren Baghdasaryan <surenb@google.com> wrote: > > > From: Kent Overstreet <kent.overstreet@linux.dev> > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > So you need to modify the code to see what you want? Figuring out the _correct_ place to measure is often a significant amount of the total effort. Having done so once, why not annotate that in the source code? > For function length you could just do something like this: > > # cd /sys/kernel/tracing > # echo __skb_wait_for_more_packets > set_ftrace_filter > # echo 1 > function_profile_enabled > # cat trace_stat/function* > Function Hit Time Avg s^2 > -------- --- ---- --- --- > __skb_wait_for_more_packets 1 0.000 us 0.000 us 0.000 us > Function Hit Time Avg s^2 > -------- --- ---- --- --- > __skb_wait_for_more_packets 1 74.813 us 74.813 us 0.000 us > Function Hit Time Avg s^2 > -------- --- ---- --- --- > Function Hit Time Avg s^2 > -------- --- ---- --- --- > > The above is for a 4 CPU machine. The s^2 is the square of the standard > deviation (makes not having to do divisions while it runs). > > But if you are looking for latency between two events (which can be kprobes > too, where you do not need to rebuild your kernel): > > From: https://man.archlinux.org/man/sqlhist.1.en > which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ > if not already installed on your distro. > > # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100' > > The above creates a synthetic event called "wakeup_lat" that joins two > events (sched_waking and sched_switch) when the pid field of sched_waking > matches the next_pid field of sched_switch. When there is a match, it will > trigger the wakeup_lat event only if the prio of the sched_waking event is > less than 100 (which in the kernel means any real-time task). The > wakeup_lat event will record the next_comm (as comm field), the pid of > woken task and the time delta in microseconds between the two events. So this looks like it's gotten better since I last looked, but it's still not there yet. Part of the problem is that the tracepoints themselves are in the wrong place: your end event is when a task is woken up, but that means spurious wakeups will cause one wait_event() call to be reported as multiple smaller waits, not one long wait - oops, now I can't actually find the thing that's causing my multi-second delay. Also, in your example you don't have it broken out by callsite. That would be the first thing I'd need for any real world debugging. So, it looks like tracing has made some progress over the past 10 years, but for debugging latency issues it's still not there yet in general. I will definitely remember function latency tracing the next time I'm doing performance work, but I expect that to be far too heavy to enable on a live server. This thing is only a couple hundred lines of code though, so perhaps tracing shouldn't be the only tool in our toolbox :)
On Thu, 1 Sep 2022 17:54:38 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > So this looks like it's gotten better since I last looked, but it's still not > there yet. > > Part of the problem is that the tracepoints themselves are in the wrong place: > your end event is when a task is woken up, but that means spurious wakeups will The end event is when a task is scheduled onto the CPU. The start event is the first time it is woken up. > cause one wait_event() call to be reported as multiple smaller waits, not one > long wait - oops, now I can't actually find the thing that's causing my > multi-second delay. > > Also, in your example you don't have it broken out by callsite. That would be > the first thing I'd need for any real world debugging. OK, how about this (currently we can only have 3 keys, but you can create multiple histograms on the same event). # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger (notice the "stacktrace" in the keys) # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist # event histogram # # trigger info: hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active] # { comm: migration/2 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/5 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/1 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/7 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/0 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 start_kernel+0x595/0x5be secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/4 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: rtkit-daemon , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 preempt_schedule_common+0x2d/0x70 preempt_schedule_thunk+0x16/0x18 _raw_spin_unlock_irq+0x2e/0x40 eventfd_write+0xc8/0x290 vfs_write+0xc0/0x2a0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb , delta: ~ 10-19} hitcount: 1 { comm: migration/6 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: rtkit-daemon , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 20-29} hitcount: 1 { comm: rtkit-daemon , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 preempt_schedule_common+0x2d/0x70 preempt_schedule_thunk+0x16/0x18 _raw_spin_unlock_irq+0x2e/0x40 eventfd_write+0xc8/0x290 vfs_write+0xc0/0x2a0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb , delta: ~ 30-39} hitcount: 1 { comm: rtkit-daemon , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 40-49} hitcount: 1 Totals: Hits: 53 Entries: 11 Dropped: 0 Not the prettiest thing to read. But hey, we got the full stack of where these latencies happened! Yes, it adds some overhead when the events are triggered due to the stacktrace code, but it's extremely useful information. > > So, it looks like tracing has made some progress over the past 10 years, > but for debugging latency issues it's still not there yet in general. I I call BS on that statement. Just because you do not know what has been added to the kernel in the last 10 years (like you had no idea about seq_buf and that was added in 2014) means to me that you are totally clueless on what tracing can and can not do. It appears to me that you are too focused on inventing your own wheel that does exactly what you want before looking to see how things are today. Just because something didn't fit your needs 10 years ago doesn't mean that it can't fit your needs today. > will definitely remember function latency tracing the next time I'm doing > performance work, but I expect that to be far too heavy to enable on a > live server. I run it on production machines all the time. With the filtering in place it has very little overhead. Mostly in the noise. The best part is that it has practically zero overhead (but can add some cache pressure) when it's off, and can be turned on at run time. The tracing infrastructure is very modular, you can use parts of it that you need, without the overhead of other parts. Like you found out this week that tracepoints are not the same as trace events. Because tracepoints are just a hook in the code that anything can attach to (that's what Daniel's RV work does). Trace events provide the stored data to be recorded. I will note that the current histogram code overhead has increased due to retpolines, but I have code to convert them from indirect calls to direct calls via a switch statement which drops the overhead by 20%! https://lore.kernel.org/all/20220823214606.344269352@goodmis.org/ > > This thing is only a couple hundred lines of code though, so perhaps > tracing shouldn't be the only tool in our toolbox :) I'm already getting complaints from customers/users that are saying there's too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The idea is to have the tools using mostly the same infrastructure, and not be 100% off on its own, unless there's a clear reason to invent a new wheel that several people are asking for, not just one or two. -- Steve
On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote: > On Thu, 1 Sep 2022 17:54:38 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > So this looks like it's gotten better since I last looked, but it's still not > > there yet. > > > > Part of the problem is that the tracepoints themselves are in the wrong place: > > your end event is when a task is woken up, but that means spurious wakeups will > > The end event is when a task is scheduled onto the CPU. The start event is > the first time it is woken up. Yeah, that's not what I want. You're just tracing latency due to having more processes runnable than CPUs. I don't care about that for debugging, though! I specifically want latency at the wait_event() level, and related - every time a process blocked _on some condition_, until that condition became true. Not until some random, potentially spurious wakeup. > Not the prettiest thing to read. But hey, we got the full stack of where > these latencies happened! Most of the time I _don't_ want full stacktraces, though! That means I have a ton more output to sort through, and the data is far more expensive to collect. I don't know why it's what people go to first - see the page_owner stuff - but that doesn't get used much either because the output is _really hard to sort through_. Most of the time, just a single file and line number is all you want - and tracing has always made it hard to get at that. > Yes, it adds some overhead when the events are triggered due to the > stacktrace code, but it's extremely useful information. > > > > > So, it looks like tracing has made some progress over the past 10 years, > > but for debugging latency issues it's still not there yet in general. I > > I call BS on that statement. Just because you do not know what has been > added to the kernel in the last 10 years (like you had no idea about > seq_buf and that was added in 2014) means to me that you are totally > clueless on what tracing can and can not do. > > It appears to me that you are too focused on inventing your own wheel that > does exactly what you want before looking to see how things are today. Just > because something didn't fit your needs 10 years ago doesn't mean that it > can't fit your needs today. ...And the ad hominem attacks start. Steve, I'm not attacking you, and there's room enough in this world for the both of us to be doing our thing creating new and useful tools. > I'm already getting complaints from customers/users that are saying there's > too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The > idea is to have the tools using mostly the same infrastructure, and not be > 100% off on its own, unless there's a clear reason to invent a new wheel > that several people are asking for, not just one or two. I would like to see more focus on usability. That means, in a best case scenario, always-on data collection that I can just look at, and it'll already be in the format most likely to be useful. Surely you can appreciate the usefulness of that..? Tracing started out as a tool for efficiently getting lots of data out of the kernel, and it's great for that. But I think your focus on the cool thing you built may be blinding you a bit to alternative approaches...
On Thu, 1 Sep 2022 18:55:15 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote: > > On Thu, 1 Sep 2022 17:54:38 -0400 > > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > > So this looks like it's gotten better since I last looked, but it's still not > > > there yet. > > > > > > Part of the problem is that the tracepoints themselves are in the wrong place: > > > your end event is when a task is woken up, but that means spurious wakeups will > > > > The end event is when a task is scheduled onto the CPU. The start event is > > the first time it is woken up. > > Yeah, that's not what I want. You're just tracing latency due to having more > processes runnable than CPUs. > > I don't care about that for debugging, though! I specifically want latency at > the wait_event() level, and related - every time a process blocked _on some > condition_, until that condition became true. Not until some random, potentially > spurious wakeup. Ideally this would be better if we could pass the stack trace from one event to the next, but that wouldn't be too hard to implement. It just needs to be done. But anyway: # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events // created an event on prepare_to_wait_event as that's usually called just // before wait event. # sqlhist -e -n wait_sched 'select start.common_pid as pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start join sched_switch as end on start.common_pid = end.prev_pid where end.prev_state & 3' // Create a "wait_sched" event that traces the time between the // prepare_to_wait_event call and the scheduler. Only trigger it if the // schedule happens in the interruptible or uninterruptible states. # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as end on start.pid = end.next_pid where start.delta < 50' // Now attach the wait_event to the sched_switch where the task gets // scheduled back in. But we are only going to care if the delta between // the prepare_to_wait_event and the schedule is less that 50us. This is a // hack to just care about where a prepare_to_wait_event was done just before // scheduling out. # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > /sys/kernel/tracing/events/synthetic/wake_sched/trigger // Now we are going to look at the deltas that the task was sleeping for an // event. But this just gives pids and deltas. # echo 'hist:keys=pid,stacktrace if delta < 50' >> /sys/kernel/tracing/events/synthetic/wait_sched/trigger // And this is to get the backtraces of where the task was. This is because // the stack trace is not available at the schedule in, because the // sched_switch can only give the stack trace of when a task schedules out. // Again, this is somewhat a hack. # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist # event histogram # # trigger info: hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 [active] # { pid: 2114, delta2: ~ 10-19 } hitcount: 1 { pid: 1389, delta2: ~ 160-169 } hitcount: 1 { pid: 1389, delta2: ~ 660-669 } hitcount: 1 { pid: 1389, delta2: ~ 1020-1029 } hitcount: 1 { pid: 1189, delta2: ~ 500020-500029 } hitcount: 1 { pid: 1189, delta2: ~ 500030-500039 } hitcount: 1 { pid: 1195, delta2: ~ 500030-500039 } hitcount: 2 { pid: 1189, delta2: ~ 500040-500049 } hitcount: 10 { pid: 1193, delta2: ~ 500040-500049 } hitcount: 3 { pid: 1197, delta2: ~ 500040-500049 } hitcount: 3 { pid: 1195, delta2: ~ 500040-500049 } hitcount: 9 { pid: 1190, delta2: ~ 500050-500059 } hitcount: 55 { pid: 1197, delta2: ~ 500050-500059 } hitcount: 51 { pid: 1191, delta2: ~ 500050-500059 } hitcount: 61 { pid: 1198, delta2: ~ 500050-500059 } hitcount: 56 { pid: 1195, delta2: ~ 500050-500059 } hitcount: 48 { pid: 1192, delta2: ~ 500050-500059 } hitcount: 54 { pid: 1194, delta2: ~ 500050-500059 } hitcount: 50 { pid: 1196, delta2: ~ 500050-500059 } hitcount: 57 { pid: 1189, delta2: ~ 500050-500059 } hitcount: 48 { pid: 1193, delta2: ~ 500050-500059 } hitcount: 52 { pid: 1194, delta2: ~ 500060-500069 } hitcount: 12 { pid: 1191, delta2: ~ 500060-500069 } hitcount: 2 { pid: 1190, delta2: ~ 500060-500069 } hitcount: 7 { pid: 1198, delta2: ~ 500060-500069 } hitcount: 9 { pid: 1193, delta2: ~ 500060-500069 } hitcount: 6 { pid: 1196, delta2: ~ 500060-500069 } hitcount: 5 { pid: 1192, delta2: ~ 500060-500069 } hitcount: 9 { pid: 1197, delta2: ~ 500060-500069 } hitcount: 9 { pid: 1195, delta2: ~ 500060-500069 } hitcount: 6 { pid: 1189, delta2: ~ 500060-500069 } hitcount: 6 { pid: 1198, delta2: ~ 500070-500079 } hitcount: 1 { pid: 1192, delta2: ~ 500070-500079 } hitcount: 2 { pid: 1193, delta2: ~ 500070-500079 } hitcount: 3 { pid: 1194, delta2: ~ 500070-500079 } hitcount: 2 { pid: 1191, delta2: ~ 500070-500079 } hitcount: 3 { pid: 1190, delta2: ~ 500070-500079 } hitcount: 1 { pid: 1196, delta2: ~ 500070-500079 } hitcount: 1 { pid: 1193, delta2: ~ 500080-500089 } hitcount: 1 { pid: 1192, delta2: ~ 500080-500089 } hitcount: 1 { pid: 1196, delta2: ~ 500080-500089 } hitcount: 2 { pid: 1194, delta2: ~ 500090-500099 } hitcount: 1 { pid: 1197, delta2: ~ 500090-500099 } hitcount: 1 { pid: 1193, delta2: ~ 500090-500099 } hitcount: 1 { pid: 61, delta2: ~ 503910-503919 } hitcount: 1 { pid: 61, delta2: ~ 503920-503929 } hitcount: 1 { pid: 61, delta2: ~ 503930-503939 } hitcount: 1 { pid: 61, delta2: ~ 503960-503969 } hitcount: 15 { pid: 61, delta2: ~ 503970-503979 } hitcount: 18 { pid: 61, delta2: ~ 503980-503989 } hitcount: 20 { pid: 61, delta2: ~ 504010-504019 } hitcount: 2 { pid: 61, delta2: ~ 504020-504029 } hitcount: 1 { pid: 61, delta2: ~ 504030-504039 } hitcount: 2 { pid: 58, delta2: ~ 43409960-43409969 } hitcount: 1 Totals: Hits: 718 Entries: 54 Dropped: 0 The above is useless without the following: # cat /sys/kernel/tracing/events/synthetic/wait_sched/hist # event histogram # # trigger info: hist:keys=pid:vals=hitcount:__arg_1618_2=pid,__arg_1618_3=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if delta < 10 [active] # { pid: 612 } hitcount: 1 { pid: 889 } hitcount: 2 { pid: 1389 } hitcount: 3 { pid: 58 } hitcount: 3 { pid: 2096 } hitcount: 5 { pid: 61 } hitcount: 145 { pid: 1196 } hitcount: 151 { pid: 1190 } hitcount: 151 { pid: 1198 } hitcount: 153 { pid: 1197 } hitcount: 153 { pid: 1195 } hitcount: 153 { pid: 1194 } hitcount: 153 { pid: 1191 } hitcount: 153 { pid: 1192 } hitcount: 153 { pid: 1189 } hitcount: 153 { pid: 1193 } hitcount: 153 Totals: Hits: 1685 Entries: 16 Dropped: 0 # event histogram # # trigger info: hist:keys=pid,stacktrace:vals=hitcount:sort=hitcount:size=2048 if delta < 10 [active] # { pid: 1389, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 pipe_read+0x318/0x420 new_sync_read+0x18b/0x1a0 vfs_read+0xf5/0x190 ksys_read+0xab/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 3 { pid: 1189, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 61, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 schedule_timeout+0x88/0x160 kcompactd+0x364/0x3f0 kthread+0x141/0x170 ret_from_fork+0x22/0x30 } hitcount: 28 { pid: 1194, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1197, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1198, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1191, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1196, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1192, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1195, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1190, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 { pid: 1193, stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule+0x72/0x110 read_events+0x119/0x190 do_io_getevents+0x72/0xe0 __x64_sys_io_getevents+0x59/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb } hitcount: 28 Totals: Hits: 311 Entries: 12 Dropped: 0 Now we just need a tool to map the pids of the delta histogram to the pids of the stack traces to figure out where the issues may happen. The above is just to show that there's a lot of infrastructure already there that does a lot of this work, but needs improvement. My theme to this email is to modify what's there to make it work for you before just doing everything from scratch, and then we have a bunch of stuff that only does what we want, but is not flexible to do what others may want. > > > > Not the prettiest thing to read. But hey, we got the full stack of where > > these latencies happened! > > Most of the time I _don't_ want full stacktraces, though! We could easily add a feature to limit how much you want to trace. Perhaps even a skip level. That is, add skip and depth options to the stacktrace field. > > That means I have a ton more output to sort through, and the data is far more > expensive to collect. That's what user space tools are for ;-) > > I don't know why it's what people go to first - see the page_owner stuff - but > that doesn't get used much either because the output is _really hard to sort > through_. > > Most of the time, just a single file and line number is all you want - and > tracing has always made it hard to get at that. Because we would need to store too much dwarf information in the kernel to do so. But user space could do this for you with the function/offset information. > > > > Yes, it adds some overhead when the events are triggered due to the > > stacktrace code, but it's extremely useful information. > > > > > > > > So, it looks like tracing has made some progress over the past 10 > > > years, but for debugging latency issues it's still not there yet in > > > general. I > > > > I call BS on that statement. Just because you do not know what has been > > added to the kernel in the last 10 years (like you had no idea about > > seq_buf and that was added in 2014) means to me that you are totally > > clueless on what tracing can and can not do. > > > > It appears to me that you are too focused on inventing your own wheel > > that does exactly what you want before looking to see how things are > > today. Just because something didn't fit your needs 10 years ago > > doesn't mean that it can't fit your needs today. > > ...And the ad hominem attacks start. Look, you keep making comments about the tracing infrastructure that you clearly do not understand. And that is pretty insulting. Sorry, I'm not sure you realize this, but those comments do turn people off and their responses will start to become stronger. > > Steve, I'm not attacking you, and there's room enough in this world for > the both of us to be doing our thing creating new and useful tools. You seem to push back hard when people suggest improving other utilities to suite your needs. > > > I'm already getting complaints from customers/users that are saying > > there's too many tools in the toolbox already. (Do we use > > ftrace/perf/bpf?). The idea is to have the tools using mostly the same > > infrastructure, and not be 100% off on its own, unless there's a clear > > reason to invent a new wheel that several people are asking for, not > > just one or two. > > I would like to see more focus on usability. Then lets make the current tools more usable. For example, the synthetic event kernel interface is horrible. It's an awesome feature that wasn't getting used due to the interface. This is why I created "sqlhist". It's now really easy to create synthetic events with that tool. I agree, focus on usability, but that doesn't always mean to create yet another tool. This reminds me of: https://xkcd.com/927/ > > That means, in a best case scenario, always-on data collection that I can > just look at, and it'll already be in the format most likely to be useful. > > Surely you can appreciate the usefulness of that..? I find "runtime turn on and off" better than "always on". We have static_branches today (aka jump labels). I would strongly suggest using them. You get them automatically from tracepoints . Even sched_stats are using these. > > Tracing started out as a tool for efficiently getting lots of data out of > the kernel, and it's great for that. But I think your focus on the cool > thing you built may be blinding you a bit to alternative approaches... I actually work hard to have the tracing infrastructure help out other approaches. perf and bpf use the ftrace infrastructure because it is designed to be modular. Nothing is "must be the ftrace way". I'm not against the new features you are adding, I just want you to make a little more effort in incorporating other infrastructures (and perhaps even improving that infrastructure) to suite your needs. If ftrace, perf, bpf can't do what you want, take a harder look to see if you can modify them to do so. -- Steve
On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote: > If ftrace, perf, bpf can't do what you want, take a harder look to see if > you can modify them to do so. Maybe we can use this exchange to make both of our tools better. I like your histograms - the quantiles algorithm I've had for years is janky, I've been meaning to rip that out, I'd love to take a look at your code for that. And having an on/off switch is a good idea, I'll try to add that at some point. Maybe you got some ideas from my stuff too. I'd love to get better tracepoints for measuring latency - what I added to init_wait() and finish_wait() was really only a starting point. Figuring out the right places to measure is where I'd like to be investing my time in this area, and there's no reason we couldn't both be making use of that. e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not all code uses that, init_wait() got me better coverage. But I've already seen that that misses things, too, there's more work to be done. random thought: might try adding a warning in schedule() any time it's called and codetag_time_stats_start() hasn't been called, that'll be a starting point...
On Thu, 1 Sep 2022 21:35:32 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote: > > If ftrace, perf, bpf can't do what you want, take a harder look to see if > > you can modify them to do so. > > Maybe we can use this exchange to make both of our tools better. I like your > histograms - the quantiles algorithm I've had for years is janky, I've been > meaning to rip that out, I'd love to take a look at your code for that. And > having an on/off switch is a good idea, I'll try to add that at some point. > Maybe you got some ideas from my stuff too. > > I'd love to get better tracepoints for measuring latency - what I added to > init_wait() and finish_wait() was really only a starting point. Figuring out > the right places to measure is where I'd like to be investing my time in this > area, and there's no reason we couldn't both be making use of that. Yes, this is exactly what I'm talking about. I'm not against your work, I just want you to work more with everyone to come up with ideas that can help everyone as a whole. That's how "open source communities" is suppose to work ;-) The histogram and synthetic events can use some more clean ups. There's a lot of places that can be improved in that code. But I feel the ideas behind that code is sound. It's just getting the implementation to be a bit more efficient. > > e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not > all code uses that, init_wait() got me better coverage. But I've already seen > that that misses things, too, there's more work to be done. I picked prepare_to_wait() just because I was hacking up something quick and thought that was "close enough" ;-) -- Steve
diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h index 16fbf74edc3d..d799f4aced82 100644 --- a/include/asm-generic/codetag.lds.h +++ b/include/asm-generic/codetag.lds.h @@ -10,6 +10,7 @@ #define CODETAG_SECTIONS() \ SECTION_WITH_BOUNDARIES(alloc_tags) \ - SECTION_WITH_BOUNDARIES(dynamic_fault_tags) + SECTION_WITH_BOUNDARIES(dynamic_fault_tags) \ + SECTION_WITH_BOUNDARIES(time_stats_tags) #endif /* __ASM_GENERIC_CODETAG_LDS_H */ diff --git a/include/linux/codetag_time_stats.h b/include/linux/codetag_time_stats.h new file mode 100644 index 000000000000..7e44c7ee9e9b --- /dev/null +++ b/include/linux/codetag_time_stats.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CODETAG_TIMESTATS_H +#define _LINUX_CODETAG_TIMESTATS_H + +/* + * Code tagging based latency tracking: + * (C) 2022 Kent Overstreet + * + * This allows you to easily instrument code to track latency, and have the + * results show up in debugfs. To use, add the following two calls to your code + * at the beginning and end of the event you wish to instrument: + * + * code_tag_time_stats_start(start_time); + * code_tag_time_stats_finish(start_time); + * + * Statistics will then show up in debugfs under /sys/kernel/debug/time_stats, + * listed by file and line number. + */ + +#ifdef CONFIG_CODETAG_TIME_STATS + +#include <linux/codetag.h> +#include <linux/time_stats.h> +#include <linux/timekeeping.h> + +struct codetag_time_stats { + struct codetag tag; + struct time_stats stats; +}; + +#define codetag_time_stats_start(_start_time) u64 _start_time = ktime_get_ns() + +#define codetag_time_stats_finish(_start_time) \ +do { \ + static struct codetag_time_stats \ + __used \ + __section("time_stats_tags") \ + __aligned(8) s = { \ + .tag = CODE_TAG_INIT, \ + .stats.lock = __SPIN_LOCK_UNLOCKED(_lock) \ + }; \ + \ + WARN_ONCE(!(_start_time), "codetag_time_stats_start() not called");\ + time_stats_update(&s.stats, _start_time); \ +} while (0) + +#else + +#define codetag_time_stats_finish(_start_time) do {} while (0) +#define codetag_time_stats_start(_start_time) do {} while (0) + +#endif /* CODETAG_CODETAG_TIME_STATS */ + +#endif diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 677a25d44d7f..3bcef85eacd8 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -488,7 +488,7 @@ struct io_cqe { struct io_cmd_data { struct file *file; /* each command gets 56 bytes of data */ - __u8 data[56]; + __u8 data[64]; }; static inline void io_kiocb_cmd_sz_check(size_t cmd_sz) diff --git a/include/linux/wait.h b/include/linux/wait.h index 91ced6a118bc..bab11b7ef19a 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -4,6 +4,7 @@ /* * Linux wait queue related types and methods */ +#include <linux/codetag_time_stats.h> #include <linux/list.h> #include <linux/stddef.h> #include <linux/spinlock.h> @@ -32,6 +33,9 @@ struct wait_queue_entry { void *private; wait_queue_func_t func; struct list_head entry; +#ifdef CONFIG_CODETAG_TIME_STATS + u64 start_time; +#endif }; struct wait_queue_head { @@ -79,10 +83,17 @@ extern void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *n # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name) #endif +#ifdef CONFIG_CODETAG_TIME_STATS +#define WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER .start_time = ktime_get_ns(), +#else +#define WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER +#endif + #define WAIT_FUNC_INITIALIZER(name, function) { \ .private = current, \ .func = function, \ .entry = LIST_HEAD_INIT((name).entry), \ + WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER \ } #define DEFINE_WAIT_FUNC(name, function) \ @@ -98,6 +109,9 @@ __init_waitqueue_entry(struct wait_queue_entry *wq_entry, unsigned int flags, wq_entry->private = private; wq_entry->func = func; INIT_LIST_HEAD(&wq_entry->entry); +#ifdef CONFIG_CODETAG_TIME_STATS + wq_entry->start_time = ktime_get_ns(); +#endif } #define init_waitqueue_func_entry(_wq_entry, _func) \ @@ -1180,11 +1194,17 @@ do { \ void prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); bool prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); -void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); +void __finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout); int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); +#define finish_wait(_wq_head, _wq_entry) \ +do { \ + codetag_time_stats_finish((_wq_entry)->start_time); \ + __finish_wait(_wq_head, _wq_entry); \ +} while (0) + typedef int (*task_call_f)(struct task_struct *p, void *arg); extern int task_call_func(struct task_struct *p, task_call_f func, void *arg); diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index b9922346077d..e88de3f0c3ad 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -367,7 +367,7 @@ int do_wait_intr_irq(wait_queue_head_t *wq, wait_queue_entry_t *wait) EXPORT_SYMBOL(do_wait_intr_irq); /** - * finish_wait - clean up after waiting in a queue + * __finish_wait - clean up after waiting in a queue * @wq_head: waitqueue waited on * @wq_entry: wait descriptor * @@ -375,7 +375,7 @@ EXPORT_SYMBOL(do_wait_intr_irq); * the wait descriptor from the given waitqueue if still * queued. */ -void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +void __finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; @@ -399,7 +399,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en spin_unlock_irqrestore(&wq_head->lock, flags); } } -EXPORT_SYMBOL(finish_wait); +EXPORT_SYMBOL(__finish_wait); int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key) { diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index b7d03afbc808..b0f86643b8f0 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1728,6 +1728,14 @@ config LATENCYTOP Enable this option if you want to use the LatencyTOP tool to find out which userspace is blocking on what kernel operations. +config CODETAG_TIME_STATS + bool "Code tagging based latency measuring" + depends on DEBUG_FS + select TIME_STATS + select CODE_TAGGING + help + Enabling this option makes latency statistics available in debugfs + source "kernel/trace/Kconfig" config PROVIDE_OHCI1394_DMA_INIT diff --git a/lib/Makefile b/lib/Makefile index e54392011f5e..d4067973805b 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -233,6 +233,7 @@ obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o obj-$(CONFIG_TIME_STATS) += time_stats.o +obj-$(CONFIG_CODETAG_TIME_STATS) += codetag_time_stats.o lib-$(CONFIG_GENERIC_BUG) += bug.o diff --git a/lib/codetag_time_stats.c b/lib/codetag_time_stats.c new file mode 100644 index 000000000000..b0e9a08308a2 --- /dev/null +++ b/lib/codetag_time_stats.c @@ -0,0 +1,143 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/codetag_time_stats.h> +#include <linux/ctype.h> +#include <linux/debugfs.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/seq_buf.h> + +static struct codetag_type *cttype; + +struct user_buf { + char __user *buf; /* destination user buffer */ + size_t size; /* size of requested read */ + ssize_t ret; /* bytes read so far */ +}; + +static int flush_ubuf(struct user_buf *dst, struct seq_buf *src) +{ + if (src->len) { + size_t bytes = min_t(size_t, src->len, dst->size); + int err = copy_to_user(dst->buf, src->buffer, bytes); + + if (err) + return err; + + dst->ret += bytes; + dst->buf += bytes; + dst->size -= bytes; + src->len -= bytes; + memmove(src->buffer, src->buffer + bytes, src->len); + } + + return 0; +} + +struct time_stats_iter { + struct codetag_iterator ct_iter; + struct seq_buf buf; + char rawbuf[4096]; + bool first; +}; + +static int time_stats_open(struct inode *inode, struct file *file) +{ + struct time_stats_iter *iter; + + pr_debug("called"); + + iter = kzalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) + return -ENOMEM; + + codetag_lock_module_list(cttype, true); + codetag_init_iter(&iter->ct_iter, cttype); + codetag_lock_module_list(cttype, false); + + file->private_data = iter; + seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf)); + iter->first = true; + return 0; +} + +static int time_stats_release(struct inode *inode, struct file *file) +{ + struct time_stats_iter *i = file->private_data; + + kfree(i); + return 0; +} + +static ssize_t time_stats_read(struct file *file, char __user *ubuf, + size_t size, loff_t *ppos) +{ + struct time_stats_iter *iter = file->private_data; + struct user_buf buf = { .buf = ubuf, .size = size }; + struct codetag_time_stats *s; + struct codetag *ct; + int err; + + codetag_lock_module_list(iter->ct_iter.cttype, true); + while (1) { + err = flush_ubuf(&buf, &iter->buf); + if (err || !buf.size) + break; + + ct = codetag_next_ct(&iter->ct_iter); + if (!ct) + break; + + s = container_of(ct, struct codetag_time_stats, tag); + if (s->stats.count) { + if (!iter->first) { + seq_buf_putc(&iter->buf, '\n'); + iter->first = true; + } + + codetag_to_text(&iter->buf, &s->tag); + seq_buf_putc(&iter->buf, '\n'); + time_stats_to_text(&iter->buf, &s->stats); + } + } + codetag_lock_module_list(iter->ct_iter.cttype, false); + + return err ?: buf.ret; +} + +static const struct file_operations time_stats_ops = { + .owner = THIS_MODULE, + .open = time_stats_open, + .release = time_stats_release, + .read = time_stats_read, +}; + +static void time_stats_module_unload(struct codetag_type *cttype, struct codetag_module *mod) +{ + struct codetag_time_stats *i, *start = (void *) mod->range.start; + struct codetag_time_stats *end = (void *) mod->range.stop; + + for (i = start; i != end; i++) + time_stats_exit(&i->stats); +} + +static int __init codetag_time_stats_init(void) +{ + const struct codetag_type_desc desc = { + .section = "time_stats_tags", + .tag_size = sizeof(struct codetag_time_stats), + .module_unload = time_stats_module_unload, + }; + struct dentry *debugfs_file; + + cttype = codetag_register_type(&desc); + if (IS_ERR_OR_NULL(cttype)) + return PTR_ERR(cttype); + + debugfs_file = debugfs_create_file("time_stats", 0666, NULL, NULL, &time_stats_ops); + if (IS_ERR(debugfs_file)) + return PTR_ERR(debugfs_file); + + return 0; +} +module_init(codetag_time_stats_init);