mbox series

[0/5] tracing: Allow trace_printk() to use the persistent ring buffer

Message ID 20240823013902.135036960@goodmis.org (mailing list archive)
Headers show
Series tracing: Allow trace_printk() to use the persistent ring buffer | expand

Message

Steven Rostedt Aug. 23, 2024, 1:39 a.m. UTC
[ Take-two because I had an extra comma at the end of my Cc list
  for sending via quilt ]

While debugging the rtla and timerlat issue[1], I started using
trace_printk() to see what was happening as the task_struct was randomly
being corrupted. But the reproducer to this bug would cause a kernel crash.
I wanted to try the new persistent ring buffer (which has not been merged
yet), to debug this. As I was using trace_printk(), I needed the
trace_printk() to write into this buffer.

trace_printk() currently only writes into the top trace buffer, and I didn't
want to use the trace_array_printk() for this, as it seemed too clunky.
Instead, I wrote up this patch series that allows an instance created on the
kernel command line to be used as the trace_printk destination buffer.

To accomplish this, I needed a way to add an option to the trace_instance
command line. I created "flags" to modify the trace instance after it is
created. If the trace_instance has a "^<flag>" in its name, it is parsed as
a flag.

  reserve_mem=12M:4096:trace  trace_instance=boot_map^traceprintk@trace

The "^traceprintk" would tell the kernel to use the boot_map instance as the
location of the trace_printk(). Now when the kernel crashed, the boot
instance would have the trace_printk() output.

Note, when I first did this, the kernel crashed when reading the boot
buffer, as the trace_printk() would default into using the trace_bprintk()
which saves the pointer to the format string and the raw arguments in the
buffer. On reading it, it would try to read the pointer where the format
string was located at in the previous kernel and crashed. To solve this, I
added a flag to the trace_array to denote that it is a boot trace buffer,
and if it is set, then the binary trace_printks will call the version that
does the formatting before adding it to the buffer. This may be slightly
slower, but it's safer.

I also noticed that using trace_printk() in the boot buffer, it would write
to the buffer on the boot after the crash. This caused events to be listed
in the buffer to be mixed between two kernels, and it was really confusing
to read. I added another boot time flag to have the tracing instance boot
with tracing disabled.

  trace_instance=boot_map^traceoff^traceprintk@trace

With this, and enabling the printk console trace event in the boot buffer, I
got nice traces like this:

 # trace-cmd start -B boot_map -e printk
 # <run test to crash kernel>
[CRASH / REBOOT ]
 # trace-cmd show -B boot_map
[..]
          <idle>-0       [004] dNh1.    45.277163: timerlat_irq: exit timer no restart
           <...>-910     [007] ....1    45.277201: wait_next_period.isra.0: hrtimer start ffff88823c7b5b28
           <...>-907     [004] ....1    45.277206: wait_next_period.isra.0: hrtimer start ffff88823c635b28
           <...>-905     [003] ....1    45.277240: timerlat_fd_release: OSN_VAR KTHREAD [3] timerlatu/3:905 ffff88810d5f4500
           <...>-920     [003] .....    45.277316: stop_kthread: kill timerlatu/3:905 ffff88810d5f4500
           <...>-920     [003] .....    45.277348: stop_kthread: OSN_VAR KTHREAD [4] timerlatu/4:907 ffff8881104ae780
           <...>-920     [003] .....    45.277357: stop_kthread: kill timerlatu/4:907 ffff8881104ae780
           <...>-920     [003] .....    45.277358: stop_kthread: OSN_VAR KTHREAD [5] timerlatu/5:908 ffff8881104aa280
           <...>-920     [003] .....    45.277365: stop_kthread: kill timerlatu/5:908 ffff8881104aa280
           <...>-920     [003] .....    45.277367: stop_kthread: OSN_VAR KTHREAD [6] timerlatu/6:909 ffff8881104ab3c0
           <...>-920     [003] .....    45.277375: stop_kthread: kill timerlatu/6:909 ffff8881104ab3c0
           <...>-920     [003] .....    45.277377: stop_kthread: OSN_VAR KTHREAD [7] timerlatu/7:910 ffff8881104a9140
           <...>-920     [003] .....    45.277386: stop_kthread: kill timerlatu/7:910 ffff8881104a9140
           <...>-909     [006] d.h1.    45.277556: timerlat_irq: enter timer
           <...>-909     [006] d.h1.    45.277561: timerlat_irq: exit timer no restart
           <...>-908     [005] d.h1.    45.277919: timerlat_irq: enter timer
           <...>-908     [005] d.h1.    45.277923: timerlat_irq: exit timer no restart
           <...>-907     [004] d.h1.    45.278080: timerlat_irq: enter timer
           <...>-910     [007] d.h1.    45.278081: timerlat_irq: enter timer
           <...>-907     [004] d.h1.    45.278083: timerlat_irq: exit timer no restart
           <...>-910     [007] d.h1.    45.278084: timerlat_irq: exit timer no restart
           <...>-1       [002] d..1.    45.278665: console: ==================================================================
           <...>-1       [002] d..1.    45.280870: console: BUG: KASAN: slab-use-after-free in proc_pid_lookup+0xbf/0x210
           <...>-1       [002] d..1.    45.282876: console: Write of size 4 at addr ffff88810d5f4528 by task systemd/1
           <...>-1       [002] d..1.    45.284781: console: 
           <...>-1       [002] d..1.    45.285295: console: CPU: 2 UID: 0 PID: 1 Comm: systemd Not tainted 6.11.0-rc3-test-00027-g014f473a3416-dirty #124 e44bf2780799707baf299f82ac321c3be7495d33
           <...>-1       [002] d..1.    45.289265: console: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
           <...>-1       [002] d..1.    45.292001: console: Call Trace:
           <...>-1       [002] d..1.    45.292764: console:  <TASK>
           <...>-1       [002] d..1.    45.293441: console:  dump_stack_lvl+0x53/0x70
           <...>-1       [002] d..1.    45.294575: console:  print_report+0xc6/0x640
           <...>-1       [002] d..1.    45.295672: console:  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
           <...>-1       [002] d..1.    45.297182: console:  ? proc_pid_lookup+0xbf/0x210
           <...>-1       [002] d..1.    45.298407: console:  kasan_report+0xc6/0x100
           <...>-1       [002] d..1.    45.299496: console:  ? proc_pid_lookup+0xbf/0x210
           <...>-1       [002] d..1.    45.300707: console:  kasan_check_range+0xf4/0x1a0
           <...>-1       [002] d..1.    45.301931: console:  proc_pid_lookup+0xbf/0x210
[..]

And I was able to easily debug the situation.

Now that I have an easy way to make the trace_printk() write to any
instance, I decided to add a run time option to the instances to tell them
to be the destination to the trace_printk(). That is, you don't need to add
a kernel command line to make the trace_printk() write to an instance. All
you need to do is set the trace_printk_dest option in the instance (note,
the trace_printk option was already taken at the top level that can be used
to disable trace_printk). Only one instance is allowed to be the
trace_printk destination, so setting this flag in one instance will clear it
in another. Also, an instance must always be the destination of the
trace_printk() function, so clearing the flag in any instance will
automatically cause the top level instance to be set. Trying to clear the
flag of the top level instance will return -EINVAL.

  echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest

With all these fun new ways to debug the kernel, I figured it's about time
to start adding it to the Documentation directory. I created a debugging.rst
file in Documentation/trace, that is the start of adding techniques in using
tracing to debug your kernel.

[1] https://lore.kernel.org/all/20240821160316.02c03c44@gandalf.local.home/

These patches are based on top of:

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
     for-next


Steven Rostedt (5):
      tracing: Add "traceoff" flag to boot time tracing instances
      tracing: Allow trace_printk() to go to other instance buffers
      tracing: Have trace_printk not use binary prints if boot buffer
      tracing: Add option to set an instance to be the trace_printk destination
      tracing/Documentation: Start a document on how to debug with tracing

----
 Documentation/admin-guide/kernel-parameters.txt |  23 ++++
 Documentation/trace/debugging.rst               | 159 ++++++++++++++++++++++++
 Documentation/trace/ftrace.rst                  |  12 ++
 kernel/trace/trace.c                            | 127 ++++++++++++++++---
 kernel/trace/trace.h                            |   4 +-
 kernel/trace/trace_output.c                     |   5 +-
 6 files changed, 313 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/trace/debugging.rst