mbox series

[0/5] rtla/timerlat: Stop on signal properly when overloaded

Message ID 20250116144931.649593-1-tglozar@redhat.com (mailing list archive)
Headers show
Series rtla/timerlat: Stop on signal properly when overloaded | expand

Message

Tomas Glozar Jan. 16, 2025, 2:49 p.m. UTC
We have been seeing an issue where if rtla is run on machines with a high
number of CPUs (100+), timerlat can generate more samples than rtla is able
to process via tracefs_iterate_raw_events. This is especially common when
the interval is set to 100us (rteval and cyclictest default) as opposed
to the rtla default of 1000us, but also happens with the rtla default.

Currently, this leads to rtla hanging and having to be terminated with
SIGTERM. SIGINT setting stop_tracing is not enough, since more and more
events are coming and tracefs_iterate_raw_events never exits.

This patchset contains two changes:
- Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more events are
generated when rtla is supposed exit. This fixes rtla hanging and should
go to stable.
- On receiving SIGINT/SIGALRM twice, abort iteration immediately with
tracefs_iterate_stop, making rtla exit right away instead of waiting for
all events to be processed. This is more of a usability feature: if the user
is in a hurry, they can Ctrl-C twice (or once after the duration has expired)
and exit immediately, discarding any events pending processing.

Note: I am sending those together only because the second one depends on
the first. Also this should be fixed in osnoise, too.

In the future, two more patchsets will be sent: one to display how many
events/samples were dropped (either left in tracefs buffer or by buffer
overflow), one to improve sample processing performance to be on par with
cyclictest (ideally) so that samples are not dropped in the cases mentioned
in the beginning of the email.

Tomas Glozar (5):
  rtla: Add trace_instance_stop
  rtla/timerlat_hist: Stop timerlat tracer on signal
  rtla/timerlat_top: Stop timerlat tracer on signal
  rtla/timerlat_hist: Abort event processing on second signal
  rtla/timerlat_top: Abort event processing on second signal

 tools/tracing/rtla/src/timerlat_hist.c | 19 ++++++++++++++++++-
 tools/tracing/rtla/src/timerlat_top.c  | 20 +++++++++++++++++++-
 tools/tracing/rtla/src/trace.c         |  8 ++++++++
 tools/tracing/rtla/src/trace.h         |  1 +
 4 files changed, 46 insertions(+), 2 deletions(-)

Comments

Steven Rostedt Jan. 17, 2025, 12:46 a.m. UTC | #1
On Thu, 16 Jan 2025 15:49:26 +0100
Tomas Glozar <tglozar@redhat.com> wrote:

> In the future, two more patchsets will be sent: one to display how many
> events/samples were dropped (either left in tracefs buffer or by buffer
> overflow), one to improve sample processing performance to be on par with
> cyclictest (ideally) so that samples are not dropped in the cases mentioned
> in the beginning of the email.

Hmm, I wonder if timerlat can handle per cpu data, then you could kick off
a thread per CPU (or a set of CPUs) where the thread is responsible for
handling the data.


		CPU_ZERO_S(cpu_size, cpusetp);
		CPU_SET_S(cpu, cpu_size, cpusetp);
                retval = tracefs_iterate_raw_events(trace->tep,
                                trace->inst,
                                cpusetp,
                                cpu_size,
                                collect_registered_events,
                                                    trace);

And then that iteration will only read over a subset of CPUs. Each thread
can do a different subset and then it should be able to keep up.

-- Steve
Tomas Glozar Jan. 17, 2025, 12:04 p.m. UTC | #2
pá 17. 1. 2025 v 1:46 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> Hmm, I wonder if timerlat can handle per cpu data, then you could kick off
> a thread per CPU (or a set of CPUs) where the thread is responsible for
> handling the data.
>
>
>                 CPU_ZERO_S(cpu_size, cpusetp);
>                 CPU_SET_S(cpu, cpu_size, cpusetp);
>                 retval = tracefs_iterate_raw_events(trace->tep,
>                                 trace->inst,
>                                 cpusetp,
>                                 cpu_size,
>                                 collect_registered_events,
>                                                     trace);
>
> And then that iteration will only read over a subset of CPUs. Each thread
> can do a different subset and then it should be able to keep up.
>

That's a good idea, I didn't think of that. But it doesn't help much
in a scenario where rtla is pinned to a few housekeeping CPUs with -H,
which is used for testing isolated-CPU-based setups.

I was thinking of turning timerlat_hist_handler/timerlat_top_handler
into a BPF program and having it executed right after the sample is
created, e.g. by using the BPF perf interface to hook it to a
tracepoint event. The histogram/counter would be stored in BPF maps,
which would be merely copied over in the main loop. This is
essentially how cyclictest does it, except in userspace. I expect this
solution to have good performance, but the obvious downside is that it
requires BPF. This is not a problem for us, but might be for other
rtla users and we'd likely have to keep both implementations of sample
processing in the code.

Also, before even starting with that, it would be likely necessary to
remove the duplicate code throughout timerlat/osnoise and test it
properly, so we don't have to do the same code changes twice or four
times.

Tomas
Steven Rostedt Jan. 17, 2025, 3:29 p.m. UTC | #3
On Fri, 17 Jan 2025 13:04:07 +0100
Tomas Glozar <tglozar@redhat.com> wrote:


> I was thinking of turning timerlat_hist_handler/timerlat_top_handler
> into a BPF program and having it executed right after the sample is
> created, e.g. by using the BPF perf interface to hook it to a
> tracepoint event. The histogram/counter would be stored in BPF maps,
> which would be merely copied over in the main loop. This is
> essentially how cyclictest does it, except in userspace. I expect this
> solution to have good performance, but the obvious downside is that it
> requires BPF. This is not a problem for us, but might be for other
> rtla users and we'd likely have to keep both implementations of sample
> processing in the code.
> 
> Also, before even starting with that, it would be likely necessary to
> remove the duplicate code throughout timerlat/osnoise and test it
> properly, so we don't have to do the same code changes twice or four
> times.

We could also add kernel helpers to the code if it would help.

Hmm, the timerlat event could probably get access to a trigger to allow it
to do the work in the kernel like what the 'hist' triggers do. We can
extend on that.

The reason I haven't written a BPF program yet, is because when I feel
there's a useful operation that can be done, I just extend ftrace to do it
;-)

-- Steve
Tomas Glozar Jan. 17, 2025, 3:55 p.m. UTC | #4
pá 17. 1. 2025 v 16:29 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> Hmm, the timerlat event could probably get access to a trigger to allow it
> to do the work in the kernel like what the 'hist' triggers do. We can
> extend on that.
>

If we only need to count the numbers, that would be a perfect
solution: use the hist trigger in place of the BPF program (no need to
reinvent the wheel!). I'll have to look at exactly what we need.

> The reason I haven't written a BPF program yet, is because when I feel
> there's a useful operation that can be done, I just extend ftrace to do it
> ;-)

I used to work with BPF, so I usually think of a BPF solution first,
then I think about whether I can replace the BPF part :D

Tomas