diff mbox series

exit: add trace_task_exit() tracepoint before current->mm is reset

Message ID 20250401184021.2591443-1-andrii@kernel.org (mailing list archive)
State New
Headers show
Series exit: add trace_task_exit() tracepoint before current->mm is reset | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko April 1, 2025, 6:40 p.m. UTC
It is useful to be able to access current->mm to, say, record a bunch of
VMA information right before the task exits (e.g., for stack
symbolization reasons when dealing with short-lived processes that exit
in the middle of profiling session). We currently do have
trace_sched_process_exit() in the exit path, but it is called a bit too
late, after exit_mm() resets current->mm to NULL, which makes it
unsuitable for inspecting and recording task's mm_struct-related data
when tracing process lifetimes.

There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm(). taskstats
performs a similar kind of accounting that some applications do with
BPF, and so co-locating them seems like a good fit.

Moving trace_sched_process_exit() a bit earlier would solve this problem
as well, and I'm open to that. But this might potentially change its
semantics a little, and so instead of risking that, I went for adding
a new trace_task_exit() tracepoint instead. Tracepoints have zero
overhead at runtime, unless actively traced, so this seems acceptable.

Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/trace/events/task.h | 24 ++++++++++++++++++++++++
 kernel/exit.c               |  2 ++
 2 files changed, 26 insertions(+)

Comments

Steven Rostedt April 1, 2025, 9:32 p.m. UTC | #1
On Tue,  1 Apr 2025 11:40:21 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

Hi Andrii,

> It is useful to be able to access current->mm to, say, record a bunch of
> VMA information right before the task exits (e.g., for stack
> symbolization reasons when dealing with short-lived processes that exit
> in the middle of profiling session). We currently do have
> trace_sched_process_exit() in the exit path, but it is called a bit too
> late, after exit_mm() resets current->mm to NULL, which makes it
> unsuitable for inspecting and recording task's mm_struct-related data
> when tracing process lifetimes.

My fear of adding another task exit trace event is that it will get a
bit confusing as that we now have trace_sched_process_exit() and also
trace_task_exit() with slightly different semantics.

How about adding a trace_exit_mm()? Add that to the exit_mm() code?

static void exit_mm(void)
{
	struct mm_struct *mm = current->mm;

	exit_mm_release(current, mm);
	trace_exit_mm(mm);

??

> 
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm(). taskstats
> performs a similar kind of accounting that some applications do with
> BPF, and so co-locating them seems like a good fit.
> 
> Moving trace_sched_process_exit() a bit earlier would solve this problem
> as well, and I'm open to that. But this might potentially change its
> semantics a little, and so instead of risking that, I went for adding
> a new trace_task_exit() tracepoint instead. Tracepoints have zero
> overhead at runtime, unless actively traced, so this seems acceptable.
> 
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.

There should be no problem adding group_dead to the
trace_sched_process_exit() trace event. Adding fields should never cause
any user API breakage.

-- Steve
Steven Rostedt April 1, 2025, 9:34 p.m. UTC | #2
On Tue, 1 Apr 2025 17:32:49 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> static void exit_mm(void)
> {
> 	struct mm_struct *mm = current->mm;
> 
> 	exit_mm_release(current, mm);
> 	trace_exit_mm(mm);
> 
> ??

That should have been:

static void exit_mm(void)
{
	struct mm_struct *mm = current->mm;

	trace_exit_mm(mm);
	exit_mm_release(current, mm);

-- Steve
Andrii Nakryiko April 1, 2025, 10:04 p.m. UTC | #3
On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue,  1 Apr 2025 11:40:21 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Hi Andrii,
>
> > It is useful to be able to access current->mm to, say, record a bunch of
> > VMA information right before the task exits (e.g., for stack
> > symbolization reasons when dealing with short-lived processes that exit
> > in the middle of profiling session). We currently do have
> > trace_sched_process_exit() in the exit path, but it is called a bit too
> > late, after exit_mm() resets current->mm to NULL, which makes it
> > unsuitable for inspecting and recording task's mm_struct-related data
> > when tracing process lifetimes.
>
> My fear of adding another task exit trace event is that it will get a
> bit confusing as that we now have trace_sched_process_exit() and also
> trace_task_exit() with slightly different semantics.
>
> How about adding a trace_exit_mm()? Add that to the exit_mm() code?

This is kind of the worst of both worlds, no? We still have a new
tracepoint, but this one can't tell if it's a `group_dead` situation
or not... I can pass group_dead into exit_mm(), but it will be just
for the sake of that new tracepoint.

How bad would it be to just move trace_sched_process_exit() then? (and
add group_dead there, as you mentioned)?

>
> static void exit_mm(void)
> {
>         struct mm_struct *mm = current->mm;
>
>         exit_mm_release(current, mm);
>         trace_exit_mm(mm);
>
> ??
>
> >
> > There is a particularly suitable place, though, right after
> > taskstats_exit() is called, but before we do exit_mm(). taskstats
> > performs a similar kind of accounting that some applications do with
> > BPF, and so co-locating them seems like a good fit.
> >
> > Moving trace_sched_process_exit() a bit earlier would solve this problem
> > as well, and I'm open to that. But this might potentially change its
> > semantics a little, and so instead of risking that, I went for adding
> > a new trace_task_exit() tracepoint instead. Tracepoints have zero
> > overhead at runtime, unless actively traced, so this seems acceptable.
> >
> > Also, existing trace_sched_process_exit() tracepoint is notoriously
> > missing `group_dead` flag that is certainly useful in practice and some
> > of our production applications have to work around this. So plumb
> > `group_dead` through while at it, to have a richer and more complete
> > tracepoint.
>
> There should be no problem adding group_dead to the
> trace_sched_process_exit() trace event. Adding fields should never cause
> any user API breakage.
>
> -- Steve
>
Steven Rostedt April 1, 2025, 10:13 p.m. UTC | #4
On Tue, 1 Apr 2025 15:04:11 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> How bad would it be to just move trace_sched_process_exit() then? (and
> add group_dead there, as you mentioned)?

I personally don't have an issue with that. In fact, the one place I used
the sched_process_exit tracepoint, I had to change to use
sched_process_free because it does too much after that.

OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary
location anyway.

-- Steve
Andrii Nakryiko April 1, 2025, 10:17 p.m. UTC | #5
On Tue, Apr 1, 2025 at 3:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 1 Apr 2025 15:04:11 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > How bad would it be to just move trace_sched_process_exit() then? (and
> > add group_dead there, as you mentioned)?
>
> I personally don't have an issue with that. In fact, the one place I used
> the sched_process_exit tracepoint, I had to change to use
> sched_process_free because it does too much after that.

heh, I ran into that as well just recently and also had to use
sched_process_free instead of sched_process_exit, because between exit
and free we still can get sched_switch tracepoint trigger (so it's a
bit too early to clean up whatever per-task state I maintain in BPF
program).

So yeah, I'm up for that as well, will send v2 just moving and
extending the existing tracepoint. Thanks!

>
> OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary
> location anyway.
>
> -- Steve
Michal Hocko April 2, 2025, 7:18 a.m. UTC | #6
On Tue 01-04-25 17:34:16, Steven Rostedt wrote:
> On Tue, 1 Apr 2025 17:32:49 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > static void exit_mm(void)
> > {
> > 	struct mm_struct *mm = current->mm;
> > 
> > 	exit_mm_release(current, mm);
> > 	trace_exit_mm(mm);
> > 
> > ??
> 
> That should have been:
> 
> static void exit_mm(void)
> {
> 	struct mm_struct *mm = current->mm;
> 
> 	trace_exit_mm(mm);
> 	exit_mm_release(current, mm);

If the primary usecase is to get an overview of the mm before exiting
then this is more appropriate.
Michal Hocko April 2, 2025, 7:20 a.m. UTC | #7
On Tue 01-04-25 15:04:11, Andrii Nakryiko wrote:
> On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Tue,  1 Apr 2025 11:40:21 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Hi Andrii,
> >
> > > It is useful to be able to access current->mm to, say, record a bunch of
> > > VMA information right before the task exits (e.g., for stack
> > > symbolization reasons when dealing with short-lived processes that exit
> > > in the middle of profiling session). We currently do have
> > > trace_sched_process_exit() in the exit path, but it is called a bit too
> > > late, after exit_mm() resets current->mm to NULL, which makes it
> > > unsuitable for inspecting and recording task's mm_struct-related data
> > > when tracing process lifetimes.
> >
> > My fear of adding another task exit trace event is that it will get a
> > bit confusing as that we now have trace_sched_process_exit() and also
> > trace_task_exit() with slightly different semantics.
> >
> > How about adding a trace_exit_mm()? Add that to the exit_mm() code?
> 
> This is kind of the worst of both worlds, no? We still have a new
> tracepoint, but this one can't tell if it's a `group_dead` situation
> or not... I can pass group_dead into exit_mm(), but it will be just
> for the sake of that new tracepoint.

Is it important to tell the difference between thread and the
whole process group exiting?

Please keep in mind that even group exit doesn't really imply the mm is
going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could
be shared outside of thread group).
Peter Zijlstra April 2, 2025, 8:27 a.m. UTC | #8
On Tue, Apr 01, 2025 at 11:40:21AM -0700, Andrii Nakryiko wrote:
> It is useful to be able to access current->mm to, say, record a bunch of
> VMA information right before the task exits (e.g., for stack
> symbolization reasons when dealing with short-lived processes that exit
> in the middle of profiling session). We currently do have
> trace_sched_process_exit() in the exit path, but it is called a bit too
> late, after exit_mm() resets current->mm to NULL, which makes it
> unsuitable for inspecting and recording task's mm_struct-related data
> when tracing process lifetimes.
> 
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm(). taskstats
> performs a similar kind of accounting that some applications do with
> BPF, and so co-locating them seems like a good fit.
> 
> Moving trace_sched_process_exit() a bit earlier would solve this problem
> as well, and I'm open to that.

I don't see a problem with moving it.
Jiri Olsa April 2, 2025, 10:27 a.m. UTC | #9
On Tue, Apr 01, 2025 at 03:17:01PM -0700, Andrii Nakryiko wrote:
> On Tue, Apr 1, 2025 at 3:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Tue, 1 Apr 2025 15:04:11 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > How bad would it be to just move trace_sched_process_exit() then? (and
> > > add group_dead there, as you mentioned)?
> >
> > I personally don't have an issue with that. In fact, the one place I used
> > the sched_process_exit tracepoint, I had to change to use
> > sched_process_free because it does too much after that.
> 
> heh, I ran into that as well just recently and also had to use
> sched_process_free instead of sched_process_exit, because between exit
> and free we still can get sched_switch tracepoint trigger (so it's a
> bit too early to clean up whatever per-task state I maintain in BPF
> program).
> 
> So yeah, I'm up for that as well, will send v2 just moving and
> extending the existing tracepoint. Thanks!

+1, it'd be great to have the group_dead info, we also need
to have some workarounds for that

thanks,
jirka

> 
> >
> > OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary
> > location anyway.
> >
> > -- Steve
>
Steven Rostedt April 2, 2025, 1:58 p.m. UTC | #10
On Wed, 2 Apr 2025 09:20:37 +0200
Michal Hocko <mhocko@suse.com> wrote:

> Is it important to tell the difference between thread and the
> whole process group exiting?
> 
> Please keep in mind that even group exit doesn't really imply the mm is
> going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could
> be shared outside of thread group).

The main reason I'm OK with just updating the sched_process_exit()
tracepoint is because it is in an arbitrary location. The process is
exiting, but because the tracepoint is basically in the middle of the
routine, it doesn't really give us any information about the actual exit.

This tracepoint does give us if a task is exiting from an mm. You are
correct, it doesn't tell us if the mm is going away. If that is the
purpose, then here should be a tracepoint in the exit_mm() code or perhaps
even in the mput() function.

-- Steve
Andrii Nakryiko April 2, 2025, 3:56 p.m. UTC | #11
On Wed, Apr 2, 2025 at 12:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 01-04-25 15:04:11, Andrii Nakryiko wrote:
> > On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Tue,  1 Apr 2025 11:40:21 -0700
> > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Hi Andrii,
> > >
> > > > It is useful to be able to access current->mm to, say, record a bunch of
> > > > VMA information right before the task exits (e.g., for stack
> > > > symbolization reasons when dealing with short-lived processes that exit
> > > > in the middle of profiling session). We currently do have
> > > > trace_sched_process_exit() in the exit path, but it is called a bit too
> > > > late, after exit_mm() resets current->mm to NULL, which makes it
> > > > unsuitable for inspecting and recording task's mm_struct-related data
> > > > when tracing process lifetimes.
> > >
> > > My fear of adding another task exit trace event is that it will get a
> > > bit confusing as that we now have trace_sched_process_exit() and also
> > > trace_task_exit() with slightly different semantics.
> > >
> > > How about adding a trace_exit_mm()? Add that to the exit_mm() code?
> >
> > This is kind of the worst of both worlds, no? We still have a new
> > tracepoint, but this one can't tell if it's a `group_dead` situation
> > or not... I can pass group_dead into exit_mm(), but it will be just
> > for the sake of that new tracepoint.
>
> Is it important to tell the difference between thread and the
> whole process group exiting?

Yes, it often is important. In the sense that process group exiting
would trigger extra information gathering/aggregation/sending compared
to just process (thread) existing. Both are important to track, but
it's useful to be able to differentiate.

>
> Please keep in mind that even group exit doesn't really imply the mm is
> going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could
> be shared outside of thread group).

Yep, I realize that, and as Steven said, for cases like that a
dedicated mm-specific tracepoint might be useful. But I'm not really
aware of use cases caring about mm_struct itself, in isolation. It's
all usually in the context of thread/process exit, and mm-related
information is one of a bunch of extra information that's useful at
that point. It's just so that with current sched_process_exit()
tracepoint placement we (e.g., BPF program) gets control a bit too
late.

But it seems like everyone is OK just shifting sched_process_exit() to
before exit_mm(), which should cover the use cases I had in mind. We
can always add mm-specific tracepoint when the need arises.

Thanks everyone for discussion and suggestions!

> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index af535b053033..98f4ec060073 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -53,6 +53,30 @@  TRACE_EVENT(task_rename,
 		  __entry->oldcomm, __entry->newcomm, __entry->oom_score_adj)
 );
 
+TRACE_EVENT(task_exit,
+
+	TP_PROTO(struct task_struct *task, bool group_dead),
+
+	TP_ARGS(task, group_dead),
+
+	TP_STRUCT__entry(
+		__field(	pid_t,	pid)
+		__array(	char,	comm, TASK_COMM_LEN)
+		__field(	bool,	group_dead)
+	),
+
+	TP_fast_assign(
+		__entry->pid = task->pid;
+		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		__entry->group_dead = group_dead;
+	),
+
+	TP_printk("pid=%d comm=%s group_dead=%s",
+		__entry->pid, __entry->comm,
+		__entry->group_dead ? "true" : "false"
+	)
+);
+
 /**
  * task_prctl_unknown - called on unknown prctl() option
  * @option:	option passed
diff --git a/kernel/exit.c b/kernel/exit.c
index c2e6c7b7779f..8496fc07f9c8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -54,6 +54,7 @@ 
 #include <linux/init_task.h>
 #include <linux/perf_event.h>
 #include <trace/events/sched.h>
+#include <trace/events/task.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
 #include <linux/writeback.h>
@@ -937,6 +938,7 @@  void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	trace_task_exit(tsk, group_dead);
 
 	exit_mm();