diff mbox series

[1/2] tracing: Simplify and fix "buffered event" synchronization

Message ID 20231127151248.7232-2-petr.pavlu@suse.com (mailing list archive)
State Superseded
Headers show
Series tracing: Simplify and fix "buffered event" synchronization | expand

Commit Message

Petr Pavlu Nov. 27, 2023, 3:12 p.m. UTC
The following warning appears when using buffered events:
[  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
[...]
[  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
[  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
[  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
[  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
[  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
[  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
[  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
[  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
[  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
[  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
[  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
[  203.781862] Call Trace:
[  203.781870]  <TASK>
[  203.851949]  trace_event_buffer_commit+0x1ea/0x250
[  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
[  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
[  203.851990]  do_syscall_64+0x3a/0xe0
[  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  203.852090] RIP: 0033:0x7f4cd870fa77
[  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
[  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
[  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
[  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
[  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
[  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
[  204.049256]  </TASK>

For instance, it can be triggered by running these two commands in
parallel:
$ while true; do
    echo hist:key=id.syscall:val=hitcount > \
      /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
  done
$ stress-ng --sysinfo $(nproc)

The warning indicates that the current ring_buffer_per_cpu is not in the
committing state. It happens because the active ring_buffer_event
doesn't actually come from the ring_buffer_per_cpu but is allocated from
trace_buffered_event.

The bug is in function trace_buffered_event_disable() where the
following normally happens:
* The code invokes disable_trace_buffered_event() via
  smp_call_function_many() and follows it by synchronize_rcu(). This
  increments the per-CPU variable trace_buffered_event_cnt on each
  target CPU and grants trace_buffered_event_disable() the exclusive
  access to the per-CPU variable trace_buffered_event.
* Maintenance is performed on trace_buffered_event, all per-CPU event
  buffers get freed.
* The code invokes enable_trace_buffered_event() via
  smp_call_function_many(). This decrements trace_buffered_event_cnt and
  releases the access to trace_buffered_event.

A problem is that smp_call_function_many() runs a given function on all
target CPUs except on the current one. The following can then occur:
* Task X executing trace_buffered_event_disable() runs on CPU A.
* The control reaches synchronize_rcu() and the task gets rescheduled on
  another CPU B.
* The RCU synchronization finishes. At this point,
  trace_buffered_event_disable() has the exclusive access to all
  trace_buffered_event variables except trace_buffered_event[A] because
  trace_buffered_event_cnt[A] is never incremented and if the buffer is
  currently unused, remains set to 0.
* A different task Y is scheduled on CPU A and hits a trace event. The
  code in trace_event_buffer_lock_reserve() sees that
  trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer
  provided by trace_buffered_event[A].
* Task X continues its execution in trace_buffered_event_disable(). The
  code incorrectly frees the event buffer pointed by
  trace_buffered_event[A] and resets the variable to NULL.
* Task Y writes event data to the now freed buffer and later detects the
  created inconsistency.

The issue is observable since commit dea499781a11 ("tracing: Fix warning
in trace_buffered_event_disable()") which moved the call of
trace_buffered_event_disable() in __ftrace_event_enable_disable()
earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
The underlying problem in trace_buffered_event_disable() is however
present since the original implementation in commit 0fc1b09ff1ff
("tracing: Use temp buffer when filtering events").

The bug is simply fixable by replacing smp_call_function_many() by
on_each_cpu_mask(), but the code has other issues as well:
* Function trace_event_buffer_lock_reserve() reads trace_buffered_event
  and trace_buffered_event_cnt in reverse order than they are written in
  trace_buffered_event_disable() and without any memory barrier. It
  could happen that trace_event_buffer_lock_reserve() still finds
  a valid pointer in trace_buffered_event which is being freed by
  trace_buffered_event_disable() but later already sees the decremented
  value of trace_buffered_event_cnt back to 0 and incorrectly decides to
  use the provided buffer.
* Function trace_buffered_event_enable() initializes memory of each
  allocated trace_buffered_event to zero but no memory barrier is
  present to ensure this operation is completed before publishing its
  pointer for use in trace_event_buffer_lock_reserve().
* Calling function trace_buffered_event_enable() normally requires
  pairing it with trace_buffered_event_disable(). However, the function
  has no error return code and in case of a failure decrements back
  trace_buffered_event_ref. This results in underflow of the counter
  when trace_buffered_event_disable() gets called later.

These problems can be addressed individually but it looks better to
rework the synchronization mechanism and simplify it in the first place.

Instead of using the per-CPU variable trace_buffered_event_cnt for the
exclusive access during the disable operation, introduce a new variable
trace_buffered_event_enabled to reflect the current state and
appropriately adjust the code. The variable indicates whether buffered
events are currently enabled and trace_buffered_event together with
trace_buffered_event_cnt are ok to use.

The updated synchronization mechanism fixes the mentioned problems and
avoids also sending IPIs across the system.

Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
---
 kernel/trace/trace.c | 66 ++++++++++++++++++--------------------------
 1 file changed, 27 insertions(+), 39 deletions(-)

Comments

Steven Rostedt Nov. 27, 2023, 5:41 p.m. UTC | #1
On Mon, 27 Nov 2023 16:12:47 +0100
Petr Pavlu <petr.pavlu@suse.com> wrote:

> The following warning appears when using buffered events:
> [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420

Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
in your version?

> [...]
> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
> [  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
> [  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
> [  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
> [  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
> [  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
> [  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
> [  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
> [  203.781862] Call Trace:
> [  203.781870]  <TASK>
> [  203.851949]  trace_event_buffer_commit+0x1ea/0x250
> [  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
> [  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
> [  203.851990]  do_syscall_64+0x3a/0xe0
> [  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [  203.852090] RIP: 0033:0x7f4cd870fa77
> [  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
> [  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
> [  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
> [  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
> [  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
> [  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
> [  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
> [  204.049256]  </TASK>
> 
> For instance, it can be triggered by running these two commands in
> parallel:
> $ while true; do
>     echo hist:key=id.syscall:val=hitcount > \
>       /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
>   done
> $ stress-ng --sysinfo $(nproc)
> 
> The warning indicates that the current ring_buffer_per_cpu is not in the
> committing state. It happens because the active ring_buffer_event
> doesn't actually come from the ring_buffer_per_cpu but is allocated from
> trace_buffered_event.
> 
> The bug is in function trace_buffered_event_disable() where the
> following normally happens:
> * The code invokes disable_trace_buffered_event() via
>   smp_call_function_many() and follows it by synchronize_rcu(). This
>   increments the per-CPU variable trace_buffered_event_cnt on each
>   target CPU and grants trace_buffered_event_disable() the exclusive
>   access to the per-CPU variable trace_buffered_event.
> * Maintenance is performed on trace_buffered_event, all per-CPU event
>   buffers get freed.
> * The code invokes enable_trace_buffered_event() via
>   smp_call_function_many(). This decrements trace_buffered_event_cnt and
>   releases the access to trace_buffered_event.
> 
> A problem is that smp_call_function_many() runs a given function on all
> target CPUs except on the current one. The following can then occur:
> * Task X executing trace_buffered_event_disable() runs on CPU A.
> * The control reaches synchronize_rcu() and the task gets rescheduled on
>   another CPU B.
> * The RCU synchronization finishes. At this point,
>   trace_buffered_event_disable() has the exclusive access to all
>   trace_buffered_event variables except trace_buffered_event[A] because
>   trace_buffered_event_cnt[A] is never incremented and if the buffer is
>   currently unused, remains set to 0.
> * A different task Y is scheduled on CPU A and hits a trace event. The
>   code in trace_event_buffer_lock_reserve() sees that
>   trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer
>   provided by trace_buffered_event[A].
> * Task X continues its execution in trace_buffered_event_disable(). The
>   code incorrectly frees the event buffer pointed by
>   trace_buffered_event[A] and resets the variable to NULL.
> * Task Y writes event data to the now freed buffer and later detects the
>   created inconsistency.
> 
> The issue is observable since commit dea499781a11 ("tracing: Fix warning
> in trace_buffered_event_disable()") which moved the call of
> trace_buffered_event_disable() in __ftrace_event_enable_disable()
> earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
> The underlying problem in trace_buffered_event_disable() is however
> present since the original implementation in commit 0fc1b09ff1ff
> ("tracing: Use temp buffer when filtering events").
> 
> The bug is simply fixable by replacing smp_call_function_many() by
> on_each_cpu_mask(), but the code has other issues as well:

Or by simply calling the update for the local CPU as well as the many:

	preempt_disable();
	/* For each CPU, set the buffer as used. */
	disable_trace_buffered_event(NULL);
	smp_call_function_many(tracing_buffer_mask,
			       disable_trace_buffered_event, NULL, 1);
	preempt_enable();

> * Function trace_event_buffer_lock_reserve() reads trace_buffered_event
>   and trace_buffered_event_cnt in reverse order than they are written in
>   trace_buffered_event_disable() and without any memory barrier. It
>   could happen that trace_event_buffer_lock_reserve() still finds
>   a valid pointer in trace_buffered_event which is being freed by
>   trace_buffered_event_disable() but later already sees the decremented
>   value of trace_buffered_event_cnt back to 0 and incorrectly decides to
>   use the provided buffer.

Not an issue (see below)

> * Function trace_buffered_event_enable() initializes memory of each
>   allocated trace_buffered_event to zero but no memory barrier is
>   present to ensure this operation is completed before publishing its
>   pointer for use in trace_event_buffer_lock_reserve().

Yes, a wmb() may be helpful, but as the reader will see either the
allocated page or NULL, and can handle both cases. I decided not to add it
because it really doesn't matter if we do.

> * Calling function trace_buffered_event_enable() normally requires
>   pairing it with trace_buffered_event_disable(). However, the function
>   has no error return code and in case of a failure decrements back
>   trace_buffered_event_ref. This results in underflow of the counter
>   when trace_buffered_event_disable() gets called later.

Not an issue.

Even on failure the ref count will be greater than zero, where 
trace_buffered_event_disable() handles it properly, and the freeing can
handle no pages being allocated.

That is:

	free_page((unsigned long)per_cpu(trace_buffered_event, cpu));

would be the same as: free_page((unsigned long)NULL);

Which is perfectly fine to do.

> 
> Instead of using the per-CPU variable trace_buffered_event_cnt for the
> exclusive access during the disable operation, introduce a new variable
> trace_buffered_event_enabled to reflect the current state and
> appropriately adjust the code. The variable indicates whether buffered
> events are currently enabled and trace_buffered_event together with
> trace_buffered_event_cnt are ok to use.

The current logic is perfectly fine. The only bug here (which you found)
was the missing update to the counter of the current CPU.

> 
> The updated synchronization mechanism fixes the mentioned problems and
> avoids also sending IPIs across the system.
> 
> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
> Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()")
> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> ---

Let me explain why it works.

In trace_event_buffer_lock_reserve():

		preempt_disable_notrace();
		if ((entry = __this_cpu_read(trace_buffered_event))) {
			int max_len = PAGE_SIZE - struct_size(entry, array, 1);

			val = this_cpu_inc_return(trace_buffered_event_cnt);

			if (val == 1 && likely(len <= max_len)) {

If val == 1 and len <= max_len, then this is going to use the "buffered event".

Notice that preemption is disabled.

				trace_event_setup(entry, type, trace_ctx);
				entry->array[0] = len;
				/* Return with preemption disabled */
				return entry;

And we return with preemption disabled!

Everything after this does not use the "buffered event" and is not part of
this code.

			}
			this_cpu_dec(trace_buffered_event_cnt);
		}
		/* __trace_buffer_lock_reserve() disables preemption */
		preempt_enable_notrace();


Now, on commit (when we are done with the "buffered event"):

__buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event *event)
{
	__this_cpu_write(trace_taskinfo_save, true);

	/* If this is the temp buffer, we need to commit fully */
	if (this_cpu_read(trace_buffered_event) == event) {
		/* Length is in event->array[0] */
		ring_buffer_write(buffer, event->array[0], &event->array[1]);
		/* Release the temp buffer */
		this_cpu_dec(trace_buffered_event_cnt);
		/* ring_buffer_unlock_commit() enables preemption */
		preempt_enable_notrace();

Preemption is finally enabled here. That is, this could not preempt from
the time we found the "buffered event" to the time we released it.

	} else
		ring_buffer_unlock_commit(buffer);
}


Now lets look at the trace_buffered_event_disable():

void trace_buffered_event_disable(void)
{
	int cpu;

	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));

	if (WARN_ON_ONCE(!trace_buffered_event_ref))
		return;

	if (--trace_buffered_event_ref)
		return;

	preempt_disable();
	/* For each CPU, set the buffer as used. */
	smp_call_function_many(tracing_buffer_mask,
			       disable_trace_buffered_event, NULL, 1);
	preempt_enable();

The above will up the counter of all the buffers (after the bug you found
is fixed ;-)

That means, we just need to wait till all the currently running users of
the buffer are done with it.

	/* Wait for all current users to finish */
	synchronize_rcu();

The synchronize_rcu() also synchronizes preempt disabled sections. That is,
after the synchronize_rcu() completes, all users of the "buffered event"
are done with it, and because we upped the count, there will be no more
users.

That is *there is no race here*!

At this point, there are no users of the "buffered event" and we can do
whatever we want without locks or memory barriers. This is RCU 101.

	for_each_tracing_cpu(cpu) {
		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
		per_cpu(trace_buffered_event, cpu) = NULL;
	}
	/*
	 * Make sure trace_buffered_event is NULL before clearing
	 * trace_buffered_event_cnt.
	 */
	smp_wmb();

	preempt_disable();
	/* Do the work on each cpu */
	smp_call_function_many(tracing_buffer_mask,
			       enable_trace_buffered_event, NULL, 1);
	preempt_enable();
}



Now, the reason you found the crash was from the first issue you mentioned.
That is, we didn't disable the current CPU and if we migrated, things would
get really messed up. The only fix here is to make sure all CPUs have their
trace_buffered_event_cnt get incremented.

Thanks,

-- Steve
Petr Pavlu Nov. 28, 2023, 3:05 p.m. UTC | #2
On 11/27/23 18:41, Steven Rostedt wrote:
> On Mon, 27 Nov 2023 16:12:47 +0100
> Petr Pavlu <petr.pavlu@suse.com> wrote:
> 
>> The following warning appears when using buffered events:
>> [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
> 
> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
> in your version?

The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
code and RB_WARN_ON():
  3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
  3896                                  struct ring_buffer_event *event)
  3897  {
[...]
  3908           * This must only be called if the event has not been
  3909           * committed yet. Thus we can assume that preemption
  3910           * is still disabled.
  3911           */
  3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
  3913  
  3914          rb_decrement_entry(cpu_buffer, event);
  3915          if (rb_try_to_discard(cpu_buffer, event))
  3916                  goto out;

https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912

>> [...]
>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
>> [  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
>> [  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
>> [  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
>> [  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
>> [  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
>> [  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
>> [  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
>> [  203.781862] Call Trace:
>> [  203.781870]  <TASK>
>> [  203.851949]  trace_event_buffer_commit+0x1ea/0x250
>> [  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
>> [  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
>> [  203.851990]  do_syscall_64+0x3a/0xe0
>> [  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [  203.852090] RIP: 0033:0x7f4cd870fa77
>> [  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
>> [  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
>> [  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
>> [  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
>> [  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
>> [  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
>> [  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
>> [  204.049256]  </TASK>
>>
>> For instance, it can be triggered by running these two commands in
>> parallel:
>> $ while true; do
>>     echo hist:key=id.syscall:val=hitcount > \
>>       /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
>>   done
>> $ stress-ng --sysinfo $(nproc)
>>
>> The warning indicates that the current ring_buffer_per_cpu is not in the
>> committing state. It happens because the active ring_buffer_event
>> doesn't actually come from the ring_buffer_per_cpu but is allocated from
>> trace_buffered_event.
>>
>> The bug is in function trace_buffered_event_disable() where the
>> following normally happens:
>> * The code invokes disable_trace_buffered_event() via
>>   smp_call_function_many() and follows it by synchronize_rcu(). This
>>   increments the per-CPU variable trace_buffered_event_cnt on each
>>   target CPU and grants trace_buffered_event_disable() the exclusive
>>   access to the per-CPU variable trace_buffered_event.
>> * Maintenance is performed on trace_buffered_event, all per-CPU event
>>   buffers get freed.
>> * The code invokes enable_trace_buffered_event() via
>>   smp_call_function_many(). This decrements trace_buffered_event_cnt and
>>   releases the access to trace_buffered_event.
>>
>> A problem is that smp_call_function_many() runs a given function on all
>> target CPUs except on the current one. The following can then occur:
>> * Task X executing trace_buffered_event_disable() runs on CPU A.
>> * The control reaches synchronize_rcu() and the task gets rescheduled on
>>   another CPU B.
>> * The RCU synchronization finishes. At this point,
>>   trace_buffered_event_disable() has the exclusive access to all
>>   trace_buffered_event variables except trace_buffered_event[A] because
>>   trace_buffered_event_cnt[A] is never incremented and if the buffer is
>>   currently unused, remains set to 0.
>> * A different task Y is scheduled on CPU A and hits a trace event. The
>>   code in trace_event_buffer_lock_reserve() sees that
>>   trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer
>>   provided by trace_buffered_event[A].
>> * Task X continues its execution in trace_buffered_event_disable(). The
>>   code incorrectly frees the event buffer pointed by
>>   trace_buffered_event[A] and resets the variable to NULL.
>> * Task Y writes event data to the now freed buffer and later detects the
>>   created inconsistency.
>>
>> The issue is observable since commit dea499781a11 ("tracing: Fix warning
>> in trace_buffered_event_disable()") which moved the call of
>> trace_buffered_event_disable() in __ftrace_event_enable_disable()
>> earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
>> The underlying problem in trace_buffered_event_disable() is however
>> present since the original implementation in commit 0fc1b09ff1ff
>> ("tracing: Use temp buffer when filtering events").
>>
>> The bug is simply fixable by replacing smp_call_function_many() by
>> on_each_cpu_mask(), but the code has other issues as well:
> 
> Or by simply calling the update for the local CPU as well as the many:
> 
> 	preempt_disable();
> 	/* For each CPU, set the buffer as used. */
> 	disable_trace_buffered_event(NULL);
> 	smp_call_function_many(tracing_buffer_mask,
> 			       disable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> 
>> * Function trace_event_buffer_lock_reserve() reads trace_buffered_event
>>   and trace_buffered_event_cnt in reverse order than they are written in
>>   trace_buffered_event_disable() and without any memory barrier. It
>>   could happen that trace_event_buffer_lock_reserve() still finds
>>   a valid pointer in trace_buffered_event which is being freed by
>>   trace_buffered_event_disable() but later already sees the decremented
>>   value of trace_buffered_event_cnt back to 0 and incorrectly decides to
>>   use the provided buffer.
> 
> Not an issue (see below)
> 
>> * Function trace_buffered_event_enable() initializes memory of each
>>   allocated trace_buffered_event to zero but no memory barrier is
>>   present to ensure this operation is completed before publishing its
>>   pointer for use in trace_event_buffer_lock_reserve().
> 
> Yes, a wmb() may be helpful, but as the reader will see either the
> allocated page or NULL, and can handle both cases. I decided not to add it
> because it really doesn't matter if we do.

My concern is about the earlier call "memset(event, 0, sizeof(*event));"
in trace_buffered_event_enable() which writes the obtained event page
prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
I think another CPU running trace_event_buffer_lock_reserve() could see
already the valid event pointer, write to the event page and only then
the memory could see the memset() write coming from
trace_buffered_event_enable(). The event data would then become
corrupted.

> 
>> * Calling function trace_buffered_event_enable() normally requires
>>   pairing it with trace_buffered_event_disable(). However, the function
>>   has no error return code and in case of a failure decrements back
>>   trace_buffered_event_ref. This results in underflow of the counter
>>   when trace_buffered_event_disable() gets called later.
> 
> Not an issue.
> 
> Even on failure the ref count will be greater than zero, where 
> trace_buffered_event_disable() handles it properly, and the freeing can
> handle no pages being allocated.
> 
> That is:
> 
> 	free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 
> would be the same as: free_page((unsigned long)NULL);
> 
> Which is perfectly fine to do.

The situation that I have on my mind is the following:
* The counter trace_buffered_event_ref is at 0.
* The soft mode gets enabled for some event and
  trace_buffered_event_enable() is called. The function increments
  trace_buffered_event_ref to 1 and starts allocating event pages.
* The allocation fails for some page and trace_buffered_event_disable()
  is called for cleanup.
* Function trace_buffered_event_disable() decrements
  trace_buffered_event_ref back to 0, recognizes that it was the last
  use of buffered events and frees all allocated pages.
* The control goes back to trace_buffered_event_enable() which returns.
  The caller of trace_buffered_event_enable() has no information that
  the function actually failed.
* Some time later, the soft mode is disabled for the same event.
  Function trace_buffered_event_disable() is called. It warns on
  "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.

The end result is only an ugly warning but I think it is still good to
fix.

>>
>> Instead of using the per-CPU variable trace_buffered_event_cnt for the
>> exclusive access during the disable operation, introduce a new variable
>> trace_buffered_event_enabled to reflect the current state and
>> appropriately adjust the code. The variable indicates whether buffered
>> events are currently enabled and trace_buffered_event together with
>> trace_buffered_event_cnt are ok to use.
> 
> The current logic is perfectly fine. The only bug here (which you found)
> was the missing update to the counter of the current CPU.
> 
>>
>> The updated synchronization mechanism fixes the mentioned problems and
>> avoids also sending IPIs across the system.
>>
>> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
>> Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()")
>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>> ---
> 
> Let me explain why it works.
> 
> In trace_event_buffer_lock_reserve():
> 
> 		preempt_disable_notrace();
> 		if ((entry = __this_cpu_read(trace_buffered_event))) {
> 			int max_len = PAGE_SIZE - struct_size(entry, array, 1);
> 
> 			val = this_cpu_inc_return(trace_buffered_event_cnt);
> 
> 			if (val == 1 && likely(len <= max_len)) {
> 
> If val == 1 and len <= max_len, then this is going to use the "buffered event".
> 
> Notice that preemption is disabled.
> 
> 				trace_event_setup(entry, type, trace_ctx);
> 				entry->array[0] = len;
> 				/* Return with preemption disabled */
> 				return entry;
> 
> And we return with preemption disabled!
> 
> Everything after this does not use the "buffered event" and is not part of
> this code.
> 
> 			}
> 			this_cpu_dec(trace_buffered_event_cnt);
> 		}
> 		/* __trace_buffer_lock_reserve() disables preemption */
> 		preempt_enable_notrace();
> 
> 
> Now, on commit (when we are done with the "buffered event"):
> 
> __buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event *event)
> {
> 	__this_cpu_write(trace_taskinfo_save, true);
> 
> 	/* If this is the temp buffer, we need to commit fully */
> 	if (this_cpu_read(trace_buffered_event) == event) {
> 		/* Length is in event->array[0] */
> 		ring_buffer_write(buffer, event->array[0], &event->array[1]);
> 		/* Release the temp buffer */
> 		this_cpu_dec(trace_buffered_event_cnt);
> 		/* ring_buffer_unlock_commit() enables preemption */
> 		preempt_enable_notrace();
> 
> Preemption is finally enabled here. That is, this could not preempt from
> the time we found the "buffered event" to the time we released it.
> 
> 	} else
> 		ring_buffer_unlock_commit(buffer);
> }
> 
> 
> Now lets look at the trace_buffered_event_disable():
> 
> void trace_buffered_event_disable(void)
> {
> 	int cpu;
> 
> 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
> 	if (WARN_ON_ONCE(!trace_buffered_event_ref))
> 		return;
> 
> 	if (--trace_buffered_event_ref)
> 		return;
> 
> 	preempt_disable();
> 	/* For each CPU, set the buffer as used. */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       disable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> 
> The above will up the counter of all the buffers (after the bug you found
> is fixed ;-)
> 
> That means, we just need to wait till all the currently running users of
> the buffer are done with it.
> 
> 	/* Wait for all current users to finish */
> 	synchronize_rcu();
> 
> The synchronize_rcu() also synchronizes preempt disabled sections. That is,
> after the synchronize_rcu() completes, all users of the "buffered event"
> are done with it, and because we upped the count, there will be no more
> users.
> 
> That is *there is no race here*!
> 
> At this point, there are no users of the "buffered event" and we can do
> whatever we want without locks or memory barriers. This is RCU 101.
> 
> 	for_each_tracing_cpu(cpu) {
> 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 		per_cpu(trace_buffered_event, cpu) = NULL;
> 	}
> 	/*
> 	 * Make sure trace_buffered_event is NULL before clearing
> 	 * trace_buffered_event_cnt.
> 	 */
> 	smp_wmb();
> 
> 	preempt_disable();
> 	/* Do the work on each cpu */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       enable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> }
> 
> 
> 
> Now, the reason you found the crash was from the first issue you mentioned.
> That is, we didn't disable the current CPU and if we migrated, things would
> get really messed up. The only fix here is to make sure all CPUs have their
> trace_buffered_event_cnt get incremented.

The potential race that I see is the following:
* Function trace_buffered_event_disable() is called on CPU 0. It
  increments trace_buffered_event_cnt on each CPU and waits via
  synchronize_rcu() for each user of trace_buffered_event to complete.
* After synchronize_rcu() is already finished, the situation is that all
  counters trace_buffered_event_cnt are at 1 and all pointers
  trace_buffered_event are still valid.
* At this point, on a different CPU 1, the execution reaches
  trace_event_buffer_lock_reserve(). The function calls
  preempt_disable_notrace() and enters an RCU read-side critical section
  but that is not important because all RCU stuff is already performed
  and completed in trace_buffered_event_disable(). The function proceeds
  and reads a still valid pointer from trace_buffered_event[CPU1] into
  the local variable "entry". However, it doesn't yet read
  trace_buffered_event_cnt[CPU1] which happens later.
* Function trace_buffered_event_disable() continues. It frees
  trace_buffered_event[CPU1] and decrements
  trace_buffered_event_cnt[CPU1] back to 0.
* Function trace_event_buffer_lock_reserve() continues. It reads and
  increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
  believe that it can use the "entry" that it already obtained but the
  pointer is now invalid and any access results in a use-after-free.

Thanks,
Petr
Steven Rostedt Nov. 28, 2023, 3:27 p.m. UTC | #3
On Tue, 28 Nov 2023 16:05:00 +0100
Petr Pavlu <petr.pavlu@suse.com> wrote:

> On 11/27/23 18:41, Steven Rostedt wrote:
> > On Mon, 27 Nov 2023 16:12:47 +0100
> > Petr Pavlu <petr.pavlu@suse.com> wrote:
> >   
> >> The following warning appears when using buffered events:
> >> [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420  
> > 
> > Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
> > in your version?  
> 
> The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
> code and RB_WARN_ON():

Interesting.

>   3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
>   3896                                  struct ring_buffer_event *event)
>   3897  {
> [...]
>   3908           * This must only be called if the event has not been
>   3909           * committed yet. Thus we can assume that preemption
>   3910           * is still disabled.
>   3911           */
>   3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
>   3913  
>   3914          rb_decrement_entry(cpu_buffer, event);
>   3915          if (rb_try_to_discard(cpu_buffer, event))
>   3916                  goto out;
> 
> https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912
> 
> >> [...]
> >> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a

I was looking at the above 6.7.0-rc2-default, which made me somewhat confused.


> >> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
> >> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
> >> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 

> >> * Function trace_buffered_event_enable() initializes memory of each
> >>   allocated trace_buffered_event to zero but no memory barrier is
> >>   present to ensure this operation is completed before publishing its
> >>   pointer for use in trace_event_buffer_lock_reserve().  
> > 
> > Yes, a wmb() may be helpful, but as the reader will see either the
> > allocated page or NULL, and can handle both cases. I decided not to add it
> > because it really doesn't matter if we do.  
> 
> My concern is about the earlier call "memset(event, 0, sizeof(*event));"
> in trace_buffered_event_enable() which writes the obtained event page
> prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
> I think another CPU running trace_event_buffer_lock_reserve() could see
> already the valid event pointer, write to the event page and only then
> the memory could see the memset() write coming from
> trace_buffered_event_enable(). The event data would then become
> corrupted.

OK, I'll buy that. Feel free to send an patch that adds that wmb with a
comment explaining the above.

> 
> >   
> >> * Calling function trace_buffered_event_enable() normally requires
> >>   pairing it with trace_buffered_event_disable(). However, the function
> >>   has no error return code and in case of a failure decrements back
> >>   trace_buffered_event_ref. This results in underflow of the counter
> >>   when trace_buffered_event_disable() gets called later.  
> > 
> > Not an issue.
> > 
> > Even on failure the ref count will be greater than zero, where 
> > trace_buffered_event_disable() handles it properly, and the freeing can
> > handle no pages being allocated.
> > 
> > That is:
> > 
> > 	free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> > 
> > would be the same as: free_page((unsigned long)NULL);
> > 
> > Which is perfectly fine to do.  
> 
> The situation that I have on my mind is the following:
> * The counter trace_buffered_event_ref is at 0.
> * The soft mode gets enabled for some event and
>   trace_buffered_event_enable() is called. The function increments
>   trace_buffered_event_ref to 1 and starts allocating event pages.
> * The allocation fails for some page and trace_buffered_event_disable()
>   is called for cleanup.
> * Function trace_buffered_event_disable() decrements
>   trace_buffered_event_ref back to 0, recognizes that it was the last
>   use of buffered events and frees all allocated pages.
> * The control goes back to trace_buffered_event_enable() which returns.
>   The caller of trace_buffered_event_enable() has no information that
>   the function actually failed.
> * Some time later, the soft mode is disabled for the same event.
>   Function trace_buffered_event_disable() is called. It warns on
>   "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
> 
> The end result is only an ugly warning but I think it is still good to
> fix.

Ah, I missed the call to trace_buffered_event_disable() in the failure
path. Actually, since the trace_buffered_event is only an optimization, I
think the best thing to do is to just return without the clean up,
expecting that the other caller will do the clean up. We could even not
even stop the loop!

That is, just have:

void trace_buffered_event_enable(void)
{
	struct ring_buffer_event *event;
	struct page *page;
	int cpu;

	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));

	if (trace_buffered_event_ref++)
		return;

	for_each_tracing_cpu(cpu) {
		page = alloc_pages_node(cpu_to_node(cpu),
					GFP_KERNEL | __GFP_NORETRY, 0);

		/* This is just an optimization and can handle failures */
		if (!page) {
			pr_err("Failed to create event buffer\n");
			break;
		}

		event = page_address(page);
		memset(event, 0, sizeof(*event));

		per_cpu(trace_buffered_event, cpu) = event;

		preempt_disable();
		if (cpu == smp_processor_id() &&
		    __this_cpu_read(trace_buffered_event) !=
		    per_cpu(trace_buffered_event, cpu))
			WARN_ON_ONCE(1);
		preempt_enable();
	}
}

> > 
> > Now, the reason you found the crash was from the first issue you mentioned.
> > That is, we didn't disable the current CPU and if we migrated, things would
> > get really messed up. The only fix here is to make sure all CPUs have their
> > trace_buffered_event_cnt get incremented.  
> 
> The potential race that I see is the following:
> * Function trace_buffered_event_disable() is called on CPU 0. It
>   increments trace_buffered_event_cnt on each CPU and waits via
>   synchronize_rcu() for each user of trace_buffered_event to complete.

> * After synchronize_rcu() is already finished, the situation is that all
>   counters trace_buffered_event_cnt are at 1 and all pointers
>   trace_buffered_event are still valid.

> * At this point, on a different CPU 1, the execution reaches
>   trace_event_buffer_lock_reserve(). The function calls
>   preempt_disable_notrace() and enters an RCU read-side critical section
>   but that is not important because all RCU stuff is already performed
>   and completed in trace_buffered_event_disable(). The function proceeds
>   and reads a still valid pointer from trace_buffered_event[CPU1] into
>   the local variable "entry". However, it doesn't yet read
>   trace_buffered_event_cnt[CPU1] which happens later.

> * Function trace_buffered_event_disable() continues. It frees
>   trace_buffered_event[CPU1] and decrements
>   trace_buffered_event_cnt[CPU1] back to 0.

> * Function trace_event_buffer_lock_reserve() continues. It reads and
>   increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
>   believe that it can use the "entry" that it already obtained but the
>   pointer is now invalid and any access results in a use-after-free.

OK, so you are saying we need another synchronize_rcu() after we set the
trace_buffered_event to NULL?


void trace_buffered_event_disable(void)
{
	int cpu;

	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));

	if (WARN_ON_ONCE(!trace_buffered_event_ref))
		return;

	if (--trace_buffered_event_ref)
		return;

	preempt_disable();
	/* For each CPU, set the buffer as used. */
	smp_call_function_many(tracing_buffer_mask,
			       disable_trace_buffered_event, NULL, 1);
	preempt_enable();

	/* Wait for all current users to finish */
	synchronize_rcu();

	for_each_tracing_cpu(cpu) {
		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
		per_cpu(trace_buffered_event, cpu) = NULL;
	}
	/*
	 * Make sure trace_buffered_event is NULL before clearing
	 * trace_buffered_event_cnt.
	 */
-	smp_wmb();
+	synchronize_rcu();

	preempt_disable();
	/* Do the work on each cpu */
	smp_call_function_many(tracing_buffer_mask,
			       enable_trace_buffered_event, NULL, 1);
	preempt_enable();
}

I agree with that.

-- Steve
Petr Pavlu Nov. 29, 2023, 9:22 a.m. UTC | #4
On 11/28/23 16:27, Steven Rostedt wrote:
> On Tue, 28 Nov 2023 16:05:00 +0100
> Petr Pavlu <petr.pavlu@suse.com> wrote:
> 
>> On 11/27/23 18:41, Steven Rostedt wrote:
>>> On Mon, 27 Nov 2023 16:12:47 +0100
>>> Petr Pavlu <petr.pavlu@suse.com> wrote:
>>>   
>>>> The following warning appears when using buffered events:
>>>> [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420  
>>>
>>> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
>>> in your version?  
>>
>> The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
>> code and RB_WARN_ON():
> 
> Interesting.
> 
>>   3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
>>   3896                                  struct ring_buffer_event *event)
>>   3897  {
>> [...]
>>   3908           * This must only be called if the event has not been
>>   3909           * committed yet. Thus we can assume that preemption
>>   3910           * is still disabled.
>>   3911           */
>>   3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
>>   3913  
>>   3914          rb_decrement_entry(cpu_buffer, event);
>>   3915          if (rb_try_to_discard(cpu_buffer, event))
>>   3916                  goto out;
>>
>> https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912
>>
>>>> [...]
>>>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
> 
> I was looking at the above 6.7.0-rc2-default, which made me somewhat confused.
> 
> 
>>>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>>>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>>>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 
> 
>>>> * Function trace_buffered_event_enable() initializes memory of each
>>>>   allocated trace_buffered_event to zero but no memory barrier is
>>>>   present to ensure this operation is completed before publishing its
>>>>   pointer for use in trace_event_buffer_lock_reserve().  
>>>
>>> Yes, a wmb() may be helpful, but as the reader will see either the
>>> allocated page or NULL, and can handle both cases. I decided not to add it
>>> because it really doesn't matter if we do.  
>>
>> My concern is about the earlier call "memset(event, 0, sizeof(*event));"
>> in trace_buffered_event_enable() which writes the obtained event page
>> prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
>> I think another CPU running trace_event_buffer_lock_reserve() could see
>> already the valid event pointer, write to the event page and only then
>> the memory could see the memset() write coming from
>> trace_buffered_event_enable(). The event data would then become
>> corrupted.
> 
> OK, I'll buy that. Feel free to send an patch that adds that wmb with a
> comment explaining the above.

Sure, I can do that. A nice thing about my proposed patch is that it
requires only one smp_wmb() in this case, while the current code will
require adding this barrier in the allocation loop prior to writing
trace_buffered_event.

>>>   
>>>> * Calling function trace_buffered_event_enable() normally requires
>>>>   pairing it with trace_buffered_event_disable(). However, the function
>>>>   has no error return code and in case of a failure decrements back
>>>>   trace_buffered_event_ref. This results in underflow of the counter
>>>>   when trace_buffered_event_disable() gets called later.  
>>>
>>> Not an issue.
>>>
>>> Even on failure the ref count will be greater than zero, where 
>>> trace_buffered_event_disable() handles it properly, and the freeing can
>>> handle no pages being allocated.
>>>
>>> That is:
>>>
>>> 	free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>>>
>>> would be the same as: free_page((unsigned long)NULL);
>>>
>>> Which is perfectly fine to do.  
>>
>> The situation that I have on my mind is the following:
>> * The counter trace_buffered_event_ref is at 0.
>> * The soft mode gets enabled for some event and
>>   trace_buffered_event_enable() is called. The function increments
>>   trace_buffered_event_ref to 1 and starts allocating event pages.
>> * The allocation fails for some page and trace_buffered_event_disable()
>>   is called for cleanup.
>> * Function trace_buffered_event_disable() decrements
>>   trace_buffered_event_ref back to 0, recognizes that it was the last
>>   use of buffered events and frees all allocated pages.
>> * The control goes back to trace_buffered_event_enable() which returns.
>>   The caller of trace_buffered_event_enable() has no information that
>>   the function actually failed.
>> * Some time later, the soft mode is disabled for the same event.
>>   Function trace_buffered_event_disable() is called. It warns on
>>   "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
>>
>> The end result is only an ugly warning but I think it is still good to
>> fix.
> 
> Ah, I missed the call to trace_buffered_event_disable() in the failure
> path. Actually, since the trace_buffered_event is only an optimization, I
> think the best thing to do is to just return without the clean up,
> expecting that the other caller will do the clean up. We could even not
> even stop the loop!
> 
> That is, just have:
> 
> void trace_buffered_event_enable(void)
> {
> 	struct ring_buffer_event *event;
> 	struct page *page;
> 	int cpu;
> 
> 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
> 	if (trace_buffered_event_ref++)
> 		return;
> 
> 	for_each_tracing_cpu(cpu) {
> 		page = alloc_pages_node(cpu_to_node(cpu),
> 					GFP_KERNEL | __GFP_NORETRY, 0);
> 
> 		/* This is just an optimization and can handle failures */
> 		if (!page) {
> 			pr_err("Failed to create event buffer\n");
> 			break;
> 		}
> 
> 		event = page_address(page);
> 		memset(event, 0, sizeof(*event));
> 
> 		per_cpu(trace_buffered_event, cpu) = event;
> 
> 		preempt_disable();
> 		if (cpu == smp_processor_id() &&
> 		    __this_cpu_read(trace_buffered_event) !=
> 		    per_cpu(trace_buffered_event, cpu))
> 			WARN_ON_ONCE(1);
> 		preempt_enable();
> 	}
> }

The current behavior is that if the allocation in one call to
trace_buffered_event_enable() fails then another call to the same
function results in a new attempt to allocate the event pages.

The code above changes it. As long trace_buffered_event_ref doesn't go
back to 0 and then 1 again, no new allocation attempt is made. It means
that trace_buffered_event can be left in this "half-initialized" state
for a while.

This is possibly ok, it is an unlikely corner case anyway. My patch
nonetheless preserves the current behavior of retrying the allocation.

> 
>>>
>>> Now, the reason you found the crash was from the first issue you mentioned.
>>> That is, we didn't disable the current CPU and if we migrated, things would
>>> get really messed up. The only fix here is to make sure all CPUs have their
>>> trace_buffered_event_cnt get incremented.  
>>
>> The potential race that I see is the following:
>> * Function trace_buffered_event_disable() is called on CPU 0. It
>>   increments trace_buffered_event_cnt on each CPU and waits via
>>   synchronize_rcu() for each user of trace_buffered_event to complete.
> 
>> * After synchronize_rcu() is already finished, the situation is that all
>>   counters trace_buffered_event_cnt are at 1 and all pointers
>>   trace_buffered_event are still valid.
> 
>> * At this point, on a different CPU 1, the execution reaches
>>   trace_event_buffer_lock_reserve(). The function calls
>>   preempt_disable_notrace() and enters an RCU read-side critical section
>>   but that is not important because all RCU stuff is already performed
>>   and completed in trace_buffered_event_disable(). The function proceeds
>>   and reads a still valid pointer from trace_buffered_event[CPU1] into
>>   the local variable "entry". However, it doesn't yet read
>>   trace_buffered_event_cnt[CPU1] which happens later.
> 
>> * Function trace_buffered_event_disable() continues. It frees
>>   trace_buffered_event[CPU1] and decrements
>>   trace_buffered_event_cnt[CPU1] back to 0.
> 
>> * Function trace_event_buffer_lock_reserve() continues. It reads and
>>   increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
>>   believe that it can use the "entry" that it already obtained but the
>>   pointer is now invalid and any access results in a use-after-free.
> 
> OK, so you are saying we need another synchronize_rcu() after we set the
> trace_buffered_event to NULL?
> 
> 
> void trace_buffered_event_disable(void)
> {
> 	int cpu;
> 
> 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
> 	if (WARN_ON_ONCE(!trace_buffered_event_ref))
> 		return;
> 
> 	if (--trace_buffered_event_ref)
> 		return;
> 
> 	preempt_disable();
> 	/* For each CPU, set the buffer as used. */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       disable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> 
> 	/* Wait for all current users to finish */
> 	synchronize_rcu();
> 
> 	for_each_tracing_cpu(cpu) {
> 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 		per_cpu(trace_buffered_event, cpu) = NULL;
> 	}
> 	/*
> 	 * Make sure trace_buffered_event is NULL before clearing
> 	 * trace_buffered_event_cnt.
> 	 */
> -	smp_wmb();
> +	synchronize_rcu();
> 
> 	preempt_disable();
> 	/* Do the work on each cpu */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       enable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> }
> 
> I agree with that.

Yes, I believe this should address this potential race condition.

An alternative would be instead to update
trace_event_buffer_lock_reserve() as follows:

	if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
		smp_rmb();
		if ((entry = __this_cpu_read(trace_buffered_event))) {
			[...]

That saves the synchronize_rcu() call but additionally modifies
trace_buffered_event_cnt even if trace_buffered_event is currently NULL.

Another alternative is the approach taken by my patch which avoids more
RCU work and unnecessary memory modifications.

I'd be interested if you could have a look again at what I'm proposing
in my patch. I think it simplifies the code while addressing these
problems as well. However, if you have reservations about that approach
then it is ok, I can fix the found problems individually as discussed.

Thanks,
Petr
Steven Rostedt Nov. 29, 2023, 2:58 p.m. UTC | #5
On Wed, 29 Nov 2023 10:22:23 +0100
Petr Pavlu <petr.pavlu@suse.com> wrote:

> Yes, I believe this should address this potential race condition.
> 
> An alternative would be instead to update
> trace_event_buffer_lock_reserve() as follows:
> 
> 	if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
> 		smp_rmb();

This is the problem I have with your approach. That smp_rmb() is in the
highly critical path. On some architectures, this has a significant impact
on the overhead of this code. This path is called during code execution and
increases the overhead of the tracing infrastructure.

If I'm given two solutions where one adds a smp_rmb() to the critical path
and the other just slows down the non-critical path more, I'll take the
slow down of non-critical path every time.

> 		if ((entry = __this_cpu_read(trace_buffered_event))) {
> 			[...]
> 
> That saves the synchronize_rcu() call but additionally modifies
> trace_buffered_event_cnt even if trace_buffered_event is currently NULL.
> 
> Another alternative is the approach taken by my patch which avoids more
> RCU work and unnecessary memory modifications.
> 
> I'd be interested if you could have a look again at what I'm proposing
> in my patch. I think it simplifies the code while addressing these
> problems as well. However, if you have reservations about that approach
> then it is ok, I can fix the found problems individually as discussed.

Fix this without adding any memory barriers to the critical path, then I'll
take another look.

FYI, this code was designed in the first place to avoid adding memory
barriers in the critical path.

Thanks!

-- Steve
Petr Pavlu Dec. 1, 2023, 2:17 p.m. UTC | #6
On 11/29/23 15:58, Steven Rostedt wrote:
> On Wed, 29 Nov 2023 10:22:23 +0100
> Petr Pavlu <petr.pavlu@suse.com> wrote:
> 
>> Yes, I believe this should address this potential race condition.
>>
>> An alternative would be instead to update
>> trace_event_buffer_lock_reserve() as follows:
>>
>> 	if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
>> 		smp_rmb();
> 
> This is the problem I have with your approach. That smp_rmb() is in the
> highly critical path. On some architectures, this has a significant impact
> on the overhead of this code. This path is called during code execution and
> increases the overhead of the tracing infrastructure.
> 
> If I'm given two solutions where one adds a smp_rmb() to the critical path
> and the other just slows down the non-critical path more, I'll take the
> slow down of non-critical path every time.
> 
>> 		if ((entry = __this_cpu_read(trace_buffered_event))) {
>> 			[...]
>>
>> That saves the synchronize_rcu() call but additionally modifies
>> trace_buffered_event_cnt even if trace_buffered_event is currently NULL.
>>
>> Another alternative is the approach taken by my patch which avoids more
>> RCU work and unnecessary memory modifications.
>>
>> I'd be interested if you could have a look again at what I'm proposing
>> in my patch. I think it simplifies the code while addressing these
>> problems as well. However, if you have reservations about that approach
>> then it is ok, I can fix the found problems individually as discussed.
> 
> Fix this without adding any memory barriers to the critical path, then I'll
> take another look.
> 
> FYI, this code was designed in the first place to avoid adding memory
> barriers in the critical path.

Thank you for the explanation. I ran the tests you mentioned in the
description of commit 0fc1b09ff1ff ("tracing: Use temp buffer when
filtering events") to understand this aspect a bit more. I confirm that
my proposed patch makes the tracing slower, usually by single digit
percentages. I understand this is not welcome. I also realize that the
slowdown might be even worse in different situations and on other
architectures that I checked (arm64, x86_64).

Ok, keeping the current approach, my plan for v2 is to prepare the
following patches:

* Fix for the missing increment+decrement of trace_buffered_event_cnt
  on the current CPU in trace_buffered_event_disable().

  Replace smp_call_function_many() with on_each_cpu_mask() in
  trace_buffered_event_disable(). The on_each_cpu_mask() function has
  also an advantage that it itself disables preemption so doing that can
  be then removed from trace_buffered_event_disable().

* Fix the potential race between trace_buffered_event_enable() and
  trace_event_buffer_lock_reserve() where the latter might already see
  a valid trace_buffered_event pointer but not all initialization yet.

  I think this might be actually best to address by using the same
  maintenance exclusion as is implemented in
  trace_buffered_event_disable(). It would make both maintenance
  operations consistent but for the cost of making the enable operation
  somewhat slower.

* Fix the WARN_ON_ONCE(!trace_buffered_event_ref) issued in
  trace_buffered_event_disable() when trace_buffered_event_enable()
  previously fails.

  Add a variable/flag tracking whether trace_buffered_event is currently
  allocated and use that for driving if a new allocation needs to be
  done when trace_buffered_event_enable() is called, or the buffers
  should be really freed when trace_buffered_event_disable() is invoked.

  Not sure if the mentioned alternative of leaving trace_buffered_event
  partially initialized on failure is preferred instead.

* Fix the potential race between trace_buffered_event_disable() and
  trace_event_buffer_lock_reserve() where the latter might still grab
  a pointer from trace_buffered_event that is being freed.

  Replace smp_wmb() with synchronize_rcu() in
  trace_buffered_event_disable().

Thanks,
Petr
Steven Rostedt Dec. 1, 2023, 2:46 p.m. UTC | #7
On Fri, 1 Dec 2023 15:17:35 +0100
Petr Pavlu <petr.pavlu@suse.com> wrote:

> Ok, keeping the current approach, my plan for v2 is to prepare the
> following patches:
> 
> * Fix for the missing increment+decrement of trace_buffered_event_cnt
>   on the current CPU in trace_buffered_event_disable().
> 
>   Replace smp_call_function_many() with on_each_cpu_mask() in
>   trace_buffered_event_disable(). The on_each_cpu_mask() function has
>   also an advantage that it itself disables preemption so doing that can
>   be then removed from trace_buffered_event_disable().

OK.

> 
> * Fix the potential race between trace_buffered_event_enable() and
>   trace_event_buffer_lock_reserve() where the latter might already see
>   a valid trace_buffered_event pointer but not all initialization yet.
> 
>   I think this might be actually best to address by using the same
>   maintenance exclusion as is implemented in
>   trace_buffered_event_disable(). It would make both maintenance
>   operations consistent but for the cost of making the enable operation
>   somewhat slower.

I wouldn't do them the same just to make them consistent. I think the
smp_wmb() is sufficient. Don't you think?

> 
> * Fix the WARN_ON_ONCE(!trace_buffered_event_ref) issued in
>   trace_buffered_event_disable() when trace_buffered_event_enable()
>   previously fails.
> 
>   Add a variable/flag tracking whether trace_buffered_event is currently
>   allocated and use that for driving if a new allocation needs to be
>   done when trace_buffered_event_enable() is called, or the buffers
>   should be really freed when trace_buffered_event_disable() is invoked.
> 
>   Not sure if the mentioned alternative of leaving trace_buffered_event
>   partially initialized on failure is preferred instead.

I do not really have a preference for either solution. They both are bad if
it happens ;-)

> 
> * Fix the potential race between trace_buffered_event_disable() and
>   trace_event_buffer_lock_reserve() where the latter might still grab
>   a pointer from trace_buffered_event that is being freed.
> 
>   Replace smp_wmb() with synchronize_rcu() in
>   trace_buffered_event_disable().

Sounds good.

Thanks!

-- Steve
Petr Pavlu Dec. 5, 2023, 4:16 p.m. UTC | #8
On 12/1/23 15:46, Steven Rostedt wrote:
> On Fri, 1 Dec 2023 15:17:35 +0100
> Petr Pavlu <petr.pavlu@suse.com> wrote:
> 
>> Ok, keeping the current approach, my plan for v2 is to prepare the
>> following patches:
>>
>> [...]
>> * Fix the potential race between trace_buffered_event_enable() and
>>   trace_event_buffer_lock_reserve() where the latter might already see
>>   a valid trace_buffered_event pointer but not all initialization yet.
>>
>>   I think this might be actually best to address by using the same
>>   maintenance exclusion as is implemented in
>>   trace_buffered_event_disable(). It would make both maintenance
>>   operations consistent but for the cost of making the enable operation
>>   somewhat slower.
> 
> I wouldn't do them the same just to make them consistent. I think the
> smp_wmb() is sufficient. Don't you think?

Looking at this again, I think it is actually a non-issue. Function
trace_buffered_event_enable() only writes the header part of
ring_buffer_event but that is never written nor read by the actual users
which obtain the buffer from trace_event_buffer_lock_reserve().

No change is then needed, it is left out in v2 of the series.

-- Petr
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..123acb9105dd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2741,6 +2741,7 @@  trace_buffer_lock_reserve(struct trace_buffer *buffer,
 DEFINE_PER_CPU(struct ring_buffer_event *, trace_buffered_event);
 DEFINE_PER_CPU(int, trace_buffered_event_cnt);
 static int trace_buffered_event_ref;
+static bool trace_buffered_event_enabled;
 
 /**
  * trace_buffered_event_enable - enable buffering events
@@ -2764,7 +2765,9 @@  void trace_buffered_event_enable(void)
 
 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
 
-	if (trace_buffered_event_ref++)
+	trace_buffered_event_ref++;
+
+	if (trace_buffered_event_enabled)
 		return;
 
 	for_each_tracing_cpu(cpu) {
@@ -2777,30 +2780,21 @@  void trace_buffered_event_enable(void)
 		memset(event, 0, sizeof(*event));
 
 		per_cpu(trace_buffered_event, cpu) = event;
-
-		preempt_disable();
-		if (cpu == smp_processor_id() &&
-		    __this_cpu_read(trace_buffered_event) !=
-		    per_cpu(trace_buffered_event, cpu))
-			WARN_ON_ONCE(1);
-		preempt_enable();
 	}
 
+	/*
+	 * Ensure all initialization changes are visible and publish
+	 * availability of trace_buffered_event.
+	 */
+	smp_wmb();
+	WRITE_ONCE(trace_buffered_event_enabled, true);
 	return;
- failed:
-	trace_buffered_event_disable();
-}
 
-static void enable_trace_buffered_event(void *data)
-{
-	/* Probably not needed, but do it anyway */
-	smp_rmb();
-	this_cpu_dec(trace_buffered_event_cnt);
-}
-
-static void disable_trace_buffered_event(void *data)
-{
-	this_cpu_inc(trace_buffered_event_cnt);
+ failed:
+	for_each_tracing_cpu(cpu) {
+		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
+		per_cpu(trace_buffered_event, cpu) = NULL;
+	}
 }
 
 /**
@@ -2820,33 +2814,19 @@  void trace_buffered_event_disable(void)
 	if (WARN_ON_ONCE(!trace_buffered_event_ref))
 		return;
 
-	if (--trace_buffered_event_ref)
+	if (--trace_buffered_event_ref || !trace_buffered_event_enabled)
 		return;
 
-	preempt_disable();
-	/* For each CPU, set the buffer as used. */
-	smp_call_function_many(tracing_buffer_mask,
-			       disable_trace_buffered_event, NULL, 1);
-	preempt_enable();
+	WRITE_ONCE(trace_buffered_event_enabled, false);
 
 	/* Wait for all current users to finish */
 	synchronize_rcu();
 
 	for_each_tracing_cpu(cpu) {
+		WARN_ON_ONCE(per_cpu(trace_buffered_event_cnt, cpu) != 0);
 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
 		per_cpu(trace_buffered_event, cpu) = NULL;
 	}
-	/*
-	 * Make sure trace_buffered_event is NULL before clearing
-	 * trace_buffered_event_cnt.
-	 */
-	smp_wmb();
-
-	preempt_disable();
-	/* Do the work on each cpu */
-	smp_call_function_many(tracing_buffer_mask,
-			       enable_trace_buffered_event, NULL, 1);
-	preempt_enable();
 }
 
 static struct trace_buffer *temp_buffer;
@@ -2883,9 +2863,16 @@  trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
 		 * is still quicker than no copy on match, but having
 		 * to discard out of the ring buffer on a failed match.
 		 */
-		if ((entry = __this_cpu_read(trace_buffered_event))) {
+		if (READ_ONCE(trace_buffered_event_enabled)) {
 			int max_len = PAGE_SIZE - struct_size(entry, array, 1);
 
+			/*
+			 * Ensure per-CPU variables trace_buffered_event(_cnt)
+			 * are read after trace_buffered_event_enabled, pairs
+			 * with smp_wmb() in trace_buffered_event_enable().
+			 */
+			smp_rmb();
+
 			val = this_cpu_inc_return(trace_buffered_event_cnt);
 
 			/*
@@ -2906,6 +2893,7 @@  trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
 			 * handle the failure in that case.
 			 */
 			if (val == 1 && likely(len <= max_len)) {
+				entry = __this_cpu_read(trace_buffered_event);
 				trace_event_setup(entry, type, trace_ctx);
 				entry->array[0] = len;
 				/* Return with preemption disabled */