diff mbox series

[v3] ring-buffer: Remove 32bit timestamp logic

Message ID 20231214125433.03091e5e@gandalf.local.home (mailing list archive)
State Superseded
Headers show
Series [v3] ring-buffer: Remove 32bit timestamp logic | expand

Commit Message

Steven Rostedt Dec. 14, 2023, 5:54 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!

This is basically a revert of 10464b4aa605e ("ring-buffer: Add rb_time_t
64 bit operations for speeding up 32 bit").

Cc: stable@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v2: https://lore.kernel.org/linux-trace-kernel/20231213232957.498cd339@gandalf.local.home

-- Added check for architectures that can not handle cmpxchg in NMI or
   x86 architectures that do not support true 64bit cmpxchg, and for
   them, to bail out of tracing if in NMI context.

 kernel/trace/ring_buffer.c | 232 ++++---------------------------------
 1 file changed, 22 insertions(+), 210 deletions(-)

Comments

Linus Torvalds Dec. 14, 2023, 7:46 p.m. UTC | #1
On Thu, 14 Dec 2023 at 09:53, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> +       /*
> +        * For architectures that can not do cmpxchg() in NMI, or require
> +        * disabling interrupts to do 64-bit cmpxchg(), do not allow them
> +        * to record in NMI context.
> +        */
> +       if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
> +            (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64))) &&
> +           unlikely(in_nmi())) {
> +               return NULL;
> +       }

Again, this is COMPLETE GARBAGE.

You're using "ARCH_HAVE_NMI_SAFE_CMPXCHG" to test something that just
isn't what it's about.

Having a NMI-safe cmpxchg does *not* mean that you actualyl have a
NMI-safe 64-bit version.

You can't test it that way.

Stop making random changes that just happen to work on the one machine
you tested it on.

           Linus
Steven Rostedt Dec. 14, 2023, 8:19 p.m. UTC | #2
On Thu, 14 Dec 2023 11:46:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 14 Dec 2023 at 09:53, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > +       /*
> > +        * For architectures that can not do cmpxchg() in NMI, or require
> > +        * disabling interrupts to do 64-bit cmpxchg(), do not allow them
> > +        * to record in NMI context.
> > +        */
> > +       if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
> > +            (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64))) &&
> > +           unlikely(in_nmi())) {
> > +               return NULL;
> > +       }  
> 
> Again, this is COMPLETE GARBAGE.
> 
> You're using "ARCH_HAVE_NMI_SAFE_CMPXCHG" to test something that just
> isn't what it's about.

For the 64-bit cmpxchg, on it isn't useful, but this code also calls normal
cmpxchg too. I had that part of the if condition as a separate patch, but
not for this purpose, but for actual architectures that do not support
cmpxchg in NMI. Those are broken here too, and that check fixes it.

> 
> Having a NMI-safe cmpxchg does *not* mean that you actualyl have a
> NMI-safe 64-bit version.

Understood, but we also have cmpxchg in this path as well.

> 
> You can't test it that way.
> 
> Stop making random changes that just happen to work on the one machine
> you tested it on.

They are not random, but they are two different reasons. I still have the
standalone patch that adds the ARCH_HAVE_NMI_SAFE_CMPXCHG for the purpose
of not calling this code on architectures that do not support cmpxchg in
NMI, because if cmpxchg is not supported in NMI, then this should bail.

My mistake was to combine that change into this one which made it looks like
they are related, when in actuality they are not. Which is why there's a
"||" and not an "&&"

For this issue of the 64bit cmpxchg, is there any config that works for any
arch that do not have a safe 64-bit cmpxchg? At least for 486, is the
second half of the if condition reasonable?

	if (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64)) {
		if (unlikely(in_nmi()))
			return NULL;
	}

?

-- Steve
Steven Rostedt Dec. 14, 2023, 8:27 p.m. UTC | #3
On Thu, 14 Dec 2023 15:19:11 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> For this issue of the 64bit cmpxchg, is there any config that works for any
> arch that do not have a safe 64-bit cmpxchg? At least for 486, is the
> second half of the if condition reasonable?
> 
> 	if (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64)) {
> 		if (unlikely(in_nmi()))
> 			return NULL;
> 	}

Ignore this, I'm now reading your other email. I'll have more questions there.

-- Steve
Linus Torvalds Dec. 14, 2023, 8:30 p.m. UTC | #4
On Thu, 14 Dec 2023 at 12:18, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> For this issue of the 64bit cmpxchg, is there any config that works for any
> arch that do not have a safe 64-bit cmpxchg? At least for 486, is the
> second half of the if condition reasonable?
>
>         if (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64)) {
>                 if (unlikely(in_nmi()))
>                         return NULL;
>         }

No.

Read my email. Don't do random x86-centric things. We have that

  #ifndef system_has_cmpxchg64
      #define system_has_cmpxchg64() false
  #endif

which should work.

NOTE! The above is for 32-bit architectures only! For 64-bit ones
either just use cmpxchg directly. And if you need a 128-bit one,
there's system_has_cmpxchg128...

                 Linus
Linus Torvalds Dec. 14, 2023, 8:32 p.m. UTC | #5
On Thu, 14 Dec 2023 at 12:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Read my email. Don't do random x86-centric things. We have that
>
>   #ifndef system_has_cmpxchg64
>       #define system_has_cmpxchg64() false
>   #endif
>
> which should work.

And again, by "should work" I mean that it would disable this entirely
on things like arm32 until the arm people decide they care. But at
least it won't use an unsafe non-working 64-bit cmpxchg.

And no, for 6.7, only fix reported bugs. No big reorgs at all,
particularly for something that likely has never been hit by any user
and sounds like this all just came out of discussion.

              Linus
Steven Rostedt Dec. 14, 2023, 8:47 p.m. UTC | #6
On Thu, 14 Dec 2023 12:32:38 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 14 Dec 2023 at 12:30, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Read my email. Don't do random x86-centric things. We have that
> >
> >   #ifndef system_has_cmpxchg64
> >       #define system_has_cmpxchg64() false
> >   #endif
> >
> > which should work.  
> 
> And again, by "should work" I mean that it would disable this entirely
> on things like arm32 until the arm people decide they care. But at
> least it won't use an unsafe non-working 64-bit cmpxchg.

If archs have no implementation of cmpxchg64 then the code cannot be
removed. If it's just slower and emulated, then it shouldn't be a big deal.
The only thing it may lose is tracing in NMI context, which I'm not sure
that really matters.

> 
> And no, for 6.7, only fix reported bugs. No big reorgs at all,
> particularly for something that likely has never been hit by any user
> and sounds like this all just came out of discussion.

The discussion came out of adding new tests to cover new changes I'm making
in the ring buffer code that happened to trigger subtle bugs in the 32-bit
version of reading the 64-bit timestamps. The reason for that code, is
because of the 64-bit cmpcxhg that is required to implement it. If we are
keeping this code, then there's 2 or 3 fixes to it that I need to send to
you, and also one outstanding one that in theory can be a bug, but in
practice is highly unlikely to ever be hit. The fix for that one is a bit
more involved, and will have to come later.

When I was discussing these fixes and other changes with Mathieu, we
started thinking "is this complexity worth it?" and "does anything actually
need this?".

That's where this patch originated from.

Now, I could also make this special code only compile for the
"!system_has_cmpxchg64" case as well, which shouldn't punish the Atom
processor to do 3 cmpxchg's instead of one cmpxchg8b.

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..9fdbd08af72f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -27,6 +27,7 @@ 
 #include <linux/cpu.h>
 #include <linux/oom.h>
 
+#include <asm/local64.h>
 #include <asm/local.h>
 
 /*
@@ -463,27 +464,9 @@  enum {
 	RB_CTX_MAX
 };
 
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
-	local_t		cnt;
-	local_t		top;
-	local_t		bottom;
-	local_t		msb;
-};
-#else
-#include <asm/local64.h>
 struct rb_time_struct {
 	local64_t	time;
 };
-#endif
 typedef struct rb_time_struct rb_time_t;
 
 #define MAX_NEST	5
@@ -573,179 +556,9 @@  struct ring_buffer_iter {
 	int				missed_events;
 };
 
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- *  - Reads may fail if it interrupted a modification of the time stamp.
- *      It will succeed if it did not interrupt another write even if
- *      the read itself is interrupted by a write.
- *      It returns whether it was successful or not.
- *
- *  - Writes always succeed and will overwrite other writes and writes
- *      that were done by events interrupting the current write.
- *
- *  - A write followed by a read of the same time stamp will always succeed,
- *      but may not contain the same value.
- *
- *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
- *      Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT	30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT	 60
-
-static inline int rb_time_cnt(unsigned long val)
-{
-	return (val >> RB_TIME_SHIFT) & 3;
-}
-
-static inline u64 rb_time_val(unsigned long top, unsigned long bottom)
-{
-	u64 val;
-
-	val = top & RB_TIME_VAL_MASK;
-	val <<= RB_TIME_SHIFT;
-	val |= bottom & RB_TIME_VAL_MASK;
-
-	return val;
-}
-
-static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
-{
-	unsigned long top, bottom, msb;
-	unsigned long c;
-
-	/*
-	 * If the read is interrupted by a write, then the cnt will
-	 * be different. Loop until both top and bottom have been read
-	 * without interruption.
-	 */
-	do {
-		c = local_read(&t->cnt);
-		top = local_read(&t->top);
-		bottom = local_read(&t->bottom);
-		msb = local_read(&t->msb);
-	} while (c != local_read(&t->cnt));
-
-	*cnt = rb_time_cnt(top);
-
-	/* If top and msb counts don't match, this interrupted a write */
-	if (*cnt != rb_time_cnt(msb))
-		return false;
-
-	/* The shift to msb will lose its cnt bits */
-	*ret = rb_time_val(top, bottom) | ((u64)msb << RB_TIME_MSB_SHIFT);
-	return true;
-}
-
-static bool rb_time_read(rb_time_t *t, u64 *ret)
-{
-	unsigned long cnt;
-
-	return __rb_time_read(t, ret, &cnt);
-}
-
-static inline unsigned long rb_time_val_cnt(unsigned long val, unsigned long cnt)
-{
-	return (val & RB_TIME_VAL_MASK) | ((cnt & 3) << RB_TIME_SHIFT);
-}
-
-static inline void rb_time_split(u64 val, unsigned long *top, unsigned long *bottom,
-				 unsigned long *msb)
-{
-	*top = (unsigned long)((val >> RB_TIME_SHIFT) & RB_TIME_VAL_MASK);
-	*bottom = (unsigned long)(val & RB_TIME_VAL_MASK);
-	*msb = (unsigned long)(val >> RB_TIME_MSB_SHIFT);
-}
-
-static inline void rb_time_val_set(local_t *t, unsigned long val, unsigned long cnt)
-{
-	val = rb_time_val_cnt(val, cnt);
-	local_set(t, val);
-}
-
-static void rb_time_set(rb_time_t *t, u64 val)
-{
-	unsigned long cnt, top, bottom, msb;
-
-	rb_time_split(val, &top, &bottom, &msb);
-
-	/* Writes always succeed with a valid number even if it gets interrupted. */
-	do {
-		cnt = local_inc_return(&t->cnt);
-		rb_time_val_set(&t->top, top, cnt);
-		rb_time_val_set(&t->bottom, bottom, cnt);
-		rb_time_val_set(&t->msb, val >> RB_TIME_MSB_SHIFT, cnt);
-	} while (cnt != local_read(&t->cnt));
-}
-
-static inline bool
-rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
-{
-	return local_try_cmpxchg(l, &expect, set);
-}
-
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
-	unsigned long cnt, top, bottom, msb;
-	unsigned long cnt2, top2, bottom2, msb2;
-	u64 val;
-
-	/* The cmpxchg always fails if it interrupted an update */
-	 if (!__rb_time_read(t, &val, &cnt2))
-		 return false;
-
-	 if (val != expect)
-		 return false;
-
-	 cnt = local_read(&t->cnt);
-	 if ((cnt & 3) != cnt2)
-		 return false;
-
-	 cnt2 = cnt + 1;
-
-	 rb_time_split(val, &top, &bottom, &msb);
-	 top = rb_time_val_cnt(top, cnt);
-	 bottom = rb_time_val_cnt(bottom, cnt);
-
-	 rb_time_split(set, &top2, &bottom2, &msb2);
-	 top2 = rb_time_val_cnt(top2, cnt2);
-	 bottom2 = rb_time_val_cnt(bottom2, cnt2);
-
-	if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->top, top, top2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
-		return false;
-	return true;
-}
-
-#else /* 64 bits */
-
-/* local64_t always succeeds */
-
-static inline bool rb_time_read(rb_time_t *t, u64 *ret)
+static inline void rb_time_read(rb_time_t *t, u64 *ret)
 {
 	*ret = local64_read(&t->time);
-	return true;
 }
 static void rb_time_set(rb_time_t *t, u64 val)
 {
@@ -756,7 +569,6 @@  static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 {
 	return local64_try_cmpxchg(&t->time, &expect, set);
 }
-#endif
 
 /*
  * Enable this to make sure that the event passed to
@@ -863,10 +675,7 @@  u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer,
 	WARN_ONCE(1, "nest (%d) greater than max", nest);
 
  fail:
-	/* Can only fail on 32 bit */
-	if (!rb_time_read(&cpu_buffer->write_stamp, &ts))
-		/* Screw it, just read the current time */
-		ts = rb_time_stamp(cpu_buffer->buffer);
+	rb_time_read(&cpu_buffer->write_stamp, &ts);
 
 	return ts;
 }
@@ -2863,7 +2672,7 @@  rb_check_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
 		  (unsigned long long)info->ts,
 		  (unsigned long long)info->before,
 		  (unsigned long long)info->after,
-		  (unsigned long long)(rb_time_read(&cpu_buffer->write_stamp, &write_stamp) ? write_stamp : 0),
+		  (unsigned long long)({rb_time_read(&cpu_buffer->write_stamp, &write_stamp); write_stamp;}),
 		  sched_clock_stable() ? "" :
 		  "If you just came from a suspend/resume,\n"
 		  "please switch to the trace global clock:\n"
@@ -3021,8 +2830,7 @@  rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 
 	delta = rb_time_delta(event);
 
-	if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
-		return false;
+	rb_time_read(&cpu_buffer->write_stamp, &write_stamp);
 
 	/* Make sure the write stamp is read before testing the location */
 	barrier();
@@ -3560,16 +3368,14 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	struct ring_buffer_event *event;
 	struct buffer_page *tail_page;
 	unsigned long tail, write, w;
-	bool a_ok;
-	bool b_ok;
 
 	/* Don't let the compiler play games with cpu_buffer->tail_page */
 	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
 
  /*A*/	w = local_read(&tail_page->write) & RB_WRITE_MASK;
 	barrier();
-	b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
-	a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
+	rb_time_read(&cpu_buffer->before_stamp, &info->before);
+	rb_time_read(&cpu_buffer->write_stamp, &info->after);
 	barrier();
 	info->ts = rb_time_stamp(cpu_buffer->buffer);
 
@@ -3584,7 +3390,7 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 		if (!w) {
 			/* Use the sub-buffer timestamp */
 			info->delta = 0;
-		} else if (unlikely(!a_ok || !b_ok || info->before != info->after)) {
+		} else if (unlikely(info->before != info->after)) {
 			info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;
 			info->length += RB_LEN_TIME_EXTEND;
 		} else {
@@ -3613,13 +3419,11 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 
 	if (likely(tail == w)) {
 		u64 save_before;
-		bool s_ok;
 
 		/* Nothing interrupted us between A and C */
  /*D*/		rb_time_set(&cpu_buffer->write_stamp, info->ts);
 		barrier();
- /*E*/		s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
-		RB_WARN_ON(cpu_buffer, !s_ok);
+ /*E*/		rb_time_read(&cpu_buffer->before_stamp, &save_before);
 		if (likely(!(info->add_timestamp &
 			     (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
 			/* This did not interrupt any time update */
@@ -3632,8 +3436,7 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 		if (unlikely(info->ts != save_before)) {
 			/* SLOW PATH - Interrupted between C and E */
 
-			a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
-			RB_WARN_ON(cpu_buffer, !a_ok);
+			rb_time_read(&cpu_buffer->write_stamp, &info->after);
 
 			/* Write stamp must only go forward */
 			if (save_before > info->after) {
@@ -3648,9 +3451,7 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	} else {
 		u64 ts;
 		/* SLOW PATH - Interrupted between A and C */
-		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
-		/* Was interrupted before here, write_stamp must be valid */
-		RB_WARN_ON(cpu_buffer, !a_ok);
+		rb_time_read(&cpu_buffer->write_stamp, &info->after);
 		ts = rb_time_stamp(cpu_buffer->buffer);
 		barrier();
  /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
@@ -3712,6 +3513,17 @@  rb_reserve_next_event(struct trace_buffer *buffer,
 	int nr_loops = 0;
 	int add_ts_default;
 
+	/*
+	 * For architectures that can not do cmpxchg() in NMI, or require
+	 * disabling interrupts to do 64-bit cmpxchg(), do not allow them
+	 * to record in NMI context.
+	 */
+	if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
+	     (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64))) &&
+	    unlikely(in_nmi())) {
+		return NULL;
+	}
+
 	rb_start_commit(cpu_buffer);
 	/* The commit page can not change after this */