Message ID | 20231213211126.24f8c1dd@gandalf.local.home (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ring-buffer: Remove 32bit timestamp logic | expand |
Linus, Looking for some advice on this. tl;dr; The ring-buffer timestamp requires a 64-bit cmpxchg to keep the timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg can be extremely slow on 32-bit architectures. So I created a rb_time_t that on 64-bit was a normal local64_t type, and on 32-bit it's represented by 3 32-bit words and a counter for synchronization. But this now requires three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do. Not having any 32-bit hardware to test on, I simply push through this complex code for the 32-bit case. I tested it on both 32-bit (running on a x86_64 machine) and 64-bit kernels and it seemed rather robust. But now that I'm doing some heavy development on the ring buffer again, I came across a few very subtle bugs in this code (and so has Mathieu Desnoyers). We started discussing how much time this is actually saving to be worth the complexity, and actually found some hardware to test. One Atom processor. For the Atom it was 11.5ns for 32-bit and 16ns for 64-bit. Granted, this isn't being contended on. But a 30% improvement doesn't justify 3 to 1 cmpxchgs. I plan on just nuking the whole thing (the patch below), which is basically a revert of 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit"). Now my question to you. Should I bother with pushing to you the subtle fixes to this code and send you the revert for the next merge window, or do you think I should just say "screw it" and nuke it now? Or do you think it's worth keeping for some other architecture that 3 32-bit cmpxchgs are still faster than a single 64-bit one? Thanks, -- Steve On Wed, 13 Dec 2023 21:11:26 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > Each event has a 27 bit timestamp delta that is used to hold the delta > from the last event. If the time between events is greater than 2^27, then > a timestamp is added that holds a 59 bit absolute timestamp. > > Until a389d86f7fd09 ("ring-buffer: Have nested events still record running > time stamp"), if an interrupt interrupted an event in progress, all the > events delta would be zero to not deal with the races that need to be > handled. The commit a389d86f7fd09 changed that to handle the races giving > all events, even those that preempt other events, still have an accurate > timestamp. > > To handle those races requires performing 64-bit cmpxchg on the > timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered > very slow. To try to deal with this the timestamp logic was broken into > two and then three 32-bit cmpxchgs, with the thought that two (or three) > 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit > architectures. > > Part of the problem with this is that I didn't have any 32-bit > architectures to test on. After hitting several subtle bugs in this code, > an effort was made to try and see if three 32-bit cmpxchgs are indeed > faster than a single 64-bit. After a few people brushed off the dust of > their old 32-bit machines, tests were done, and even though 32-bit cmpxchg > was faster than a single 64-bit, it was in the order of 50% at best, not > 300%. > > This means that this complex code is not only complex but also not even > faster than just using 64-bit cmpxchg. > > Nuke it! > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- > [ > Should we just push this now and mark it for stable? > That is, just get rid of this logic for all kernels. > ] > kernel/trace/ring_buffer.c | 226 ++----------------------------------- > 1 file changed, 12 insertions(+), 214 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 3fce5e4b4f2b..b05416e14cb6 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -27,6 +27,7 @@ > #include <linux/cpu.h> > #include <linux/oom.h> > > +#include <asm/local64.h> > #include <asm/local.h> > > /* > @@ -463,27 +464,9 @@ enum { > RB_CTX_MAX > }; > > -#if BITS_PER_LONG == 32 > -#define RB_TIME_32 > -#endif > - > -/* To test on 64 bit machines */ > -//#define RB_TIME_32 > - > -#ifdef RB_TIME_32 > - > -struct rb_time_struct { > - local_t cnt; > - local_t top; > - local_t bottom; > - local_t msb; > -}; > -#else > -#include <asm/local64.h> > struct rb_time_struct { > local64_t time; > }; > -#endif > typedef struct rb_time_struct rb_time_t; > > #define MAX_NEST 5 > @@ -573,183 +556,9 @@ struct ring_buffer_iter { > int missed_events; > }; > > -#ifdef RB_TIME_32 > - > -/* > - * On 32 bit machines, local64_t is very expensive. As the ring > - * buffer doesn't need all the features of a true 64 bit atomic, > - * on 32 bit, it uses these functions (64 still uses local64_t). > - * > - * For the ring buffer, 64 bit required operations for the time is > - * the following: > - * > - * - Reads may fail if it interrupted a modification of the time stamp. > - * It will succeed if it did not interrupt another write even if > - * the read itself is interrupted by a write. > - * It returns whether it was successful or not. > - * > - * - Writes always succeed and will overwrite other writes and writes > - * that were done by events interrupting the current write. > - * > - * - A write followed by a read of the same time stamp will always succeed, > - * but may not contain the same value. > - * > - * - A cmpxchg will fail if it interrupted another write or cmpxchg. > - * Other than that, it acts like a normal cmpxchg. > - * > - * The 60 bit time stamp is broken up by 30 bits in a top and bottom half > - * (bottom being the least significant 30 bits of the 60 bit time stamp). > - * > - * The two most significant bits of each half holds a 2 bit counter (0-3). > - * Each update will increment this counter by one. > - * When reading the top and bottom, if the two counter bits match then the > - * top and bottom together make a valid 60 bit number. > - */ > -#define RB_TIME_SHIFT 30 > -#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1) > -#define RB_TIME_MSB_SHIFT 60 > - > -static inline int rb_time_cnt(unsigned long val) > -{ > - return (val >> RB_TIME_SHIFT) & 3; > -} > - > -static inline u64 rb_time_val(unsigned long top, unsigned long bottom) > -{ > - u64 val; > - > - val = top & RB_TIME_VAL_MASK; > - val <<= RB_TIME_SHIFT; > - val |= bottom & RB_TIME_VAL_MASK; > - > - return val; > -} > - > -static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) > -{ > - unsigned long top, bottom, msb; > - unsigned long c; > - > - /* > - * If the read is interrupted by a write, then the cnt will > - * be different. Loop until both top and bottom have been read > - * without interruption. > - */ > - do { > - c = local_read(&t->cnt); > - top = local_read(&t->top); > - bottom = local_read(&t->bottom); > - msb = local_read(&t->msb); > - } while (c != local_read(&t->cnt)); > - > - *cnt = rb_time_cnt(top); > - > - /* If top, msb or bottom counts don't match, this interrupted a write */ > - if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom)) > - return false; > - > - /* The shift to msb will lose its cnt bits */ > - *ret = rb_time_val(top, bottom) | ((u64)msb << RB_TIME_MSB_SHIFT); > - return true; > -} > - > -static bool rb_time_read(rb_time_t *t, u64 *ret) > -{ > - unsigned long cnt; > - > - return __rb_time_read(t, ret, &cnt); > -} > - > -static inline unsigned long rb_time_val_cnt(unsigned long val, unsigned long cnt) > -{ > - return (val & RB_TIME_VAL_MASK) | ((cnt & 3) << RB_TIME_SHIFT); > -} > - > -static inline void rb_time_split(u64 val, unsigned long *top, unsigned long *bottom, > - unsigned long *msb) > -{ > - *top = (unsigned long)((val >> RB_TIME_SHIFT) & RB_TIME_VAL_MASK); > - *bottom = (unsigned long)(val & RB_TIME_VAL_MASK); > - *msb = (unsigned long)(val >> RB_TIME_MSB_SHIFT); > -} > - > -static inline void rb_time_val_set(local_t *t, unsigned long val, unsigned long cnt) > -{ > - val = rb_time_val_cnt(val, cnt); > - local_set(t, val); > -} > - > -static void rb_time_set(rb_time_t *t, u64 val) > -{ > - unsigned long cnt, top, bottom, msb; > - > - rb_time_split(val, &top, &bottom, &msb); > - > - /* Writes always succeed with a valid number even if it gets interrupted. */ > - do { > - cnt = local_inc_return(&t->cnt); > - rb_time_val_set(&t->top, top, cnt); > - rb_time_val_set(&t->bottom, bottom, cnt); > - rb_time_val_set(&t->msb, val >> RB_TIME_MSB_SHIFT, cnt); > - } while (cnt != local_read(&t->cnt)); > -} > - > -static inline bool > -rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set) > -{ > - return local_try_cmpxchg(l, &expect, set); > -} > - > -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) > -{ > - unsigned long cnt, top, bottom, msb; > - unsigned long cnt2, top2, bottom2, msb2; > - u64 val; > - > - /* Any interruptions in this function should cause a failure */ > - cnt = local_read(&t->cnt); > - > - /* The cmpxchg always fails if it interrupted an update */ > - if (!__rb_time_read(t, &val, &cnt2)) > - return false; > - > - if (val != expect) > - return false; > - > - if ((cnt & 3) != cnt2) > - return false; > - > - cnt2 = cnt + 1; > - > - rb_time_split(val, &top, &bottom, &msb); > - msb = rb_time_val_cnt(msb, cnt); > - top = rb_time_val_cnt(top, cnt); > - bottom = rb_time_val_cnt(bottom, cnt); > - > - rb_time_split(set, &top2, &bottom2, &msb2); > - msb2 = rb_time_val_cnt(msb2, cnt); > - top2 = rb_time_val_cnt(top2, cnt2); > - bottom2 = rb_time_val_cnt(bottom2, cnt2); > - > - if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->top, top, top2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) > - return false; > - return true; > -} > - > -#else /* 64 bits */ > - > -/* local64_t always succeeds */ > - > -static inline bool rb_time_read(rb_time_t *t, u64 *ret) > +static inline void rb_time_read(rb_time_t *t, u64 *ret) > { > *ret = local64_read(&t->time); > - return true; > } > static void rb_time_set(rb_time_t *t, u64 val) > { > @@ -760,7 +569,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) > { > return local64_try_cmpxchg(&t->time, &expect, set); > } > -#endif > > /* > * Enable this to make sure that the event passed to > @@ -867,10 +675,7 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer, > WARN_ONCE(1, "nest (%d) greater than max", nest); > > fail: > - /* Can only fail on 32 bit */ > - if (!rb_time_read(&cpu_buffer->write_stamp, &ts)) > - /* Screw it, just read the current time */ > - ts = rb_time_stamp(cpu_buffer->buffer); > + rb_time_read(&cpu_buffer->write_stamp, &ts); > > return ts; > } > @@ -2867,7 +2672,7 @@ rb_check_timestamp(struct ring_buffer_per_cpu *cpu_buffer, > (unsigned long long)info->ts, > (unsigned long long)info->before, > (unsigned long long)info->after, > - (unsigned long long)(rb_time_read(&cpu_buffer->write_stamp, &write_stamp) ? write_stamp : 0), > + (unsigned long long)({rb_time_read(&cpu_buffer->write_stamp, &write_stamp); write_stamp;}), > sched_clock_stable() ? "" : > "If you just came from a suspend/resume,\n" > "please switch to the trace global clock:\n" > @@ -3025,8 +2830,7 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer, > > delta = rb_time_delta(event); > > - if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp)) > - return false; > + rb_time_read(&cpu_buffer->write_stamp, &write_stamp); > > /* Make sure the write stamp is read before testing the location */ > barrier(); > @@ -3569,16 +3373,14 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > struct ring_buffer_event *event; > struct buffer_page *tail_page; > unsigned long tail, write, w; > - bool a_ok; > - bool b_ok; > > /* Don't let the compiler play games with cpu_buffer->tail_page */ > tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page); > > /*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK; > barrier(); > - b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); > - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); > + rb_time_read(&cpu_buffer->before_stamp, &info->before); > + rb_time_read(&cpu_buffer->write_stamp, &info->after); > barrier(); > info->ts = rb_time_stamp(cpu_buffer->buffer); > > @@ -3593,7 +3395,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > if (!w) { > /* Use the sub-buffer timestamp */ > info->delta = 0; > - } else if (unlikely(!a_ok || !b_ok || info->before != info->after)) { > + } else if (unlikely(info->before != info->after)) { > info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND; > info->length += RB_LEN_TIME_EXTEND; > } else { > @@ -3622,13 +3424,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > > if (likely(tail == w)) { > u64 save_before; > - bool s_ok; > > /* Nothing interrupted us between A and C */ > /*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts); > barrier(); > - /*E*/ s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before); > - RB_WARN_ON(cpu_buffer, !s_ok); > + /*E*/ rb_time_read(&cpu_buffer->before_stamp, &save_before); > if (likely(!(info->add_timestamp & > (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) > /* This did not interrupt any time update */ > @@ -3641,8 +3441,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > if (unlikely(info->ts != save_before)) { > /* SLOW PATH - Interrupted between C and E */ > > - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); > - RB_WARN_ON(cpu_buffer, !a_ok); > + rb_time_read(&cpu_buffer->write_stamp, &info->after); > > /* Write stamp must only go forward */ > if (save_before > info->after) { > @@ -3657,9 +3456,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > } else { > u64 ts; > /* SLOW PATH - Interrupted between A and C */ > - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); > - /* Was interrupted before here, write_stamp must be valid */ > - RB_WARN_ON(cpu_buffer, !a_ok); > + rb_time_read(&cpu_buffer->write_stamp, &info->after); > ts = rb_time_stamp(cpu_buffer->buffer); > barrier(); > /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && > @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, > struct rb_event_info info; > int nr_loops = 0; > int add_ts_default; > + static int once; > > /* ring buffer does cmpxchg, make sure it is safe in NMI context */ > if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
On Wed, 13 Dec 2023 21:11:26 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, > struct rb_event_info info; > int nr_loops = 0; > int add_ts_default; > + static int once; > > /* ring buffer does cmpxchg, make sure it is safe in NMI context */ > if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && Oops, some debug code got committed! -- Steve
On 2023-12-13 21:11, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > Each event has a 27 bit timestamp delta that is used to hold the delta > from the last event. If the time between events is greater than 2^27, then > a timestamp is added that holds a 59 bit absolute timestamp. > > Until a389d86f7fd09 ("ring-buffer: Have nested events still record running > time stamp"), if an interrupt interrupted an event in progress, all the > events delta would be zero to not deal with the races that need to be > handled. The commit a389d86f7fd09 changed that to handle the races giving > all events, even those that preempt other events, still have an accurate > timestamp. > > To handle those races requires performing 64-bit cmpxchg on the > timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered > very slow. To try to deal with this the timestamp logic was broken into > two and then three 32-bit cmpxchgs, with the thought that two (or three) > 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit > architectures. > > Part of the problem with this is that I didn't have any 32-bit > architectures to test on. After hitting several subtle bugs in this code, > an effort was made to try and see if three 32-bit cmpxchgs are indeed > faster than a single 64-bit. After a few people brushed off the dust of > their old 32-bit machines, tests were done, and even though 32-bit cmpxchg > was faster than a single 64-bit, it was in the order of 50% at best, not > 300%. I literally had to dust off my old Eee PC for this :) > > This means that this complex code is not only complex but also not even > faster than just using 64-bit cmpxchg. > > Nuke it! > Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, > struct rb_event_info info; > int nr_loops = 0; > int add_ts_default; > + static int once; (as you spotted, should be removed) Thanks, Mathieu > > /* ring buffer does cmpxchg, make sure it is safe in NMI context */ > if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
On Wed, 13 Dec 2023 at 18:45, Steven Rostedt <rostedt@goodmis.org> wrote: > > tl;dr; The ring-buffer timestamp requires a 64-bit cmpxchg to keep the > timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg > can be extremely slow on 32-bit architectures. So I created a rb_time_t > that on 64-bit was a normal local64_t type, and on 32-bit it's represented > by 3 32-bit words and a counter for synchronization. But this now requires > three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do. It's not that a 64-bit cmpxchg is even slow. It doesn't EXIST AT ALL on older 32-bit x86 machines. Which is why we have arch/x86/lib/cmpxchg8b_emu.S which emulates it on machines that don't have the CX8 capability ("CX8" being the x86 capability flag name for the cmpxchg8b instruction, aka 64-bit cmpxchg). Which only works because those older 32-bit cpu's also don't do SMP, so there are no SMP cache coherency issues, only interrupt atomicity issues. IOW, the way to do an atomic 64-bit cmpxchg on the affected hardware is to simply disable interrupts. In other words - it's not just slow. It's *really* slow. As in 10x slower, not "slightly slower". > We started discussing how much time this is actually saving to be worth the > complexity, and actually found some hardware to test. One Atom processor. That atom processor won't actually show the issue. It's much too recent. So your "test" is actually worthless. And you probably did this all with a kernel config that had CONFIG_X86_CMPXCHG64 set anyway, which wouldn't even boot on a i486 machine. So in fact your test was probably doubly broken, in that not only didn't you test the slow case, you tested something that wouldn't even have worked in the environment where the slow case happened. Now, the real question is if anybody cares about CPUs that don't have cmpxchg8b support. IBecause in practice, it's really just old 486-class machines (and a couple of clone manufacturers who _claimed_ to be Pentium class, but weren't - there was also some odd thing with Windows breaking if you had CPUID claiming to support CX8 We dropped support for the original 80386 some time ago. I'd actually be willing to drop support for ll pre-cmpxchg8b machines, and get rid of the emulation. I also suspect that from a perf angle, none of this matters. The emulation being slow probably is a non-issue, simply because even if you run on an old i486 machine, you probably won't be doing perf or tracing on it. Linus
On Wed, 13 Dec 2023 22:53:19 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 13 Dec 2023 at 18:45, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > tl;dr; The ring-buffer timestamp requires a 64-bit cmpxchg to keep the > > timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg > > can be extremely slow on 32-bit architectures. So I created a rb_time_t > > that on 64-bit was a normal local64_t type, and on 32-bit it's represented > > by 3 32-bit words and a counter for synchronization. But this now requires > > three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do. > > It's not that a 64-bit cmpxchg is even slow. It doesn't EXIST AT ALL > on older 32-bit x86 machines. > > Which is why we have > > arch/x86/lib/cmpxchg8b_emu.S > > which emulates it on machines that don't have the CX8 capability > ("CX8" being the x86 capability flag name for the cmpxchg8b > instruction, aka 64-bit cmpxchg). > > Which only works because those older 32-bit cpu's also don't do SMP, > so there are no SMP cache coherency issues, only interrupt atomicity > issues. > > IOW, the way to do an atomic 64-bit cmpxchg on the affected hardware > is to simply disable interrupts. > > In other words - it's not just slow. It's *really* slow. As in 10x > slower, not "slightly slower". Ah, I'm starting to remember this for the rationale in doing it. I should have read up on the LWN article I even wrote about it! https://lwn.net/Articles/831892/ "I mentioned that I used the local64 variants of operations like local_read/cmpxchg/etc. operations. Desnoyers went on to argue that the local64 operations on 32-bit machines were horrible in performance, and worse, some require that interrupts be disabled, meaning that they could not be used in NMI context." And yes, this does get called in NMI context. > > > We started discussing how much time this is actually saving to be worth the > > complexity, and actually found some hardware to test. One Atom processor. > > That atom processor won't actually show the issue. It's much too > recent. So your "test" is actually worthless. > > And you probably did this all with a kernel config that had > CONFIG_X86_CMPXCHG64 set anyway, which wouldn't even boot on a i486 > machine. > > So in fact your test was probably doubly broken, in that not only > didn't you test the slow case, you tested something that wouldn't even > have worked in the environment where the slow case happened. > > Now, the real question is if anybody cares about CPUs that don't have > cmpxchg8b support. > > IBecause in practice, it's really just old 486-class machines (and a > couple of clone manufacturers who _claimed_ to be Pentium class, but > weren't - there was also some odd thing with Windows breaking if you > had CPUID claiming to support CX8 > > We dropped support for the original 80386 some time ago. I'd actually > be willing to drop support for ll pre-cmpxchg8b machines, and get rid > of the emulation. > > I also suspect that from a perf angle, none of this matters. The > emulation being slow probably is a non-issue, simply because even if > you run on an old i486 machine, you probably won't be doing perf or > tracing on it. Thanks for the background. I had a patch that added: + /* ring buffer does cmpxchg, make sure it is safe in NMI context */ + if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && + (unlikely(in_nmi()))) { + return NULL; + } But for ripping out this code, I should probably change that to: if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) || (IS_ENABLED(X86_32) && !IS_ENABLED(X86_CMPXCHG64))) && unlikely(in_nmi())) { return NULL; } Not sure if there's other architectures that are affected by this (hence why I Cc'd linux-arch). I don't think anyone actually cares about the performance overhead of 486 doing 64-bit cmpxchg by disabling interrupts. Especially since this only happens in the slow path (if an event interrupts the processing of another event). If someone complains, we can always add back this code. Now back to my original question. Are you OK with me sending this to you now, or should I send you just the subtle fixes to the 32-bit rb_time_* code and keep this patch for the merge window? My preference is to get it in now and label it as stable, but I'm fine either way. -- Steve
On Thu, 14 Dec 2023 at 08:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > And yes, this does get called in NMI context. Not on an i486-class machine they won't. You don't have a local apic on those, and they won't have any NMI sources under our control (ie NMI does exist, but we're talking purely legacy NMI for "motherboard problems" like RAM parity errors etc) > I had a patch that added: > > + /* ring buffer does cmpxchg, make sure it is safe in NMI context */ > + if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && > + (unlikely(in_nmi()))) { > + return NULL; > + } CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG doesn't work on x86 in this context, because the issue is that yes, there's a safe 'cmpxchg', but that's not what you want. You want the save cmpxchg64, which is an entirely different beast. And honestly, I don't think that NMI_SAFE_CMPXCHG covers the double-word case anywhere else either, except purely by luck. In mm/slab.c, we also use a double-wide cmpxchg, and there the rule has literally been that it's conditional on (a) system_has_cmpxchg64() existing as a macro (b) using that macro to then gate - at runtime - whether it actually works or not I think - but didn't check - that we essentially only enable the two-word case on x86 as a result, and fall back to the slow case on all other architectures - and on the i486 case. That said, other architectures *do* have a working double-word cmpxchg, but I wouldn't guarantee it. For example, 32-bit arm does have one using ldrexd/strexd, but that only exists on arm v6+. And guess what? You'll silently get a "disable interrupts, do it as a non-atomic load-store" on arm too for that case. And again, pre-v6 arm is about as relevant as i486 is, but my point is, that double-word cmpxchg you rely on simply DOES NOT EXIST on 32-bit platforms except under special circumstances. So this isn't a "x86 is the odd man out". This is literally generic. > Now back to my original question. Are you OK with me sending this to you > now, or should I send you just the subtle fixes to the 32-bit rb_time_* > code and keep this patch for the merge window? I'm absolutely not taking some untested random "let's do 64-bit cmpxchg that we know is broken on 32-bit using broken conditionals" shit. What *would* work is that slab approach, which is essentially #ifndef system_has_cmpxchg64 #define system_has_cmpxchg64() false #endif ... if (!system_has_cmpxchg64()) return error or slow case do_64bit_cmpxchg_case(); (although the slub case is much more indirect, and uses a __CMPXCHG_DOUBLE flag that only gets set when that define exists etc). But that would literally cut off support for all non-x86 32-bit architectures. So no. You need to forget about the whole "do a 64-bit cmpxchg on 32-bit architectures" as being some kind of solution in the short term. Linus
On Thu, 14 Dec 2023 11:44:55 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 14 Dec 2023 at 08:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > And yes, this does get called in NMI context. > > Not on an i486-class machine they won't. You don't have a local apic > on those, and they won't have any NMI sources under our control (ie > NMI does exist, but we're talking purely legacy NMI for "motherboard > problems" like RAM parity errors etc) Ah, so we should not worry about being in NMI context without a 64bit cmpxchg? > > > I had a patch that added: > > > > + /* ring buffer does cmpxchg, make sure it is safe in NMI context */ > > + if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && > > + (unlikely(in_nmi()))) { > > + return NULL; > > + } > > CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG doesn't work on x86 in this context, > because the issue is that yes, there's a safe 'cmpxchg', but that's > not what you want. Sorry, that's from another patch that I combined into this one that I added in case there's architectures that have NMIs but need to avoid cmpxchg, as this code does normal long word cmpxchg too. And that change *should* go to you and stable. It's either just luck that things didn't crash on those systems today. Or it happens so infrequently, nobody reported it. > > You want the save cmpxchg64, which is an entirely different beast. Understood, but this code has both. It's just that the "special" code I'm removing does the 64-bit cmpxchg. > > And honestly, I don't think that NMI_SAFE_CMPXCHG covers the > double-word case anywhere else either, except purely by luck. I still need to cover the normal cmpxchg. I'll keep that a separate patch. > > In mm/slab.c, we also use a double-wide cmpxchg, and there the rule > has literally been that it's conditional on > > (a) system_has_cmpxchg64() existing as a macro > > (b) using that macro to then gate - at runtime - whether it actually > works or not > > I think - but didn't check - that we essentially only enable the > two-word case on x86 as a result, and fall back to the slow case on > all other architectures - and on the i486 case. > > That said, other architectures *do* have a working double-word > cmpxchg, but I wouldn't guarantee it. For example, 32-bit arm does > have one using ldrexd/strexd, but that only exists on arm v6+. > > And guess what? You'll silently get a "disable interrupts, do it as a > non-atomic load-store" on arm too for that case. And again, pre-v6 arm > is about as relevant as i486 is, but my point is, that double-word > cmpxchg you rely on simply DOES NOT EXIST on 32-bit platforms except > under special circumstances. > > So this isn't a "x86 is the odd man out". This is literally generic. > > > Now back to my original question. Are you OK with me sending this to you > > now, or should I send you just the subtle fixes to the 32-bit rb_time_* > > code and keep this patch for the merge window? > > I'm absolutely not taking some untested random "let's do 64-bit > cmpxchg that we know is broken on 32-bit using broken conditionals" > shit. > > What *would* work is that slab approach, which is essentially > > #ifndef system_has_cmpxchg64 > #define system_has_cmpxchg64() false > #endif > > ... > if (!system_has_cmpxchg64()) > return error or slow case > > do_64bit_cmpxchg_case(); > > (although the slub case is much more indirect, and uses a > __CMPXCHG_DOUBLE flag that only gets set when that define exists etc). > > But that would literally cut off support for all non-x86 32-bit architectures. > > So no. You need to forget about the whole "do a 64-bit cmpxchg on > 32-bit architectures" as being some kind of solution in the short > term. But do all archs have an implementation of cmpxchg64, even if it requires disabling interrupts? If not, then I definitely cannot remove this code. If they all have an implementation, where some is just very slow, then that is also perfectly fine. The only time cmpxchg64 gets called is on the slow path, which is if an event is interrupted by an interrupt, and that interrupt writes an event to the same buffer. This doesn't happen often, and if it requires disabling interrupts, then it shouldn't be much notice. I just want to avoid the case that it will simply break, which is the NMI case. In which case, would: if (!system_has_cmpxchg64() && in_nmi()) return NULL; Be OK? -- Steve
On Thu, 14 Dec 2023 at 12:35, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 14 Dec 2023 11:44:55 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 14 Dec 2023 at 08:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > And yes, this does get called in NMI context. > > > > Not on an i486-class machine they won't. You don't have a local apic > > on those, and they won't have any NMI sources under our control (ie > > NMI does exist, but we're talking purely legacy NMI for "motherboard > > problems" like RAM parity errors etc) > > Ah, so we should not worry about being in NMI context without a 64bit cmpxchg? .. on x86. Elsewhere, who knows? It is *probably* true in most situations. '32-bit' => 'legacy' => 'less likely to have fancy profiling / irq setups'. But I really don't know. > > So no. You need to forget about the whole "do a 64-bit cmpxchg on > > 32-bit architectures" as being some kind of solution in the short > > term. > > But do all archs have an implementation of cmpxchg64, even if it requires > disabling interrupts? If not, then I definitely cannot remove this code. We have a generic header file, so anybody who uses that would get the fallback version, ie arch_cmpxchg64 -> generic_cmpxchg64_local -> __generic_cmpxchg64_local which does that irq disabling thing. But no, not everybody is guaranteed to use that fallback. From a quick look, ARC, hexagon and CSky don't do this, for example. And then I got bored and stopped looking. My guess is that *most* 32-bit architectures do not have a 64-bit cmpxchg - not even the irq-safe one. For the UP case you can do your own, of course. Linus
On Thu, 14 Dec 2023 12:50:29 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But do all archs have an implementation of cmpxchg64, even if it requires > > disabling interrupts? If not, then I definitely cannot remove this code. > > We have a generic header file, so anybody who uses that would get the > fallback version, ie > > arch_cmpxchg64 -> generic_cmpxchg64_local -> __generic_cmpxchg64_local > > which does that irq disabling thing. > > But no, not everybody is guaranteed to use that fallback. From a quick > look, ARC, hexagon and CSky don't do this, for example. > > And then I got bored and stopped looking. > > My guess is that *most* 32-bit architectures do not have a 64-bit > cmpxchg - not even the irq-safe one. OK, that means I have to completely abandon this change, even for the next merge window. I may add a check if cmpxchg64 exists before falling back to the 32bit cmpxchg version. But even that will have to wait till the next merge window, with a fixes tag (but not Cc'd stable) in case anyone would like to backport it. Thanks for the advice! -- Steve
... > My guess is that *most* 32-bit architectures do not have a 64-bit > cmpxchg - not even the irq-safe one. Does any sparc32 even have a 32-bit cmpxchg? The original versions (which were definitely SMP capable) only had a byte sized atomic exchange that always wrote 0xff. Sparc32 does have 'load/store double' (two 32bit registers) but 32bit cpu like nios2 and (I think) RISCV (and probably anything else loosely based on MIPS) only have single register load/store instructions. They'll mainly be UP only, I've not looked at RISCV enough to see what it has when supporting SMP. > For the UP case you can do your own, of course. A generic version of the soft interrupt disable code would help. Then it would just be an inc/dec of memory rather than having to save the current interrupt enable state. Especially for code that only disables interrupts for a few instructions - so the costs of deferring the interrupt don't happen often. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 3fce5e4b4f2b..b05416e14cb6 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -27,6 +27,7 @@ #include <linux/cpu.h> #include <linux/oom.h> +#include <asm/local64.h> #include <asm/local.h> /* @@ -463,27 +464,9 @@ enum { RB_CTX_MAX }; -#if BITS_PER_LONG == 32 -#define RB_TIME_32 -#endif - -/* To test on 64 bit machines */ -//#define RB_TIME_32 - -#ifdef RB_TIME_32 - -struct rb_time_struct { - local_t cnt; - local_t top; - local_t bottom; - local_t msb; -}; -#else -#include <asm/local64.h> struct rb_time_struct { local64_t time; }; -#endif typedef struct rb_time_struct rb_time_t; #define MAX_NEST 5 @@ -573,183 +556,9 @@ struct ring_buffer_iter { int missed_events; }; -#ifdef RB_TIME_32 - -/* - * On 32 bit machines, local64_t is very expensive. As the ring - * buffer doesn't need all the features of a true 64 bit atomic, - * on 32 bit, it uses these functions (64 still uses local64_t). - * - * For the ring buffer, 64 bit required operations for the time is - * the following: - * - * - Reads may fail if it interrupted a modification of the time stamp. - * It will succeed if it did not interrupt another write even if - * the read itself is interrupted by a write. - * It returns whether it was successful or not. - * - * - Writes always succeed and will overwrite other writes and writes - * that were done by events interrupting the current write. - * - * - A write followed by a read of the same time stamp will always succeed, - * but may not contain the same value. - * - * - A cmpxchg will fail if it interrupted another write or cmpxchg. - * Other than that, it acts like a normal cmpxchg. - * - * The 60 bit time stamp is broken up by 30 bits in a top and bottom half - * (bottom being the least significant 30 bits of the 60 bit time stamp). - * - * The two most significant bits of each half holds a 2 bit counter (0-3). - * Each update will increment this counter by one. - * When reading the top and bottom, if the two counter bits match then the - * top and bottom together make a valid 60 bit number. - */ -#define RB_TIME_SHIFT 30 -#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1) -#define RB_TIME_MSB_SHIFT 60 - -static inline int rb_time_cnt(unsigned long val) -{ - return (val >> RB_TIME_SHIFT) & 3; -} - -static inline u64 rb_time_val(unsigned long top, unsigned long bottom) -{ - u64 val; - - val = top & RB_TIME_VAL_MASK; - val <<= RB_TIME_SHIFT; - val |= bottom & RB_TIME_VAL_MASK; - - return val; -} - -static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) -{ - unsigned long top, bottom, msb; - unsigned long c; - - /* - * If the read is interrupted by a write, then the cnt will - * be different. Loop until both top and bottom have been read - * without interruption. - */ - do { - c = local_read(&t->cnt); - top = local_read(&t->top); - bottom = local_read(&t->bottom); - msb = local_read(&t->msb); - } while (c != local_read(&t->cnt)); - - *cnt = rb_time_cnt(top); - - /* If top, msb or bottom counts don't match, this interrupted a write */ - if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom)) - return false; - - /* The shift to msb will lose its cnt bits */ - *ret = rb_time_val(top, bottom) | ((u64)msb << RB_TIME_MSB_SHIFT); - return true; -} - -static bool rb_time_read(rb_time_t *t, u64 *ret) -{ - unsigned long cnt; - - return __rb_time_read(t, ret, &cnt); -} - -static inline unsigned long rb_time_val_cnt(unsigned long val, unsigned long cnt) -{ - return (val & RB_TIME_VAL_MASK) | ((cnt & 3) << RB_TIME_SHIFT); -} - -static inline void rb_time_split(u64 val, unsigned long *top, unsigned long *bottom, - unsigned long *msb) -{ - *top = (unsigned long)((val >> RB_TIME_SHIFT) & RB_TIME_VAL_MASK); - *bottom = (unsigned long)(val & RB_TIME_VAL_MASK); - *msb = (unsigned long)(val >> RB_TIME_MSB_SHIFT); -} - -static inline void rb_time_val_set(local_t *t, unsigned long val, unsigned long cnt) -{ - val = rb_time_val_cnt(val, cnt); - local_set(t, val); -} - -static void rb_time_set(rb_time_t *t, u64 val) -{ - unsigned long cnt, top, bottom, msb; - - rb_time_split(val, &top, &bottom, &msb); - - /* Writes always succeed with a valid number even if it gets interrupted. */ - do { - cnt = local_inc_return(&t->cnt); - rb_time_val_set(&t->top, top, cnt); - rb_time_val_set(&t->bottom, bottom, cnt); - rb_time_val_set(&t->msb, val >> RB_TIME_MSB_SHIFT, cnt); - } while (cnt != local_read(&t->cnt)); -} - -static inline bool -rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set) -{ - return local_try_cmpxchg(l, &expect, set); -} - -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) -{ - unsigned long cnt, top, bottom, msb; - unsigned long cnt2, top2, bottom2, msb2; - u64 val; - - /* Any interruptions in this function should cause a failure */ - cnt = local_read(&t->cnt); - - /* The cmpxchg always fails if it interrupted an update */ - if (!__rb_time_read(t, &val, &cnt2)) - return false; - - if (val != expect) - return false; - - if ((cnt & 3) != cnt2) - return false; - - cnt2 = cnt + 1; - - rb_time_split(val, &top, &bottom, &msb); - msb = rb_time_val_cnt(msb, cnt); - top = rb_time_val_cnt(top, cnt); - bottom = rb_time_val_cnt(bottom, cnt); - - rb_time_split(set, &top2, &bottom2, &msb2); - msb2 = rb_time_val_cnt(msb2, cnt); - top2 = rb_time_val_cnt(top2, cnt2); - bottom2 = rb_time_val_cnt(bottom2, cnt2); - - if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) - return false; - if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) - return false; - if (!rb_time_read_cmpxchg(&t->top, top, top2)) - return false; - if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) - return false; - return true; -} - -#else /* 64 bits */ - -/* local64_t always succeeds */ - -static inline bool rb_time_read(rb_time_t *t, u64 *ret) +static inline void rb_time_read(rb_time_t *t, u64 *ret) { *ret = local64_read(&t->time); - return true; } static void rb_time_set(rb_time_t *t, u64 val) { @@ -760,7 +569,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) { return local64_try_cmpxchg(&t->time, &expect, set); } -#endif /* * Enable this to make sure that the event passed to @@ -867,10 +675,7 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer, WARN_ONCE(1, "nest (%d) greater than max", nest); fail: - /* Can only fail on 32 bit */ - if (!rb_time_read(&cpu_buffer->write_stamp, &ts)) - /* Screw it, just read the current time */ - ts = rb_time_stamp(cpu_buffer->buffer); + rb_time_read(&cpu_buffer->write_stamp, &ts); return ts; } @@ -2867,7 +2672,7 @@ rb_check_timestamp(struct ring_buffer_per_cpu *cpu_buffer, (unsigned long long)info->ts, (unsigned long long)info->before, (unsigned long long)info->after, - (unsigned long long)(rb_time_read(&cpu_buffer->write_stamp, &write_stamp) ? write_stamp : 0), + (unsigned long long)({rb_time_read(&cpu_buffer->write_stamp, &write_stamp); write_stamp;}), sched_clock_stable() ? "" : "If you just came from a suspend/resume,\n" "please switch to the trace global clock:\n" @@ -3025,8 +2830,7 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer, delta = rb_time_delta(event); - if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp)) - return false; + rb_time_read(&cpu_buffer->write_stamp, &write_stamp); /* Make sure the write stamp is read before testing the location */ barrier(); @@ -3569,16 +3373,14 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, struct ring_buffer_event *event; struct buffer_page *tail_page; unsigned long tail, write, w; - bool a_ok; - bool b_ok; /* Don't let the compiler play games with cpu_buffer->tail_page */ tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page); /*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK; barrier(); - b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); + rb_time_read(&cpu_buffer->before_stamp, &info->before); + rb_time_read(&cpu_buffer->write_stamp, &info->after); barrier(); info->ts = rb_time_stamp(cpu_buffer->buffer); @@ -3593,7 +3395,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, if (!w) { /* Use the sub-buffer timestamp */ info->delta = 0; - } else if (unlikely(!a_ok || !b_ok || info->before != info->after)) { + } else if (unlikely(info->before != info->after)) { info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND; info->length += RB_LEN_TIME_EXTEND; } else { @@ -3622,13 +3424,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, if (likely(tail == w)) { u64 save_before; - bool s_ok; /* Nothing interrupted us between A and C */ /*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts); barrier(); - /*E*/ s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before); - RB_WARN_ON(cpu_buffer, !s_ok); + /*E*/ rb_time_read(&cpu_buffer->before_stamp, &save_before); if (likely(!(info->add_timestamp & (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) /* This did not interrupt any time update */ @@ -3641,8 +3441,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, if (unlikely(info->ts != save_before)) { /* SLOW PATH - Interrupted between C and E */ - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); - RB_WARN_ON(cpu_buffer, !a_ok); + rb_time_read(&cpu_buffer->write_stamp, &info->after); /* Write stamp must only go forward */ if (save_before > info->after) { @@ -3657,9 +3456,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, } else { u64 ts; /* SLOW PATH - Interrupted between A and C */ - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); - /* Was interrupted before here, write_stamp must be valid */ - RB_WARN_ON(cpu_buffer, !a_ok); + rb_time_read(&cpu_buffer->write_stamp, &info->after); ts = rb_time_stamp(cpu_buffer->buffer); barrier(); /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, struct rb_event_info info; int nr_loops = 0; int add_ts_default; + static int once; /* ring buffer does cmpxchg, make sure it is safe in NMI context */ if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&