Message ID | 20231215165628.096822746@goodmis.org (mailing list archive) |
---|---|
State | Superseded |
Commit | 64aecfaef9a7049e1e6eecf7c8d9f2471acb74f7 |
Headers | show |
Series | tracing: Replace final 64-bit cmpxchg with compare and update if available | expand |
On Fri, 15 Dec 2023 11:55:13 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > There's only one place that performs a 64-bit cmpxchg for the timestamp > processing. The cmpxchg is only to set the write_stamp equal to the > before_stamp, and if it doesn't get set, then the next event will simply > be forced to add an absolute timestamp. > > Given that 64-bit cmpxchg is expensive on 32-bit, and the current > workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster. > It's best to just not do the cmpxchg as a simple compare works for the > accuracy of the timestamp. The only thing that will happen without the > cmpxchg is the prepended absolute timestamp on the next event which is not > that big of a deal as the path where this happens is seldom hit because it > requires an interrupt to happen between a few lines of code that also > writes an event into the same buffer. > > With this change, the 32-bit rb_time_t workaround can be removed. > Hmm, but this patch itself is just moving rb_time_cmpxchg() in the new rb_time_cmp_and_update() function. The actual change has been done in the next patch. I think there is no reason to split this from the second one... Isn't this part actual change? > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) > { > - return rb_time_cmpxchg(t, expect, set); > +#ifdef RB_TIME_32 > + return expect == READ_ONCE(t->time); > +#else > + return local64_try_cmpxchg(&t->time, &expect, set); > +#endif > } Thank you, > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- > kernel/trace/ring_buffer.c | 23 ++++++++++++++++++++--- > 1 file changed, 20 insertions(+), 3 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 1a26b3a1901f..c6842a4331a9 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -762,6 +762,15 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) > } > #endif > > +/* > + * Returns true if t == expect, and if possible will update with set, > + * but t is not guaranteed to be updated even if this returns true > + */ > +static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) > +{ > + return rb_time_cmpxchg(t, expect, set); > +} > + > /* > * Enable this to make sure that the event passed to > * ring_buffer_event_time_stamp() is not committed and also > @@ -3622,9 +3631,17 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > barrier(); > /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && > info->after < ts && > - rb_time_cmpxchg(&cpu_buffer->write_stamp, > - info->after, ts)) { > - /* Nothing came after this event between C and E */ > + rb_time_cmp_and_update(&cpu_buffer->write_stamp, > + info->after, ts)) { > + /* > + * Nothing came after this event between C and E it is > + * safe to use info->after for the delta. > + * The above rb_time_cmp_and_update() may or may not > + * have updated the write_stamp. If it did not then > + * the next event will simply add an absolute timestamp > + * as the write_stamp will be different than the > + * before_stamp. > + */ > info->delta = ts - info->after; > } else { > /* > -- > 2.42.0 > >
On Mon, 18 Dec 2023 23:24:55 +0900 Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote: > On Fri, 15 Dec 2023 11:55:13 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > There's only one place that performs a 64-bit cmpxchg for the timestamp > > processing. The cmpxchg is only to set the write_stamp equal to the > > before_stamp, and if it doesn't get set, then the next event will simply > > be forced to add an absolute timestamp. > > > > Given that 64-bit cmpxchg is expensive on 32-bit, and the current > > workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster. > > It's best to just not do the cmpxchg as a simple compare works for the > > accuracy of the timestamp. The only thing that will happen without the > > cmpxchg is the prepended absolute timestamp on the next event which is not > > that big of a deal as the path where this happens is seldom hit because it > > requires an interrupt to happen between a few lines of code that also > > writes an event into the same buffer. > > > > With this change, the 32-bit rb_time_t workaround can be removed. > > > > Hmm, but this patch itself is just moving rb_time_cmpxchg() in the new > rb_time_cmp_and_update() function. The actual change has been done > in the next patch. Exactly. Which is why I said above "with this change, the 32-bit rb_time_t workaround can be removed". It can't be removed without this change. > I think there is no reason to split this from the > second one... I originally had it as one patch, but I disliked the removal of the workaround touching the main logic code (which this patch does). Basically I broke it into: 1. Remove workaround exposure from the main logic. (this patch) 2. Remove the workaround. (next patch). > > Isn't this part actual change? This part is abstracted out from the main logic. Which is why I made this patch. > > > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) > > { > > - return rb_time_cmpxchg(t, expect, set); > > +#ifdef RB_TIME_32 > > + return expect == READ_ONCE(t->time); And I need to make a v2 as the above is wrong. It should have been: return expect == local64_read(&t->time); -- Steve > > +#else > > + return local64_try_cmpxchg(&t->time, &expect, set); > > +#endif > > }
On Mon, 18 Dec 2023 10:15:31 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Basically I broke it into: > > 1. Remove workaround exposure from the main logic. (this patch) > 2. Remove the workaround. (next patch). > > > > > Isn't this part actual change? > > This part is abstracted out from the main logic. Which is why I made this > patch. > > > > > > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) > > > { > > > - return rb_time_cmpxchg(t, expect, set); > > > +#ifdef RB_TIME_32 > > > + return expect == READ_ONCE(t->time); > > And I need to make a v2 as the above is wrong. It should have been: > > return expect == local64_read(&t->time); My v2 version will also make 64 bit not guaranteed to update on return of true. Which adds even more reason to separate out the two. -- Steve
On Mon, 18 Dec 2023 13:42:40 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) > > > > { > > > > - return rb_time_cmpxchg(t, expect, set); > > > > +#ifdef RB_TIME_32 > > > > + return expect == READ_ONCE(t->time); > > > > And I need to make a v2 as the above is wrong. It should have been: > > > > return expect == local64_read(&t->time); > > > My v2 version will also make 64 bit not guaranteed to update on return of > true. Which adds even more reason to separate out the two. This code was failing my tests, and after figuring out why, I realized I can remove this 64-bit cmpxchg for both 64-bit and 32-bit architectures! I was thinking that the 64-bit cmpxchg() was to keep from adding an absolute timestamp to after the slowpath. But that was not the case. It was actually going to *add* a absolute timestamp if it succeeded. Well, not in every case, but in some cases. First let me explain the purpose of this last 64-bit cmpxchg: Before reserving the data on the ring buffer, we do: /*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK; barrier(); b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); barrier(); info->ts = rb_time_stamp(cpu_buffer->buffer); The 'w' is the location we expect to have our data on. Then we read the two timestamps "before_stamp" and "write_stamp" and save them locally on the stack (info is a stack item). Then we read the current timestamp and place it into info->ts. /*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts); All events written into the ring buffer will write into the "before_stamp" before reserving the ring buffer. I cut out the logic above that makes sure that "before_stamp" matches "write_stamp", as if they do not match, the "write_stamp" can not be trusted. As soon as the "before_stamp" is written to, then any interrupting events will not trust the "write_stamp" and add its own absolute timestamp, but also update the write_stamp to its before_stamp so that later events can be trusted. Now reserve the data for this event on the ring buffer: /*C*/ write = local_add_return(info->length, &tail_page->write); tail = write - info->length; if (likely(tail == w)) { The above reserves the data for the event on the ring buffer, and then checks to see if where it was reserved matches where it expected to be reserved at the start of the operations above. [...] } else { Now we go into the slow path of where some thing snucked in between /*A*/ and /*C*/ a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); Read the write_stamp again. ts = rb_time_stamp(cpu_buffer->buffer); Get the current timestamp barrier(); /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && info->after < ts && If "write" from /*C*/ still matches the current location on the ring buffer, we know that nothing came in between /*C*/ and /*E*/ rb_time_cmpxchg(&cpu_buffer->write_stamp, info->after, ts)) { info->delta = ts - info->after; Now the above cmpxchg() is needed in this code because we need to update the write_stamp to the current timestamp so that we can safely calculate the delta. But by writing to the write_stamp via the cmpxchg() we are actually making it different than the before_stamp, as the interruption could have been between /*B*/ and /*C*/, and this would also force the next event to use an absolute timestamp. Of course, if it happened between /*A*/ and /*B*/ it is actually validating the write_stamp again by making it match the before_stamp. The real reason we need to do the cmpxchg() is because of a possible interruption after /*E*/ above. The write_stamp needs to be updated atomically, otherwise the interrupting event that comes after /*E*/, will be using the previous interruption write_stamp to calculate its delta from. For example: /*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts); ---> interrupt, updates before_stamp and write_stamp to its own ts /*C*/ write = local_add_return(info->length, &tail_page->write); /* Allocated memory, any interruptions here, will be using the previous interrupt ts and not this event */ /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && info->after < ts && If write == tail_page->write than nothing came in, so it is still safe to add a delta, but only if we atomically update the write_stamp, forcing interrupting events to add an absolute timestamp. If it is not done atomically, then there's a race that another interrupt could come in, and use the previous interrupt event to calculate its delta, even though, this event is in between the two. BUT! we can also do this instead without the cmpcxchg! Especially as success of cmpxchg could possibly force a absolute timestamp. If we always force the absolute timestamp, then there's no need for the cmpxchg at all. By doing: /* * Read a new timestamp and update before_stamp with it. * Any new event coming in now will use an absolute timestamp */ ts = rb_time_stamp(cpu_buffer->buffer); rb_time_set(&cpu_buffer->before_stamp, ts); /* Next event will force an absolute timestamp */ barrier(); a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); barrier(); /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && info->after < ts) { /* * If nothing came in between C and E, it is safe * to use the write_stamp as the delta. */ info->delta = ts - info->after; } else { /* * Interrupted twice and the second interruption is possibly * using the first interruption to calculate its delta. Just * set our delta to zero to not mess the event that came in * after up. */ info->delta = 0; } With this logic, we do not need any 64-bit cmpxchg() at all! -- Steve
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 1a26b3a1901f..c6842a4331a9 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -762,6 +762,15 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) } #endif +/* + * Returns true if t == expect, and if possible will update with set, + * but t is not guaranteed to be updated even if this returns true + */ +static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set) +{ + return rb_time_cmpxchg(t, expect, set); +} + /* * Enable this to make sure that the event passed to * ring_buffer_event_time_stamp() is not committed and also @@ -3622,9 +3631,17 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, barrier(); /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && info->after < ts && - rb_time_cmpxchg(&cpu_buffer->write_stamp, - info->after, ts)) { - /* Nothing came after this event between C and E */ + rb_time_cmp_and_update(&cpu_buffer->write_stamp, + info->after, ts)) { + /* + * Nothing came after this event between C and E it is + * safe to use info->after for the delta. + * The above rb_time_cmp_and_update() may or may not + * have updated the write_stamp. If it did not then + * the next event will simply add an absolute timestamp + * as the write_stamp will be different than the + * before_stamp. + */ info->delta = ts - info->after; } else { /*