Message ID | 20231218230712.3a76b081@gandalf.local.home (mailing list archive) |
---|---|
State | Accepted |
Commit | b803d7c664d55705831729d2f2e29c874bcd62ea |
Headers | show |
Series | ring-buffer: Fix slowpath of interrupted event | expand |
On Mon, 18 Dec 2023 23:07:12 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > To synchronize the timestamps with the ring buffer reservation, there are > two timestamps that are saved in the buffer meta data. > > 1. before_stamp > 2. write_stamp > > When the two are equal, the write_stamp is considered valid, as in, it may > be used to calculate the delta of the next event as the write_stamp is the > timestamp of the previous reserved event on the buffer. > > This is done by the following: > > /*A*/ w = current position on the ring buffer > before = before_stamp > after = write_stamp > ts = read current timestamp > > if (before != after) { > write_stamp is not valid, force adding an absolute > timestamp. (additional comment) This happens if this caller interrupts another reservation process's B to E. (thus the original one only updates "before_stamp", but "write_stamp" is old.) > } > > /*B*/ before_stamp = ts > > /*C*/ write = local_add_return(event length, position on ring buffer) > > if (w == write - event length) { > /* Nothing interrupted between A and C */ > /*E*/ write_stamp = ts; > delta = ts - after > /* > * If nothing interrupted again, > * before_stamp == write_stamp and write_stamp > * can be used to calculate the delta for > * events that come in after this one. > */ (additional comment) If something interrupts between C and E, the write_stamp goes backward because interrupted one updates the before_stamp and write_stamp with new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts). Thus anyway the next entry will use absolute time stamp forcibly. > } else { > > /* > * The slow path! > * Was interrupted between A and C. > */ > > This is the place that there's a bug. We currently have: > > after = write_stamp > ts = read current timestamp > > /*F*/ if (write == current position on the ring buffer && > after < ts && cmpxchg(write_stamp, after, ts)) { > > delta = ts - after; > > } else { > delta = 0; > } > > The assumption is that if the current position on the ring buffer hasn't > moved between C and F, then it also was not interrupted, and that the last > event written has a timestamp that matches the write_stamp. That is the > write_stamp is valid. > > But this may not be the case: > > If a task context event was interrupted by softirq between B and C. > > And the softirq wrote an event that got interrupted by a hard irq between > C and E. > > and the hard irq wrote an event (does not need to be interrupted) > > We have: > > /*B*/ before_stamp = ts of normal context > > ---> interrupted by softirq > > /*B*/ before_stamp = ts of softirq context > > ---> interrupted by hardirq > > /*B*/ before_stamp = ts of hard irq context > /*E*/ write_stamp = ts of hard irq context > > /* matches and write_stamp valid */ > <---- > > /*E*/ write_stamp = ts of softirq context > > /* No longer matches before_stamp, write_stamp is not valid! */ > > <--- > > w != write - length, go to slow path > > // Right now the order of events in the ring buffer is: > // > // |-- softirq event --|-- hard irq event --|-- normal context event --| > // > > after = write_stamp (this is the ts of softirq) > ts = read current timestamp > > if (write == current position on the ring buffer [true] && > after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) { > > delta = ts - after [Wrong!] > > The delta is to be between the hard irq event and the normal context > event, but the above logic made the delta between the softirq event and > the normal context event, where the hard irq event is between the two. This > will shift all the remaining event timestamps on the sub-buffer > incorrectly. > > The write_stamp is only valid if it matches the before_stamp. The cmpxchg > does nothing to help this. That's right. Even if someone interrupts between A and C, we can write the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new) so the next entry will forcibly record the absolute timestamp. In this case, what we need to do here is using the absolute timestamp instead of delta. > > Instead, the following logic can be done to fix this: > > before = before_stamp > ts = read current timestamp > before_stamp = ts > > after = write_stamp > > if (write == current position on the ring buffer && > after == before && after < ts) { > > delta = ts - after > > } else { > delta = 0; > } Ah, this is a good idea. Instead of using the old timestamp, use delta = 0, thus it will reuse the interrupted timestamp if someone interrupts between A and C. Yeah the above works, but my question is, do we really need this really slow path? I mean; > if (w == write - event length) { > /* Nothing interrupted between A and C */ > /*E*/ write_stamp = ts; > delta = ts - after } else { /* Something interrupted between A and C, which should record a new entry before this reserved entry with newer timestamp. we reuse it. */ ts = after = write_stamp; delta = 0; } Isn't this enough? Thank you, > > The above will only use the write_stamp if it still matches before_stamp > and was tested to not have changed since C. > > As a bonus, with this logic we do not need any 64-bit cmpxchg() at all! > > This mean the 32-bit rb_time_t workaround can finally be removed. But > that's for a later time. > > Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/ > > Cc: stable@vger.kernel.org > Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp") > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- > kernel/trace/ring_buffer.c | 79 ++++++++++++-------------------------- > 1 file changed, 24 insertions(+), 55 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 5a114e752f11..e7055f52afe0 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set) > return local_try_cmpxchg(l, &expect, set); > } > > -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) > -{ > - unsigned long cnt, top, bottom, msb; > - unsigned long cnt2, top2, bottom2, msb2; > - u64 val; > - > - /* Any interruptions in this function should cause a failure */ > - cnt = local_read(&t->cnt); > - > - /* The cmpxchg always fails if it interrupted an update */ > - if (!__rb_time_read(t, &val, &cnt2)) > - return false; > - > - if (val != expect) > - return false; > - > - if ((cnt & 3) != cnt2) > - return false; > - > - cnt2 = cnt + 1; > - > - rb_time_split(val, &top, &bottom, &msb); > - msb = rb_time_val_cnt(msb, cnt); > - top = rb_time_val_cnt(top, cnt); > - bottom = rb_time_val_cnt(bottom, cnt); > - > - rb_time_split(set, &top2, &bottom2, &msb2); > - msb2 = rb_time_val_cnt(msb2, cnt); > - top2 = rb_time_val_cnt(top2, cnt2); > - bottom2 = rb_time_val_cnt(bottom2, cnt2); > - > - if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->top, top, top2)) > - return false; > - if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) > - return false; > - return true; > -} > - > #else /* 64 bits */ > > /* local64_t always succeeds */ > @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val) > { > local64_set(&t->time, val); > } > - > -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) > -{ > - return local64_try_cmpxchg(&t->time, &expect, set); > -} > #endif > > /* > @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, > } else { > u64 ts; > /* SLOW PATH - Interrupted between A and C */ > - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); > - /* Was interrupted before here, write_stamp must be valid */ > + > + /* Save the old before_stamp */ > + a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); > RB_WARN_ON(cpu_buffer, !a_ok); > + > + /* > + * Read a new timestamp and update the before_stamp to make > + * the next event after this one force using an absolute > + * timestamp. This is in case an interrupt were to come in > + * between E and F. > + */ > ts = rb_time_stamp(cpu_buffer->buffer); > + rb_time_set(&cpu_buffer->before_stamp, ts); > + > + barrier(); > + /*E*/ a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); > + /* Was interrupted before here, write_stamp must be valid */ > + RB_WARN_ON(cpu_buffer, !a_ok); > barrier(); > - /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && > - info->after < ts && > - rb_time_cmpxchg(&cpu_buffer->write_stamp, > - info->after, ts)) { > - /* Nothing came after this event between C and E */ > + /*F*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && > + info->after == info->before && info->after < ts) { > + /* > + * Nothing came after this event between C and F, it is > + * safe to use info->after for the delta as it > + * matched info->before and is still valid. > + */ > info->delta = ts - info->after; > } else { > /* > - * Interrupted between C and E: > + * Interrupted between C and F: > * Lost the previous events time stamp. Just set the > * delta to zero, and this will be the same time as > * the event this event interrupted. And the events that > -- > 2.42.0 >
On Tue, 19 Dec 2023 23:37:10 +0900 Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote: > Yeah the above works, but my question is, do we really need this > really slow path? I mean; > > > if (w == write - event length) { > > /* Nothing interrupted between A and C */ > > /*E*/ write_stamp = ts; > > delta = ts - after > > } else { > /* > Something interrupted between A and C, which should record > a new entry before this reserved entry with newer timestamp. > we reuse it. > */ > ts = after = write_stamp; > delta = 0; > } > > Isn't this enough? I really like to avoid: delta = 0 when possible. It's basically what I do when I have no other options. Why? Because let's just say you are looking at the time of interrupt events. If you just trace the entry of the interrupt, and that interrupt interrupted an event being written to, we have this: Time starts at ts 1000, and we are able to calculate the delta of the interrupted event. And the trace will have: 1000 - interrupt event 2000 - normal context event 2100 - next normal context event Where we see the delta between the interrupt event and the normal context event was 1000. But if we just had it be delta = 0, it would be: 1000 - interrupt event 1000 - normal context event 2100 - next normal context event It will look like the time between the interrupt event and the normal context event was instant, and the time between the normal context event and the next normal context event was 1100 when in reality it was just 100. The above scenario is rather common. Perhaps it happens 1% of the time. The case where we currently have delta = 0 only happens when the same event gets interrupted twice. That is, two separate interrupts came in, one before it allocated its space on the buffer, and one after it allocated. That's a much more race occurrence (0.01% possibly, or less). In fact my traces seldom even show it. Most of the time, even when doing function tracing, I have a hard time seeing this rare race. So, if we can have delta=0 only 0.01% or less of the time, I rather do that then have it be delta=0 1% of the time. Thanks for the review! -- Steve
On Tue, 19 Dec 2023 10:10:27 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > 1000 - interrupt event > 2000 - normal context event > 2100 - next normal context event > > Where we see the delta between the interrupt event and the normal context > event was 1000. But if we just had it be delta = 0, it would be: > > 1000 - interrupt event > 1000 - normal context event > 2100 - next normal context event > Also, let me show the rare case of where we do have delta=0. That would be an interrupt event coming in before *and* after the normal context event was allocated. That is, we have: |-- interrupt event --|-- normal context event --|-- interrupt event --| And that could look like: 1000 - interrupt event 1000 - normal context event 2100 - next interrupt event It may look like the normal context ran for 1000 before the next interrupt event, but in reality it happened sometime between the two. Hmm, in this case, what we could possibly do is to read the absolute timestamp *after* the interrupted event! That is, we detected we are here: |-- interrupt event --|-- normal context event --|-- interrupt event --| ^ | | And we know that the write pointer is past this event. We can look at the next event and see what its timestamp is. It is either a delta or a absolute timestamp. It is a delta if the previous interrupt came in between B and C, as that event would have made before_stamp == write_stamp. Or it is an absolute delta if this event updated the before_stamp making it no longer match the write_stamp of the previous event. If it is an absolute timestamp, it means the interrupt came in before our update of the before stamp, and we could make our delta: absolute value of next timestamp - this event before_stamp As we have: |-- interrupt event --|-- normal context event --|-- interrupt event --| ^ ^ ^ | | | ts is before before_stamp | | our before_stamp | absolute value We just need to make our delta not go beyond the absolute value. So: ts of first event + (absolute value - our before_stamp) Should not be greater than the absolute value. If the next event has a delta, we could make this event delta equal to it, and then update the next event to have a delta of zero, which would give us: 1000 - interrupt event 2100 - normal context event 2100 - next interrupt event Which is much more realistic to what happened. But all this is for another time ;-) -- Steve
On Tue, 19 Dec 2023 10:36:13 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > |-- interrupt event --|-- normal context event --|-- interrupt event --| > > ^ ^ ^ > | | | > ts is before before_stamp | | > our before_stamp | > absolute value > > We just need to make our delta not go beyond the absolute value. So: > > ts of first event + (absolute value - our before_stamp) > > Should not be greater than the absolute value. Hmm, this isn't good enough either. Because we could have had the interrupt happen *after* the before_stamp update, and leave the write_stamp not equal to the before_stamp (like the original patch is fixing). That is, in the case of finding an absolute value, there's still no way to know what delta to use without walking the sub-buffer. Hmm, since this is so rare, we could do that :-/ Anyway, for now, it's just going to be delta = 0, and we could do the sub-buffer walk if it becomes an issue (which it hasn't for the last decade or more). -- Steve
On Tue, 19 Dec 2023 10:10:27 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 19 Dec 2023 23:37:10 +0900 > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote: > > > Yeah the above works, but my question is, do we really need this > > really slow path? I mean; > > > > > if (w == write - event length) { > > > /* Nothing interrupted between A and C */ > > > /*E*/ write_stamp = ts; > > > delta = ts - after > > > > } else { > > /* > > Something interrupted between A and C, which should record > > a new entry before this reserved entry with newer timestamp. > > we reuse it. > > */ > > ts = after = write_stamp; > > delta = 0; > > } > > > > Isn't this enough? > > I really like to avoid: delta = 0 when possible. It's basically what I do > when I have no other options. Why? > > Because let's just say you are looking at the time of interrupt events. If > you just trace the entry of the interrupt, and that interrupt interrupted > an event being written to, we have this: > > Time starts at ts 1000, and we are able to calculate the delta of the > interrupted event. And the trace will have: > > 1000 - interrupt event > 2000 - normal context event > 2100 - next normal context event > > Where we see the delta between the interrupt event and the normal context > event was 1000. But if we just had it be delta = 0, it would be: > > 1000 - interrupt event > 1000 - normal context event > 2100 - next normal context event > > It will look like the time between the interrupt event and the normal > context event was instant, and the time between the normal context event > and the next normal context event was 1100 when in reality it was just 100. Ah, OK. interrupt event can be the beginning of the interrupt handling and normal context event seems not interrupted from the traced log. OK, then we have to adjust the ts of normal context event. (interrupted after reserving the event is OK because user can observe the interrupt event on the log between normal context event and next one.) Reveiewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Thank you! > > The above scenario is rather common. Perhaps it happens 1% of the time. The > case where we currently have delta = 0 only happens when the same event > gets interrupted twice. That is, two separate interrupts came in, one > before it allocated its space on the buffer, and one after it allocated. > That's a much more race occurrence (0.01% possibly, or less). In fact my > traces seldom even show it. Most of the time, even when doing function > tracing, I have a hard time seeing this rare race. > > So, if we can have delta=0 only 0.01% or less of the time, I rather do that > then have it be delta=0 1% of the time. > > Thanks for the review! > > -- Steve
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 5a114e752f11..e7055f52afe0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set) return local_try_cmpxchg(l, &expect, set); } -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) -{ - unsigned long cnt, top, bottom, msb; - unsigned long cnt2, top2, bottom2, msb2; - u64 val; - - /* Any interruptions in this function should cause a failure */ - cnt = local_read(&t->cnt); - - /* The cmpxchg always fails if it interrupted an update */ - if (!__rb_time_read(t, &val, &cnt2)) - return false; - - if (val != expect) - return false; - - if ((cnt & 3) != cnt2) - return false; - - cnt2 = cnt + 1; - - rb_time_split(val, &top, &bottom, &msb); - msb = rb_time_val_cnt(msb, cnt); - top = rb_time_val_cnt(top, cnt); - bottom = rb_time_val_cnt(bottom, cnt); - - rb_time_split(set, &top2, &bottom2, &msb2); - msb2 = rb_time_val_cnt(msb2, cnt); - top2 = rb_time_val_cnt(top2, cnt2); - bottom2 = rb_time_val_cnt(bottom2, cnt2); - - if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) - return false; - if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) - return false; - if (!rb_time_read_cmpxchg(&t->top, top, top2)) - return false; - if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) - return false; - return true; -} - #else /* 64 bits */ /* local64_t always succeeds */ @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val) { local64_set(&t->time, val); } - -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) -{ - return local64_try_cmpxchg(&t->time, &expect, set); -} #endif /* @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, } else { u64 ts; /* SLOW PATH - Interrupted between A and C */ - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); - /* Was interrupted before here, write_stamp must be valid */ + + /* Save the old before_stamp */ + a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); RB_WARN_ON(cpu_buffer, !a_ok); + + /* + * Read a new timestamp and update the before_stamp to make + * the next event after this one force using an absolute + * timestamp. This is in case an interrupt were to come in + * between E and F. + */ ts = rb_time_stamp(cpu_buffer->buffer); + rb_time_set(&cpu_buffer->before_stamp, ts); + + barrier(); + /*E*/ a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); + /* Was interrupted before here, write_stamp must be valid */ + RB_WARN_ON(cpu_buffer, !a_ok); barrier(); - /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && - info->after < ts && - rb_time_cmpxchg(&cpu_buffer->write_stamp, - info->after, ts)) { - /* Nothing came after this event between C and E */ + /*F*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && + info->after == info->before && info->after < ts) { + /* + * Nothing came after this event between C and F, it is + * safe to use info->after for the delta as it + * matched info->before and is still valid. + */ info->delta = ts - info->after; } else { /* - * Interrupted between C and E: + * Interrupted between C and F: * Lost the previous events time stamp. Just set the * delta to zero, and this will be the same time as * the event this event interrupted. And the events that