diff mbox series

ring-buffer: Fix slowpath of interrupted event

Message ID 20231218230712.3a76b081@gandalf.local.home (mailing list archive)
State Accepted
Commit b803d7c664d55705831729d2f2e29c874bcd62ea
Headers show
Series ring-buffer: Fix slowpath of interrupted event | expand

Commit Message

Steven Rostedt Dec. 19, 2023, 4:07 a.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

To synchronize the timestamps with the ring buffer reservation, there are
two timestamps that are saved in the buffer meta data.

1. before_stamp
2. write_stamp

When the two are equal, the write_stamp is considered valid, as in, it may
be used to calculate the delta of the next event as the write_stamp is the
timestamp of the previous reserved event on the buffer.

This is done by the following:

 /*A*/	w = current position on the ring buffer
	before = before_stamp
	after = write_stamp
	ts = read current timestamp

	if (before != after) {
		write_stamp is not valid, force adding an absolute
		timestamp.
	}

 /*B*/	before_stamp = ts

 /*C*/	write = local_add_return(event length, position on ring buffer)

	if (w == write - event length) {
		/* Nothing interrupted between A and C */
 /*E*/		write_stamp = ts;
		delta = ts - after
		/*
		 * If nothing interrupted again,
		 * before_stamp == write_stamp and write_stamp
		 * can be used to calculate the delta for
		 * events that come in after this one.
		 */
	} else {

		/*
		 * The slow path!
		 * Was interrupted between A and C.
		 */

This is the place that there's a bug. We currently have:

		after = write_stamp
		ts = read current timestamp

 /*F*/		if (write == current position on the ring buffer &&
		    after < ts && cmpxchg(write_stamp, after, ts)) {

			delta = ts - after;

		} else {
			delta = 0;
		}

The assumption is that if the current position on the ring buffer hasn't
moved between C and F, then it also was not interrupted, and that the last
event written has a timestamp that matches the write_stamp. That is the
write_stamp is valid.

But this may not be the case:

If a task context event was interrupted by softirq between B and C.

And the softirq wrote an event that got interrupted by a hard irq between
C and E.

and the hard irq wrote an event (does not need to be interrupted)

We have:

 /*B*/ before_stamp = ts of normal context

   ---> interrupted by softirq

	/*B*/ before_stamp = ts of softirq context

	  ---> interrupted by hardirq

		/*B*/ before_stamp = ts of hard irq context
		/*E*/ write_stamp = ts of hard irq context

		/* matches and write_stamp valid */
	  <----

	/*E*/ write_stamp = ts of softirq context

	/* No longer matches before_stamp, write_stamp is not valid! */

   <---

 w != write - length, go to slow path

// Right now the order of events in the ring buffer is:
//
// |-- softirq event --|-- hard irq event --|-- normal context event --|
//

 after = write_stamp (this is the ts of softirq)
 ts = read current timestamp

 if (write == current position on the ring buffer [true] &&
     after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {

	delta = ts - after  [Wrong!]

The delta is to be between the hard irq event and the normal context
event, but the above logic made the delta between the softirq event and
the normal context event, where the hard irq event is between the two. This
will shift all the remaining event timestamps on the sub-buffer
incorrectly.

The write_stamp is only valid if it matches the before_stamp. The cmpxchg
does nothing to help this.

Instead, the following logic can be done to fix this:

	before = before_stamp
	ts = read current timestamp
	before_stamp = ts

	after = write_stamp

	if (write == current position on the ring buffer &&
	    after == before && after < ts) {

		delta = ts - after

	} else {
		delta = 0;
	}

The above will only use the write_stamp if it still matches before_stamp
and was tested to not have changed since C.

As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!

This mean the 32-bit rb_time_t workaround can finally be removed. But
that's for a later time.

Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/

Cc: stable@vger.kernel.org
Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 79 ++++++++++++--------------------------
 1 file changed, 24 insertions(+), 55 deletions(-)

Comments

Masami Hiramatsu (Google) Dec. 19, 2023, 2:37 p.m. UTC | #1
On Mon, 18 Dec 2023 23:07:12 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> To synchronize the timestamps with the ring buffer reservation, there are
> two timestamps that are saved in the buffer meta data.
> 
> 1. before_stamp
> 2. write_stamp
> 
> When the two are equal, the write_stamp is considered valid, as in, it may
> be used to calculate the delta of the next event as the write_stamp is the
> timestamp of the previous reserved event on the buffer.
> 
> This is done by the following:
> 
>  /*A*/	w = current position on the ring buffer
> 	before = before_stamp
> 	after = write_stamp
> 	ts = read current timestamp
> 
> 	if (before != after) {
> 		write_stamp is not valid, force adding an absolute
> 		timestamp.

(additional comment)
This happens if this caller interrupts another reservation process's B to E.
(thus the original one only updates "before_stamp", but "write_stamp" is old.)

> 	}
> 
>  /*B*/	before_stamp = ts
> 
>  /*C*/	write = local_add_return(event length, position on ring buffer)
> 
> 	if (w == write - event length) {
> 		/* Nothing interrupted between A and C */
>  /*E*/		write_stamp = ts;
> 		delta = ts - after
> 		/*
> 		 * If nothing interrupted again,
> 		 * before_stamp == write_stamp and write_stamp
> 		 * can be used to calculate the delta for
> 		 * events that come in after this one.
> 		 */

(additional comment)
If something interrupts between C and E, the write_stamp goes backward
because interrupted one updates the before_stamp and write_stamp with
new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts).
Thus anyway the next entry will use absolute time stamp forcibly.

> 	} else {
> 
> 		/*
> 		 * The slow path!
> 		 * Was interrupted between A and C.
> 		 */
> 
> This is the place that there's a bug. We currently have:
> 
> 		after = write_stamp
> 		ts = read current timestamp
> 
>  /*F*/		if (write == current position on the ring buffer &&
> 		    after < ts && cmpxchg(write_stamp, after, ts)) {
> 
> 			delta = ts - after;
> 
> 		} else {
> 			delta = 0;
> 		}
> 
> The assumption is that if the current position on the ring buffer hasn't
> moved between C and F, then it also was not interrupted, and that the last
> event written has a timestamp that matches the write_stamp. That is the
> write_stamp is valid.
> 
> But this may not be the case:
> 
> If a task context event was interrupted by softirq between B and C.
> 
> And the softirq wrote an event that got interrupted by a hard irq between
> C and E.
> 
> and the hard irq wrote an event (does not need to be interrupted)
> 
> We have:
> 
>  /*B*/ before_stamp = ts of normal context
> 
>    ---> interrupted by softirq
> 
> 	/*B*/ before_stamp = ts of softirq context
> 
> 	  ---> interrupted by hardirq
> 
> 		/*B*/ before_stamp = ts of hard irq context
> 		/*E*/ write_stamp = ts of hard irq context
> 
> 		/* matches and write_stamp valid */
> 	  <----
> 
> 	/*E*/ write_stamp = ts of softirq context
> 
> 	/* No longer matches before_stamp, write_stamp is not valid! */
> 
>    <---
> 
>  w != write - length, go to slow path
> 
> // Right now the order of events in the ring buffer is:
> //
> // |-- softirq event --|-- hard irq event --|-- normal context event --|
> //
> 
>  after = write_stamp (this is the ts of softirq)
>  ts = read current timestamp
> 
>  if (write == current position on the ring buffer [true] &&
>      after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
> 
> 	delta = ts - after  [Wrong!]
> 
> The delta is to be between the hard irq event and the normal context
> event, but the above logic made the delta between the softirq event and
> the normal context event, where the hard irq event is between the two. This
> will shift all the remaining event timestamps on the sub-buffer
> incorrectly.
> 
> The write_stamp is only valid if it matches the before_stamp. The cmpxchg
> does nothing to help this.

That's right. Even if someone interrupts between A and C, we can write
the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new)
so the next entry will forcibly record the absolute timestamp.
In this case, what we need to do here is using the absolute timestamp instead
of delta.

> 
> Instead, the following logic can be done to fix this:
> 
> 	before = before_stamp
> 	ts = read current timestamp
> 	before_stamp = ts
> 
> 	after = write_stamp
> 
> 	if (write == current position on the ring buffer &&
> 	    after == before && after < ts) {
> 
> 		delta = ts - after
> 
> 	} else {
> 		delta = 0;
> 	}

Ah, this is a good idea. Instead of using the old timestamp, use
delta = 0, thus it will reuse the interrupted timestamp if someone
interrupts between A and C.

Yeah the above works, but my question is, do we really need this
really slow path? I mean;

> 	if (w == write - event length) {
> 		/* Nothing interrupted between A and C */
>  /*E*/		write_stamp = ts;
> 		delta = ts - after

	} else {
		/*
		  Something interrupted between A and C, which should record
		  a new entry before this reserved entry with newer timestamp.
		  we reuse it.
		 */
	 	ts = after = write_stamp;
		delta = 0;
	}

Isn't this enough?

Thank you,

> 
> The above will only use the write_stamp if it still matches before_stamp
> and was tested to not have changed since C.
> 
> As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
> 
> This mean the 32-bit rb_time_t workaround can finally be removed. But
> that's for a later time.
> 
> Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
> 
> Cc: stable@vger.kernel.org
> Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 79 ++++++++++++--------------------------
>  1 file changed, 24 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5a114e752f11..e7055f52afe0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
>  	return local_try_cmpxchg(l, &expect, set);
>  }
>  
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -	unsigned long cnt, top, bottom, msb;
> -	unsigned long cnt2, top2, bottom2, msb2;
> -	u64 val;
> -
> -	/* Any interruptions in this function should cause a failure */
> -	cnt = local_read(&t->cnt);
> -
> -	/* The cmpxchg always fails if it interrupted an update */
> -	 if (!__rb_time_read(t, &val, &cnt2))
> -		 return false;
> -
> -	 if (val != expect)
> -		 return false;
> -
> -	 if ((cnt & 3) != cnt2)
> -		 return false;
> -
> -	 cnt2 = cnt + 1;
> -
> -	 rb_time_split(val, &top, &bottom, &msb);
> -	 msb = rb_time_val_cnt(msb, cnt);
> -	 top = rb_time_val_cnt(top, cnt);
> -	 bottom = rb_time_val_cnt(bottom, cnt);
> -
> -	 rb_time_split(set, &top2, &bottom2, &msb2);
> -	 msb2 = rb_time_val_cnt(msb2, cnt);
> -	 top2 = rb_time_val_cnt(top2, cnt2);
> -	 bottom2 = rb_time_val_cnt(bottom2, cnt2);
> -
> -	if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->top, top, top2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
> -		return false;
> -	return true;
> -}
> -
>  #else /* 64 bits */
>  
>  /* local64_t always succeeds */
> @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val)
>  {
>  	local64_set(&t->time, val);
>  }
> -
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -	return local64_try_cmpxchg(&t->time, &expect, set);
> -}
>  #endif
>  
>  /*
> @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
>  	} else {
>  		u64 ts;
>  		/* SLOW PATH - Interrupted between A and C */
> -		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> -		/* Was interrupted before here, write_stamp must be valid */
> +
> +		/* Save the old before_stamp */
> +		a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
>  		RB_WARN_ON(cpu_buffer, !a_ok);
> +
> +		/*
> +		 * Read a new timestamp and update the before_stamp to make
> +		 * the next event after this one force using an absolute
> +		 * timestamp. This is in case an interrupt were to come in
> +		 * between E and F.
> +		 */
>  		ts = rb_time_stamp(cpu_buffer->buffer);
> +		rb_time_set(&cpu_buffer->before_stamp, ts);
> +
> +		barrier();
> + /*E*/		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> +		/* Was interrupted before here, write_stamp must be valid */
> +		RB_WARN_ON(cpu_buffer, !a_ok);
>  		barrier();
> - /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> -		    info->after < ts &&
> -		    rb_time_cmpxchg(&cpu_buffer->write_stamp,
> -				    info->after, ts)) {
> -			/* Nothing came after this event between C and E */
> + /*F*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> +		    info->after == info->before && info->after < ts) {
> +			/*
> +			 * Nothing came after this event between C and F, it is
> +			 * safe to use info->after for the delta as it
> +			 * matched info->before and is still valid.
> +			 */
>  			info->delta = ts - info->after;
>  		} else {
>  			/*
> -			 * Interrupted between C and E:
> +			 * Interrupted between C and F:
>  			 * Lost the previous events time stamp. Just set the
>  			 * delta to zero, and this will be the same time as
>  			 * the event this event interrupted. And the events that
> -- 
> 2.42.0
>
Steven Rostedt Dec. 19, 2023, 3:10 p.m. UTC | #2
On Tue, 19 Dec 2023 23:37:10 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Yeah the above works, but my question is, do we really need this
> really slow path? I mean;
> 
> > 	if (w == write - event length) {
> > 		/* Nothing interrupted between A and C */
> >  /*E*/		write_stamp = ts;
> > 		delta = ts - after  
> 
> 	} else {
> 		/*
> 		  Something interrupted between A and C, which should record
> 		  a new entry before this reserved entry with newer timestamp.
> 		  we reuse it.
> 		 */
> 	 	ts = after = write_stamp;
> 		delta = 0;
> 	}
> 
> Isn't this enough?

I really like to avoid: delta = 0 when possible. It's basically what I do
when I have no other options. Why?

Because let's just say you are looking at the time of interrupt events. If
you just trace the entry of the interrupt, and that interrupt interrupted
an event being written to, we have this:

Time starts at ts 1000, and we are able to calculate the delta of the
interrupted event. And the trace will have:

 1000 - interrupt event
 2000 - normal context event
 2100 - next normal context event

Where we see the delta between the interrupt event and the normal context
event was 1000. But if we just had it be delta = 0, it would be:

 1000 - interrupt event
 1000 - normal context event
 2100 - next normal context event

It will look like the time between the interrupt event and the normal
context event was instant, and the time between the normal context event
and the next normal context event was 1100 when in reality it was just 100.

The above scenario is rather common. Perhaps it happens 1% of the time. The
case where we currently have delta = 0 only happens when the same event
gets interrupted twice. That is, two separate interrupts came in, one
before it allocated its space on the buffer, and one after it allocated.
That's a much more race occurrence (0.01% possibly, or less). In fact my
traces seldom even show it. Most of the time, even when doing function
tracing, I have a hard time seeing this rare race.

So, if we can have delta=0 only 0.01% or less of the time, I rather do that
then have it be delta=0 1% of the time.

Thanks for the review!

-- Steve
Steven Rostedt Dec. 19, 2023, 3:36 p.m. UTC | #3
On Tue, 19 Dec 2023 10:10:27 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

>  1000 - interrupt event
>  2000 - normal context event
>  2100 - next normal context event
> 
> Where we see the delta between the interrupt event and the normal context
> event was 1000. But if we just had it be delta = 0, it would be:
> 
>  1000 - interrupt event
>  1000 - normal context event
>  2100 - next normal context event
> 

Also, let me show the rare case of where we do have delta=0. That would be
an interrupt event coming in before *and* after the normal context event
was allocated. That is, we have:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

And that could look like:

  1000 - interrupt event
  1000 - normal context event
  2100 - next interrupt event

It may look like the normal context ran for 1000 before the next interrupt
event, but in reality it happened sometime between the two.

Hmm, in this case, what we could possibly do is to read the absolute
timestamp *after* the interrupted event!

That is, we detected we are here:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

                       ^
                       |
                       |


And we know that the write pointer is past this event. We can look at the
next event and see what its timestamp is. It is either a delta or a
absolute timestamp. It is a delta if the previous interrupt came in between
B and C, as that event would have made before_stamp == write_stamp. Or it
is an absolute delta if this event updated the before_stamp making it no
longer match the write_stamp of the previous event.

If it is an absolute timestamp, it means the interrupt came in before our
update of the before stamp, and we could make our delta:

	absolute value of next timestamp - this event before_stamp

As we have:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

         ^                            ^                     ^
         |                            |                     |
    ts is before before_stamp         |                     |
                               our before_stamp             |
                                                    absolute value

We just need to make our delta not go beyond the absolute value. So:

  ts of first event + (absolute value - our before_stamp)

Should not be greater than the absolute value.

If the next event has a delta, we could make this event delta equal to it,
and then update the next event to have a delta of zero, which would give us:

  1000 - interrupt event
  2100 - normal context event
  2100 - next interrupt event

Which is much more realistic to what happened.

But all this is for another time ;-)

-- Steve
Steven Rostedt Dec. 19, 2023, 3:53 p.m. UTC | #4
On Tue, 19 Dec 2023 10:36:13 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

>  |-- interrupt event --|-- normal context event --|-- interrupt event --|
> 
>          ^                            ^                     ^
>          |                            |                     |
>     ts is before before_stamp         |                     |
>                                our before_stamp             |
>                                                     absolute value
> 
> We just need to make our delta not go beyond the absolute value. So:
> 
>   ts of first event + (absolute value - our before_stamp)
> 
> Should not be greater than the absolute value.

Hmm, this isn't good enough either. Because we could have had the interrupt
happen *after* the before_stamp update, and leave the write_stamp not equal
to the before_stamp (like the original patch is fixing).

That is, in the case of finding an absolute value, there's still no way to
know what delta to use without walking the sub-buffer. Hmm, since this is
so rare, we could do that :-/

Anyway, for now, it's just going to be delta = 0, and we could do the
sub-buffer walk if it becomes an issue (which it hasn't for the last decade
or more).

-- Steve
Masami Hiramatsu (Google) Dec. 19, 2023, 11:52 p.m. UTC | #5
On Tue, 19 Dec 2023 10:10:27 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 19 Dec 2023 23:37:10 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > Yeah the above works, but my question is, do we really need this
> > really slow path? I mean;
> > 
> > > 	if (w == write - event length) {
> > > 		/* Nothing interrupted between A and C */
> > >  /*E*/		write_stamp = ts;
> > > 		delta = ts - after  
> > 
> > 	} else {
> > 		/*
> > 		  Something interrupted between A and C, which should record
> > 		  a new entry before this reserved entry with newer timestamp.
> > 		  we reuse it.
> > 		 */
> > 	 	ts = after = write_stamp;
> > 		delta = 0;
> > 	}
> > 
> > Isn't this enough?
> 
> I really like to avoid: delta = 0 when possible. It's basically what I do
> when I have no other options. Why?
> 
> Because let's just say you are looking at the time of interrupt events. If
> you just trace the entry of the interrupt, and that interrupt interrupted
> an event being written to, we have this:
> 
> Time starts at ts 1000, and we are able to calculate the delta of the
> interrupted event. And the trace will have:
> 
>  1000 - interrupt event
>  2000 - normal context event
>  2100 - next normal context event
> 
> Where we see the delta between the interrupt event and the normal context
> event was 1000. But if we just had it be delta = 0, it would be:
> 
>  1000 - interrupt event
>  1000 - normal context event
>  2100 - next normal context event
> 
> It will look like the time between the interrupt event and the normal
> context event was instant, and the time between the normal context event
> and the next normal context event was 1100 when in reality it was just 100.

Ah, OK. interrupt event can be the beginning of the interrupt handling
and normal context event seems not interrupted from the traced log.
OK, then we have to adjust the ts of normal context event. (interrupted
after reserving the event is OK because user can observe the interrupt
event on the log between normal context event and next one.)

Reveiewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thank you!

> 
> The above scenario is rather common. Perhaps it happens 1% of the time. The
> case where we currently have delta = 0 only happens when the same event
> gets interrupted twice. That is, two separate interrupts came in, one
> before it allocated its space on the buffer, and one after it allocated.
> That's a much more race occurrence (0.01% possibly, or less). In fact my
> traces seldom even show it. Most of the time, even when doing function
> tracing, I have a hard time seeing this rare race.
> 
> So, if we can have delta=0 only 0.01% or less of the time, I rather do that
> then have it be delta=0 1% of the time.
> 
> Thanks for the review!
> 
> -- Steve
diff mbox series

Patch

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5a114e752f11..e7055f52afe0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -700,48 +700,6 @@  rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
 	return local_try_cmpxchg(l, &expect, set);
 }
 
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
-	unsigned long cnt, top, bottom, msb;
-	unsigned long cnt2, top2, bottom2, msb2;
-	u64 val;
-
-	/* Any interruptions in this function should cause a failure */
-	cnt = local_read(&t->cnt);
-
-	/* The cmpxchg always fails if it interrupted an update */
-	 if (!__rb_time_read(t, &val, &cnt2))
-		 return false;
-
-	 if (val != expect)
-		 return false;
-
-	 if ((cnt & 3) != cnt2)
-		 return false;
-
-	 cnt2 = cnt + 1;
-
-	 rb_time_split(val, &top, &bottom, &msb);
-	 msb = rb_time_val_cnt(msb, cnt);
-	 top = rb_time_val_cnt(top, cnt);
-	 bottom = rb_time_val_cnt(bottom, cnt);
-
-	 rb_time_split(set, &top2, &bottom2, &msb2);
-	 msb2 = rb_time_val_cnt(msb2, cnt);
-	 top2 = rb_time_val_cnt(top2, cnt2);
-	 bottom2 = rb_time_val_cnt(bottom2, cnt2);
-
-	if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->top, top, top2))
-		return false;
-	if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
-		return false;
-	return true;
-}
-
 #else /* 64 bits */
 
 /* local64_t always succeeds */
@@ -755,11 +713,6 @@  static void rb_time_set(rb_time_t *t, u64 val)
 {
 	local64_set(&t->time, val);
 }
-
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
-	return local64_try_cmpxchg(&t->time, &expect, set);
-}
 #endif
 
 /*
@@ -3610,20 +3563,36 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	} else {
 		u64 ts;
 		/* SLOW PATH - Interrupted between A and C */
-		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
-		/* Was interrupted before here, write_stamp must be valid */
+
+		/* Save the old before_stamp */
+		a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
 		RB_WARN_ON(cpu_buffer, !a_ok);
+
+		/*
+		 * Read a new timestamp and update the before_stamp to make
+		 * the next event after this one force using an absolute
+		 * timestamp. This is in case an interrupt were to come in
+		 * between E and F.
+		 */
 		ts = rb_time_stamp(cpu_buffer->buffer);
+		rb_time_set(&cpu_buffer->before_stamp, ts);
+
+		barrier();
+ /*E*/		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
+		/* Was interrupted before here, write_stamp must be valid */
+		RB_WARN_ON(cpu_buffer, !a_ok);
 		barrier();
- /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
-		    info->after < ts &&
-		    rb_time_cmpxchg(&cpu_buffer->write_stamp,
-				    info->after, ts)) {
-			/* Nothing came after this event between C and E */
+ /*F*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
+		    info->after == info->before && info->after < ts) {
+			/*
+			 * Nothing came after this event between C and F, it is
+			 * safe to use info->after for the delta as it
+			 * matched info->before and is still valid.
+			 */
 			info->delta = ts - info->after;
 		} else {
 			/*
-			 * Interrupted between C and E:
+			 * Interrupted between C and F:
 			 * Lost the previous events time stamp. Just set the
 			 * delta to zero, and this will be the same time as
 			 * the event this event interrupted. And the events that