diff mbox series

ring-buffer: Simplify reservation with try_cmpxchg() loop

Message ID 20240118181206.4977da2f@gandalf.local.home (mailing list archive)
State Superseded
Headers show
Series ring-buffer: Simplify reservation with try_cmpxchg() loop | expand

Commit Message

Steven Rostedt Jan. 18, 2024, 11:12 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.

Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.

Before we had:

	w = local_read(&tail_page->write);
	/* get time stamps */
	write = local_add_return(length, &tail_page->write);
	if (write - length == w) {
		/* do simple case */
	} else {
		/* do complex case */
	}

By switching the local_add_return() to a local_try_cmpxchg() it can now be:

	 w = local_read(&tail_page->write);
 again:
	/* get time stamps */
	if (!local_try_cmpxchg(&tail_page->write, &w, w + length))
		goto again;

	 /* do simple case */

The benchmarks between the two showed no regressions when trying this:

 Enable: CONFIG_TRACEPOINT_BENCHMARK

 # trace-cmd record -m 800000 -e benchmark sleep 60

Before the patch:

 # trace-cmd report | tail
 event_benchmark-944   [003]  1998.910191: benchmark_event:      last=150 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=150
 event_benchmark-944   [003]  1998.910192: benchmark_event:      last=149 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=149
 event_benchmark-944   [003]  1998.910193: benchmark_event:      last=150 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=150
 event_benchmark-944   [003]  1998.910193: benchmark_event:      last=150 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=150
 event_benchmark-944   [003]  1998.910194: benchmark_event:      last=136 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=136
 event_benchmark-944   [003]  1998.910194: benchmark_event:      last=138 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=138
 event_benchmark-944   [003]  1998.910195: benchmark_event:      last=150 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=150
 event_benchmark-944   [003]  1998.910196: benchmark_event:      last=151 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=151
 event_benchmark-944   [003]  1998.910196: benchmark_event:      last=150 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=150
 event_benchmark-944   [003]  1998.910197: benchmark_event:      last=152 first=3488 max=1199686 min=124 avg=208 std=39 std^2=1579 delta=152

After the patch:

 # trace-cmd report | tail
 event_benchmark-848   [004]   171.414716: benchmark_event:      last=143 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=143
 event_benchmark-848   [004]   171.414717: benchmark_event:      last=142 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=142
 event_benchmark-848   [004]   171.414718: benchmark_event:      last=142 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=142
 event_benchmark-848   [004]   171.414718: benchmark_event:      last=141 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=141
 event_benchmark-848   [004]   171.414719: benchmark_event:      last=141 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=141
 event_benchmark-848   [004]   171.414719: benchmark_event:      last=141 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=141
 event_benchmark-848   [004]   171.414720: benchmark_event:      last=140 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=140
 event_benchmark-848   [004]   171.414721: benchmark_event:      last=142 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=142
 event_benchmark-848   [004]   171.414721: benchmark_event:      last=145 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=145
 event_benchmark-848   [004]   171.414722: benchmark_event:      last=144 first=14483 max=1155491 min=125 avg=189 std=16 std^2=264 delta=144

It may have even improved!

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 100 ++++++++++++-------------------------
 1 file changed, 31 insertions(+), 69 deletions(-)

Comments

Mathieu Desnoyers Jan. 19, 2024, 2:40 p.m. UTC | #1
On 2024-01-18 18:12, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> Instead of using local_add_return() to reserve the ring buffer data,
> Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
> reservation with the time keeping code.

I admire the effort of trying to simplify the Ftrace ring buffer by bringing
over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem,
I certainly welcome the simplifications.

> Although, it does not get rid of the double time stamps (before_stamp and
> write_stamp), using cmpxchg() does get rid of the more complex case when
> an interrupting event occurs between getting the timestamps and reserving
> the data, as when that happens, it just tries again instead of dealing
> with it.

I understand that the reason why you need the before/after stamps and their
associated complexity is because the Ftrace ring buffer ABI encodes event
timestamps as delta from the previous event within the buffer as a mean of
compressing the timestamp fields. If the delta cannot be represented in a
given number of bits, then it inserts a 64-bit timestamp (not sure if that
one is absolute or a delta from previous event).

This timestamp encoding as delta between events introduce a strong
inter-dependency between consecutive (nested) events, and is the reason
why you are stuck with all this timestamp before/after complexity.

The Common Trace Format specifies (and LTTng implements) a different way
to achieve the same ring buffer space-savings achieved with timestamp deltas
while keeping the timestamps semantically absolute from a given reference,
hence without all the before/after timestamp complexity. You can see the
clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic
idea on the producer side is to record the low-order bits of the current
timestamp in the event header (truncating the higher order bits), and
fall back on a full 64-bit value if the number of low-order bits overflows
from the previous timestamp is more than 1, or if it is impossible to figure
out precisely the timestamp of the previous event due to a race. This achieves
the same space savings as delta timestamp encoding without introducing the
strong event inter-dependency.

The fact that Ftrace exposes this ring buffer binary layout as a user-space
ABI makes it tricky to move to the Common Trace Format timestamp encoding.
There are clearly huge simplifications that could be made by moving to this
scheme though. Is there any way to introduce a different timestamp encoding
scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
introduce this simpler scheme and gradually phase out the more complex delta
encoding when no users are left.

Thoughts ?

Thanks,

Mathieu

[1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update
Steven Rostedt Jan. 19, 2024, 3:37 p.m. UTC | #2
On Fri, 19 Jan 2024 09:40:27 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2024-01-18 18:12, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> > 
> > Instead of using local_add_return() to reserve the ring buffer data,
> > Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
> > reservation with the time keeping code.  
> 
> I admire the effort of trying to simplify the Ftrace ring buffer by bringing
> over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem,
> I certainly welcome the simplifications.
> 

The idea itself wasn't new. It was you convincing me that
local_add_return() is no faster than local_cmpxchg(). As I would have done
it this way from the start if I wasn't dead set against adding any new
cmpxchg() in the fast path.

Testing showed that local_add_return() is pretty much just as bad, so the
added complexity to avoid just slapping in a cmpxchg() was useless.

> > Although, it does not get rid of the double time stamps (before_stamp and
> > write_stamp), using cmpxchg() does get rid of the more complex case when
> > an interrupting event occurs between getting the timestamps and reserving
> > the data, as when that happens, it just tries again instead of dealing
> > with it.  
> 
> I understand that the reason why you need the before/after stamps and their
> associated complexity is because the Ftrace ring buffer ABI encodes event
> timestamps as delta from the previous event within the buffer as a mean of
> compressing the timestamp fields. If the delta cannot be represented in a
> given number of bits, then it inserts a 64-bit timestamp (not sure if that
> one is absolute or a delta from previous event).

There's both. An extended timestamp, which is added when the delta is too
big, and that too is just a delta from the previous event. And there is the
absolute timestamp as well. I could always just use the absolute one. That
event came much later.

> 
> This timestamp encoding as delta between events introduce a strong
> inter-dependency between consecutive (nested) events, and is the reason
> why you are stuck with all this timestamp before/after complexity.
> 
> The Common Trace Format specifies (and LTTng implements) a different way
> to achieve the same ring buffer space-savings achieved with timestamp deltas
> while keeping the timestamps semantically absolute from a given reference,
> hence without all the before/after timestamp complexity. You can see the
> clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic

That points to this:

---------------------8<-------------------------
6.3. Clock value update procedure
To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C:

Let L be an unsigned integer initialized to, depending on the type property of C:

"fixed-length-unsigned-integer"
The value of the length property of C.

"variable-length-unsigned-integer"
S ×7, where S is the number of bytes which F occupies with the data stream.

Let MASK be an unsigned integer initialized to 2L − 1.

Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator.

Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator.

Set DEF_CLK_VAL to:

If V ≥ CUR
H + V

Else
H + MASK + 1 + V
--------------------->8-------------------------

There's a lot of missing context there, so I don't see how it relates.


> idea on the producer side is to record the low-order bits of the current
> timestamp in the event header (truncating the higher order bits), and
> fall back on a full 64-bit value if the number of low-order bits overflows
> from the previous timestamp is more than 1, or if it is impossible to figure
> out precisely the timestamp of the previous event due to a race. This achieves
> the same space savings as delta timestamp encoding without introducing the
> strong event inter-dependency.

So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?

> 
> The fact that Ftrace exposes this ring buffer binary layout as a user-space
> ABI makes it tricky to move to the Common Trace Format timestamp encoding.
> There are clearly huge simplifications that could be made by moving to this
> scheme though. Is there any way to introduce a different timestamp encoding
> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
> introduce this simpler scheme and gradually phase out the more complex delta
> encoding when no users are left.

I'm not sure if there's a path forward. The infrastructure can easily swap
in and out a new implementation. That is, there's not much dependency on
the way the ring buffer works outside the ring buffer itself.

If we were to change the layout, it would likely require a new interface
file to read. The trace_pipe_raw is the only file that exposes the current
ring buffer. We could create a trace_out_raw or some other named file that
has a completely different API and it wouldn't break any existing API.

Although, if we want to change the "default" way, it may need some other
knobs or something, which wouldn't be hard.

Now I have to ask, what's the motivation for this. The code isn't that
complex anymore. Yes it still has the before/after timestamps, but the
most complexity in that code was due to what happens in the race of
updating the reserved data. But that's no longer the case with the
cmpxchg(). If you look at this patch, that entire else statement was
deleted. And that deleted code was what made me sick to my stomach ;-)
Good riddance!

-- Steve
Mathieu Desnoyers Jan. 19, 2024, 8:56 p.m. UTC | #3
On 2024-01-19 10:37, Steven Rostedt wrote:
> On Fri, 19 Jan 2024 09:40:27 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2024-01-18 18:12, Steven Rostedt wrote:
>>> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>>>

[...]

>>> Although, it does not get rid of the double time stamps (before_stamp and
>>> write_stamp), using cmpxchg() does get rid of the more complex case when
>>> an interrupting event occurs between getting the timestamps and reserving
>>> the data, as when that happens, it just tries again instead of dealing
>>> with it.
>>
>> I understand that the reason why you need the before/after stamps and their
>> associated complexity is because the Ftrace ring buffer ABI encodes event
>> timestamps as delta from the previous event within the buffer as a mean of
>> compressing the timestamp fields. If the delta cannot be represented in a
>> given number of bits, then it inserts a 64-bit timestamp (not sure if that
>> one is absolute or a delta from previous event).
> 
> There's both. An extended timestamp, which is added when the delta is too
> big, and that too is just a delta from the previous event. And there is the
> absolute timestamp as well. I could always just use the absolute one. That
> event came much later.

OK

> 
>>
>> This timestamp encoding as delta between events introduce a strong
>> inter-dependency between consecutive (nested) events, and is the reason
>> why you are stuck with all this timestamp before/after complexity.
>>
>> The Common Trace Format specifies (and LTTng implements) a different way
>> to achieve the same ring buffer space-savings achieved with timestamp deltas
>> while keeping the timestamps semantically absolute from a given reference,
>> hence without all the before/after timestamp complexity. You can see the
>> clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic
> 
> That points to this:
> 
> ---------------------8<-------------------------
> 6.3. Clock value update procedure
> To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C:
> 
> Let L be an unsigned integer initialized to, depending on the type property of C:
> 
> "fixed-length-unsigned-integer"
> The value of the length property of C.
> 
> "variable-length-unsigned-integer"
> S ×7, where S is the number of bytes which F occupies with the data stream.
> 
> Let MASK be an unsigned integer initialized to 2L − 1.
> 
> Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator.
> 
> Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator.
> 
> Set DEF_CLK_VAL to:
> 
> If V ≥ CUR
> H + V
> 
> Else
> H + MASK + 1 + V
> --------------------->8-------------------------
> 
> There's a lot of missing context there, so I don't see how it relates.

This explains how the "current time" is reconstructed by a trace reader
when loading an event header timestamp field. But for the sake of this
discussion we can focus on the less formal explanation of how the tracer
generates this timestamp encoding provided below.

>> idea on the producer side is to record the low-order bits of the current
>> timestamp in the event header (truncating the higher order bits), and
>> fall back on a full 64-bit value if the number of low-order bits overflows
>> from the previous timestamp is more than 1, or if it is impossible to figure
>> out precisely the timestamp of the previous event due to a race. This achieves
>> the same space savings as delta timestamp encoding without introducing the
>> strong event inter-dependency.
> 
> So when an overflow happens, you just insert a timestamp, and then events
> after that is based on that?

No. Let's use an example to show how it works.

For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
in the compact event header representation. But for the sake of making this
discussion easier, let's assume a tracer would use 16-bit for timestamps in the
compact representation.

Let's say we have the following ktime_get() values (monotonic timestamp value) for
a sequence of events:

                                   Timestamp (Hex)    Encoding in the trace

Packet header timestamp begin     0x12345678         64-bit: 0x12345678

Event 1                           0x12345678         16-bit: 0x5678
    (When decoded, same value as previous timestamp, no overflow)
Event 2                           0x12347777         16-bit: 0x7777
    (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
Event 3                           0x12350000         16-bit: 0x0000
    (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
     which allows the trace reader to reconstruct timestamp 0x12350000 from the
     previous timestamp and the 16-bit timestamp encoding.)
Event 4                           0x12370000         64-bit: 0x12370000
    (Encoding over 16-bit not possible because going from 0x12350000 to
     0x12370000 would overflow 16-bit twice, which cannot be detected
     by a trace reader. Therefore use the full 64-bit timestamp in the
     "large" event header representation.)

> 
>>
>> The fact that Ftrace exposes this ring buffer binary layout as a user-space
>> ABI makes it tricky to move to the Common Trace Format timestamp encoding.
>> There are clearly huge simplifications that could be made by moving to this
>> scheme though. Is there any way to introduce a different timestamp encoding
>> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
>> introduce this simpler scheme and gradually phase out the more complex delta
>> encoding when no users are left.
> 
> I'm not sure if there's a path forward. The infrastructure can easily swap
> in and out a new implementation. That is, there's not much dependency on
> the way the ring buffer works outside the ring buffer itself.
> 
> If we were to change the layout, it would likely require a new interface
> file to read. The trace_pipe_raw is the only file that exposes the current
> ring buffer. We could create a trace_out_raw or some other named file that
> has a completely different API and it wouldn't break any existing API.

Or introduce "trace_pipe_raw2" or some kind of versioned file names as new
ABIs.

> Although, if we want to change the "default" way, it may need some other
> knobs or something, which wouldn't be hard.

The delta-timestamp-encoding would have to stay there for a while as long
as users have not switched over to trace_pipe_raw2. Then when it's really
gone, the trace_pipe_raw could either go away or return an error when
opened.

> Now I have to ask, what's the motivation for this. The code isn't that
> complex anymore. Yes it still has the before/after timestamps, but the
> most complexity in that code was due to what happens in the race of
> updating the reserved data. But that's no longer the case with the
> cmpxchg(). If you look at this patch, that entire else statement was
> deleted. And that deleted code was what made me sick to my stomach ;-)
> Good riddance!

The motivation for this would be to further simplify the implementation
of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and
remove a few useless loads, stores, and conditional branches on the fast-path
of __rb_reserve_next(). This would turn the before/after timestamp
stores/loads/comparisons into a simple "last timestamp" field and a comparison
of the current timestamp against the "last timestamp" value to figure out
whether the compact representation can be used.

I don't know whether it's worth the trouble or not, it's really up to you. :)

Thanks,

Mathieu
Steven Rostedt Jan. 19, 2024, 9:42 p.m. UTC | #4
On Fri, 19 Jan 2024 15:56:21 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > So when an overflow happens, you just insert a timestamp, and then events
> > after that is based on that?  
> 
> No. Let's use an example to show how it works.
> 
> For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
> in the compact event header representation. But for the sake of making this
> discussion easier, let's assume a tracer would use 16-bit for timestamps in the
> compact representation.
> 
> Let's say we have the following ktime_get() values (monotonic timestamp value) for
> a sequence of events:
> 
>                                    Timestamp (Hex)    Encoding in the trace
> 
> Packet header timestamp begin     0x12345678         64-bit: 0x12345678
> 
> Event 1                           0x12345678         16-bit: 0x5678
>     (When decoded, same value as previous timestamp, no overflow)
> Event 2                           0x12347777         16-bit: 0x7777
>     (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
> Event 3                           0x12350000         16-bit: 0x0000
>     (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
>      which allows the trace reader to reconstruct timestamp 0x12350000 from the
>      previous timestamp and the 16-bit timestamp encoding.)
> Event 4                           0x12370000         64-bit: 0x12370000
>     (Encoding over 16-bit not possible because going from 0x12350000 to
>      0x12370000 would overflow 16-bit twice, which cannot be detected
>      by a trace reader. Therefore use the full 64-bit timestamp in the
>      "large" event header representation.)

I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.

The events are based off of the last injected timestamp. The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore. So in actuality, it inserts 0x12340000, plus adds a
5678 that represents the creation of the header (like we discussed about in
the last tracing meeting).

The first event has: 0x5678 which is based on the previous injected
timestamp of 0x12340000.

the events go on just using a delta from that until you see it overflow (or
too big of a delta to fit in the 16 bits), and you insert a new full
timestamps that everything will be based on:

  0x12370000

Now events following that are just a delta from that timestamp. But you
calculate the delta simply by masking out the lower bits.

But how do you detect the overflow? That last timestamps to know if the tsc
overflowed or not needs to be saved and compared. I would assume you have a
similar race that I have.

> 
> >   
> >>
> >> The fact that Ftrace exposes this ring buffer binary layout as a user-space
> >> ABI makes it tricky to move to the Common Trace Format timestamp encoding.
> >> There are clearly huge simplifications that could be made by moving to this
> >> scheme though. Is there any way to introduce a different timestamp encoding
> >> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
> >> introduce this simpler scheme and gradually phase out the more complex delta
> >> encoding when no users are left.  
> > 
> > I'm not sure if there's a path forward. The infrastructure can easily swap
> > in and out a new implementation. That is, there's not much dependency on
> > the way the ring buffer works outside the ring buffer itself.
> > 
> > If we were to change the layout, it would likely require a new interface
> > file to read. The trace_pipe_raw is the only file that exposes the current
> > ring buffer. We could create a trace_out_raw or some other named file that
> > has a completely different API and it wouldn't break any existing API.  
> 
> Or introduce "trace_pipe_raw2" or some kind of versioned file names as new
> ABIs.

I hate "version numbers" ;-)  I would rather have a meaning name instead.

"trace_pipe_raw_ts_delta" or something that gives a hint to why it's
different.

> 
> > Although, if we want to change the "default" way, it may need some other
> > knobs or something, which wouldn't be hard.  
> 
> The delta-timestamp-encoding would have to stay there for a while as long
> as users have not switched over to trace_pipe_raw2. Then when it's really
> gone, the trace_pipe_raw could either go away or return an error when
> opened.
> 
> > Now I have to ask, what's the motivation for this. The code isn't that
> > complex anymore. Yes it still has the before/after timestamps, but the
> > most complexity in that code was due to what happens in the race of
> > updating the reserved data. But that's no longer the case with the
> > cmpxchg(). If you look at this patch, that entire else statement was
> > deleted. And that deleted code was what made me sick to my stomach ;-)
> > Good riddance!  
> 
> The motivation for this would be to further simplify the implementation
> of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and
> remove a few useless loads, stores, and conditional branches on the fast-path
> of __rb_reserve_next(). This would turn the before/after timestamp
> stores/loads/comparisons into a simple "last timestamp" field and a comparison
> of the current timestamp against the "last timestamp" value to figure out
> whether the compact representation can be used.
> 
> I don't know whether it's worth the trouble or not, it's really up to you. :)

Yeah, making such a change would require an analysis to know if it would
speed things up enough to make a significant difference to do that effort.

-- Steve
Mathieu Desnoyers Jan. 20, 2024, 1:49 a.m. UTC | #5
On 2024-01-19 16:42, Steven Rostedt wrote:
> On Fri, 19 Jan 2024 15:56:21 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>> So when an overflow happens, you just insert a timestamp, and then events
>>> after that is based on that?
>>
>> No. Let's use an example to show how it works.
>>
>> For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
>> in the compact event header representation. But for the sake of making this
>> discussion easier, let's assume a tracer would use 16-bit for timestamps in the
>> compact representation.
>>
>> Let's say we have the following ktime_get() values (monotonic timestamp value) for
>> a sequence of events:
>>
>>                                     Timestamp (Hex)    Encoding in the trace
>>
>> Packet header timestamp begin     0x12345678         64-bit: 0x12345678
>>
>> Event 1                           0x12345678         16-bit: 0x5678
>>      (When decoded, same value as previous timestamp, no overflow)
>> Event 2                           0x12347777         16-bit: 0x7777
>>      (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
>> Event 3                           0x12350000         16-bit: 0x0000
>>      (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
>>       which allows the trace reader to reconstruct timestamp 0x12350000 from the
>>       previous timestamp and the 16-bit timestamp encoding.)
>> Event 4                           0x12370000         64-bit: 0x12370000
>>      (Encoding over 16-bit not possible because going from 0x12350000 to
>>       0x12370000 would overflow 16-bit twice, which cannot be detected
>>       by a trace reader. Therefore use the full 64-bit timestamp in the
>>       "large" event header representation.)
> 
> I think that's basically what I said, but you are just looking at it
> differently ;-) Or should I say, you are using bits for optimization.

Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.

> The events are based off of the last injected timestamp.

Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.

> The above example,
> starts with an timestamp injection into the packet header: 0x12345678, with
> the lsb 16bits ignore.

Wrong again. The 16 least significant bits are not ignored. The "current timestamp"
is really 0x12345678 when the packet header is read.

> So in actuality, it inserts 0x12340000, plus adds a
> 5678 that represents the creation of the header (like we discussed about in
> the last tracing meeting).

There is no "0x12340000" reference time. There is only the actual 0x12345678
current time at packet begin, including those 16 low order bits.

> 
> The first event has: 0x5678 which is based on the previous injected
> timestamp of 0x12340000.

It is not "based" on the previous injected timestamp. It just represents
the low-order 16-bits of the timestamp at event 1. The trace readers takes
two informations to compute the complete current timestamp for event 1:

1) The current timestamp just before event 1 is read (0x12345678),
2) The low-order 16 bits of event 1 (0x5678).

Given that there is no 16-bit overflow when comparing:

0x12345678 & 0xFFFF and 0x5678

We know that the resulting current timestamp for event 1 is:

(0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678

Or basically that time did not move between the packet header and event 1.

> 
> the events go on just using a delta from that until you see it overflow (or
> too big of a delta to fit in the 16 bits), and you insert a new full
> timestamps that everything will be based on:
> 
>    0x12370000

No. The following events use the same algorithm I just described: They use
this notion of "current timestamp" and the information provided by the new
timestamp field in the event header to figure out the updated current timestamp.

It is _never_ based on some kind of arbitrary reference, it is always absolute,
and is always computed based on the current timestamp and the timestamp field
encountered.

> 
> Now events following that are just a delta from that timestamp. But you
> calculate the delta simply by masking out the lower bits.

No, again, there is no delta from arbitrary injected time. It's always
computed from this virtual "current time", which applies to an event stream.

> 
> But how do you detect the overflow? That last timestamps to know if the tsc
> overflowed or not needs to be saved and compared. I would assume you have a
> similar race that I have.

Yes, I save a "last timestamp" per buffer, but the race does not matter because
of the order in which it is saved wrt local cmpxchg updating the reserved position.

The algorithm looks like:

do {
   - read current reserved position (old pos)
   - read time
   - compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

[A]

save_last_tsc()

If interrupted at [A] by another trace producer, it will compare with an
older "last tsc" than the tsc of the event physically located just before
the nested event. This stale "last tsc" has a value which is necessarily
lower than the one we would be saving with the save_last_tsc immediately
afterwards, which means in the worse case we end up using a full 64-bit
timestamp when in fact we could use a more compact representation. But
this race is rare and therefore it does not matter for size.

The fact that I only need this last_tsc value for the sake of optimization,
and not for computation of a time delta from a previous injected timestamp,
makes it possible to handle the race gracefully without requiring anything
more than a single last_tsc value per buffer and a single comparison for
16-bit overflow.

> 
>>
>>>    
>>>>
>>>> The fact that Ftrace exposes this ring buffer binary layout as a user-space
>>>> ABI makes it tricky to move to the Common Trace Format timestamp encoding.
>>>> There are clearly huge simplifications that could be made by moving to this
>>>> scheme though. Is there any way to introduce a different timestamp encoding
>>>> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
>>>> introduce this simpler scheme and gradually phase out the more complex delta
>>>> encoding when no users are left.
>>>
>>> I'm not sure if there's a path forward. The infrastructure can easily swap
>>> in and out a new implementation. That is, there's not much dependency on
>>> the way the ring buffer works outside the ring buffer itself.
>>>
>>> If we were to change the layout, it would likely require a new interface
>>> file to read. The trace_pipe_raw is the only file that exposes the current
>>> ring buffer. We could create a trace_out_raw or some other named file that
>>> has a completely different API and it wouldn't break any existing API.
>>
>> Or introduce "trace_pipe_raw2" or some kind of versioned file names as new
>> ABIs.
> 
> I hate "version numbers" ;-)  I would rather have a meaning name instead.
> 
> "trace_pipe_raw_ts_delta" or something that gives a hint to why it's
> different.

It's just that as it evolves and more changes are integrated, you may end
up with "trace_pipe_raw_featureA_featureB_featureC", which is also inconvenient.

> 
>>
>>> Although, if we want to change the "default" way, it may need some other
>>> knobs or something, which wouldn't be hard.
>>
>> The delta-timestamp-encoding would have to stay there for a while as long
>> as users have not switched over to trace_pipe_raw2. Then when it's really
>> gone, the trace_pipe_raw could either go away or return an error when
>> opened.
>>
>>> Now I have to ask, what's the motivation for this. The code isn't that
>>> complex anymore. Yes it still has the before/after timestamps, but the
>>> most complexity in that code was due to what happens in the race of
>>> updating the reserved data. But that's no longer the case with the
>>> cmpxchg(). If you look at this patch, that entire else statement was
>>> deleted. And that deleted code was what made me sick to my stomach ;-)
>>> Good riddance!
>>
>> The motivation for this would be to further simplify the implementation
>> of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and
>> remove a few useless loads, stores, and conditional branches on the fast-path
>> of __rb_reserve_next(). This would turn the before/after timestamp
>> stores/loads/comparisons into a simple "last timestamp" field and a comparison
>> of the current timestamp against the "last timestamp" value to figure out
>> whether the compact representation can be used.
>>
>> I don't know whether it's worth the trouble or not, it's really up to you. :)
> 
> Yeah, making such a change would require an analysis to know if it would
> speed things up enough to make a significant difference to do that effort.

I would also consider reducing code complexity as a worthwhile metric in
addition to speed. It makes it easier to extend in the future, easier to
understand for reviewers, and subtle bugs are less likely to creep in.

Thanks,

Mathieu
Steven Rostedt Jan. 20, 2024, 1:47 p.m. UTC | #6
On Fri, 19 Jan 2024 20:49:36 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >> Let's say we have the following ktime_get() values (monotonic timestamp value) for
> >> a sequence of events:
> >>
> >>                                     Timestamp (Hex)    Encoding in the trace
> >>
> >> Packet header timestamp begin     0x12345678         64-bit: 0x12345678
> >>
> >> Event 1                           0x12345678         16-bit: 0x5678
> >>      (When decoded, same value as previous timestamp, no overflow)
> >> Event 2                           0x12347777         16-bit: 0x7777
> >>      (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
> >> Event 3                           0x12350000         16-bit: 0x0000
> >>      (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
> >>       which allows the trace reader to reconstruct timestamp 0x12350000 from the
> >>       previous timestamp and the 16-bit timestamp encoding.)
> >> Event 4                           0x12370000         64-bit: 0x12370000
> >>      (Encoding over 16-bit not possible because going from 0x12350000 to
> >>       0x12370000 would overflow 16-bit twice, which cannot be detected
> >>       by a trace reader. Therefore use the full 64-bit timestamp in the
> >>       "large" event header representation.)  
> > 
> > I think that's basically what I said, but you are just looking at it
> > differently ;-) Or should I say, you are using bits for optimization.  
> 
> Based on your explanation below, we are really talking about different things
> here. Let me try to reply to your explanation to try to show where what I am
> doing completely differs from what you have in mind. This will help explain
> how I handle 16-bit overflow as well.
> 
> > The events are based off of the last injected timestamp.  
> 
> Incorrect. There is no "injected timestamp". There is only a concept
> of the "current timestamp" as we either write to or read from the
> event stream. I will take the point of view of the trace reader for
> the rest of the discussion.
> 
> > The above example,
> > starts with an timestamp injection into the packet header: 0x12345678, with
> > the lsb 16bits ignore.  
> 
> Wrong again. The 16 least significant bits are not ignored. The "current timestamp"
> is really 0x12345678 when the packet header is read.

In the packet header you have 0x12345678 in the first event you have
0x5678 how does that get you the timestamp? If that event had 0x8888,
when the reader reads this packet, it would take the header 0x12345678
chop off (ignore) the 5678, and add the 8888, right?

> 
> > So in actuality, it inserts 0x12340000, plus adds a
> > 5678 that represents the creation of the header (like we discussed about in
> > the last tracing meeting).  
> 
> There is no "0x12340000" reference time. There is only the actual 0x12345678
> current time at packet begin, including those 16 low order bits.

It could be 0x12341111 for all I care. I know you use that as the start
time for when tracing starts. But for the events in the packet (which
is what I care about) that bottom 16 bits in the packet header isn't
necessary.

> 
> > 
> > The first event has: 0x5678 which is based on the previous injected
> > timestamp of 0x12340000.  
> 
> It is not "based" on the previous injected timestamp. It just represents
> the low-order 16-bits of the timestamp at event 1. The trace readers takes
> two informations to compute the complete current timestamp for event 1:
> 
> 1) The current timestamp just before event 1 is read (0x12345678),

And where does that reader get that information from? Either the packet
header, or you need to inject a timestamp somewhere, don't you?

> 2) The low-order 16 bits of event 1 (0x5678).
> 
> Given that there is no 16-bit overflow when comparing:
> 
> 0x12345678 & 0xFFFF and 0x5678
> 
> We know that the resulting current timestamp for event 1 is:
> 
> (0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678

And above you ignore the least 16 bits ;-)

> 
> Or basically that time did not move between the packet header and event 1.
> 
> > 
> > the events go on just using a delta from that until you see it overflow (or
> > too big of a delta to fit in the 16 bits), and you insert a new full
> > timestamps that everything will be based on:
> > 
> >    0x12370000  
> 
> No. The following events use the same algorithm I just described: They use

Right, you mask off the bits of the timestamp.

> this notion of "current timestamp" and the information provided by the new
> timestamp field in the event header to figure out the updated current timestamp.
> 
> It is _never_ based on some kind of arbitrary reference, it is always absolute,
> and is always computed based on the current timestamp and the timestamp field
> encountered.

And that current timestamp was created by some injected timestamp in
the writer. The reader has to get it from somewhere!

> 
> > 
> > Now events following that are just a delta from that timestamp. But
> > you calculate the delta simply by masking out the lower bits.  
> 
> No, again, there is no delta from arbitrary injected time. It's always
> computed from this virtual "current time", which applies to an event
> stream.

> (0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678

Delta from arbitrary injected time of (0x12345678 & ~0xFFFF) = 0x12340000

to event "delta" 0x5678 = 0x12345678

 ;-)

You are just using different terminology for the same thing I'm talking
about.

> 
> > 
> > But how do you detect the overflow? That last timestamps to know if
> > the tsc overflowed or not needs to be saved and compared. I would
> > assume you have a similar race that I have.  
> 
> Yes, I save a "last timestamp" per buffer, but the race does not
> matter because of the order in which it is saved wrt local cmpxchg
> updating the reserved position.
> 
> The algorithm looks like:
> 
> do {
>    - read current reserved position (old pos)
>    - read time
>    - compute new reserved position (new pos)
> } while (cmpxchg(reserved pos, old pos, new pos) != old pos);
> 
> [A]
> 
> save_last_tsc()

So the last_tsc that is saved is from the timestamp read before the
cmpxchg.

> 
> If interrupted at [A] by another trace producer, it will compare with
> an older "last tsc" than the tsc of the event physically located just
> before the nested event. This stale "last tsc" has a value which is
> necessarily lower than the one we would be saving with the
> save_last_tsc immediately afterwards, which means in the worse case
> we end up using a full 64-bit timestamp when in fact we could use a
> more compact representation. But this race is rare and therefore it
> does not matter for size.

That's equivalent to me "injecting" an absolute value for the same race.

> 
> The fact that I only need this last_tsc value for the sake of
> optimization, and not for computation of a time delta from a previous
> injected timestamp, makes it possible to handle the race gracefully
> without requiring anything more than a single last_tsc value per
> buffer and a single comparison for 16-bit overflow.

If you have:

> do {
>    - read current reserved position (old pos)
>    - read time
>    - compute new reserved position (new pos)
> } while (cmpxchg(reserved pos, old pos, new pos) != old pos);
> 
> [A]

And here you have an interrupt that runs for several minutes (yeah bad
code ;-) And it does the same thing

do {
   - read current reserved position (old pos)
   - read time
   - compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

save_last_tsc()

But that last_tsc from before the cmpxchg is much later than the one it
interrupted.


> 
> save_last_tsc()

Ah, so if you put the clock back here, it will just cause the next
event to inject a full timestamp again because the delta is too big.

This is how you are able to avoid the "before/after" logic I have, as
the race is automatically detected. The least significant bits of the
timestamp is ignored for the event delta calculation. And if a race
happens where the interrupting event saves a later timestamp and comes
back here, if the interrupted event writes the older timestamp, it just
causes that delta calculation to overflow again and you inject another
64bit timestamp into the buffer.

> >>
> >> Or introduce "trace_pipe_raw2" or some kind of versioned file
> >> names as new ABIs.  
> > 
> > I hate "version numbers" ;-)  I would rather have a meaning name
> > instead.
> > 
> > "trace_pipe_raw_ts_delta" or something that gives a hint to why it's
> > different.  
> 
> It's just that as it evolves and more changes are integrated, you may
> end up with "trace_pipe_raw_featureA_featureB_featureC", which is
> also inconvenient.

Yes, but at least you know what those features are ;-)

And it's been 15 years since the code was written, and this would be
the "first" new feature added.

> 
> >   
> >>  
> >>> Although, if we want to change the "default" way, it may need
> >>> some other knobs or something, which wouldn't be hard.  
> >>
> >> The delta-timestamp-encoding would have to stay there for a while
> >> as long as users have not switched over to trace_pipe_raw2. Then
> >> when it's really gone, the trace_pipe_raw could either go away or
> >> return an error when opened.
> >>  
> >>> Now I have to ask, what's the motivation for this. The code isn't
> >>> that complex anymore. Yes it still has the before/after
> >>> timestamps, but the most complexity in that code was due to what
> >>> happens in the race of updating the reserved data. But that's no
> >>> longer the case with the cmpxchg(). If you look at this patch,
> >>> that entire else statement was deleted. And that deleted code was
> >>> what made me sick to my stomach ;-) Good riddance!  
> >>
> >> The motivation for this would be to further simplify the
> >> implementation of __rb_reserve_next(), rb_add_timestamp(), and
> >> rb_try_to_discard(), and remove a few useless loads, stores, and
> >> conditional branches on the fast-path of __rb_reserve_next(). This
> >> would turn the before/after timestamp stores/loads/comparisons
> >> into a simple "last timestamp" field and a comparison of the
> >> current timestamp against the "last timestamp" value to figure out
> >> whether the compact representation can be used.
> >>
> >> I don't know whether it's worth the trouble or not, it's really up
> >> to you. :)  
> > 
> > Yeah, making such a change would require an analysis to know if it
> > would speed things up enough to make a significant difference to do
> > that effort.  
> 
> I would also consider reducing code complexity as a worthwhile metric
> in addition to speed. It makes it easier to extend in the future,
> easier to understand for reviewers, and subtle bugs are less likely
> to creep in.

Really, it wouldn't make it that much simpler. The addition of the
cmpxchg() of this patch removed the nasty part of the code.

Going through a transition of changing it could end up being just as
complex. I'm not sure the complexity in that transition is better than
the complexity of the current code, as this code has been there for 15
years, and I know of at least 2 other projects that depend on this
format as is.

-- Steve
Steven Rostedt Jan. 20, 2024, 2:34 p.m. UTC | #7
On Sat, 20 Jan 2024 08:47:13 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > I would also consider reducing code complexity as a worthwhile metric
> > in addition to speed. It makes it easier to extend in the future,
> > easier to understand for reviewers, and subtle bugs are less likely
> > to creep in.  
> 
> Really, it wouldn't make it that much simpler. The addition of the
> cmpxchg() of this patch removed the nasty part of the code.

Now let's look at the difference of the two. You still need to save the
current timestamp in one variable. I have to do it in two, so your
algorithm does have that advantage. I currently have (eliminating the
"always add absolute timestamp" switch):


  w = write;
  before = before_stamp;
again:
  after = write_stamp; (equivalent to your last_tsc)
  ts = rdtsc();
  if (!w)
	delta = 0; // event has same ts as subbuf
  else if (before == after)
	delta = ts - after;
  else {
	delta = 0;
	inject_absolute = true;
  }

  before_stamp = ts;

  if (!try_cmpxchg(&write, w, w + length))
	goto again;

  write_stamp = ts;


Now if I were to updated it to use a delta from the last injected
timestamp, where injecting a timestamp only happens on overflow.

#define TS_BITS 27
#define MAX_DELTA ((1 << TS_BITS) - 1)
#define BITMASK (~MAX_DELTA)

  w = write;
again:
  ts = rdtsc();

  delta = ts & MAX_DELTA;

  if (ts - (write_stamp & BITMASK) > MAX_DELTA)
	inject_absolute = true;

  if (!try_cmpxchg(&write, w, w + length))
	goto again;

  write_stamp = ts;

I admit that it does simplify the code a little, but does it do it
enough to be worth the process of deprecating an ABI with a new one?

-- Steve
Mathieu Desnoyers Jan. 25, 2024, 9:18 p.m. UTC | #8
On 2024-01-20 08:47, Steven Rostedt wrote:
> On Fri, 19 Jan 2024 20:49:36 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>>> Let's say we have the following ktime_get() values (monotonic timestamp value) for
>>>> a sequence of events:
>>>>
>>>>                                      Timestamp (Hex)    Encoding in the trace
>>>>
>>>> Packet header timestamp begin     0x12345678         64-bit: 0x12345678
>>>>
>>>> Event 1                           0x12345678         16-bit: 0x5678
>>>>       (When decoded, same value as previous timestamp, no overflow)
>>>> Event 2                           0x12347777         16-bit: 0x7777
>>>>       (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
>>>> Event 3                           0x12350000         16-bit: 0x0000
>>>>       (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
>>>>        which allows the trace reader to reconstruct timestamp 0x12350000 from the
>>>>        previous timestamp and the 16-bit timestamp encoding.)
>>>> Event 4                           0x12370000         64-bit: 0x12370000
>>>>       (Encoding over 16-bit not possible because going from 0x12350000 to
>>>>        0x12370000 would overflow 16-bit twice, which cannot be detected
>>>>        by a trace reader. Therefore use the full 64-bit timestamp in the
>>>>        "large" event header representation.)
>>>
>>> I think that's basically what I said, but you are just looking at it
>>> differently ;-) Or should I say, you are using bits for optimization.
>>
>> Based on your explanation below, we are really talking about different things
>> here. Let me try to reply to your explanation to try to show where what I am
>> doing completely differs from what you have in mind. This will help explain
>> how I handle 16-bit overflow as well.
>>
>>> The events are based off of the last injected timestamp.
>>
>> Incorrect. There is no "injected timestamp". There is only a concept
>> of the "current timestamp" as we either write to or read from the
>> event stream. I will take the point of view of the trace reader for
>> the rest of the discussion.
>>
>>> The above example,
>>> starts with an timestamp injection into the packet header: 0x12345678, with
>>> the lsb 16bits ignore.
>>
>> Wrong again. The 16 least significant bits are not ignored. The "current timestamp"
>> is really 0x12345678 when the packet header is read.
> 
> In the packet header you have 0x12345678 in the first event you have
> 0x5678 how does that get you the timestamp? If that event had 0x8888,
> when the reader reads this packet, it would take the header 0x12345678
> chop off (ignore) the 5678, and add the 8888, right?

We need to consider not only what happens when the 16 low bits increase, but
also what happens when they end up with a value smaller than the previous
16 low bits.

As a summary from our video meeting discussion:

There are 3 cases we care about here:

packet header timestamp: 0x12345678

followed by either:

A) first event delta from packet header timestamp is 0: 16-bit value 0x5678

B) first event delta from packet header timestamp is <= 0xFFFF:
    B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value)
    B.2) 16-bit value example 0x2222 (exactly one 16-bit overflow from previous value)

C) first event delta from packet header timestamp is larger than 0xFFFF, which would
    cause the low-order 16 bits to have more than one 16-bit overflow from the previous
    value. The tracer detects this and uses a full 64-bit timestamp instead of the compact
    16 bits.

[...]


> 
>>
>>>
>>> But how do you detect the overflow? That last timestamps to know if
>>> the tsc overflowed or not needs to be saved and compared. I would
>>> assume you have a similar race that I have.
>>
>> Yes, I save a "last timestamp" per buffer, but the race does not
>> matter because of the order in which it is saved wrt local cmpxchg
>> updating the reserved position.
>>
>> The algorithm looks like:
>>
>> do {
>>     - read current reserved position (old pos)
>>     - read time
>>     - compute new reserved position (new pos)
>> } while (cmpxchg(reserved pos, old pos, new pos) != old pos);
>>
>> [A]
>>
>> save_last_tsc()
> 
> So the last_tsc that is saved is from the timestamp read before the
> cmpxchg.

Yes.

> 
>>
>> If interrupted at [A] by another trace producer, it will compare with
>> an older "last tsc" than the tsc of the event physically located just
>> before the nested event. This stale "last tsc" has a value which is
>> necessarily lower than the one we would be saving with the
>> save_last_tsc immediately afterwards, which means in the worse case
>> we end up using a full 64-bit timestamp when in fact we could use a
>> more compact representation. But this race is rare and therefore it
>> does not matter for size.
> 
> That's equivalent to me "injecting" an absolute value for the same race.

Yes.

> 
>>
>> The fact that I only need this last_tsc value for the sake of
>> optimization, and not for computation of a time delta from a previous
>> injected timestamp, makes it possible to handle the race gracefully
>> without requiring anything more than a single last_tsc value per
>> buffer and a single comparison for 16-bit overflow.
> 
> If you have:
> 
>> do {
>>     - read current reserved position (old pos)
>>     - read time
>>     - compute new reserved position (new pos)
>> } while (cmpxchg(reserved pos, old pos, new pos) != old pos);
>>
>> [A]
> 
> And here you have an interrupt that runs for several minutes (yeah bad
> code ;-) And it does the same thing
> 
> do {
>     - read current reserved position (old pos)
>     - read time
>     - compute new reserved position (new pos)
> } while (cmpxchg(reserved pos, old pos, new pos) != old pos);
> 
> save_last_tsc()
> 
> But that last_tsc from before the cmpxchg is much later than the one it
> interrupted.
> 
> 
>>
>> save_last_tsc()
> 
> Ah, so if you put the clock back here, it will just cause the next
> event to inject a full timestamp again because the delta is too big.

Yes.

> 
> This is how you are able to avoid the "before/after" logic I have, as
> the race is automatically detected. The least significant bits of the
> timestamp is ignored for the event delta calculation.

Not quite, as I explained at the beginning of this email. All bits from the
previous timestamp, including its low bits, are useful to know how many
overflows happened since the last tsc.

> And if a race
> happens where the interrupting event saves a later timestamp and comes
> back here, if the interrupted event writes the older timestamp, it just
> causes that delta calculation to overflow again and you inject another
> 64bit timestamp into the buffer.

This part is correct: in the race you describe, we end up with the
possibility of bringing the last_tsc backwards, which can only cause
the tracer to use the full 64-bit timestamp when in fact it could use
the compact representation. But it's rare and should not matter in
practice.

And by the way this algorithm is designed to work with preemption/migration
enabled as well, not just interrupts. So the race can come from a thread
running concurrently on another CPU and it should work as well.

[...]

> 
> Going through a transition of changing it could end up being just as
> complex. I'm not sure the complexity in that transition is better than
> the complexity of the current code, as this code has been there for 15
> years, and I know of at least 2 other projects that depend on this
> format as is.

I agree with you that it's not clear-cut whether introducing this change
would be a benefit at this stage considering the extra complexity of
extending the ABI while keeping backward compatibility.

But it's something we can keep in mind if we ever have to do major ABI
extensions for other reasons.

Thanks,

Mathieu
Steven Rostedt Jan. 25, 2024, 10:14 p.m. UTC | #9
On Thu, 25 Jan 2024 16:18:37 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > This is how you are able to avoid the "before/after" logic I have, as
> > the race is automatically detected. The least significant bits of the
> > timestamp is ignored for the event delta calculation.  
> 
> Not quite, as I explained at the beginning of this email. All bits from the
> previous timestamp, including its low bits, are useful to know how many
> overflows happened since the last tsc.

Yes, but it still means updating that timestamp you will compare to doesn't
have the race I have. If the timestamp's upper bits are the same, or are
off by one and the lower bits are higher than the current timestamp, you
don't need to inject. But if the lower bits are higher than the timestamp
or the higehr bits are off by more than one then you do.

The lower bits are a delta against "0" of the current timestamp lower bits.
There's no race in updating those bits as long as the upper bits remain the
same or are off by one and the current timestamp lower bits are lower than
the saved time stamp.

In my case, because the delta is off of the entire timestamp, what I write
into the saved timestamp, all bits matter. And to handle that I need the
before/after timestamps to know if the currently saved timestamp didn't
have a race.

> 
> > And if a race
> > happens where the interrupting event saves a later timestamp and comes
> > back here, if the interrupted event writes the older timestamp, it just
> > causes that delta calculation to overflow again and you inject another
> > 64bit timestamp into the buffer.  
> 
> This part is correct: in the race you describe, we end up with the
> possibility of bringing the last_tsc backwards, which can only cause
> the tracer to use the full 64-bit timestamp when in fact it could use
> the compact representation. But it's rare and should not matter in
> practice.
> 
> And by the way this algorithm is designed to work with preemption/migration
> enabled as well, not just interrupts. So the race can come from a thread
> running concurrently on another CPU and it should work as well.
> 
> [...]
> 
> > 
> > Going through a transition of changing it could end up being just as
> > complex. I'm not sure the complexity in that transition is better than
> > the complexity of the current code, as this code has been there for 15
> > years, and I know of at least 2 other projects that depend on this
> > format as is.  
> 
> I agree with you that it's not clear-cut whether introducing this change
> would be a benefit at this stage considering the extra complexity of
> extending the ABI while keeping backward compatibility.
> 
> But it's something we can keep in mind if we ever have to do major ABI
> extensions for other reasons.

Yeah, it's something to think about if we want to use a different format
for something else.

Thanks for the review.

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7887d61d5b56..74f8895785ca 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3449,9 +3449,11 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	/* Don't let the compiler play games with cpu_buffer->tail_page */
 	tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
 
- /*A*/	w = local_read(&tail_page->write) & RB_WRITE_MASK;
+ /*A*/	w = local_read(&tail_page->write);
 	barrier();
 	rb_time_read(&cpu_buffer->before_stamp, &info->before);
+	/* Read before_stamp only the first time through */
+ again:
 	rb_time_read(&cpu_buffer->write_stamp, &info->after);
 	barrier();
 	info->ts = rb_time_stamp(cpu_buffer->buffer);
@@ -3464,7 +3466,7 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 		 * absolute timestamp.
 		 * Don't bother if this is the start of a new page (w == 0).
 		 */
-		if (!w) {
+		if (!(w & RB_WRITE_MASK)) {
 			/* Use the sub-buffer timestamp */
 			info->delta = 0;
 		} else if (unlikely(info->before != info->after)) {
@@ -3481,89 +3483,49 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 
  /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);
 
- /*C*/	write = local_add_return(info->length, &tail_page->write);
+ /*C*/	if (!local_try_cmpxchg(&tail_page->write, &w, w + info->length))
+		goto again;
 
-	/* set write to only the index of the write */
-	write &= RB_WRITE_MASK;
+	/* Set write to the start of this event */
+	write = w & RB_WRITE_MASK;
 
-	tail = write - info->length;
+	/* set tail to the end of the event */
+	tail = write + info->length;
 
 	/* See if we shot pass the end of this buffer page */
-	if (unlikely(write > cpu_buffer->buffer->subbuf_size)) {
+	if (unlikely(tail > cpu_buffer->buffer->subbuf_size)) {
 		check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
-		return rb_move_tail(cpu_buffer, tail, info);
+		return rb_move_tail(cpu_buffer, write, info);
 	}
 
-	if (likely(tail == w)) {
-		/* Nothing interrupted us between A and C */
- /*D*/		rb_time_set(&cpu_buffer->write_stamp, info->ts);
-		/*
-		 * If something came in between C and D, the write stamp
-		 * may now not be in sync. But that's fine as the before_stamp
-		 * will be different and then next event will just be forced
-		 * to use an absolute timestamp.
-		 */
-		if (likely(!(info->add_timestamp &
-			     (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
-			/* This did not interrupt any time update */
-			info->delta = info->ts - info->after;
-		else
-			/* Just use full timestamp for interrupting event */
-			info->delta = info->ts;
-		check_buffer(cpu_buffer, info, tail);
-	} else {
-		u64 ts;
-		/* SLOW PATH - Interrupted between A and C */
-
-		/* Save the old before_stamp */
-		rb_time_read(&cpu_buffer->before_stamp, &info->before);
-
-		/*
-		 * Read a new timestamp and update the before_stamp to make
-		 * the next event after this one force using an absolute
-		 * timestamp. This is in case an interrupt were to come in
-		 * between E and F.
-		 */
-		ts = rb_time_stamp(cpu_buffer->buffer);
-		rb_time_set(&cpu_buffer->before_stamp, ts);
-
-		barrier();
- /*E*/		rb_time_read(&cpu_buffer->write_stamp, &info->after);
-		barrier();
- /*F*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
-		    info->after == info->before && info->after < ts) {
-			/*
-			 * Nothing came after this event between C and F, it is
-			 * safe to use info->after for the delta as it
-			 * matched info->before and is still valid.
-			 */
-			info->delta = ts - info->after;
-		} else {
-			/*
-			 * Interrupted between C and F:
-			 * Lost the previous events time stamp. Just set the
-			 * delta to zero, and this will be the same time as
-			 * the event this event interrupted. And the events that
-			 * came after this will still be correct (as they would
-			 * have built their delta on the previous event.
-			 */
-			info->delta = 0;
-		}
-		info->ts = ts;
-		info->add_timestamp &= ~RB_ADD_STAMP_FORCE;
-	}
+	/* Nothing interrupted us between A and C */
+ /*D*/	rb_time_set(&cpu_buffer->write_stamp, info->ts);
+	/*
+	 * If something came in between C and D, the write stamp
+	 * may now not be in sync. But that's fine as the before_stamp
+	 * will be different and then next event will just be forced
+	 * to use an absolute timestamp.
+	 */
+	if (likely(!(info->add_timestamp &
+		     (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
+		/* This did not interrupt any time update */
+		info->delta = info->ts - info->after;
+	else
+		/* Just use full timestamp for interrupting event */
+		info->delta = info->ts;
+	check_buffer(cpu_buffer, info, write);
 
 	/*
 	 * If this is the first commit on the page, then it has the same
 	 * timestamp as the page itself.
 	 */
-	if (unlikely(!tail && !(info->add_timestamp &
+	if (unlikely(!write && !(info->add_timestamp &
 				(RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
 		info->delta = 0;
 
 	/* We reserved something on the buffer */
 
-	event = __rb_page_index(tail_page, tail);
+	event = __rb_page_index(tail_page, write);
 	rb_update_event(cpu_buffer, event, info);
 
 	local_inc(&tail_page->entries);
@@ -3572,7 +3534,7 @@  __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	 * If this is the first commit on the page, then update
 	 * its timestamp.
 	 */
-	if (unlikely(!tail))
+	if (unlikely(!write))
 		tail_page->page->time_stamp = info->ts;
 
 	/* account for these added bytes */