Message ID | 20240118181206.4977da2f@gandalf.local.home (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ring-buffer: Simplify reservation with try_cmpxchg() loop | expand |
On 2024-01-18 18:12, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > Instead of using local_add_return() to reserve the ring buffer data, > Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the > reservation with the time keeping code. I admire the effort of trying to simplify the Ftrace ring buffer by bringing over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem, I certainly welcome the simplifications. > Although, it does not get rid of the double time stamps (before_stamp and > write_stamp), using cmpxchg() does get rid of the more complex case when > an interrupting event occurs between getting the timestamps and reserving > the data, as when that happens, it just tries again instead of dealing > with it. I understand that the reason why you need the before/after stamps and their associated complexity is because the Ftrace ring buffer ABI encodes event timestamps as delta from the previous event within the buffer as a mean of compressing the timestamp fields. If the delta cannot be represented in a given number of bits, then it inserts a 64-bit timestamp (not sure if that one is absolute or a delta from previous event). This timestamp encoding as delta between events introduce a strong inter-dependency between consecutive (nested) events, and is the reason why you are stuck with all this timestamp before/after complexity. The Common Trace Format specifies (and LTTng implements) a different way to achieve the same ring buffer space-savings achieved with timestamp deltas while keeping the timestamps semantically absolute from a given reference, hence without all the before/after timestamp complexity. You can see the clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic idea on the producer side is to record the low-order bits of the current timestamp in the event header (truncating the higher order bits), and fall back on a full 64-bit value if the number of low-order bits overflows from the previous timestamp is more than 1, or if it is impossible to figure out precisely the timestamp of the previous event due to a race. This achieves the same space savings as delta timestamp encoding without introducing the strong event inter-dependency. The fact that Ftrace exposes this ring buffer binary layout as a user-space ABI makes it tricky to move to the Common Trace Format timestamp encoding. There are clearly huge simplifications that could be made by moving to this scheme though. Is there any way to introduce a different timestamp encoding scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to introduce this simpler scheme and gradually phase out the more complex delta encoding when no users are left. Thoughts ? Thanks, Mathieu [1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update
On Fri, 19 Jan 2024 09:40:27 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > On 2024-01-18 18:12, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > Instead of using local_add_return() to reserve the ring buffer data, > > Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the > > reservation with the time keeping code. > > I admire the effort of trying to simplify the Ftrace ring buffer by bringing > over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem, > I certainly welcome the simplifications. > The idea itself wasn't new. It was you convincing me that local_add_return() is no faster than local_cmpxchg(). As I would have done it this way from the start if I wasn't dead set against adding any new cmpxchg() in the fast path. Testing showed that local_add_return() is pretty much just as bad, so the added complexity to avoid just slapping in a cmpxchg() was useless. > > Although, it does not get rid of the double time stamps (before_stamp and > > write_stamp), using cmpxchg() does get rid of the more complex case when > > an interrupting event occurs between getting the timestamps and reserving > > the data, as when that happens, it just tries again instead of dealing > > with it. > > I understand that the reason why you need the before/after stamps and their > associated complexity is because the Ftrace ring buffer ABI encodes event > timestamps as delta from the previous event within the buffer as a mean of > compressing the timestamp fields. If the delta cannot be represented in a > given number of bits, then it inserts a 64-bit timestamp (not sure if that > one is absolute or a delta from previous event). There's both. An extended timestamp, which is added when the delta is too big, and that too is just a delta from the previous event. And there is the absolute timestamp as well. I could always just use the absolute one. That event came much later. > > This timestamp encoding as delta between events introduce a strong > inter-dependency between consecutive (nested) events, and is the reason > why you are stuck with all this timestamp before/after complexity. > > The Common Trace Format specifies (and LTTng implements) a different way > to achieve the same ring buffer space-savings achieved with timestamp deltas > while keeping the timestamps semantically absolute from a given reference, > hence without all the before/after timestamp complexity. You can see the > clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic That points to this: ---------------------8<------------------------- 6.3. Clock value update procedure To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C: Let L be an unsigned integer initialized to, depending on the type property of C: "fixed-length-unsigned-integer" The value of the length property of C. "variable-length-unsigned-integer" S ×7, where S is the number of bytes which F occupies with the data stream. Let MASK be an unsigned integer initialized to 2L − 1. Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator. Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator. Set DEF_CLK_VAL to: If V ≥ CUR H + V Else H + MASK + 1 + V --------------------->8------------------------- There's a lot of missing context there, so I don't see how it relates. > idea on the producer side is to record the low-order bits of the current > timestamp in the event header (truncating the higher order bits), and > fall back on a full 64-bit value if the number of low-order bits overflows > from the previous timestamp is more than 1, or if it is impossible to figure > out precisely the timestamp of the previous event due to a race. This achieves > the same space savings as delta timestamp encoding without introducing the > strong event inter-dependency. So when an overflow happens, you just insert a timestamp, and then events after that is based on that? > > The fact that Ftrace exposes this ring buffer binary layout as a user-space > ABI makes it tricky to move to the Common Trace Format timestamp encoding. > There are clearly huge simplifications that could be made by moving to this > scheme though. Is there any way to introduce a different timestamp encoding > scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to > introduce this simpler scheme and gradually phase out the more complex delta > encoding when no users are left. I'm not sure if there's a path forward. The infrastructure can easily swap in and out a new implementation. That is, there's not much dependency on the way the ring buffer works outside the ring buffer itself. If we were to change the layout, it would likely require a new interface file to read. The trace_pipe_raw is the only file that exposes the current ring buffer. We could create a trace_out_raw or some other named file that has a completely different API and it wouldn't break any existing API. Although, if we want to change the "default" way, it may need some other knobs or something, which wouldn't be hard. Now I have to ask, what's the motivation for this. The code isn't that complex anymore. Yes it still has the before/after timestamps, but the most complexity in that code was due to what happens in the race of updating the reserved data. But that's no longer the case with the cmpxchg(). If you look at this patch, that entire else statement was deleted. And that deleted code was what made me sick to my stomach ;-) Good riddance! -- Steve
On 2024-01-19 10:37, Steven Rostedt wrote: > On Fri, 19 Jan 2024 09:40:27 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> On 2024-01-18 18:12, Steven Rostedt wrote: >>> From: "Steven Rostedt (Google)" <rostedt@goodmis.org> >>> [...] >>> Although, it does not get rid of the double time stamps (before_stamp and >>> write_stamp), using cmpxchg() does get rid of the more complex case when >>> an interrupting event occurs between getting the timestamps and reserving >>> the data, as when that happens, it just tries again instead of dealing >>> with it. >> >> I understand that the reason why you need the before/after stamps and their >> associated complexity is because the Ftrace ring buffer ABI encodes event >> timestamps as delta from the previous event within the buffer as a mean of >> compressing the timestamp fields. If the delta cannot be represented in a >> given number of bits, then it inserts a 64-bit timestamp (not sure if that >> one is absolute or a delta from previous event). > > There's both. An extended timestamp, which is added when the delta is too > big, and that too is just a delta from the previous event. And there is the > absolute timestamp as well. I could always just use the absolute one. That > event came much later. OK > >> >> This timestamp encoding as delta between events introduce a strong >> inter-dependency between consecutive (nested) events, and is the reason >> why you are stuck with all this timestamp before/after complexity. >> >> The Common Trace Format specifies (and LTTng implements) a different way >> to achieve the same ring buffer space-savings achieved with timestamp deltas >> while keeping the timestamps semantically absolute from a given reference, >> hence without all the before/after timestamp complexity. You can see the >> clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic > > That points to this: > > ---------------------8<------------------------- > 6.3. Clock value update procedure > To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C: > > Let L be an unsigned integer initialized to, depending on the type property of C: > > "fixed-length-unsigned-integer" > The value of the length property of C. > > "variable-length-unsigned-integer" > S ×7, where S is the number of bytes which F occupies with the data stream. > > Let MASK be an unsigned integer initialized to 2L − 1. > > Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator. > > Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator. > > Set DEF_CLK_VAL to: > > If V ≥ CUR > H + V > > Else > H + MASK + 1 + V > --------------------->8------------------------- > > There's a lot of missing context there, so I don't see how it relates. This explains how the "current time" is reconstructed by a trace reader when loading an event header timestamp field. But for the sake of this discussion we can focus on the less formal explanation of how the tracer generates this timestamp encoding provided below. >> idea on the producer side is to record the low-order bits of the current >> timestamp in the event header (truncating the higher order bits), and >> fall back on a full 64-bit value if the number of low-order bits overflows >> from the previous timestamp is more than 1, or if it is impossible to figure >> out precisely the timestamp of the previous event due to a race. This achieves >> the same space savings as delta timestamp encoding without introducing the >> strong event inter-dependency. > > So when an overflow happens, you just insert a timestamp, and then events > after that is based on that? No. Let's use an example to show how it works. For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps in the compact event header representation. But for the sake of making this discussion easier, let's assume a tracer would use 16-bit for timestamps in the compact representation. Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex) Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x12347777 16-bit: 0x7777 (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit) Event 3 0x12350000 16-bit: 0x0000 (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x12350000 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x12370000 64-bit: 0x12370000 (Encoding over 16-bit not possible because going from 0x12350000 to 0x12370000 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) > >> >> The fact that Ftrace exposes this ring buffer binary layout as a user-space >> ABI makes it tricky to move to the Common Trace Format timestamp encoding. >> There are clearly huge simplifications that could be made by moving to this >> scheme though. Is there any way to introduce a different timestamp encoding >> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to >> introduce this simpler scheme and gradually phase out the more complex delta >> encoding when no users are left. > > I'm not sure if there's a path forward. The infrastructure can easily swap > in and out a new implementation. That is, there's not much dependency on > the way the ring buffer works outside the ring buffer itself. > > If we were to change the layout, it would likely require a new interface > file to read. The trace_pipe_raw is the only file that exposes the current > ring buffer. We could create a trace_out_raw or some other named file that > has a completely different API and it wouldn't break any existing API. Or introduce "trace_pipe_raw2" or some kind of versioned file names as new ABIs. > Although, if we want to change the "default" way, it may need some other > knobs or something, which wouldn't be hard. The delta-timestamp-encoding would have to stay there for a while as long as users have not switched over to trace_pipe_raw2. Then when it's really gone, the trace_pipe_raw could either go away or return an error when opened. > Now I have to ask, what's the motivation for this. The code isn't that > complex anymore. Yes it still has the before/after timestamps, but the > most complexity in that code was due to what happens in the race of > updating the reserved data. But that's no longer the case with the > cmpxchg(). If you look at this patch, that entire else statement was > deleted. And that deleted code was what made me sick to my stomach ;-) > Good riddance! The motivation for this would be to further simplify the implementation of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and remove a few useless loads, stores, and conditional branches on the fast-path of __rb_reserve_next(). This would turn the before/after timestamp stores/loads/comparisons into a simple "last timestamp" field and a comparison of the current timestamp against the "last timestamp" value to figure out whether the compact representation can be used. I don't know whether it's worth the trouble or not, it's really up to you. :) Thanks, Mathieu
On Fri, 19 Jan 2024 15:56:21 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > So when an overflow happens, you just insert a timestamp, and then events > > after that is based on that? > > No. Let's use an example to show how it works. > > For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps > in the compact event header representation. But for the sake of making this > discussion easier, let's assume a tracer would use 16-bit for timestamps in the > compact representation. > > Let's say we have the following ktime_get() values (monotonic timestamp value) for > a sequence of events: > > Timestamp (Hex) Encoding in the trace > > Packet header timestamp begin 0x12345678 64-bit: 0x12345678 > > Event 1 0x12345678 16-bit: 0x5678 > (When decoded, same value as previous timestamp, no overflow) > Event 2 0x12347777 16-bit: 0x7777 > (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit) > Event 3 0x12350000 16-bit: 0x0000 > (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once > which allows the trace reader to reconstruct timestamp 0x12350000 from the > previous timestamp and the 16-bit timestamp encoding.) > Event 4 0x12370000 64-bit: 0x12370000 > (Encoding over 16-bit not possible because going from 0x12350000 to > 0x12370000 would overflow 16-bit twice, which cannot be detected > by a trace reader. Therefore use the full 64-bit timestamp in the > "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. The events are based off of the last injected timestamp. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. So in actuality, it inserts 0x12340000, plus adds a 5678 that represents the creation of the header (like we discussed about in the last tracing meeting). The first event has: 0x5678 which is based on the previous injected timestamp of 0x12340000. the events go on just using a delta from that until you see it overflow (or too big of a delta to fit in the 16 bits), and you insert a new full timestamps that everything will be based on: 0x12370000 Now events following that are just a delta from that timestamp. But you calculate the delta simply by masking out the lower bits. But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. > > > > >> > >> The fact that Ftrace exposes this ring buffer binary layout as a user-space > >> ABI makes it tricky to move to the Common Trace Format timestamp encoding. > >> There are clearly huge simplifications that could be made by moving to this > >> scheme though. Is there any way to introduce a different timestamp encoding > >> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to > >> introduce this simpler scheme and gradually phase out the more complex delta > >> encoding when no users are left. > > > > I'm not sure if there's a path forward. The infrastructure can easily swap > > in and out a new implementation. That is, there's not much dependency on > > the way the ring buffer works outside the ring buffer itself. > > > > If we were to change the layout, it would likely require a new interface > > file to read. The trace_pipe_raw is the only file that exposes the current > > ring buffer. We could create a trace_out_raw or some other named file that > > has a completely different API and it wouldn't break any existing API. > > Or introduce "trace_pipe_raw2" or some kind of versioned file names as new > ABIs. I hate "version numbers" ;-) I would rather have a meaning name instead. "trace_pipe_raw_ts_delta" or something that gives a hint to why it's different. > > > Although, if we want to change the "default" way, it may need some other > > knobs or something, which wouldn't be hard. > > The delta-timestamp-encoding would have to stay there for a while as long > as users have not switched over to trace_pipe_raw2. Then when it's really > gone, the trace_pipe_raw could either go away or return an error when > opened. > > > Now I have to ask, what's the motivation for this. The code isn't that > > complex anymore. Yes it still has the before/after timestamps, but the > > most complexity in that code was due to what happens in the race of > > updating the reserved data. But that's no longer the case with the > > cmpxchg(). If you look at this patch, that entire else statement was > > deleted. And that deleted code was what made me sick to my stomach ;-) > > Good riddance! > > The motivation for this would be to further simplify the implementation > of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and > remove a few useless loads, stores, and conditional branches on the fast-path > of __rb_reserve_next(). This would turn the before/after timestamp > stores/loads/comparisons into a simple "last timestamp" field and a comparison > of the current timestamp against the "last timestamp" value to figure out > whether the compact representation can be used. > > I don't know whether it's worth the trouble or not, it's really up to you. :) Yeah, making such a change would require an analysis to know if it would speed things up enough to make a significant difference to do that effort. -- Steve
On 2024-01-19 16:42, Steven Rostedt wrote: > On Fri, 19 Jan 2024 15:56:21 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>> So when an overflow happens, you just insert a timestamp, and then events >>> after that is based on that? >> >> No. Let's use an example to show how it works. >> >> For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps >> in the compact event header representation. But for the sake of making this >> discussion easier, let's assume a tracer would use 16-bit for timestamps in the >> compact representation. >> >> Let's say we have the following ktime_get() values (monotonic timestamp value) for >> a sequence of events: >> >> Timestamp (Hex) Encoding in the trace >> >> Packet header timestamp begin 0x12345678 64-bit: 0x12345678 >> >> Event 1 0x12345678 16-bit: 0x5678 >> (When decoded, same value as previous timestamp, no overflow) >> Event 2 0x12347777 16-bit: 0x7777 >> (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit) >> Event 3 0x12350000 16-bit: 0x0000 >> (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once >> which allows the trace reader to reconstruct timestamp 0x12350000 from the >> previous timestamp and the 16-bit timestamp encoding.) >> Event 4 0x12370000 64-bit: 0x12370000 >> (Encoding over 16-bit not possible because going from 0x12350000 to >> 0x12370000 would overflow 16-bit twice, which cannot be detected >> by a trace reader. Therefore use the full 64-bit timestamp in the >> "large" event header representation.) > > I think that's basically what I said, but you are just looking at it > differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. > The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. > The above example, > starts with an timestamp injection into the packet header: 0x12345678, with > the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. > So in actuality, it inserts 0x12340000, plus adds a > 5678 that represents the creation of the header (like we discussed about in > the last tracing meeting). There is no "0x12340000" reference time. There is only the actual 0x12345678 current time at packet begin, including those 16 low order bits. > > The first event has: 0x5678 which is based on the previous injected > timestamp of 0x12340000. It is not "based" on the previous injected timestamp. It just represents the low-order 16-bits of the timestamp at event 1. The trace readers takes two informations to compute the complete current timestamp for event 1: 1) The current timestamp just before event 1 is read (0x12345678), 2) The low-order 16 bits of event 1 (0x5678). Given that there is no 16-bit overflow when comparing: 0x12345678 & 0xFFFF and 0x5678 We know that the resulting current timestamp for event 1 is: (0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678 Or basically that time did not move between the packet header and event 1. > > the events go on just using a delta from that until you see it overflow (or > too big of a delta to fit in the 16 bits), and you insert a new full > timestamps that everything will be based on: > > 0x12370000 No. The following events use the same algorithm I just described: They use this notion of "current timestamp" and the information provided by the new timestamp field in the event header to figure out the updated current timestamp. It is _never_ based on some kind of arbitrary reference, it is always absolute, and is always computed based on the current timestamp and the timestamp field encountered. > > Now events following that are just a delta from that timestamp. But you > calculate the delta simply by masking out the lower bits. No, again, there is no delta from arbitrary injected time. It's always computed from this virtual "current time", which applies to an event stream. > > But how do you detect the overflow? That last timestamps to know if the tsc > overflowed or not needs to be saved and compared. I would assume you have a > similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is saved wrt local cmpxchg updating the reserved position. The algorithm looks like: do { - read current reserved position (old pos) - read time - compute new reserved position (new pos) } while (cmpxchg(reserved pos, old pos, new pos) != old pos); [A] save_last_tsc() If interrupted at [A] by another trace producer, it will compare with an older "last tsc" than the tsc of the event physically located just before the nested event. This stale "last tsc" has a value which is necessarily lower than the one we would be saving with the save_last_tsc immediately afterwards, which means in the worse case we end up using a full 64-bit timestamp when in fact we could use a more compact representation. But this race is rare and therefore it does not matter for size. The fact that I only need this last_tsc value for the sake of optimization, and not for computation of a time delta from a previous injected timestamp, makes it possible to handle the race gracefully without requiring anything more than a single last_tsc value per buffer and a single comparison for 16-bit overflow. > >> >>> >>>> >>>> The fact that Ftrace exposes this ring buffer binary layout as a user-space >>>> ABI makes it tricky to move to the Common Trace Format timestamp encoding. >>>> There are clearly huge simplifications that could be made by moving to this >>>> scheme though. Is there any way to introduce a different timestamp encoding >>>> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to >>>> introduce this simpler scheme and gradually phase out the more complex delta >>>> encoding when no users are left. >>> >>> I'm not sure if there's a path forward. The infrastructure can easily swap >>> in and out a new implementation. That is, there's not much dependency on >>> the way the ring buffer works outside the ring buffer itself. >>> >>> If we were to change the layout, it would likely require a new interface >>> file to read. The trace_pipe_raw is the only file that exposes the current >>> ring buffer. We could create a trace_out_raw or some other named file that >>> has a completely different API and it wouldn't break any existing API. >> >> Or introduce "trace_pipe_raw2" or some kind of versioned file names as new >> ABIs. > > I hate "version numbers" ;-) I would rather have a meaning name instead. > > "trace_pipe_raw_ts_delta" or something that gives a hint to why it's > different. It's just that as it evolves and more changes are integrated, you may end up with "trace_pipe_raw_featureA_featureB_featureC", which is also inconvenient. > >> >>> Although, if we want to change the "default" way, it may need some other >>> knobs or something, which wouldn't be hard. >> >> The delta-timestamp-encoding would have to stay there for a while as long >> as users have not switched over to trace_pipe_raw2. Then when it's really >> gone, the trace_pipe_raw could either go away or return an error when >> opened. >> >>> Now I have to ask, what's the motivation for this. The code isn't that >>> complex anymore. Yes it still has the before/after timestamps, but the >>> most complexity in that code was due to what happens in the race of >>> updating the reserved data. But that's no longer the case with the >>> cmpxchg(). If you look at this patch, that entire else statement was >>> deleted. And that deleted code was what made me sick to my stomach ;-) >>> Good riddance! >> >> The motivation for this would be to further simplify the implementation >> of __rb_reserve_next(), rb_add_timestamp(), and rb_try_to_discard(), and >> remove a few useless loads, stores, and conditional branches on the fast-path >> of __rb_reserve_next(). This would turn the before/after timestamp >> stores/loads/comparisons into a simple "last timestamp" field and a comparison >> of the current timestamp against the "last timestamp" value to figure out >> whether the compact representation can be used. >> >> I don't know whether it's worth the trouble or not, it's really up to you. :) > > Yeah, making such a change would require an analysis to know if it would > speed things up enough to make a significant difference to do that effort. I would also consider reducing code complexity as a worthwhile metric in addition to speed. It makes it easier to extend in the future, easier to understand for reviewers, and subtle bugs are less likely to creep in. Thanks, Mathieu
On Fri, 19 Jan 2024 20:49:36 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> Let's say we have the following ktime_get() values (monotonic timestamp value) for > >> a sequence of events: > >> > >> Timestamp (Hex) Encoding in the trace > >> > >> Packet header timestamp begin 0x12345678 64-bit: 0x12345678 > >> > >> Event 1 0x12345678 16-bit: 0x5678 > >> (When decoded, same value as previous timestamp, no overflow) > >> Event 2 0x12347777 16-bit: 0x7777 > >> (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit) > >> Event 3 0x12350000 16-bit: 0x0000 > >> (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once > >> which allows the trace reader to reconstruct timestamp 0x12350000 from the > >> previous timestamp and the 16-bit timestamp encoding.) > >> Event 4 0x12370000 64-bit: 0x12370000 > >> (Encoding over 16-bit not possible because going from 0x12350000 to > >> 0x12370000 would overflow 16-bit twice, which cannot be detected > >> by a trace reader. Therefore use the full 64-bit timestamp in the > >> "large" event header representation.) > > > > I think that's basically what I said, but you are just looking at it > > differently ;-) Or should I say, you are using bits for optimization. > > Based on your explanation below, we are really talking about different things > here. Let me try to reply to your explanation to try to show where what I am > doing completely differs from what you have in mind. This will help explain > how I handle 16-bit overflow as well. > > > The events are based off of the last injected timestamp. > > Incorrect. There is no "injected timestamp". There is only a concept > of the "current timestamp" as we either write to or read from the > event stream. I will take the point of view of the trace reader for > the rest of the discussion. > > > The above example, > > starts with an timestamp injection into the packet header: 0x12345678, with > > the lsb 16bits ignore. > > Wrong again. The 16 least significant bits are not ignored. The "current timestamp" > is really 0x12345678 when the packet header is read. In the packet header you have 0x12345678 in the first event you have 0x5678 how does that get you the timestamp? If that event had 0x8888, when the reader reads this packet, it would take the header 0x12345678 chop off (ignore) the 5678, and add the 8888, right? > > > So in actuality, it inserts 0x12340000, plus adds a > > 5678 that represents the creation of the header (like we discussed about in > > the last tracing meeting). > > There is no "0x12340000" reference time. There is only the actual 0x12345678 > current time at packet begin, including those 16 low order bits. It could be 0x12341111 for all I care. I know you use that as the start time for when tracing starts. But for the events in the packet (which is what I care about) that bottom 16 bits in the packet header isn't necessary. > > > > > The first event has: 0x5678 which is based on the previous injected > > timestamp of 0x12340000. > > It is not "based" on the previous injected timestamp. It just represents > the low-order 16-bits of the timestamp at event 1. The trace readers takes > two informations to compute the complete current timestamp for event 1: > > 1) The current timestamp just before event 1 is read (0x12345678), And where does that reader get that information from? Either the packet header, or you need to inject a timestamp somewhere, don't you? > 2) The low-order 16 bits of event 1 (0x5678). > > Given that there is no 16-bit overflow when comparing: > > 0x12345678 & 0xFFFF and 0x5678 > > We know that the resulting current timestamp for event 1 is: > > (0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678 And above you ignore the least 16 bits ;-) > > Or basically that time did not move between the packet header and event 1. > > > > > the events go on just using a delta from that until you see it overflow (or > > too big of a delta to fit in the 16 bits), and you insert a new full > > timestamps that everything will be based on: > > > > 0x12370000 > > No. The following events use the same algorithm I just described: They use Right, you mask off the bits of the timestamp. > this notion of "current timestamp" and the information provided by the new > timestamp field in the event header to figure out the updated current timestamp. > > It is _never_ based on some kind of arbitrary reference, it is always absolute, > and is always computed based on the current timestamp and the timestamp field > encountered. And that current timestamp was created by some injected timestamp in the writer. The reader has to get it from somewhere! > > > > > Now events following that are just a delta from that timestamp. But > > you calculate the delta simply by masking out the lower bits. > > No, again, there is no delta from arbitrary injected time. It's always > computed from this virtual "current time", which applies to an event > stream. > (0x12345678 & ~0xFFFF) + 0x5678 = 0x12345678 Delta from arbitrary injected time of (0x12345678 & ~0xFFFF) = 0x12340000 to event "delta" 0x5678 = 0x12345678 ;-) You are just using different terminology for the same thing I'm talking about. > > > > > But how do you detect the overflow? That last timestamps to know if > > the tsc overflowed or not needs to be saved and compared. I would > > assume you have a similar race that I have. > > Yes, I save a "last timestamp" per buffer, but the race does not > matter because of the order in which it is saved wrt local cmpxchg > updating the reserved position. > > The algorithm looks like: > > do { > - read current reserved position (old pos) > - read time > - compute new reserved position (new pos) > } while (cmpxchg(reserved pos, old pos, new pos) != old pos); > > [A] > > save_last_tsc() So the last_tsc that is saved is from the timestamp read before the cmpxchg. > > If interrupted at [A] by another trace producer, it will compare with > an older "last tsc" than the tsc of the event physically located just > before the nested event. This stale "last tsc" has a value which is > necessarily lower than the one we would be saving with the > save_last_tsc immediately afterwards, which means in the worse case > we end up using a full 64-bit timestamp when in fact we could use a > more compact representation. But this race is rare and therefore it > does not matter for size. That's equivalent to me "injecting" an absolute value for the same race. > > The fact that I only need this last_tsc value for the sake of > optimization, and not for computation of a time delta from a previous > injected timestamp, makes it possible to handle the race gracefully > without requiring anything more than a single last_tsc value per > buffer and a single comparison for 16-bit overflow. If you have: > do { > - read current reserved position (old pos) > - read time > - compute new reserved position (new pos) > } while (cmpxchg(reserved pos, old pos, new pos) != old pos); > > [A] And here you have an interrupt that runs for several minutes (yeah bad code ;-) And it does the same thing do { - read current reserved position (old pos) - read time - compute new reserved position (new pos) } while (cmpxchg(reserved pos, old pos, new pos) != old pos); save_last_tsc() But that last_tsc from before the cmpxchg is much later than the one it interrupted. > > save_last_tsc() Ah, so if you put the clock back here, it will just cause the next event to inject a full timestamp again because the delta is too big. This is how you are able to avoid the "before/after" logic I have, as the race is automatically detected. The least significant bits of the timestamp is ignored for the event delta calculation. And if a race happens where the interrupting event saves a later timestamp and comes back here, if the interrupted event writes the older timestamp, it just causes that delta calculation to overflow again and you inject another 64bit timestamp into the buffer. > >> > >> Or introduce "trace_pipe_raw2" or some kind of versioned file > >> names as new ABIs. > > > > I hate "version numbers" ;-) I would rather have a meaning name > > instead. > > > > "trace_pipe_raw_ts_delta" or something that gives a hint to why it's > > different. > > It's just that as it evolves and more changes are integrated, you may > end up with "trace_pipe_raw_featureA_featureB_featureC", which is > also inconvenient. Yes, but at least you know what those features are ;-) And it's been 15 years since the code was written, and this would be the "first" new feature added. > > > > >> > >>> Although, if we want to change the "default" way, it may need > >>> some other knobs or something, which wouldn't be hard. > >> > >> The delta-timestamp-encoding would have to stay there for a while > >> as long as users have not switched over to trace_pipe_raw2. Then > >> when it's really gone, the trace_pipe_raw could either go away or > >> return an error when opened. > >> > >>> Now I have to ask, what's the motivation for this. The code isn't > >>> that complex anymore. Yes it still has the before/after > >>> timestamps, but the most complexity in that code was due to what > >>> happens in the race of updating the reserved data. But that's no > >>> longer the case with the cmpxchg(). If you look at this patch, > >>> that entire else statement was deleted. And that deleted code was > >>> what made me sick to my stomach ;-) Good riddance! > >> > >> The motivation for this would be to further simplify the > >> implementation of __rb_reserve_next(), rb_add_timestamp(), and > >> rb_try_to_discard(), and remove a few useless loads, stores, and > >> conditional branches on the fast-path of __rb_reserve_next(). This > >> would turn the before/after timestamp stores/loads/comparisons > >> into a simple "last timestamp" field and a comparison of the > >> current timestamp against the "last timestamp" value to figure out > >> whether the compact representation can be used. > >> > >> I don't know whether it's worth the trouble or not, it's really up > >> to you. :) > > > > Yeah, making such a change would require an analysis to know if it > > would speed things up enough to make a significant difference to do > > that effort. > > I would also consider reducing code complexity as a worthwhile metric > in addition to speed. It makes it easier to extend in the future, > easier to understand for reviewers, and subtle bugs are less likely > to creep in. Really, it wouldn't make it that much simpler. The addition of the cmpxchg() of this patch removed the nasty part of the code. Going through a transition of changing it could end up being just as complex. I'm not sure the complexity in that transition is better than the complexity of the current code, as this code has been there for 15 years, and I know of at least 2 other projects that depend on this format as is. -- Steve
On Sat, 20 Jan 2024 08:47:13 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > I would also consider reducing code complexity as a worthwhile metric > > in addition to speed. It makes it easier to extend in the future, > > easier to understand for reviewers, and subtle bugs are less likely > > to creep in. > > Really, it wouldn't make it that much simpler. The addition of the > cmpxchg() of this patch removed the nasty part of the code. Now let's look at the difference of the two. You still need to save the current timestamp in one variable. I have to do it in two, so your algorithm does have that advantage. I currently have (eliminating the "always add absolute timestamp" switch): w = write; before = before_stamp; again: after = write_stamp; (equivalent to your last_tsc) ts = rdtsc(); if (!w) delta = 0; // event has same ts as subbuf else if (before == after) delta = ts - after; else { delta = 0; inject_absolute = true; } before_stamp = ts; if (!try_cmpxchg(&write, w, w + length)) goto again; write_stamp = ts; Now if I were to updated it to use a delta from the last injected timestamp, where injecting a timestamp only happens on overflow. #define TS_BITS 27 #define MAX_DELTA ((1 << TS_BITS) - 1) #define BITMASK (~MAX_DELTA) w = write; again: ts = rdtsc(); delta = ts & MAX_DELTA; if (ts - (write_stamp & BITMASK) > MAX_DELTA) inject_absolute = true; if (!try_cmpxchg(&write, w, w + length)) goto again; write_stamp = ts; I admit that it does simplify the code a little, but does it do it enough to be worth the process of deprecating an ABI with a new one? -- Steve
On 2024-01-20 08:47, Steven Rostedt wrote: > On Fri, 19 Jan 2024 20:49:36 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>>> Let's say we have the following ktime_get() values (monotonic timestamp value) for >>>> a sequence of events: >>>> >>>> Timestamp (Hex) Encoding in the trace >>>> >>>> Packet header timestamp begin 0x12345678 64-bit: 0x12345678 >>>> >>>> Event 1 0x12345678 16-bit: 0x5678 >>>> (When decoded, same value as previous timestamp, no overflow) >>>> Event 2 0x12347777 16-bit: 0x7777 >>>> (When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit) >>>> Event 3 0x12350000 16-bit: 0x0000 >>>> (When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once >>>> which allows the trace reader to reconstruct timestamp 0x12350000 from the >>>> previous timestamp and the 16-bit timestamp encoding.) >>>> Event 4 0x12370000 64-bit: 0x12370000 >>>> (Encoding over 16-bit not possible because going from 0x12350000 to >>>> 0x12370000 would overflow 16-bit twice, which cannot be detected >>>> by a trace reader. Therefore use the full 64-bit timestamp in the >>>> "large" event header representation.) >>> >>> I think that's basically what I said, but you are just looking at it >>> differently ;-) Or should I say, you are using bits for optimization. >> >> Based on your explanation below, we are really talking about different things >> here. Let me try to reply to your explanation to try to show where what I am >> doing completely differs from what you have in mind. This will help explain >> how I handle 16-bit overflow as well. >> >>> The events are based off of the last injected timestamp. >> >> Incorrect. There is no "injected timestamp". There is only a concept >> of the "current timestamp" as we either write to or read from the >> event stream. I will take the point of view of the trace reader for >> the rest of the discussion. >> >>> The above example, >>> starts with an timestamp injection into the packet header: 0x12345678, with >>> the lsb 16bits ignore. >> >> Wrong again. The 16 least significant bits are not ignored. The "current timestamp" >> is really 0x12345678 when the packet header is read. > > In the packet header you have 0x12345678 in the first event you have > 0x5678 how does that get you the timestamp? If that event had 0x8888, > when the reader reads this packet, it would take the header 0x12345678 > chop off (ignore) the 5678, and add the 8888, right? We need to consider not only what happens when the 16 low bits increase, but also what happens when they end up with a value smaller than the previous 16 low bits. As a summary from our video meeting discussion: There are 3 cases we care about here: packet header timestamp: 0x12345678 followed by either: A) first event delta from packet header timestamp is 0: 16-bit value 0x5678 B) first event delta from packet header timestamp is <= 0xFFFF: B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value) B.2) 16-bit value example 0x2222 (exactly one 16-bit overflow from previous value) C) first event delta from packet header timestamp is larger than 0xFFFF, which would cause the low-order 16 bits to have more than one 16-bit overflow from the previous value. The tracer detects this and uses a full 64-bit timestamp instead of the compact 16 bits. [...] > >> >>> >>> But how do you detect the overflow? That last timestamps to know if >>> the tsc overflowed or not needs to be saved and compared. I would >>> assume you have a similar race that I have. >> >> Yes, I save a "last timestamp" per buffer, but the race does not >> matter because of the order in which it is saved wrt local cmpxchg >> updating the reserved position. >> >> The algorithm looks like: >> >> do { >> - read current reserved position (old pos) >> - read time >> - compute new reserved position (new pos) >> } while (cmpxchg(reserved pos, old pos, new pos) != old pos); >> >> [A] >> >> save_last_tsc() > > So the last_tsc that is saved is from the timestamp read before the > cmpxchg. Yes. > >> >> If interrupted at [A] by another trace producer, it will compare with >> an older "last tsc" than the tsc of the event physically located just >> before the nested event. This stale "last tsc" has a value which is >> necessarily lower than the one we would be saving with the >> save_last_tsc immediately afterwards, which means in the worse case >> we end up using a full 64-bit timestamp when in fact we could use a >> more compact representation. But this race is rare and therefore it >> does not matter for size. > > That's equivalent to me "injecting" an absolute value for the same race. Yes. > >> >> The fact that I only need this last_tsc value for the sake of >> optimization, and not for computation of a time delta from a previous >> injected timestamp, makes it possible to handle the race gracefully >> without requiring anything more than a single last_tsc value per >> buffer and a single comparison for 16-bit overflow. > > If you have: > >> do { >> - read current reserved position (old pos) >> - read time >> - compute new reserved position (new pos) >> } while (cmpxchg(reserved pos, old pos, new pos) != old pos); >> >> [A] > > And here you have an interrupt that runs for several minutes (yeah bad > code ;-) And it does the same thing > > do { > - read current reserved position (old pos) > - read time > - compute new reserved position (new pos) > } while (cmpxchg(reserved pos, old pos, new pos) != old pos); > > save_last_tsc() > > But that last_tsc from before the cmpxchg is much later than the one it > interrupted. > > >> >> save_last_tsc() > > Ah, so if you put the clock back here, it will just cause the next > event to inject a full timestamp again because the delta is too big. Yes. > > This is how you are able to avoid the "before/after" logic I have, as > the race is automatically detected. The least significant bits of the > timestamp is ignored for the event delta calculation. Not quite, as I explained at the beginning of this email. All bits from the previous timestamp, including its low bits, are useful to know how many overflows happened since the last tsc. > And if a race > happens where the interrupting event saves a later timestamp and comes > back here, if the interrupted event writes the older timestamp, it just > causes that delta calculation to overflow again and you inject another > 64bit timestamp into the buffer. This part is correct: in the race you describe, we end up with the possibility of bringing the last_tsc backwards, which can only cause the tracer to use the full 64-bit timestamp when in fact it could use the compact representation. But it's rare and should not matter in practice. And by the way this algorithm is designed to work with preemption/migration enabled as well, not just interrupts. So the race can come from a thread running concurrently on another CPU and it should work as well. [...] > > Going through a transition of changing it could end up being just as > complex. I'm not sure the complexity in that transition is better than > the complexity of the current code, as this code has been there for 15 > years, and I know of at least 2 other projects that depend on this > format as is. I agree with you that it's not clear-cut whether introducing this change would be a benefit at this stage considering the extra complexity of extending the ABI while keeping backward compatibility. But it's something we can keep in mind if we ever have to do major ABI extensions for other reasons. Thanks, Mathieu
On Thu, 25 Jan 2024 16:18:37 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > This is how you are able to avoid the "before/after" logic I have, as > > the race is automatically detected. The least significant bits of the > > timestamp is ignored for the event delta calculation. > > Not quite, as I explained at the beginning of this email. All bits from the > previous timestamp, including its low bits, are useful to know how many > overflows happened since the last tsc. Yes, but it still means updating that timestamp you will compare to doesn't have the race I have. If the timestamp's upper bits are the same, or are off by one and the lower bits are higher than the current timestamp, you don't need to inject. But if the lower bits are higher than the timestamp or the higehr bits are off by more than one then you do. The lower bits are a delta against "0" of the current timestamp lower bits. There's no race in updating those bits as long as the upper bits remain the same or are off by one and the current timestamp lower bits are lower than the saved time stamp. In my case, because the delta is off of the entire timestamp, what I write into the saved timestamp, all bits matter. And to handle that I need the before/after timestamps to know if the currently saved timestamp didn't have a race. > > > And if a race > > happens where the interrupting event saves a later timestamp and comes > > back here, if the interrupted event writes the older timestamp, it just > > causes that delta calculation to overflow again and you inject another > > 64bit timestamp into the buffer. > > This part is correct: in the race you describe, we end up with the > possibility of bringing the last_tsc backwards, which can only cause > the tracer to use the full 64-bit timestamp when in fact it could use > the compact representation. But it's rare and should not matter in > practice. > > And by the way this algorithm is designed to work with preemption/migration > enabled as well, not just interrupts. So the race can come from a thread > running concurrently on another CPU and it should work as well. > > [...] > > > > > Going through a transition of changing it could end up being just as > > complex. I'm not sure the complexity in that transition is better than > > the complexity of the current code, as this code has been there for 15 > > years, and I know of at least 2 other projects that depend on this > > format as is. > > I agree with you that it's not clear-cut whether introducing this change > would be a benefit at this stage considering the extra complexity of > extending the ABI while keeping backward compatibility. > > But it's something we can keep in mind if we ever have to do major ABI > extensions for other reasons. Yeah, it's something to think about if we want to use a different format for something else. Thanks for the review. -- Steve
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 7887d61d5b56..74f8895785ca 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3449,9 +3449,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, /* Don't let the compiler play games with cpu_buffer->tail_page */ tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page); - /*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK; + /*A*/ w = local_read(&tail_page->write); barrier(); rb_time_read(&cpu_buffer->before_stamp, &info->before); + /* Read before_stamp only the first time through */ + again: rb_time_read(&cpu_buffer->write_stamp, &info->after); barrier(); info->ts = rb_time_stamp(cpu_buffer->buffer); @@ -3464,7 +3466,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, * absolute timestamp. * Don't bother if this is the start of a new page (w == 0). */ - if (!w) { + if (!(w & RB_WRITE_MASK)) { /* Use the sub-buffer timestamp */ info->delta = 0; } else if (unlikely(info->before != info->after)) { @@ -3481,89 +3483,49 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, /*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts); - /*C*/ write = local_add_return(info->length, &tail_page->write); + /*C*/ if (!local_try_cmpxchg(&tail_page->write, &w, w + info->length)) + goto again; - /* set write to only the index of the write */ - write &= RB_WRITE_MASK; + /* Set write to the start of this event */ + write = w & RB_WRITE_MASK; - tail = write - info->length; + /* set tail to the end of the event */ + tail = write + info->length; /* See if we shot pass the end of this buffer page */ - if (unlikely(write > cpu_buffer->buffer->subbuf_size)) { + if (unlikely(tail > cpu_buffer->buffer->subbuf_size)) { check_buffer(cpu_buffer, info, CHECK_FULL_PAGE); - return rb_move_tail(cpu_buffer, tail, info); + return rb_move_tail(cpu_buffer, write, info); } - if (likely(tail == w)) { - /* Nothing interrupted us between A and C */ - /*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts); - /* - * If something came in between C and D, the write stamp - * may now not be in sync. But that's fine as the before_stamp - * will be different and then next event will just be forced - * to use an absolute timestamp. - */ - if (likely(!(info->add_timestamp & - (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) - /* This did not interrupt any time update */ - info->delta = info->ts - info->after; - else - /* Just use full timestamp for interrupting event */ - info->delta = info->ts; - check_buffer(cpu_buffer, info, tail); - } else { - u64 ts; - /* SLOW PATH - Interrupted between A and C */ - - /* Save the old before_stamp */ - rb_time_read(&cpu_buffer->before_stamp, &info->before); - - /* - * Read a new timestamp and update the before_stamp to make - * the next event after this one force using an absolute - * timestamp. This is in case an interrupt were to come in - * between E and F. - */ - ts = rb_time_stamp(cpu_buffer->buffer); - rb_time_set(&cpu_buffer->before_stamp, ts); - - barrier(); - /*E*/ rb_time_read(&cpu_buffer->write_stamp, &info->after); - barrier(); - /*F*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) && - info->after == info->before && info->after < ts) { - /* - * Nothing came after this event between C and F, it is - * safe to use info->after for the delta as it - * matched info->before and is still valid. - */ - info->delta = ts - info->after; - } else { - /* - * Interrupted between C and F: - * Lost the previous events time stamp. Just set the - * delta to zero, and this will be the same time as - * the event this event interrupted. And the events that - * came after this will still be correct (as they would - * have built their delta on the previous event. - */ - info->delta = 0; - } - info->ts = ts; - info->add_timestamp &= ~RB_ADD_STAMP_FORCE; - } + /* Nothing interrupted us between A and C */ + /*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts); + /* + * If something came in between C and D, the write stamp + * may now not be in sync. But that's fine as the before_stamp + * will be different and then next event will just be forced + * to use an absolute timestamp. + */ + if (likely(!(info->add_timestamp & + (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) + /* This did not interrupt any time update */ + info->delta = info->ts - info->after; + else + /* Just use full timestamp for interrupting event */ + info->delta = info->ts; + check_buffer(cpu_buffer, info, write); /* * If this is the first commit on the page, then it has the same * timestamp as the page itself. */ - if (unlikely(!tail && !(info->add_timestamp & + if (unlikely(!write && !(info->add_timestamp & (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) info->delta = 0; /* We reserved something on the buffer */ - event = __rb_page_index(tail_page, tail); + event = __rb_page_index(tail_page, write); rb_update_event(cpu_buffer, event, info); local_inc(&tail_page->entries); @@ -3572,7 +3534,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, * If this is the first commit on the page, then update * its timestamp. */ - if (unlikely(!tail)) + if (unlikely(!write)) tail_page->page->time_stamp = info->ts; /* account for these added bytes */