Message ID | 20210218222125.46565-1-mjeanson@efficios.com (mailing list archive) |
---|---|
Headers | show |
Series | Faultable tracepoints (v2) | expand |
On Thu, 18 Feb 2021 17:21:19 -0500 Michael Jeanson <mjeanson@efficios.com> wrote: > This series only implements the tracepoint infrastructure required to > allow tracers to handle page faults. Modifying each tracer to handle > those page faults would be a next step after we all agree on this piece > of instrumentation infrastructure. I started taking a quick look at this, and came up with the question: how do you allow preemption when dealing with per-cpu buffers or storage to record the data? That is, perf, bpf and ftrace are all using some kind of per-cpu data, and this is the reason for the need to disable preemption. What's the solution that LTTng is using for this? I know it has a per cpu buffers too, but does it have some kind of "per task" buffer that is being used to extract the data that can fault? -- Steve
[ Adding Mathieu Desnoyers in CC ] On 2021-02-23 21 h 16, Steven Rostedt wrote: > On Thu, 18 Feb 2021 17:21:19 -0500 > Michael Jeanson <mjeanson@efficios.com> wrote: > >> This series only implements the tracepoint infrastructure required to >> allow tracers to handle page faults. Modifying each tracer to handle >> those page faults would be a next step after we all agree on this piece >> of instrumentation infrastructure. > > I started taking a quick look at this, and came up with the question: how > do you allow preemption when dealing with per-cpu buffers or storage to > record the data? > > That is, perf, bpf and ftrace are all using some kind of per-cpu data, and > this is the reason for the need to disable preemption. What's the solution > that LTTng is using for this? I know it has a per cpu buffers too, but does > it have some kind of "per task" buffer that is being used to extract the > data that can fault? > > -- Steve >
----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote: > [ Adding Mathieu Desnoyers in CC ] > > On 2021-02-23 21 h 16, Steven Rostedt wrote: >> On Thu, 18 Feb 2021 17:21:19 -0500 >> Michael Jeanson <mjeanson@efficios.com> wrote: >> >>> This series only implements the tracepoint infrastructure required to >>> allow tracers to handle page faults. Modifying each tracer to handle >>> those page faults would be a next step after we all agree on this piece >>> of instrumentation infrastructure. >> >> I started taking a quick look at this, and came up with the question: how >> do you allow preemption when dealing with per-cpu buffers or storage to >> record the data? >> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and >> this is the reason for the need to disable preemption. What's the solution >> that LTTng is using for this? I know it has a per cpu buffers too, but does >> it have some kind of "per task" buffer that is being used to extract the >> data that can fault? As a prototype solution, what I've done currently is to copy the user-space data into a kmalloc'd buffer in a preparation step before disabling preemption and copying data over into the per-cpu buffers. It works, but I think we should be able to do it without the needless copy. What I have in mind as an efficient solution (not implemented yet) for the LTTng kernel tracer goes as follows: #define COMMIT_LOCAL 0 #define COMMIT_REMOTE 1 - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ] - probe code calculate the length which needs to be reserved to store the event (e.g. user strlen), - preempt disable -> [ preemption/blocking/migration is not allowed from here ] - reserve_cpu = smp_processor_id() - reserve space in the ring buffer for reserve_cpu [ from that point on, we have _exclusive_ access to write into the ring buffer "slot" from any cpu until we commit. ] - preempt enable -> [ preemption/blocking/migration is allowed from here ] - copy data from user-space to the ring buffer "slot", - preempt disable -> [ preemption/blocking/migration is not allowed from here ] commit_cpu = smp_processor_id() if (commit_cpu == reserve_cpu) use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL] else use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE] - preempt enable -> [ preemption/blocking/migration is allowed from here ] Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running accumulators, the trick here is to use two commit counters rather than single one for each sub-buffer. Whenever we need to read a commit count value, we always sum the total of the LOCAL and REMOTE counter. This allows dealing with migration between reserve and commit without requiring the overhead of an atomic operation on the fast-path (LOCAL case). I had to design this kind of dual-counter trick in the context of user-space use of restartable sequences. It looks like it may have a role to play in the kernel as well. :) Or am I missing something important that would not survive real-life ? Thanks, Mathieu
On Wed, 24 Feb 2021 11:59:35 -0500 (EST) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > As a prototype solution, what I've done currently is to copy the user-space > data into a kmalloc'd buffer in a preparation step before disabling preemption > and copying data over into the per-cpu buffers. It works, but I think we should > be able to do it without the needless copy. > > What I have in mind as an efficient solution (not implemented yet) for the LTTng > kernel tracer goes as follows: > > #define COMMIT_LOCAL 0 > #define COMMIT_REMOTE 1 > > - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ] > - probe code calculate the length which needs to be reserved to store the event > (e.g. user strlen), > > - preempt disable -> [ preemption/blocking/migration is not allowed from here ] > - reserve_cpu = smp_processor_id() > - reserve space in the ring buffer for reserve_cpu > [ from that point on, we have _exclusive_ access to write into the ring buffer "slot" > from any cpu until we commit. ] > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > So basically the commit position here doesn't move until this task is scheduled back in and the commit (remote or local) is updated. To put it in terms of the ftrace ring buffer, where we have both a commit page and a commit index, and it only gets moved by the first one to start a commit stack (that is, interrupts that interrupted a write will not increment the commit). Now, I'm not sure how LTTng does it, but I could see issues for ftrace to try to move the commit pointer (the pointer to the new commit page), as the design is currently dependent on the fact that it can't happen while commits are taken place. Are the pages of the LTTng indexed by an array of pages? -- Steve
----- Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote: > > > [ Adding Mathieu Desnoyers in CC ] > > > > On 2021-02-23 21 h 16, Steven Rostedt wrote: > >> On Thu, 18 Feb 2021 17:21:19 -0500 > >> Michael Jeanson <mjeanson@efficios.com> wrote: > >> > >>> This series only implements the tracepoint infrastructure required to > >>> allow tracers to handle page faults. Modifying each tracer to handle > >>> those page faults would be a next step after we all agree on this piece > >>> of instrumentation infrastructure. > >> > >> I started taking a quick look at this, and came up with the question: how > >> do you allow preemption when dealing with per-cpu buffers or storage to > >> record the data? > >> > >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and > >> this is the reason for the need to disable preemption. What's the solution > >> that LTTng is using for this? I know it has a per cpu buffers too, but does > >> it have some kind of "per task" buffer that is being used to extract the > >> data that can fault? > > As a prototype solution, what I've done currently is to copy the user-space > data into a kmalloc'd buffer in a preparation step before disabling preemption > and copying data over into the per-cpu buffers. It works, but I think we should > be able to do it without the needless copy. > > What I have in mind as an efficient solution (not implemented yet) for the LTTng > kernel tracer goes as follows: > > #define COMMIT_LOCAL 0 > #define COMMIT_REMOTE 1 > > - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ] > - probe code calculate the length which needs to be reserved to store the event > (e.g. user strlen), > > - preempt disable -> [ preemption/blocking/migration is not allowed from here ] > - reserve_cpu = smp_processor_id() > - reserve space in the ring buffer for reserve_cpu > [ from that point on, we have _exclusive_ access to write into the ring buffer "slot" > from any cpu until we commit. ] > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > - copy data from user-space to the ring buffer "slot", > > - preempt disable -> [ preemption/blocking/migration is not allowed from here ] > commit_cpu = smp_processor_id() > if (commit_cpu == reserve_cpu) > use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL] > else > use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE] The line above should increment reserve_cpu's buffer commit count, of course. Thanks, Mathieu > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running > accumulators, the trick here is to use two commit counters rather than single one for each > sub-buffer. Whenever we need to read a commit count value, we always sum the total of the > LOCAL and REMOTE counter. > > This allows dealing with migration between reserve and commit without requiring the overhead > of an atomic operation on the fast-path (LOCAL case). > > I had to design this kind of dual-counter trick in the context of user-space use of restartable > sequences. It looks like it may have a role to play in the kernel as well. :) > > Or am I missing something important that would not survive real-life ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
----- On Feb 24, 2021, at 1:14 PM, rostedt rostedt@goodmis.org wrote: > On Wed, 24 Feb 2021 11:59:35 -0500 (EST) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> As a prototype solution, what I've done currently is to copy the user-space >> data into a kmalloc'd buffer in a preparation step before disabling preemption >> and copying data over into the per-cpu buffers. It works, but I think we should >> be able to do it without the needless copy. >> >> What I have in mind as an efficient solution (not implemented yet) for the LTTng >> kernel tracer goes as follows: >> >> #define COMMIT_LOCAL 0 >> #define COMMIT_REMOTE 1 >> >> - faultable probe is called from system call tracepoint [ >> preemption/blocking/migration is allowed ] >> - probe code calculate the length which needs to be reserved to store the event >> (e.g. user strlen), >> >> - preempt disable -> [ preemption/blocking/migration is not allowed from here ] >> - reserve_cpu = smp_processor_id() >> - reserve space in the ring buffer for reserve_cpu >> [ from that point on, we have _exclusive_ access to write into the ring buffer >> "slot" >> from any cpu until we commit. ] >> - preempt enable -> [ preemption/blocking/migration is allowed from here ] >> > > So basically the commit position here doesn't move until this task is > scheduled back in and the commit (remote or local) is updated. Indeed. > To put it in terms of the ftrace ring buffer, where we have both a commit > page and a commit index, and it only gets moved by the first one to start a > commit stack (that is, interrupts that interrupted a write will not > increment the commit). The tricky part for ftrace is its reliance on the fact that the concurrent users of the per-cpu ring buffer are all nested contexts. LTTng does not assume that and has been designed to be used both in kernel and user-space: lttng-modules and lttng-ust share a lot of ring buffer code. Therefore, LTTng's ring buffer supports preemption/migration of concurrent contexts. The fact that LTTng uses local-atomic-ops on its kernel ring buffers is just an optimization on an overall ring buffer design meant to allow preemption. > Now, I'm not sure how LTTng does it, but I could see issues for ftrace to > try to move the commit pointer (the pointer to the new commit page), as the > design is currently dependent on the fact that it can't happen while > commits are taken place. Indeed, what makes it easy for LTTng is because the ring buffer has been designed to support preemption/migration from the ground up. > Are the pages of the LTTng indexed by an array of pages? Yes, they are. Handling the initial page allocation and then the tracer copy of data to/from the ring buffer pages is the responsibility of the LTTng lib ring buffer "backend". The LTTng lib ring buffer backend is somewhat similar to a page table done in software, where the top level of the page table can be dynamically updated when doing flight recorder tracing. It is however completely separate from the space reservation/commit scheme which is handled by the lib ring buffer "frontend". The algorithm I described in my prior email is specifically targeted at the frontend layer, leaving the "backend" unchanged. For some reasons I suspect Ftrace ring buffer combined those two layers into a single algorithm, which may have its advantages, but seems to strengthen its dependency on only having nested contexts sharing a given per-cpu ring buffer. Thanks, Mathieu
On Thu, Feb 25, 2021 at 9:15 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote: > > > [ Adding Mathieu Desnoyers in CC ] > > > > On 2021-02-23 21 h 16, Steven Rostedt wrote: > >> On Thu, 18 Feb 2021 17:21:19 -0500 > >> Michael Jeanson <mjeanson@efficios.com> wrote: > >> > >>> This series only implements the tracepoint infrastructure required to > >>> allow tracers to handle page faults. Modifying each tracer to handle > >>> those page faults would be a next step after we all agree on this piece > >>> of instrumentation infrastructure. > >> > >> I started taking a quick look at this, and came up with the question: how > >> do you allow preemption when dealing with per-cpu buffers or storage to > >> record the data? > >> > >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and > >> this is the reason for the need to disable preemption. What's the solution > >> that LTTng is using for this? I know it has a per cpu buffers too, but does > >> it have some kind of "per task" buffer that is being used to extract the > >> data that can fault? > > As a prototype solution, what I've done currently is to copy the user-space > data into a kmalloc'd buffer in a preparation step before disabling preemption > and copying data over into the per-cpu buffers. It works, but I think we should > be able to do it without the needless copy. > > What I have in mind as an efficient solution (not implemented yet) for the LTTng > kernel tracer goes as follows: > > #define COMMIT_LOCAL 0 > #define COMMIT_REMOTE 1 > > - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ] label: restart: > - probe code calculate the length which needs to be reserved to store the event > (e.g. user strlen), Does "user strlen" makes the content fault in? Is it possible to make the sleepable faulting only happen here between "restart" and the following "preempt disable"? The code here should do a prefetch operation like "user strlen". And we can keep preemption disabled when copying the data. If there is a fault while copying, then we can restart from the label "restart". Very immature thought. Thanks Lai > > - preempt disable -> [ preemption/blocking/migration is not allowed from here ] > - reserve_cpu = smp_processor_id() > - reserve space in the ring buffer for reserve_cpu > [ from that point on, we have _exclusive_ access to write into the ring buffer "slot" > from any cpu until we commit. ] > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > - copy data from user-space to the ring buffer "slot", > > - preempt disable -> [ preemption/blocking/migration is not allowed from here ] > commit_cpu = smp_processor_id() > if (commit_cpu == reserve_cpu) > use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL] > else > use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE] > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running > accumulators, the trick here is to use two commit counters rather than single one for each > sub-buffer. Whenever we need to read a commit count value, we always sum the total of the > LOCAL and REMOTE counter. > > This allows dealing with migration between reserve and commit without requiring the overhead > of an atomic operation on the fast-path (LOCAL case). > > I had to design this kind of dual-counter trick in the context of user-space use of restartable > sequences. It looks like it may have a role to play in the kernel as well. :) > > Or am I missing something important that would not survive real-life ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com