Message ID | 20210421120413.3110775-1-daniel.kiss@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | coresight: Add ETR-PERF polling. | expand |
Hi Daniel, On Wed, Apr 21, 2021 at 02:04:09PM +0200, Daniel Kiss wrote: > This series adds a feature to ETR-PERF that sync the ETR buffer to perf > periodically. This is really handy when the system wide trace is used > because in this case the perf won't sync during the trace. In a per-thread > setup the traced program might not go to the kernel frequvently enought > to collect trace. Polling helps in both usecases. Can be used with strobing. > Tuning polling period is challanging, I'm working on an additional patch > that adds some metrics to help tune the polling period. > Suzuki and Leo have already commented on a number of problems with this set and as such I will concentrate on the general idea. Over the years we have thought long and hard about fixing the overflow issues created by the lack of interrupt when a sink gets full, installing a timer to empty the sink buffer at regular intervals is one of them. Ultimately we haven't moved forward with the idea because it requires to stop the sink when an event is active, something that introduces more trace data loss. To me this kind of interval snapshot should be achieved using Mike's new strobing feature that came bundled with the complex configuration framework, available on next-ETE-TRBE[1]. I will rebase that branch to 5.13-rc1 when it is released in a couple of weeks from now. Thanks, Mathieu PS: Always run your work through checkpatch.pl before sending a patchset for review. [1]. https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/log/?h=next-ETE-TRBE > Daniel Kiss (4): > coresight: tmc-etr: Advance buffer pointer in sync buffer. > coresight: tmc-etr: Track perf handler. > coresight: etm-perf: Export etm_event_cpu_path. > coresight: Add ETR-PERF polling. > > .../testing/sysfs-bus-coresight-devices-tmc | 8 + > drivers/hwtracing/coresight/Makefile | 2 +- > .../hwtracing/coresight/coresight-etm-perf.c | 10 +- > .../hwtracing/coresight/coresight-etm-perf.h | 1 + > .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ > .../coresight/coresight-etr-perf-polling.h | 42 +++ > .../hwtracing/coresight/coresight-tmc-core.c | 2 + > .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- > drivers/hwtracing/coresight/coresight-tmc.h | 2 + > 9 files changed, 401 insertions(+), 4 deletions(-) > create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c > create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.h > > -- > 2.25.1 >
> Hi Daniel, > > On Wed, Apr 21, 2021 at 02:04:09PM +0200, Daniel Kiss wrote: > > This series adds a feature to ETR-PERF that sync the ETR buffer to > > perf periodically. This is really handy when the system wide trace is > > used because in this case the perf won't sync during the trace. In a > > per-thread setup the traced program might not go to the kernel > > frequvently enought to collect trace. Polling helps in both usecases. Can be > used with strobing. > > Tuning polling period is challanging, I'm working on an additional > > patch that adds some metrics to help tune the polling period. > > > > Suzuki and Leo have already commented on a number of problems with this set > and as such I will concentrate on the general idea. > > Over the years we have thought long and hard about fixing the overflow issues > created by the lack of interrupt when a sink gets full, installing a timer to empty > the sink buffer at regular intervals is one of them. Ultimately we haven't moved > forward with the idea because it requires to stop the sink when an event is > active, something that introduces more trace data loss. > > To me this kind of interval snapshot should be achieved using Mike's new > strobing feature that came bundled with the complex configuration framework, > available on next-ETE-TRBE[1]. I will rebase that branch to 5.13-rc1 when it is > released in a couple of weeks from now. It's important to understand what strobing is. It acts internally to the ETM and switches the ETM on for a time and then off for a time. It is as the name suggests, like a stroboscope (or a lighthouse). There is no synchronization between the on-periods of different ETMs. When you have multiple ETMs funnelling into a common ETR, strobing does not guarantee you a window where you can safely harvest the buffer. It achieves a reduction in the overall bandwidth of trace being dumped into the buffer, and there may be times when no trace is being written at all because all the ETMs are in their off-period. At worst, it may create a false sense of security - tests that consistently fail without strobing, may pass often enough with strobing to create the impression that strobing has solved the problem. But these tests are also likely to fail eventually with strobing. To fix this problem without disabling either ETR or ETMs you would have to guarantee that you can harvest the ETR buffer in less time than it takes to fill it. That would need very careful quntitative arguments to be made about: - the rate of trace generation by each ETM (as modified by strobing) - the number of ETMs writing into the buffer - the time available to the kernel to harvest the buffer So if there are 10 ETMs generating trace at average 1Gb/s into a 1Mb buffer, the buffer will fill in 100us, and that gives the kernel 100us to harvest the buffer before its read pointer is caught up by the ETR's advancing write pointer. If strobing is used to reduce average ETM rate to 100Mb/s the kernel has 1ms to read the buffer, and so on. In short the kernel must *guarantee* a minimum readout rate equal to the maximum aggregate write rate of the ETMs. But can the kernel guarantee any minimum readout rate at all? The alternative would be double-buffering the ETR, which we've also discussed - so while the kernel is harvesting the contents of one buffer, the ETR is writing (and possibly wrapping) the other. Some trace will still be lost but it does mean the kernel will be harvesting monotonically increasing sequences of trace, and won't be seeing artefacts from its reads colliding with the ETR's writes. Al > > Thanks, > Mathieu > > PS: Always run your work through checkpatch.pl before sending a patchset for > review. > > [1]. > https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/log/?h=next- > ETE-TRBE > > > Daniel Kiss (4): > > coresight: tmc-etr: Advance buffer pointer in sync buffer. > > coresight: tmc-etr: Track perf handler. > > coresight: etm-perf: Export etm_event_cpu_path. > > coresight: Add ETR-PERF polling. > > > > .../testing/sysfs-bus-coresight-devices-tmc | 8 + > > drivers/hwtracing/coresight/Makefile | 2 +- > > .../hwtracing/coresight/coresight-etm-perf.c | 10 +- > > .../hwtracing/coresight/coresight-etm-perf.h | 1 + > > .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ > > .../coresight/coresight-etr-perf-polling.h | 42 +++ > > .../hwtracing/coresight/coresight-tmc-core.c | 2 + > > .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- > > drivers/hwtracing/coresight/coresight-tmc.h | 2 + > > 9 files changed, 401 insertions(+), 4 deletions(-) create mode > > 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c > > create mode 100644 > > drivers/hwtracing/coresight/coresight-etr-perf-polling.h > > > > -- > > 2.25.1 > > > _______________________________________________ > CoreSight mailing list > CoreSight@lists.linaro.org > https://lists.linaro.org/mailman/listinfo/coresight
Hi Mathieu, I thought I'd add a little backgound to what has been said so far... On Tue, 27 Apr 2021 at 11:43, Al Grant <Al.Grant@arm.com> wrote: > > > Hi Daniel, > > > > On Wed, Apr 21, 2021 at 02:04:09PM +0200, Daniel Kiss wrote: > > > This series adds a feature to ETR-PERF that sync the ETR buffer to > > > perf periodically. This is really handy when the system wide trace is > > > used because in this case the perf won't sync during the trace. In a > > > per-thread setup the traced program might not go to the kernel > > > frequvently enought to collect trace. Polling helps in both usecases. Can be > > used with strobing. > > > Tuning polling period is challanging, I'm working on an additional > > > patch that adds some metrics to help tune the polling period. > > > > > > > Suzuki and Leo have already commented on a number of problems with this set > > and as such I will concentrate on the general idea. > > > > Over the years we have thought long and hard about fixing the overflow issues > > created by the lack of interrupt when a sink gets full, installing a timer to empty > > the sink buffer at regular intervals is one of them. Ultimately we haven't moved > > forward with the idea because it requires to stop the sink when an event is > > active, something that introduces more trace data loss. > > > > To me this kind of interval snapshot should be achieved using Mike's new > > strobing feature that came bundled with the complex configuration framework, > > available on next-ETE-TRBE[1]. I will rebase that branch to 5.13-rc1 when it is > > released in a couple of weeks from now. > > It's important to understand what strobing is. It acts internally to the ETM > and switches the ETM on for a time and then off for a time. It is as the > name suggests, like a stroboscope (or a lighthouse). > > There is no synchronization between the on-periods of different ETMs. > When you have multiple ETMs funnelling into a common ETR, strobing > does not guarantee you a window where you can safely harvest the buffer. > It achieves a reduction in the overall bandwidth of trace being dumped > into the buffer, and there may be times when no trace is being written > at all because all the ETMs are in their off-period. > > At worst, it may create a false sense of security - tests that consistently > fail without strobing, may pass often enough with strobing to create the > impression that strobing has solved the problem. But these tests are also > likely to fail eventually with strobing. To fix this problem without > disabling either ETR or ETMs you would have to guarantee that you can > harvest the ETR buffer in less time than it takes to fill it. That would need > very careful quntitative arguments to be made about: > > - the rate of trace generation by each ETM (as modified by strobing) > > - the number of ETMs writing into the buffer > > - the time available to the kernel to harvest the buffer > > So if there are 10 ETMs generating trace at average 1Gb/s into a 1Mb > buffer, the buffer will fill in 100us, and that gives the kernel 100us to > harvest the buffer before its read pointer is caught up by the ETR's > advancing write pointer. If strobing is used to reduce average ETM rate > to 100Mb/s the kernel has 1ms to read the buffer, and so on. In short > the kernel must *guarantee* a minimum readout rate equal to the > maximum aggregate write rate of the ETMs. But can the kernel > guarantee any minimum readout rate at all? > > The alternative would be double-buffering the ETR, which we've > also discussed - so while the kernel is harvesting the contents of one > buffer, the ETR is writing (and possibly wrapping) the other. > Some trace will still be lost but it does mean the kernel will be > harvesting monotonically increasing sequences of trace, and won't be > seeing artefacts from its reads colliding with the ETR's writes. > > Al > As Al mentions, ETR polling is designed to solve a different issue than ETM strobing. These two techniques can be used together or separately. It was noticed by users that the amount of trace captured during a given trace run would vary greatly even when tracing the same application for the same length of time. This was also found to be sensitive to process scheduling - frequent re-scheds did seem to result in more frequent ETR updates and more trace data collected. If perf does not wake up during a trace run then the ETR may wrap mulitple times and all the data will be a single buffer biased towards the end of the trace session. ETR polling is designed to ensure that more trace data is collected consistently across the whole of the trace session. There are issues of course, with stopping collection without stopping the sources. - shared to some extent by the ETE / TRBE combination. This can result in incomplete packets and other trace discontinuities. For this reason it is necessary to ensure that the decoder is restarted for each block of trace captured - which is where the patch set from James that does this using AUX records in perf to correctly split the AUXTRACE records into valid blocks is needed. In summary:- 1) ETM strobing samples trace to allow greater coverage of the program being traced for a given buffer. This is useful when building statistical profiles such as for AutoFDO 2) ETR polling ensures that more trace is collected across the entire trace session - seeking to reduce inconsistent capture volumes. 3) Use AUX records to split the AUXTRACE buffer into valid capture blocks and reset the decoder at the start of these blocks. This is essential for ETE+TRBE, the ETR polling, and systems where we are seeing hardware errata around the flush process causing similar spurious packets. (an alternative for the ETR polling / flush errata might be to insert barrier packets to force a decoder reset for every ETR block copied to the perf buffer - but this does not work for ETE/TRBE that uses no CoreSight formatted framing). Regards Mike > > > > > Thanks, > > Mathieu > > > > PS: Always run your work through checkpatch.pl before sending a patchset for > > review. > > > > [1]. > > https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/log/?h=next- > > ETE-TRBE > > > > > Daniel Kiss (4): > > > coresight: tmc-etr: Advance buffer pointer in sync buffer. > > > coresight: tmc-etr: Track perf handler. > > > coresight: etm-perf: Export etm_event_cpu_path. > > > coresight: Add ETR-PERF polling. > > > > > > .../testing/sysfs-bus-coresight-devices-tmc | 8 + > > > drivers/hwtracing/coresight/Makefile | 2 +- > > > .../hwtracing/coresight/coresight-etm-perf.c | 10 +- > > > .../hwtracing/coresight/coresight-etm-perf.h | 1 + > > > .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ > > > .../coresight/coresight-etr-perf-polling.h | 42 +++ > > > .../hwtracing/coresight/coresight-tmc-core.c | 2 + > > > .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- > > > drivers/hwtracing/coresight/coresight-tmc.h | 2 + > > > 9 files changed, 401 insertions(+), 4 deletions(-) create mode > > > 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c > > > create mode 100644 > > > drivers/hwtracing/coresight/coresight-etr-perf-polling.h > > > > > > -- > > > 2.25.1 > > > > > _______________________________________________ > > CoreSight mailing list > > CoreSight@lists.linaro.org > > https://lists.linaro.org/mailman/listinfo/coresight -- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
Good day Mike, On Tue, Apr 27, 2021 at 03:41:01PM +0100, Mike Leach wrote: > Hi Mathieu, > > I thought I'd add a little backgound to what has been said so far... > > On Tue, 27 Apr 2021 at 11:43, Al Grant <Al.Grant@arm.com> wrote: > > > > > Hi Daniel, > > > > > > On Wed, Apr 21, 2021 at 02:04:09PM +0200, Daniel Kiss wrote: > > > > This series adds a feature to ETR-PERF that sync the ETR buffer to > > > > perf periodically. This is really handy when the system wide trace is > > > > used because in this case the perf won't sync during the trace. In a > > > > per-thread setup the traced program might not go to the kernel > > > > frequvently enought to collect trace. Polling helps in both usecases. Can be > > > used with strobing. > > > > Tuning polling period is challanging, I'm working on an additional > > > > patch that adds some metrics to help tune the polling period. > > > > > > > > > > Suzuki and Leo have already commented on a number of problems with this set > > > and as such I will concentrate on the general idea. > > > > > > Over the years we have thought long and hard about fixing the overflow issues > > > created by the lack of interrupt when a sink gets full, installing a timer to empty > > > the sink buffer at regular intervals is one of them. Ultimately we haven't moved > > > forward with the idea because it requires to stop the sink when an event is > > > active, something that introduces more trace data loss. > > > > > > To me this kind of interval snapshot should be achieved using Mike's new > > > strobing feature that came bundled with the complex configuration framework, > > > available on next-ETE-TRBE[1]. I will rebase that branch to 5.13-rc1 when it is > > > released in a couple of weeks from now. > > > > It's important to understand what strobing is. It acts internally to the ETM > > and switches the ETM on for a time and then off for a time. It is as the > > name suggests, like a stroboscope (or a lighthouse). > > > > There is no synchronization between the on-periods of different ETMs. > > When you have multiple ETMs funnelling into a common ETR, strobing > > does not guarantee you a window where you can safely harvest the buffer. > > It achieves a reduction in the overall bandwidth of trace being dumped > > into the buffer, and there may be times when no trace is being written > > at all because all the ETMs are in their off-period. > > > > At worst, it may create a false sense of security - tests that consistently > > fail without strobing, may pass often enough with strobing to create the > > impression that strobing has solved the problem. But these tests are also > > likely to fail eventually with strobing. To fix this problem without > > disabling either ETR or ETMs you would have to guarantee that you can > > harvest the ETR buffer in less time than it takes to fill it. That would need > > very careful quntitative arguments to be made about: > > > > - the rate of trace generation by each ETM (as modified by strobing) > > > > - the number of ETMs writing into the buffer > > > > - the time available to the kernel to harvest the buffer > > > > So if there are 10 ETMs generating trace at average 1Gb/s into a 1Mb > > buffer, the buffer will fill in 100us, and that gives the kernel 100us to > > harvest the buffer before its read pointer is caught up by the ETR's > > advancing write pointer. If strobing is used to reduce average ETM rate > > to 100Mb/s the kernel has 1ms to read the buffer, and so on. In short > > the kernel must *guarantee* a minimum readout rate equal to the > > maximum aggregate write rate of the ETMs. But can the kernel > > guarantee any minimum readout rate at all? > > > > The alternative would be double-buffering the ETR, which we've > > also discussed - so while the kernel is harvesting the contents of one > > buffer, the ETR is writing (and possibly wrapping) the other. > > Some trace will still be lost but it does mean the kernel will be > > harvesting monotonically increasing sequences of trace, and won't be > > seeing artefacts from its reads colliding with the ETR's writes. > > > > Al > > > > As Al mentions, ETR polling is designed to solve a different issue > than ETM strobing. These two techniques can be used together or > separately. > > It was noticed by users that the amount of trace captured during a > given trace run would vary greatly even when tracing the same > application for the same length of time. Indeed, that problem is well known. > This was also found to be sensitive to process scheduling - frequent > re-scheds did seem to result in more frequent ETR updates and more > trace data collected. If perf does not wake up during a trace run then > the ETR may wrap mulitple times and all the data will be a single > buffer biased towards the end of the trace session. > Right. > ETR polling is designed to ensure that more trace data is collected > consistently across the whole of the trace session. There are issues > of course, with stopping collection without stopping the sources. - > shared to some extent by the ETE / TRBE combination. > This can result in incomplete packets and other trace discontinuities. > For this reason it is necessary to ensure that the decoder is > restarted for each block of trace captured - which is where the patch > set from James that does this using AUX records in perf to correctly > split the AUXTRACE records into valid blocks is needed. I am still waiting for a new revision from James. > > In summary:- > 1) ETM strobing samples trace to allow greater coverage of the program > being traced for a given buffer. This is useful when building > statistical profiles such as for AutoFDO > 2) ETR polling ensures that more trace is collected across the entire > trace session - seeking to reduce inconsistent capture volumes. I am not convinced disabling a sink to collect traces while an event is active is the right way to go. To me it will add (more) complexity to the coresight subsystem for very little gains, if any. If I remember correctly Leo brought forward the exact same idea about a year ago and after discussion, we all agreed the benefit would not be important enough to offset the drawbacks. As usual I am open to discussion and my opinion is not set in stone. But as I mentioned I worry the feature will increase complexity in the driver and produce dubious results. And we also have to factor in usability which, as Al pointed, out will be a problem. > 3) Use AUX records to split the AUXTRACE buffer into valid capture > blocks and reset the decoder at the start of these blocks. This is > essential for ETE+TRBE, the ETR polling, and systems where we are > seeing hardware errata around the flush process causing similar > spurious packets. (an alternative for the ETR polling / flush errata > might be to insert barrier packets to force a decoder reset for every > ETR block copied to the perf buffer - but this does not work for > ETE/TRBE that uses no CoreSight formatted framing). > > Regards > > Mike > > > > > > > > > > Thanks, > > > Mathieu > > > > > > PS: Always run your work through checkpatch.pl before sending a patchset for > > > review. > > > > > > [1]. > > > https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/log/?h=next- > > > ETE-TRBE > > > > > > > Daniel Kiss (4): > > > > coresight: tmc-etr: Advance buffer pointer in sync buffer. > > > > coresight: tmc-etr: Track perf handler. > > > > coresight: etm-perf: Export etm_event_cpu_path. > > > > coresight: Add ETR-PERF polling. > > > > > > > > .../testing/sysfs-bus-coresight-devices-tmc | 8 + > > > > drivers/hwtracing/coresight/Makefile | 2 +- > > > > .../hwtracing/coresight/coresight-etm-perf.c | 10 +- > > > > .../hwtracing/coresight/coresight-etm-perf.h | 1 + > > > > .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ > > > > .../coresight/coresight-etr-perf-polling.h | 42 +++ > > > > .../hwtracing/coresight/coresight-tmc-core.c | 2 + > > > > .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- > > > > drivers/hwtracing/coresight/coresight-tmc.h | 2 + > > > > 9 files changed, 401 insertions(+), 4 deletions(-) create mode > > > > 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c > > > > create mode 100644 > > > > drivers/hwtracing/coresight/coresight-etr-perf-polling.h > > > > > > > > -- > > > > 2.25.1 > > > > > > > _______________________________________________ > > > CoreSight mailing list > > > CoreSight@lists.linaro.org > > > https://lists.linaro.org/mailman/listinfo/coresight > > > > -- > Mike Leach > Principal Engineer, ARM Ltd. > Manchester Design Centre. UK
On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote: [...] > > 2) ETR polling ensures that more trace is collected across the entire > > trace session - seeking to reduce inconsistent capture volumes. > > I am not convinced disabling a sink to collect traces while an > event is active is the right way to go. To me it will add (more) complexity to > the coresight subsystem for very little gains, if any. > > If I remember correctly Leo brought forward the exact same idea about a year ago > and after discussion, we all agreed the benefit would not be important enough to > offset the drawbacks. > > As usual I am open to discussion and my opinion is not set in stone. But as I > mentioned I worry the feature will increase complexity in the driver and > produce dubious results. And we also have to factor in usability which, as > Al pointed, out will be a problem. Just want to remind one thing for ETR polling. From one perspective, the ETR polling mode is actually very similar with perf's snapshot mode. E.g. we can use specific interval to send USR2 singal to perf tool to captcure CoreSight trace data, thus it also can record the trace data continuously. I can see a benefit from ETR polling mode is it might introduce less overhead than perf snapshot mode. The kernel's mechanism (workqueue or kernel thread) will be much efficiency than perf's signal handling + SMP call with IPIs. So it's good to firstly understand if perf snapshot mode can meet the requirement or not. Thanks, Leo
On 21/04/2021 15:04, Daniel Kiss wrote: > This series adds a feature to ETR-PERF that sync the ETR buffer to perf > periodically. This is really handy when the system wide trace is used > because in this case the perf won't sync during the trace. In a per-thread > setup the traced program might not go to the kernel frequvently enought > to collect trace. Polling helps in both usecases. Can be used with strobing. > Tuning polling period is challanging, I'm working on an additional patch > that adds some metrics to help tune the polling period. Hi Daniel, Is the expectation that leaving the polling period at 0 should have no affect on the amount of data collected vs not using this patch? I've been running the octane 2 benchmark on Chrome on Juno and get these results: No patch: I consistently get 130MB of trace across dozens of runs. Patch, polling = 0: Run 1 - 6.1MB Run 2 - 4.7MB Patch, polling = 100: 1600MB Thanks James > > Daniel Kiss (4): > coresight: tmc-etr: Advance buffer pointer in sync buffer. > coresight: tmc-etr: Track perf handler. > coresight: etm-perf: Export etm_event_cpu_path. > coresight: Add ETR-PERF polling. > > .../testing/sysfs-bus-coresight-devices-tmc | 8 + > drivers/hwtracing/coresight/Makefile | 2 +- > .../hwtracing/coresight/coresight-etm-perf.c | 10 +- > .../hwtracing/coresight/coresight-etm-perf.h | 1 + > .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ > .../coresight/coresight-etr-perf-polling.h | 42 +++ > .../hwtracing/coresight/coresight-tmc-core.c | 2 + > .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- > drivers/hwtracing/coresight/coresight-tmc.h | 2 + > 9 files changed, 401 insertions(+), 4 deletions(-) > create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c > create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.h >
On 27/04/2021 19:24, James Clark wrote: > > > On 21/04/2021 15:04, Daniel Kiss wrote: >> This series adds a feature to ETR-PERF that sync the ETR buffer to perf >> periodically. This is really handy when the system wide trace is used >> because in this case the perf won't sync during the trace. In a per-thread >> setup the traced program might not go to the kernel frequvently enought >> to collect trace. Polling helps in both usecases. Can be used with strobing. >> Tuning polling period is challanging, I'm working on an additional patch >> that adds some metrics to help tune the polling period. > > Hi Daniel, > > Is the expectation that leaving the polling period at 0 should have no affect > on the amount of data collected vs not using this patch? > > I've been running the octane 2 benchmark on Chrome on Juno and get these results: > > No patch: I consistently get 130MB of trace across dozens of runs. > Patch, polling = 0: Run 1 - 6.1MB > Run 2 - 4.7MB Sorry, these small files are because there is no AUXTRACE event at all, only built in events. It's not small because less trace was collected. > Patch, polling = 100: 1600MB > > > Thanks > James > >> >> Daniel Kiss (4): >> coresight: tmc-etr: Advance buffer pointer in sync buffer. >> coresight: tmc-etr: Track perf handler. >> coresight: etm-perf: Export etm_event_cpu_path. >> coresight: Add ETR-PERF polling. >> >> .../testing/sysfs-bus-coresight-devices-tmc | 8 + >> drivers/hwtracing/coresight/Makefile | 2 +- >> .../hwtracing/coresight/coresight-etm-perf.c | 10 +- >> .../hwtracing/coresight/coresight-etm-perf.h | 1 + >> .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ >> .../coresight/coresight-etr-perf-polling.h | 42 +++ >> .../hwtracing/coresight/coresight-tmc-core.c | 2 + >> .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- >> drivers/hwtracing/coresight/coresight-tmc.h | 2 + >> 9 files changed, 401 insertions(+), 4 deletions(-) >> create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c >> create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.h >>
> On 27 Apr 2021, at 18:24, James Clark <James.Clark@arm.com> wrote: > > > > On 21/04/2021 15:04, Daniel Kiss wrote: >> This series adds a feature to ETR-PERF that sync the ETR buffer to perf >> periodically. This is really handy when the system wide trace is used >> because in this case the perf won't sync during the trace. In a per-thread >> setup the traced program might not go to the kernel frequvently enought >> to collect trace. Polling helps in both usecases. Can be used with strobing. >> Tuning polling period is challanging, I'm working on an additional patch >> that adds some metrics to help tune the polling period. > > Hi Daniel, > > Is the expectation that leaving the polling period at 0 should have no affect > on the amount of data collected vs not using this patch? > > I've been running the octane 2 benchmark on Chrome on Juno and get these results: > > No patch: I consistently get 130MB of trace across dozens of runs. > Patch, polling = 0: Run 1 - 6.1MB > Run 2 - 4.7MB > Patch, polling = 100: 1600MB Polling = 0 should not change the behaviour. We noticed this too, which is a bug. I’m going to drop the "coresight: tmc-etr: Track perf handler.” patch, because it causes it. Seems the csdev->refcnt and drvdata->perf_handle are out of sync Thanks, Daniel > Thanks > James > >> >> Daniel Kiss (4): >> coresight: tmc-etr: Advance buffer pointer in sync buffer. >> coresight: tmc-etr: Track perf handler. >> coresight: etm-perf: Export etm_event_cpu_path. >> coresight: Add ETR-PERF polling. >> >> .../testing/sysfs-bus-coresight-devices-tmc | 8 + >> drivers/hwtracing/coresight/Makefile | 2 +- >> .../hwtracing/coresight/coresight-etm-perf.c | 10 +- >> .../hwtracing/coresight/coresight-etm-perf.h | 1 + >> .../coresight/coresight-etr-perf-polling.c | 316 ++++++++++++++++++ >> .../coresight/coresight-etr-perf-polling.h | 42 +++ >> .../hwtracing/coresight/coresight-tmc-core.c | 2 + >> .../hwtracing/coresight/coresight-tmc-etr.c | 22 +- >> drivers/hwtracing/coresight/coresight-tmc.h | 2 + >> 9 files changed, 401 insertions(+), 4 deletions(-) >> create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.c >> create mode 100644 drivers/hwtracing/coresight/coresight-etr-perf-polling.h >>
On Tue, Apr 27, 2021 at 9:04 AM Leo Yan <leo.yan@linaro.org> wrote: > > On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote: > > [...] > > > > 2) ETR polling ensures that more trace is collected across the entire > > > trace session - seeking to reduce inconsistent capture volumes. > > > > I am not convinced disabling a sink to collect traces while an > > event is active is the right way to go. To me it will add (more) complexity to > > the coresight subsystem for very little gains, if any. > > > > If I remember correctly Leo brought forward the exact same idea about a year ago > > and after discussion, we all agreed the benefit would not be important enough to > > offset the drawbacks. > > > > As usual I am open to discussion and my opinion is not set in stone. But as I > > mentioned I worry the feature will increase complexity in the driver and > > produce dubious results. And we also have to factor in usability which, as > > Al pointed, out will be a problem. > > Just want to remind one thing for ETR polling. From one perspective, > the ETR polling mode is actually very similar with perf's snapshot > mode. E.g. we can use specific interval to send USR2 singal to perf > tool to captcure CoreSight trace data, thus it also can record the > trace data continuously. > > I can see a benefit from ETR polling mode is it might introduce less > overhead than perf snapshot mode. The kernel's mechanism (workqueue > or kernel thread) will be much efficiency than perf's signal handling > + SMP call with IPIs. > > So it's good to firstly understand if perf snapshot mode can meet the > requirement or not. We evaluated the patch on Chrome OS and I can confirm that the quality of AutoFDO profiles greatly improved with the ETR polling. Tested with per-thread and system-wide mode. Without ETR polling the size of the collected ETM data was very inconsistent on the same workload and could vary by a factor of two. This, in turn, affects the quality of the AutoFDO profiles generated from ETM. With ETR polling the data size became pretty stable. Performance evaluation shows a similar consistency in performance gain of AutoFDO optimization. This, I think, supports the idea that data collection right now is sensitive to the process scheduling and can be improved with ETR polling. For the system-wide mode particularly we didn't see any other alternatives to collect data periodically on a long-running workload. We haven't tested snapshot mode though. The idea sounds interesting. But small runtime overhead is crucial for the sampling profiler in the field and if there is a noticeable difference we would incline towards the ETR polling. Thanks, Denis > > Thanks, > Leo
On Tue, May 04, 2021 at 11:46:20PM -0700, Denis Nikitin wrote: > On Tue, Apr 27, 2021 at 9:04 AM Leo Yan <leo.yan@linaro.org> wrote: > > > > On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote: > > > > [...] > > > > > > 2) ETR polling ensures that more trace is collected across the entire > > > > trace session - seeking to reduce inconsistent capture volumes. > > > > > > I am not convinced disabling a sink to collect traces while an > > > event is active is the right way to go. To me it will add (more) complexity to > > > the coresight subsystem for very little gains, if any. > > > > > > If I remember correctly Leo brought forward the exact same idea about a year ago > > > and after discussion, we all agreed the benefit would not be important enough to > > > offset the drawbacks. > > > > > > As usual I am open to discussion and my opinion is not set in stone. But as I > > > mentioned I worry the feature will increase complexity in the driver and > > > produce dubious results. And we also have to factor in usability which, as > > > Al pointed, out will be a problem. > > > > Just want to remind one thing for ETR polling. From one perspective, > > the ETR polling mode is actually very similar with perf's snapshot > > mode. E.g. we can use specific interval to send USR2 singal to perf > > tool to captcure CoreSight trace data, thus it also can record the > > trace data continuously. > > > > I can see a benefit from ETR polling mode is it might introduce less > > overhead than perf snapshot mode. The kernel's mechanism (workqueue > > or kernel thread) will be much efficiency than perf's signal handling > > + SMP call with IPIs. > > > > So it's good to firstly understand if perf snapshot mode can meet the > > requirement or not. > > We evaluated the patch on Chrome OS and I can confirm that the quality > of AutoFDO profiles greatly improved with the ETR polling. > Tested with per-thread and system-wide mode. > > Without ETR polling the size of the collected ETM data was very > inconsistent on the same workload and could vary by a factor of two. > This, in turn, affects the quality of the AutoFDO profiles generated from ETM. > With ETR polling the data size became pretty stable. > Performance evaluation shows a similar consistency in performance gain > of AutoFDO optimization. > This, I think, supports the idea that data collection right now is sensitive > to the process scheduling and can be improved with ETR polling. > > For the system-wide mode particularly we didn't see any other alternatives > to collect data periodically on a long-running workload. > We haven't tested snapshot mode though. The idea sounds interesting. > But small runtime overhead is crucial for the sampling profiler in the field > and if there is a noticeable difference we would incline towards the > ETR polling. Please see if Leo's approach[1], or any kind of extension to the current snapshot feature, would be a viable solution. Reusing or extending code that is already there is always a better option. Thanks, Mathieu [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html > > Thanks, > Denis > > > > > Thanks, > > Leo
On Wed, May 5, 2021 at 8:29 AM Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > > On Tue, May 04, 2021 at 11:46:20PM -0700, Denis Nikitin wrote: > > On Tue, Apr 27, 2021 at 9:04 AM Leo Yan <leo.yan@linaro.org> wrote: > > > > > > On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote: > > > > > > [...] > > > > > > > > 2) ETR polling ensures that more trace is collected across the entire > > > > > trace session - seeking to reduce inconsistent capture volumes. > > > > > > > > I am not convinced disabling a sink to collect traces while an > > > > event is active is the right way to go. To me it will add (more) complexity to > > > > the coresight subsystem for very little gains, if any. > > > > > > > > If I remember correctly Leo brought forward the exact same idea about a year ago > > > > and after discussion, we all agreed the benefit would not be important enough to > > > > offset the drawbacks. > > > > > > > > As usual I am open to discussion and my opinion is not set in stone. But as I > > > > mentioned I worry the feature will increase complexity in the driver and > > > > produce dubious results. And we also have to factor in usability which, as > > > > Al pointed, out will be a problem. > > > > > > Just want to remind one thing for ETR polling. From one perspective, > > > the ETR polling mode is actually very similar with perf's snapshot > > > mode. E.g. we can use specific interval to send USR2 singal to perf > > > tool to captcure CoreSight trace data, thus it also can record the > > > trace data continuously. > > > > > > I can see a benefit from ETR polling mode is it might introduce less > > > overhead than perf snapshot mode. The kernel's mechanism (workqueue > > > or kernel thread) will be much efficiency than perf's signal handling > > > + SMP call with IPIs. > > > > > > So it's good to firstly understand if perf snapshot mode can meet the > > > requirement or not. > > > > We evaluated the patch on Chrome OS and I can confirm that the quality > > of AutoFDO profiles greatly improved with the ETR polling. > > Tested with per-thread and system-wide mode. > > > > Without ETR polling the size of the collected ETM data was very > > inconsistent on the same workload and could vary by a factor of two. > > This, in turn, affects the quality of the AutoFDO profiles generated from ETM. > > With ETR polling the data size became pretty stable. > > Performance evaluation shows a similar consistency in performance gain > > of AutoFDO optimization. > > This, I think, supports the idea that data collection right now is sensitive > > to the process scheduling and can be improved with ETR polling. > > > > For the system-wide mode particularly we didn't see any other alternatives > > to collect data periodically on a long-running workload. > > We haven't tested snapshot mode though. The idea sounds interesting. > > But small runtime overhead is crucial for the sampling profiler in the field > > and if there is a noticeable difference we would incline towards the > > ETR polling. > > Please see if Leo's approach[1], or any kind of extension to the current > snapshot feature, would be a viable solution. Reusing or extending code that is > already there is always a better option. > > Thanks, > Mathieu > > [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html > Hi Mattieu and Leo, I did some evaluation of the snapshot mode. Performance overhead is indeed higher than with ETR polling patch. Here are some numbers for comparison (measured on browser Speedometer2 benchmark): Runtime overhead of ETM tracing with ETR poll period 100ms is less than 0.5%. Snapshot mode gives 2.1%. With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode. We could probably utilize the ETM strobing feature and reduce frequency of data collection but I see a problem when I'm using both. Within a minute of profiling the ETM generates a reasonable profile size (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). But then the size grows unproportionally. With a 4 minute run I got a 6.3GB profile. I don't see such a problem with the ETR polling patch. Leo, could you please take a look at this problem? Thanks, Denis
Hi, On Fri, 14 May 2021 at 10:02, Denis Nikitin <denik@google.com> wrote: > > On Wed, May 5, 2021 at 8:29 AM Mathieu Poirier > <mathieu.poirier@linaro.org> wrote: > > > > On Tue, May 04, 2021 at 11:46:20PM -0700, Denis Nikitin wrote: > > > On Tue, Apr 27, 2021 at 9:04 AM Leo Yan <leo.yan@linaro.org> wrote: > > > > > > > > On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote: > > > > > > > > [...] > > > > > > > > > > 2) ETR polling ensures that more trace is collected across the entire > > > > > > trace session - seeking to reduce inconsistent capture volumes. > > > > > > > > > > I am not convinced disabling a sink to collect traces while an > > > > > event is active is the right way to go. To me it will add (more) complexity to > > > > > the coresight subsystem for very little gains, if any. > > > > > > > > > > If I remember correctly Leo brought forward the exact same idea about a year ago > > > > > and after discussion, we all agreed the benefit would not be important enough to > > > > > offset the drawbacks. > > > > > > > > > > As usual I am open to discussion and my opinion is not set in stone. But as I > > > > > mentioned I worry the feature will increase complexity in the driver and > > > > > produce dubious results. And we also have to factor in usability which, as > > > > > Al pointed, out will be a problem. > > > > > > > > Just want to remind one thing for ETR polling. From one perspective, > > > > the ETR polling mode is actually very similar with perf's snapshot > > > > mode. E.g. we can use specific interval to send USR2 singal to perf > > > > tool to captcure CoreSight trace data, thus it also can record the > > > > trace data continuously. > > > > > > > > I can see a benefit from ETR polling mode is it might introduce less > > > > overhead than perf snapshot mode. The kernel's mechanism (workqueue > > > > or kernel thread) will be much efficiency than perf's signal handling > > > > + SMP call with IPIs. > > > > > > > > So it's good to firstly understand if perf snapshot mode can meet the > > > > requirement or not. > > > > > > We evaluated the patch on Chrome OS and I can confirm that the quality > > > of AutoFDO profiles greatly improved with the ETR polling. > > > Tested with per-thread and system-wide mode. > > > > > > Without ETR polling the size of the collected ETM data was very > > > inconsistent on the same workload and could vary by a factor of two. > > > This, in turn, affects the quality of the AutoFDO profiles generated from ETM. > > > With ETR polling the data size became pretty stable. > > > Performance evaluation shows a similar consistency in performance gain > > > of AutoFDO optimization. > > > This, I think, supports the idea that data collection right now is sensitive > > > to the process scheduling and can be improved with ETR polling. > > > > > > For the system-wide mode particularly we didn't see any other alternatives > > > to collect data periodically on a long-running workload. > > > We haven't tested snapshot mode though. The idea sounds interesting. > > > But small runtime overhead is crucial for the sampling profiler in the field > > > and if there is a noticeable difference we would incline towards the > > > ETR polling. > > > > Please see if Leo's approach[1], or any kind of extension to the current > > snapshot feature, would be a viable solution. Reusing or extending code that is > > already there is always a better option. > > > > Thanks, > > Mathieu > > > > [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html > > > > Hi Mattieu and Leo, > > I did some evaluation of the snapshot mode. > > Performance overhead is indeed higher than with ETR polling patch. > Here are some numbers for comparison (measured on browser > Speedometer2 benchmark): > Runtime overhead of ETM tracing with ETR poll period 100ms is less than > 0.5%. Snapshot mode gives 2.1%. > With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode. > > We could probably utilize the ETM strobing feature and reduce frequency > of data collection but I see a problem when I'm using both. > Within a minute of profiling the ETM generates a reasonable profile size > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > But then the size grows unproportionally. > With a 4 minute run I got a 6.3GB profile. > I don't see such a problem with the ETR polling patch. > > Leo, could you please take a look at this problem? > > Thanks, > Denis I do think this patchset presents a valuable feature to add to the CoreSight support. This closes a gap on systems where we do not have an interrupt to trigger on a full sink, and is of evident value to users, giving them predictable trace sessions, with an even spread of trace across the session, as has been requested a number of times. I note the concern regarding stopping trace while the event is active - but is this not precisely what we do with ETE/TRBE when it hits an interrupt? The hardware is stopped and the IRQ updates the perf aux buffer, and allows the perf core to decide what to do next. In this patchset we are simply replacing the IRQ with a timer. Though ETR does not have a direct interrupt - it does have a full flag, which can be - in a system dependent way - wired through the CTI and into a PE interrupt. This would give the equivalent operation of TRBE. Now I do not suggest pursuing this course - as this is per system architecture dependent - the timer is a better fit for a generic solution. The main issue I see with this solution - which is shared with ETE/TRBE, is that for accurate decode we must know the boundaries of each captured block in the AUXTRACE buffer in order to reset the decoder. As has been discussed elsewhere, this work is in hand. While there are clearly technical implementation details to be fixed in this set, which others have commented on in the subsequent patches. I believe in principle it is sound. Regards Mike
Hi Denis, On Fri, May 14, 2021 at 02:02:25AM -0700, Denis Nikitin wrote: [...] > Hi Mattieu and Leo, > > I did some evaluation of the snapshot mode. Thanks a lot for the evaluation and share back the result. > Performance overhead is indeed higher than with ETR polling patch. > Here are some numbers for comparison (measured on browser > Speedometer2 benchmark): > Runtime overhead of ETM tracing with ETR poll period 100ms is less than > 0.5%. Snapshot mode gives 2.1%. > With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode. It's not expected that the snapshot mode causes so big overload. In my head, two factors might cause the overload: - The perf interaction between the user space and kernel space; - The data copying from the ETR's buffer to the AUX ring buffer. Check one thing: what's the buffer size for ETR polling mode and for snapshot mode in your experiments? If I remember correctly, by default the snapshot mode uses 4MB for ETR buffer, if copying 4MB per 10ms, then it's likely to cause big overload. So at the first glance, the overhead difference might be caused by the by the different buffer size between ETR poll mode and snapshot mode. > We could probably utilize the ETM strobing feature and reduce frequency > of data collection but I see a problem when I'm using both. > Within a minute of profiling the ETM generates a reasonable profile size > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > But then the size grows unproportionally. > With a 4 minute run I got a 6.3GB profile. Just check, as Mathieu has suggested, have you applied the patch [1] on your local code base for fixing the data copying for snapshot mode? After applied this patch, one possibility for unproportional issue is perf tool itself introduces many activities in snapshot mode (especially for 10ms period), so the perf tool contributes much extra trace data. Another potential issue is: after setting strobing mode, the snapshot mode will disable the complete paths for tracers and ETR, so if the strobing configuration is lost after re-enable tracers, then it might cause the huge trace data in the later phase. For this case, we definitely should fix it. > I don't see such a problem with the ETR polling patch. > > Leo, could you please take a look at this problem? Sure. For easier reproducing the issue, could you share me the detailed commands (and source code)? P.s. I saw Mike suggests to continue the ETR polling development, this is not conflict with snapshot mode. At my side, I will investigate the snapshot mode, but don't want to disturb the process for ETR polling mode, so when the ETR polling patch series is get ready, please go ahead for upstreaming the patch series. Thanks, Leo [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html
On Tue, May 18, 2021 at 10:00:40PM +0800, Leo Yan wrote: [...] > > We could probably utilize the ETM strobing feature and reduce frequency > > of data collection but I see a problem when I'm using both. > > Within a minute of profiling the ETM generates a reasonable profile size > > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > > But then the size grows unproportionally. > > With a 4 minute run I got a 6.3GB profile. > > > > I don't see such a problem with the ETR polling patch. > > > > Leo, could you please take a look at this problem? Or in the other way, could you share the perf data files? Base on the perf data files, I think at least we can confirm what's the AUX trace data size for every period, and we might also can quickly confirm if the strobing configurations are lost or not in the middle of tracing (To be honest, I have no knowledge for strobing, so need Mike's help for confirmation). Thanks, Leo
On Tue, May 18, 2021 at 10:00:40PM +0800, Leo Yan wrote: > Hi Denis, > > On Fri, May 14, 2021 at 02:02:25AM -0700, Denis Nikitin wrote: > > [...] > > > Hi Mattieu and Leo, > > > > I did some evaluation of the snapshot mode. > > Thanks a lot for the evaluation and share back the result. > > > Performance overhead is indeed higher than with ETR polling patch. > > Here are some numbers for comparison (measured on browser > > Speedometer2 benchmark): > > Runtime overhead of ETM tracing with ETR poll period 100ms is less than > > 0.5%. Snapshot mode gives 2.1%. > > With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode. > > It's not expected that the snapshot mode causes so big overload. > In my head, two factors might cause the overload: > > - The perf interaction between the user space and kernel space; > - The data copying from the ETR's buffer to the AUX ring buffer. > > Check one thing: what's the buffer size for ETR polling mode and for > snapshot mode in your experiments? > > If I remember correctly, by default the snapshot mode uses 4MB for ETR > buffer, if copying 4MB per 10ms, then it's likely to cause big > overload. So at the first glance, the overhead difference might be > caused by the by the different buffer size between ETR poll mode and > snapshot mode. > > > We could probably utilize the ETM strobing feature and reduce frequency > > of data collection but I see a problem when I'm using both. > > Within a minute of profiling the ETM generates a reasonable profile size > > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > > But then the size grows unproportionally. > > With a 4 minute run I got a 6.3GB profile. > > Just check, as Mathieu has suggested, have you applied the patch [1] > on your local code base for fixing the data copying for snapshot mode? > > After applied this patch, one possibility for unproportional issue is > perf tool itself introduces many activities in snapshot mode (especially > for 10ms period), so the perf tool contributes much extra trace data. > > Another potential issue is: after setting strobing mode, the snapshot > mode will disable the complete paths for tracers and ETR, so if the > strobing configuration is lost after re-enable tracers, then it might > cause the huge trace data in the later phase. For this case, we > definitely should fix it. > > > I don't see such a problem with the ETR polling patch. > > > > Leo, could you please take a look at this problem? > > Sure. For easier reproducing the issue, could you share me the > detailed commands (and source code)? > > P.s. I saw Mike suggests to continue the ETR polling development, this > is not conflict with snapshot mode. At my side, I will investigate the > snapshot mode, but don't want to disturb the process for ETR polling > mode, so when the ETR polling patch series is get ready, please go > ahead for upstreaming the patch series. Just to clarify my position - I would definitely like to see a solution that extends or re-use the current snapshot mode rather than introducing a new mechanism to collect data. > > Thanks, > Leo > > [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html
On Fri, May 14, 2021 at 02:02:25AM -0700, Denis Nikitin wrote: [...] > We could probably utilize the ETM strobing feature and reduce frequency > of data collection but I see a problem when I'm using both. > Within a minute of profiling the ETM generates a reasonable profile size > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > But then the size grows unproportionally. > With a 4 minute run I got a 6.3GB profile. > I don't see such a problem with the ETR polling patch. I found there have a potential bug in the perf tool for calculation the buffer size for snapshot mode. In the function cs_etm_find_snapshot of perf code [1], after the "head" has wrapped around, it always return the "old" and "head" with the difference "mm->len", that means the trace data will be copied with length "mm->len". If the buffer size is 4MB, it always copies trace data with 4MB for every time. This is incorrect for the snapshot with very small interval, even after wrapped around, it still have chance to only generate very small amount trace data (e.g. for 10ms), so I think we should fix this code like: diff --git a/tools/perf/arch/arm/util/cs-etm.c b/tools/perf/arch/arm/util/cs-etm.c index d942f118d32c..8a60e65c6651 100644 --- a/tools/perf/arch/arm/util/cs-etm.c +++ b/tools/perf/arch/arm/util/cs-etm.c @@ -849,6 +849,13 @@ static int cs_etm_find_snapshot(struct auxtrace_record *itr, if (!wrapped) return 0; + /* + * If the difference between the head and old is less than mm->len, + * it means the new trace data is small so doesn't need ajust them. + */ + if (*head - *old < mm->len) + return 0; + /* * *head has wrapped around - adjust *head and *old to pickup the * entire content of the AUX buffer. I will do more testing and send formal patch for this. Thanks, Leo [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm/util/cs-etm.c#n853
Hi Leo, Sorry for the delayed reply. On Tue, May 18, 2021 at 7:00 AM Leo Yan <leo.yan@linaro.org> wrote: > > Hi Denis, > > > Performance overhead is indeed higher than with ETR polling patch. > > Here are some numbers for comparison (measured on browser > > Speedometer2 benchmark): > > Runtime overhead of ETM tracing with ETR poll period 100ms is less than > > 0.5%. Snapshot mode gives 2.1%. > > With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode. > > It's not expected that the snapshot mode causes so big overload. > In my head, two factors might cause the overload: > > - The perf interaction between the user space and kernel space; > - The data copying from the ETR's buffer to the AUX ring buffer. > > Check one thing: what's the buffer size for ETR polling mode and for > snapshot mode in your experiments? AUX buffer size was 64KB in both modes. I used small buffers to keep the overall perf.data size under 200-500MB with a long-running perf record (3-5 min). > > If I remember correctly, by default the snapshot mode uses 4MB for ETR > buffer, if copying 4MB per 10ms, then it's likely to cause big > overload. So at the first glance, the overhead difference might be > caused by the by the different buffer size between ETR poll mode and > snapshot mode. As I said, the buffer was small. But I can go ahead and check the difference with a bigger buffer. I will also double check how strobing affects runtime overhead. It should be lower. > > > We could probably utilize the ETM strobing feature and reduce frequency > > of data collection but I see a problem when I'm using both. > > Within a minute of profiling the ETM generates a reasonable profile size > > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > > But then the size grows unproportionally. > > With a 4 minute run I got a 6.3GB profile. > > Just check, as Mathieu has suggested, have you applied the patch [1] > on your local code base for fixing the data copying for snapshot mode? That was my mistake. When I switched to the Strobing patch series I forgot to apply [1]. When applied I don't see this issue any more. It's not obvious from the description that the patch would fix my issue. So it sounds like your patch fixes multiple problems :) Thanks, Denis [...] > > Thanks, > Leo > > [1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html
Hi Leo, On Sun, May 23, 2021 at 1:45 AM Leo Yan <leo.yan@linaro.org> wrote: > > On Fri, May 14, 2021 at 02:02:25AM -0700, Denis Nikitin wrote: > > [...] > > > We could probably utilize the ETM strobing feature and reduce frequency > > of data collection but I see a problem when I'm using both. > > Within a minute of profiling the ETM generates a reasonable profile size > > (with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB). > > But then the size grows unproportionally. > > With a 4 minute run I got a 6.3GB profile. > > I don't see such a problem with the ETR polling patch. > > I found there have a potential bug in the perf tool for calculation the > buffer size for snapshot mode. > > In the function cs_etm_find_snapshot of perf code [1], after the "head" > has wrapped around, it always return the "old" and "head" with the > difference "mm->len", that means the trace data will be copied with > length "mm->len". If the buffer size is 4MB, it always copies trace > data with 4MB for every time. > > This is incorrect for the snapshot with very small interval, even after > wrapped around, it still have chance to only generate very small amount > trace data (e.g. for 10ms), so I think we should fix this code like: > > diff --git a/tools/perf/arch/arm/util/cs-etm.c b/tools/perf/arch/arm/util/cs-etm.c > index d942f118d32c..8a60e65c6651 100644 > --- a/tools/perf/arch/arm/util/cs-etm.c > +++ b/tools/perf/arch/arm/util/cs-etm.c > @@ -849,6 +849,13 @@ static int cs_etm_find_snapshot(struct auxtrace_record *itr, > if (!wrapped) > return 0; > > + /* > + * If the difference between the head and old is less than mm->len, > + * it means the new trace data is small so doesn't need ajust them. > + */ > + if (*head - *old < mm->len) > + return 0; > + > /* > * *head has wrapped around - adjust *head and *old to pickup the > * entire content of the AUX buffer. > > I will do more testing and send formal patch for this. Thanks for looking into this problem. Initially I said that the issue with the excessive profile size was not reproducible when I applied https://lists.linaro.org/pipermail/coresight/2021-April/006254.html. But after a couple of runs the issue came back. Reproducibility depends on the system workload and it happens more often if we run any workload. I tried the fix you shared and so far I don't see the issue anymore. Thanks, Denis > > Thanks, > Leo > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm/util/cs-etm.c#n853
Hi Denis, On Thu, May 27, 2021 at 12:50:28AM -0700, Denis Nikitin wrote: [...] > Thanks for looking into this problem. > Initially I said that the issue with the excessive profile size was > not reproducible > when I applied https://lists.linaro.org/pipermail/coresight/2021-April/006254.html. > But after a couple of runs the issue came back. > Reproducibility depends on the system workload and it happens more often if > we run any workload. Thanks a lot for the testing. > I tried the fix you shared and so far I don't see the issue anymore. Could you confirm here you mentioend "the fix" is the same one with [1]? If these two fixes (one is for kernel driver and another is in the perf tool) can resolve your issue, this is great! I will resend the patches. Sorry for some delay at my side. Thanks, Leo [1] https://lists.linaro.org/pipermail/coresight/2021-May/006411.html
On Thu, May 27, 2021 at 8:07 AM Leo Yan <leo.yan@linaro.org> wrote: > > Hi Denis, > > On Thu, May 27, 2021 at 12:50:28AM -0700, Denis Nikitin wrote: > > [...] > > > Thanks for looking into this problem. > > Initially I said that the issue with the excessive profile size was > > not reproducible > > when I applied https://lists.linaro.org/pipermail/coresight/2021-April/006254.html. > > But after a couple of runs the issue came back. > > Reproducibility depends on the system workload and it happens more often if > > we run any workload. > > Thanks a lot for the testing. > > > I tried the fix you shared and so far I don't see the issue anymore. > > Could you confirm here you mentioend "the fix" is the same one with [1]? Correct. > > If these two fixes (one is for kernel driver and another is in the > perf tool) can resolve your issue, this is great! I will resend the > patches. Sorry for some delay at my side. Sounds great. Thanks, Leo! I will take a look at runtime overhead meanwhile. - Denis > > Thanks, > Leo > > [1] https://lists.linaro.org/pipermail/coresight/2021-May/006411.html
Hi Denis, On Thu, May 27, 2021 at 09:22:45AM -0700, Denis Nikitin wrote: [...] > > If these two fixes (one is for kernel driver and another is in the > > perf tool) can resolve your issue, this is great! I will resend the > > patches. Sorry for some delay at my side. > > Sounds great. Thanks, Leo! Welcome! > I will take a look at runtime overhead meanwhile. Just info, I have sent out the patch series to address snapshot issues: https://lore.kernel.org/patchwork/cover/1437696/ It would be appreciated if you could verify these patches, thanks! Leo