Message ID | 20190404152051.17996-1-boris.brezillon@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | drm/panfrost: Expose HW counters to userspace | expand |
On 04/04/2019 16:20, Boris Brezillon wrote: > Hello, > > This patch adds new ioctls to expose GPU counters to userspace. > These will be used by the mesa driver (should be posted soon). > > A few words about the implementation: I followed the VC4/Etnaviv model > where perf counters are retrieved on a per-job basis. This allows one > to have get accurate results when there are users using the GPU > concurrently. > AFAICT, the mali kbase is using a different approach where several > users can register a performance monitor but with no way to have fined > grained control over what job/GPU-context to track. mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can be from different contexts (address spaces), and mali_kbase also fully uses the _NEXT registers. So there can be a job from one context executing on slot 0 and a job from a different context waiting in the _NEXT registers. (And the same for slot 1). This means that there's no (visible) gap between the first job finishing and the second job starting. Early versions of the driver even had a throttle to avoid interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the IRQ - but thankfully that's gone. The upshot is that it's basically impossible to measure "per-job" counters when running at full speed. Because multiple jobs are running and the driver doesn't actually know when one ends and the next starts. Since one of the primary use cases is to draw pretty graphs of the system load [1], this "per-job" information isn't all that relevant (and minimal performance overhead is important). And if you want to monitor just one application it is usually easiest to ensure that it is the only thing running. [1] https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > This design choice comes at a cost: every time the perfmon context > changes (the perfmon context is the list of currently active > perfmons), the driver has to add a fence to prevent new jobs from > corrupting counters that will be dumped by previous jobs. > > Let me know if that's an issue and if you think we should approach > things differently. It depends what you expect to do with the counters. Per-job counters are certainly useful sometimes. But serialising all jobs can mess up the thing you are trying to measure the performance of. Steve > Regards, > > Boris > > Boris Brezillon (3): > drm/panfrost: Move gpu_{write,read}() macros to panfrost_regs.h > drm/panfrost: Expose HW counters to userspace > panfrost/drm: Define T860 perf counters > > drivers/gpu/drm/panfrost/Makefile | 3 +- > drivers/gpu/drm/panfrost/panfrost_device.c | 8 + > drivers/gpu/drm/panfrost/panfrost_device.h | 11 + > drivers/gpu/drm/panfrost/panfrost_drv.c | 22 +- > drivers/gpu/drm/panfrost/panfrost_gpu.c | 46 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 24 + > drivers/gpu/drm/panfrost/panfrost_job.h | 4 + > drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 954 ++++++++++++++++++++ > drivers/gpu/drm/panfrost/panfrost_perfcnt.h | 59 ++ > drivers/gpu/drm/panfrost/panfrost_regs.h | 22 + > include/uapi/drm/panfrost_drm.h | 122 +++ > 11 files changed, 1268 insertions(+), 7 deletions(-) > create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c > create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h >
> Since one of the primary use cases is to draw pretty graphs of the > system load [1], this "per-job" information isn't all that relevant (and > minimal performance overhead is important). And if you want to monitor > just one application it is usually easiest to ensure that it is the only > thing running. Ah-ha, gotch. I don't know why I didn't put 2 and 2 together, but that definitely makes sense, yeah :) Boris, thoughts?
On Fri, 5 Apr 2019 09:33:55 -0700 Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > Since one of the primary use cases is to draw pretty graphs of the > > system load [1], this "per-job" information isn't all that relevant (and > > minimal performance overhead is important). And if you want to monitor > > just one application it is usually easiest to ensure that it is the only > > thing running. > > Ah-ha, gotch. I don't know why I didn't put 2 and 2 together, but that > definitely makes sense, yeah :) Boris, thoughts? Nothing to add, except, I had the gut feeling I was doing the wrong choice here, hence the mention to this design decision in my cover letter :-). I'll rework the implementation to have perfmons apply globally (instead of being attached to jobs) and get rid of the perfcnt fence. Note that we could support both per-job and global perfmons and avoid the perf penalty when only global perfmons are used, but I don't think it's worth the extra complexity (not to mention that it makes things even more confusing for userspace users).
+1
+Rob, Eric, Mark and more Hi, On Fri, 5 Apr 2019 16:20:45 +0100 Steven Price <steven.price@arm.com> wrote: > On 04/04/2019 16:20, Boris Brezillon wrote: > > Hello, > > > > This patch adds new ioctls to expose GPU counters to userspace. > > These will be used by the mesa driver (should be posted soon). > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > where perf counters are retrieved on a per-job basis. This allows one > > to have get accurate results when there are users using the GPU > > concurrently. > > AFAICT, the mali kbase is using a different approach where several > > users can register a performance monitor but with no way to have fined > > grained control over what job/GPU-context to track. > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > be from different contexts (address spaces), and mali_kbase also fully > uses the _NEXT registers. So there can be a job from one context > executing on slot 0 and a job from a different context waiting in the > _NEXT registers. (And the same for slot 1). This means that there's no > (visible) gap between the first job finishing and the second job > starting. Early versions of the driver even had a throttle to avoid > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > IRQ - but thankfully that's gone. > > The upshot is that it's basically impossible to measure "per-job" > counters when running at full speed. Because multiple jobs are running > and the driver doesn't actually know when one ends and the next starts. > > Since one of the primary use cases is to draw pretty graphs of the > system load [1], this "per-job" information isn't all that relevant (and > minimal performance overhead is important). And if you want to monitor > just one application it is usually easiest to ensure that it is the only > thing running. > > [1] > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > This design choice comes at a cost: every time the perfmon context > > changes (the perfmon context is the list of currently active > > perfmons), the driver has to add a fence to prevent new jobs from > > corrupting counters that will be dumped by previous jobs. > > > > Let me know if that's an issue and if you think we should approach > > things differently. > > It depends what you expect to do with the counters. Per-job counters are > certainly useful sometimes. But serialising all jobs can mess up the > thing you are trying to measure the performance of. I finally found some time to work on v2 this morning, and it turns out implementing global perf monitors as done in mali_kbase means rewriting almost everything (apart from the perfcnt layout stuff). I'm not against doing that, but I'd like to be sure this is really what we want. Eric, Rob, any opinion on that? Is it acceptable to expose counters through the pipe_query/AMD_perfmon interface if we don't have this job (or at least draw call) granularity? If not, should we keep the solution I'm proposing here to make sure counters values are accurate, or should we expose perf counters through a non-standard API? BTW, I'd like to remind you that serialization (waiting on the perfcnt fence) only happens if we have a perfmon context change between 2 consecutive jobs, which only happens when * 2 applications are running in // and at least one of them is monitored * or when userspace decides to stop monitoring things and dump counter values That means that, for the usual case (all perfmons disabled), there's almost zero overhead (just a few more checks in the submit job code). That also means that, if we ever decide to support global perfmon (perf monitors that track things globably) on top of the current approach, and only global perfmons are enabled, things won't be serialized as with the per-job approach, because everyone will share the same perfmon ctx (the same set of perfmons). I'd appreciate any feedback from people that have used perf counters (or implemented a way to dump them) on their platform. Thanks, Boris
On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon <boris.brezillon@collabora.com> wrote: > > +Rob, Eric, Mark and more > > Hi, > > On Fri, 5 Apr 2019 16:20:45 +0100 > Steven Price <steven.price@arm.com> wrote: > > > On 04/04/2019 16:20, Boris Brezillon wrote: > > > Hello, > > > > > > This patch adds new ioctls to expose GPU counters to userspace. > > > These will be used by the mesa driver (should be posted soon). > > > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > > where perf counters are retrieved on a per-job basis. This allows one > > > to have get accurate results when there are users using the GPU > > > concurrently. > > > AFAICT, the mali kbase is using a different approach where several > > > users can register a performance monitor but with no way to have fined > > > grained control over what job/GPU-context to track. > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > > be from different contexts (address spaces), and mali_kbase also fully > > uses the _NEXT registers. So there can be a job from one context > > executing on slot 0 and a job from a different context waiting in the > > _NEXT registers. (And the same for slot 1). This means that there's no > > (visible) gap between the first job finishing and the second job > > starting. Early versions of the driver even had a throttle to avoid > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > > IRQ - but thankfully that's gone. > > > > The upshot is that it's basically impossible to measure "per-job" > > counters when running at full speed. Because multiple jobs are running > > and the driver doesn't actually know when one ends and the next starts. > > > > Since one of the primary use cases is to draw pretty graphs of the > > system load [1], this "per-job" information isn't all that relevant (and > > minimal performance overhead is important). And if you want to monitor > > just one application it is usually easiest to ensure that it is the only > > thing running. > > > > [1] > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > > > This design choice comes at a cost: every time the perfmon context > > > changes (the perfmon context is the list of currently active > > > perfmons), the driver has to add a fence to prevent new jobs from > > > corrupting counters that will be dumped by previous jobs. > > > > > > Let me know if that's an issue and if you think we should approach > > > things differently. > > > > It depends what you expect to do with the counters. Per-job counters are > > certainly useful sometimes. But serialising all jobs can mess up the > > thing you are trying to measure the performance of. > > I finally found some time to work on v2 this morning, and it turns out > implementing global perf monitors as done in mali_kbase means rewriting > almost everything (apart from the perfcnt layout stuff). I'm not against > doing that, but I'd like to be sure this is really what we want. > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > through the pipe_query/AMD_perfmon interface if we don't have this > job (or at least draw call) granularity? If not, should we keep the > solution I'm proposing here to make sure counters values are accurate, > or should we expose perf counters through a non-standard API? I think if you can't do per-draw level granularity, then you should not try to implement AMD_perfmon.. instead the use case is more for a sort of "gpu top" app (as opposed to something like frameretrace which is taking per-draw-call level measurements from within the app. Things that use AMD_perfmon are going to, I think, expect to query values between individual glDraw calls, and you probably don't want to flush tile passes 500 times per frame. (Although, I suppose if there are multiple GPUs where perfcntrs work this way, it might be an interesting exercise to think about coming up w/ a standardized API (debugfs perhaps?) to monitor the counters.. so you could have a single userspace tool that works across several different drivers.) BR, -R > > BTW, I'd like to remind you that serialization (waiting on the perfcnt > fence) only happens if we have a perfmon context change between 2 > consecutive jobs, which only happens when > * 2 applications are running in // and at least one of them is > monitored > * or when userspace decides to stop monitoring things and dump counter > values > > That means that, for the usual case (all perfmons disabled), there's > almost zero overhead (just a few more checks in the submit job code). > That also means that, if we ever decide to support global perfmon (perf > monitors that track things globably) on top of the current approach, > and only global perfmons are enabled, things won't be serialized as > with the per-job approach, because everyone will share the same perfmon > ctx (the same set of perfmons). > > I'd appreciate any feedback from people that have used perf counters > (or implemented a way to dump them) on their platform. > > Thanks, > > Boris
On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote: > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon > <boris.brezillon@collabora.com> wrote: > > > > +Rob, Eric, Mark and more > > > > Hi, > > > > On Fri, 5 Apr 2019 16:20:45 +0100 > > Steven Price <steven.price@arm.com> wrote: > > > > > On 04/04/2019 16:20, Boris Brezillon wrote: > > > > Hello, > > > > > > > > This patch adds new ioctls to expose GPU counters to userspace. > > > > These will be used by the mesa driver (should be posted soon). > > > > > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > > > where perf counters are retrieved on a per-job basis. This allows one > > > > to have get accurate results when there are users using the GPU > > > > concurrently. > > > > AFAICT, the mali kbase is using a different approach where several > > > > users can register a performance monitor but with no way to have fined > > > > grained control over what job/GPU-context to track. > > > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > > > be from different contexts (address spaces), and mali_kbase also fully > > > uses the _NEXT registers. So there can be a job from one context > > > executing on slot 0 and a job from a different context waiting in the > > > _NEXT registers. (And the same for slot 1). This means that there's no > > > (visible) gap between the first job finishing and the second job > > > starting. Early versions of the driver even had a throttle to avoid > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > > > IRQ - but thankfully that's gone. > > > > > > The upshot is that it's basically impossible to measure "per-job" > > > counters when running at full speed. Because multiple jobs are running > > > and the driver doesn't actually know when one ends and the next starts. > > > > > > Since one of the primary use cases is to draw pretty graphs of the > > > system load [1], this "per-job" information isn't all that relevant (and > > > minimal performance overhead is important). And if you want to monitor > > > just one application it is usually easiest to ensure that it is the only > > > thing running. > > > > > > [1] > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > > > > > This design choice comes at a cost: every time the perfmon context > > > > changes (the perfmon context is the list of currently active > > > > perfmons), the driver has to add a fence to prevent new jobs from > > > > corrupting counters that will be dumped by previous jobs. > > > > > > > > Let me know if that's an issue and if you think we should approach > > > > things differently. > > > > > > It depends what you expect to do with the counters. Per-job counters are > > > certainly useful sometimes. But serialising all jobs can mess up the > > > thing you are trying to measure the performance of. > > > > I finally found some time to work on v2 this morning, and it turns out > > implementing global perf monitors as done in mali_kbase means rewriting > > almost everything (apart from the perfcnt layout stuff). I'm not against > > doing that, but I'd like to be sure this is really what we want. > > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > > through the pipe_query/AMD_perfmon interface if we don't have this > > job (or at least draw call) granularity? If not, should we keep the > > solution I'm proposing here to make sure counters values are accurate, > > or should we expose perf counters through a non-standard API? > > I think if you can't do per-draw level granularity, then you should > not try to implement AMD_perfmon.. instead the use case is more for a > sort of "gpu top" app (as opposed to something like frameretrace which > is taking per-draw-call level measurements from within the app. > Things that use AMD_perfmon are going to, I think, expect to query > values between individual glDraw calls, and you probably don't want to > flush tile passes 500 times per frame. > > (Although, I suppose if there are multiple GPUs where perfcntrs work > this way, it might be an interesting exercise to think about coming up > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so > you could have a single userspace tool that works across several > different drivers.) I agree. We've been pondering a lot of the same issues for Adreno. I would be greatly interested in seeing if we could come up with a standard solution we can use. Jordan > > > > > BTW, I'd like to remind you that serialization (waiting on the perfcnt > > fence) only happens if we have a perfmon context change between 2 > > consecutive jobs, which only happens when > > * 2 applications are running in // and at least one of them is > > monitored > > * or when userspace decides to stop monitoring things and dump counter > > values > > > > That means that, for the usual case (all perfmons disabled), there's > > almost zero overhead (just a few more checks in the submit job code). > > That also means that, if we ever decide to support global perfmon (perf > > monitors that track things globably) on top of the current approach, > > and only global perfmons are enabled, things won't be serialized as > > with the per-job approach, because everyone will share the same perfmon > > ctx (the same set of perfmons). > > > > I'd appreciate any feedback from people that have used perf counters > > (or implemented a way to dump them) on their platform. > > > > Thanks, > > > > Boris > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Boris Brezillon <boris.brezillon@collabora.com> writes: > +Rob, Eric, Mark and more > > Hi, > > On Fri, 5 Apr 2019 16:20:45 +0100 > Steven Price <steven.price@arm.com> wrote: > >> On 04/04/2019 16:20, Boris Brezillon wrote: >> > Hello, >> > >> > This patch adds new ioctls to expose GPU counters to userspace. >> > These will be used by the mesa driver (should be posted soon). >> > >> > A few words about the implementation: I followed the VC4/Etnaviv model >> > where perf counters are retrieved on a per-job basis. This allows one >> > to have get accurate results when there are users using the GPU >> > concurrently. >> > AFAICT, the mali kbase is using a different approach where several >> > users can register a performance monitor but with no way to have fined >> > grained control over what job/GPU-context to track. >> >> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can >> be from different contexts (address spaces), and mali_kbase also fully >> uses the _NEXT registers. So there can be a job from one context >> executing on slot 0 and a job from a different context waiting in the >> _NEXT registers. (And the same for slot 1). This means that there's no >> (visible) gap between the first job finishing and the second job >> starting. Early versions of the driver even had a throttle to avoid >> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the >> IRQ - but thankfully that's gone. >> >> The upshot is that it's basically impossible to measure "per-job" >> counters when running at full speed. Because multiple jobs are running >> and the driver doesn't actually know when one ends and the next starts. >> >> Since one of the primary use cases is to draw pretty graphs of the >> system load [1], this "per-job" information isn't all that relevant (and >> minimal performance overhead is important). And if you want to monitor >> just one application it is usually easiest to ensure that it is the only >> thing running. >> >> [1] >> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer >> >> > This design choice comes at a cost: every time the perfmon context >> > changes (the perfmon context is the list of currently active >> > perfmons), the driver has to add a fence to prevent new jobs from >> > corrupting counters that will be dumped by previous jobs. >> > >> > Let me know if that's an issue and if you think we should approach >> > things differently. >> >> It depends what you expect to do with the counters. Per-job counters are >> certainly useful sometimes. But serialising all jobs can mess up the >> thing you are trying to measure the performance of. > > I finally found some time to work on v2 this morning, and it turns out > implementing global perf monitors as done in mali_kbase means rewriting > almost everything (apart from the perfcnt layout stuff). I'm not against > doing that, but I'd like to be sure this is really what we want. > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > through the pipe_query/AMD_perfmon interface if we don't have this > job (or at least draw call) granularity? If not, should we keep the > solution I'm proposing here to make sure counters values are accurate, > or should we expose perf counters through a non-standard API? You should definitely not count perf results from someone else's context against your own! People doing perf analysis will expect slight performance changes (like missing bin/render parallelism between contexts) when doing perf queries, but they will be absolutely lost if their non-texturing job starts showing texturing results from some unrelated context.
Hi all, As Steven Price explained, the "GPU top" kbase approach is often more useful and accurate than per-draw timing. For a 3D game inside a GPU-accelerated desktop, the games' counters *should* include desktop overhead. This external overhead does affect the game's performance, especially if the contexts are competing for resources like memory bandwidth. An isolated sample is easy to achieve running only the app of interest; in ideal conditions (zero-copy fullscreen), desktop interference is negligible. For driver developers, the system-wide measurements are preferable, painting a complete system picture and avoiding disruptions. There is no risk of confusion, as the driver developers understand how the counters are exposed. Further, benchmarks rendering direct to a GBM surface are available (glmark2-es2-drm), eliminating interference even with poor desktop performance. For app developers, the confusion of multi-context interference is unfortunate. Nevertheless, if enabling counters were to slow down an app, the confusion could be worse. Consider second-order changes in the app's performance characteristics due to slowdown: if techniques like dynamic resolution scaling are employed, the counters' results can be invalid. Likewise, even if the lower-performance counters are appropriate for purely graphical workloads, complex apps with variable CPU overhead (e.g. from an FPS-dependent physics engine) can further confound counters. Low-overhead system-wide measurements mitigate these concerns. As Rob Clark suggested, system-wide counters could be exposed via a semi-standardized interface, perhaps within debugfs/sysfs. The interface could not be completely standard, as the list of counters exposed varies substantially by vendor and model. Nevertheless, the mechanics of discovering, enabling, reading, and disabling counters can be standardized, as can a small set of universally meaningful counters like total GPU utilization. This would permit a vendor-independent GPU top app as suggested, as is I believe currently possible with vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) It looks like this discussion is dormant. Could we try to get this sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable to diagnose without access to the counters, so I'm eager for a mainline solution to be implemented. Thank you, Alyssa
On Wed, 01 May 2019 10:12:42 -0700 Eric Anholt <eric@anholt.net> wrote: > Boris Brezillon <boris.brezillon@collabora.com> writes: > > > +Rob, Eric, Mark and more > > > > Hi, > > > > On Fri, 5 Apr 2019 16:20:45 +0100 > > Steven Price <steven.price@arm.com> wrote: > > > >> On 04/04/2019 16:20, Boris Brezillon wrote: > >> > Hello, > >> > > >> > This patch adds new ioctls to expose GPU counters to userspace. > >> > These will be used by the mesa driver (should be posted soon). > >> > > >> > A few words about the implementation: I followed the VC4/Etnaviv model > >> > where perf counters are retrieved on a per-job basis. This allows one > >> > to have get accurate results when there are users using the GPU > >> > concurrently. > >> > AFAICT, the mali kbase is using a different approach where several > >> > users can register a performance monitor but with no way to have fined > >> > grained control over what job/GPU-context to track. > >> > >> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > >> be from different contexts (address spaces), and mali_kbase also fully > >> uses the _NEXT registers. So there can be a job from one context > >> executing on slot 0 and a job from a different context waiting in the > >> _NEXT registers. (And the same for slot 1). This means that there's no > >> (visible) gap between the first job finishing and the second job > >> starting. Early versions of the driver even had a throttle to avoid > >> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > >> IRQ - but thankfully that's gone. > >> > >> The upshot is that it's basically impossible to measure "per-job" > >> counters when running at full speed. Because multiple jobs are running > >> and the driver doesn't actually know when one ends and the next starts. > >> > >> Since one of the primary use cases is to draw pretty graphs of the > >> system load [1], this "per-job" information isn't all that relevant (and > >> minimal performance overhead is important). And if you want to monitor > >> just one application it is usually easiest to ensure that it is the only > >> thing running. > >> > >> [1] > >> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > >> > >> > This design choice comes at a cost: every time the perfmon context > >> > changes (the perfmon context is the list of currently active > >> > perfmons), the driver has to add a fence to prevent new jobs from > >> > corrupting counters that will be dumped by previous jobs. > >> > > >> > Let me know if that's an issue and if you think we should approach > >> > things differently. > >> > >> It depends what you expect to do with the counters. Per-job counters are > >> certainly useful sometimes. But serialising all jobs can mess up the > >> thing you are trying to measure the performance of. > > > > I finally found some time to work on v2 this morning, and it turns out > > implementing global perf monitors as done in mali_kbase means rewriting > > almost everything (apart from the perfcnt layout stuff). I'm not against > > doing that, but I'd like to be sure this is really what we want. > > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > > through the pipe_query/AMD_perfmon interface if we don't have this > > job (or at least draw call) granularity? If not, should we keep the > > solution I'm proposing here to make sure counters values are accurate, > > or should we expose perf counters through a non-standard API? > > You should definitely not count perf results from someone else's context > against your own! Also had the feeling that doing that was a bad idea (which you and Rob confirmed), but I listed it here to clear up my doubts. > People doing perf analysis will expect slight > performance changes (like missing bin/render parallelism between > contexts) when doing perf queries, but they will be absolutely lost if > their non-texturing job starts showing texturing results from some > unrelated context. Given the feedback I had so far, it looks like defining a new interface for this type of 'global perfmon' is the way to go if we want to make everyone happy. Also note that the work I've done on VC4 has the same limitation: a perfmon context between 2 jobs will introduce a serialization point. It seems that you're not as worried as Steven or Rob is about extra cost if this serialization, but I wanted to point that out.
On Sat, 11 May 2019 15:32:20 -0700 Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > Hi all, > > As Steven Price explained, the "GPU top" kbase approach is often more > useful and accurate than per-draw timing. > > For a 3D game inside a GPU-accelerated desktop, the games' counters > *should* include desktop overhead. This external overhead does affect > the game's performance, especially if the contexts are competing for > resources like memory bandwidth. An isolated sample is easy to achieve > running only the app of interest; in ideal conditions (zero-copy > fullscreen), desktop interference is negligible. > > For driver developers, the system-wide measurements are preferable, > painting a complete system picture and avoiding disruptions. There is no > risk of confusion, as the driver developers understand how the counters > are exposed. Further, benchmarks rendering direct to a GBM surface are > available (glmark2-es2-drm), eliminating interference even with poor > desktop performance. > > For app developers, the confusion of multi-context interference is > unfortunate. Nevertheless, if enabling counters were to slow down an > app, the confusion could be worse. Consider second-order changes in the > app's performance characteristics due to slowdown: if techniques like > dynamic resolution scaling are employed, the counters' results can be > invalid. Likewise, even if the lower-performance counters are > appropriate for purely graphical workloads, complex apps with variable > CPU overhead (e.g. from an FPS-dependent physics engine) can further > confound counters. Low-overhead system-wide measurements mitigate these > concerns. I'd just like to point out that dumping counters the way mali_kbase/gator does likely has an impact on perfs (at least on some GPUs) because of the clean+invalidate-cache that happens before (or after, I don't remember) each dump. IIUC and this cache is actually global and not a per address space thing (which would be possible if the cache lines contain a tag attaching them to a specific address space), that means all jobs running when the clean+invalidate happens will have extra cache misses after each dump. Of course, that's not as invasive as the full serialization that happens with my solution, but it's not free either. > > As Rob Clark suggested, system-wide counters could be exposed via a > semi-standardized interface, perhaps within debugfs/sysfs. The interface > could not be completely standard, as the list of counters exposed varies > substantially by vendor and model. Nevertheless, the mechanics of > discovering, enabling, reading, and disabling counters can be > standardized, as can a small set of universally meaningful counters like > total GPU utilization. This would permit a vendor-independent GPU top > app as suggested, as is I believe currently possible with > vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) > > It looks like this discussion is dormant. Could we try to get this > sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable > to diagnose without access to the counters, so I'm eager for a mainline > solution to be implemented. I spent a bit of time thinking about it and looking at different solutions. debugfs/sysfs might not be the best solution, especially if we think about the multi-user case (several instances of GPU perfmon tool running in parallel), if we want it to work properly we need a way to instantiate several perf monitors and let the driver add values to all active perfmons everytime a dump happens (no matter who triggered the dump). That's exactly what mali_kbase/gator does BTW. That's achievable through debugs if we consider exposing a knob to instantiate such perfmon instances, but that also means risking perfmon leaks if the user does not take care of killing the perfmon it created when it's done with it (or when it crashes). It might also prove hard to expose that to non-root users in a secure way. I also had a quick look at the perf_event interface to see if we could extend it to support monitoring GPU events. I might be wrong as I didn't spend much time investigating how it works, but it seems that perf counters are saved/dumped/restored at each thread context switch, which is not what we want here (might add extra perfcnt dump points thus impacting GPU perfs more than we expect). So maybe the best option is a pseudo-generic ioctl-based interface to expose those perf counters.
On Tue, 30 Apr 2019 09:49:51 -0600 Jordan Crouse <jcrouse@codeaurora.org> wrote: > On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote: > > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon > > <boris.brezillon@collabora.com> wrote: > > > > > > +Rob, Eric, Mark and more > > > > > > Hi, > > > > > > On Fri, 5 Apr 2019 16:20:45 +0100 > > > Steven Price <steven.price@arm.com> wrote: > > > > > > > On 04/04/2019 16:20, Boris Brezillon wrote: > > > > > Hello, > > > > > > > > > > This patch adds new ioctls to expose GPU counters to userspace. > > > > > These will be used by the mesa driver (should be posted soon). > > > > > > > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > > > > where perf counters are retrieved on a per-job basis. This allows one > > > > > to have get accurate results when there are users using the GPU > > > > > concurrently. > > > > > AFAICT, the mali kbase is using a different approach where several > > > > > users can register a performance monitor but with no way to have fined > > > > > grained control over what job/GPU-context to track. > > > > > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > > > > be from different contexts (address spaces), and mali_kbase also fully > > > > uses the _NEXT registers. So there can be a job from one context > > > > executing on slot 0 and a job from a different context waiting in the > > > > _NEXT registers. (And the same for slot 1). This means that there's no > > > > (visible) gap between the first job finishing and the second job > > > > starting. Early versions of the driver even had a throttle to avoid > > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > > > > IRQ - but thankfully that's gone. > > > > > > > > The upshot is that it's basically impossible to measure "per-job" > > > > counters when running at full speed. Because multiple jobs are running > > > > and the driver doesn't actually know when one ends and the next starts. > > > > > > > > Since one of the primary use cases is to draw pretty graphs of the > > > > system load [1], this "per-job" information isn't all that relevant (and > > > > minimal performance overhead is important). And if you want to monitor > > > > just one application it is usually easiest to ensure that it is the only > > > > thing running. > > > > > > > > [1] > > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > > > > > > > This design choice comes at a cost: every time the perfmon context > > > > > changes (the perfmon context is the list of currently active > > > > > perfmons), the driver has to add a fence to prevent new jobs from > > > > > corrupting counters that will be dumped by previous jobs. > > > > > > > > > > Let me know if that's an issue and if you think we should approach > > > > > things differently. > > > > > > > > It depends what you expect to do with the counters. Per-job counters are > > > > certainly useful sometimes. But serialising all jobs can mess up the > > > > thing you are trying to measure the performance of. > > > > > > I finally found some time to work on v2 this morning, and it turns out > > > implementing global perf monitors as done in mali_kbase means rewriting > > > almost everything (apart from the perfcnt layout stuff). I'm not against > > > doing that, but I'd like to be sure this is really what we want. > > > > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > > > through the pipe_query/AMD_perfmon interface if we don't have this > > > job (or at least draw call) granularity? If not, should we keep the > > > solution I'm proposing here to make sure counters values are accurate, > > > or should we expose perf counters through a non-standard API? > > > > I think if you can't do per-draw level granularity, then you should > > not try to implement AMD_perfmon.. instead the use case is more for a > > sort of "gpu top" app (as opposed to something like frameretrace which > > is taking per-draw-call level measurements from within the app. > > Things that use AMD_perfmon are going to, I think, expect to query > > values between individual glDraw calls, and you probably don't want to > > flush tile passes 500 times per frame. > > > > (Although, I suppose if there are multiple GPUs where perfcntrs work > > this way, it might be an interesting exercise to think about coming up > > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so > > you could have a single userspace tool that works across several > > different drivers.) > > I agree. We've been pondering a lot of the same issues for Adreno. So, you also have those global perf counters that can't be retrieved through the cmdstream? After the discussion I had with Rob I had the impression freedreno was supporting the AMD_perfmon interface in a proper way.
On 12/05/2019 14:38, Boris Brezillon wrote: > On Sat, 11 May 2019 15:32:20 -0700 > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > >> Hi all, >> >> As Steven Price explained, the "GPU top" kbase approach is often more >> useful and accurate than per-draw timing. >> >> For a 3D game inside a GPU-accelerated desktop, the games' counters >> *should* include desktop overhead. This external overhead does affect >> the game's performance, especially if the contexts are competing for >> resources like memory bandwidth. An isolated sample is easy to achieve >> running only the app of interest; in ideal conditions (zero-copy >> fullscreen), desktop interference is negligible. >> >> For driver developers, the system-wide measurements are preferable, >> painting a complete system picture and avoiding disruptions. There is no >> risk of confusion, as the driver developers understand how the counters >> are exposed. Further, benchmarks rendering direct to a GBM surface are >> available (glmark2-es2-drm), eliminating interference even with poor >> desktop performance. >> >> For app developers, the confusion of multi-context interference is >> unfortunate. Nevertheless, if enabling counters were to slow down an >> app, the confusion could be worse. Consider second-order changes in the >> app's performance characteristics due to slowdown: if techniques like >> dynamic resolution scaling are employed, the counters' results can be >> invalid. Likewise, even if the lower-performance counters are >> appropriate for purely graphical workloads, complex apps with variable >> CPU overhead (e.g. from an FPS-dependent physics engine) can further >> confound counters. Low-overhead system-wide measurements mitigate these >> concerns. > > I'd just like to point out that dumping counters the way > mali_kbase/gator does likely has an impact on perfs (at least on some > GPUs) because of the clean+invalidate-cache that happens before (or > after, I don't remember) each dump. IIUC and this cache is actually The clean is required after the dump to ensure that the CPU can see the dumped counters (otherwise they could be stuck in the GPU's cache). Note that if you don't immediately need to observe the values the clean can be deferred. E.g. if each dump you target a different memory region then you could perform a flush only after several dumps. > global and not a per address space thing (which would be possible if the > cache lines contain a tag attaching them to a specific address space), > that means all jobs running when the clean+invalidate happens will have > extra cache misses after each dump. Of course, that's not as invasive as > the full serialization that happens with my solution, but it's not free > either. Cache cleans (at the L2 cache level) are global. There are L1 caches (and TLBs) but they are not relevant to counter dumping. In practice cache cleans are not that expensive most of the time. As you note - mali_kbase doesn't bother to avoid them when doing counter capture. It should also be unnecessary to "invalidate" - a clean should be sufficient. The invalidate is only required if e.g. MMU page tables have been modified as well. That would reduce the affect on currently running jobs. But I notice that mali_kbase doesn't appear to use the "clean only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being broken though. >> >> As Rob Clark suggested, system-wide counters could be exposed via a >> semi-standardized interface, perhaps within debugfs/sysfs. The interface >> could not be completely standard, as the list of counters exposed varies >> substantially by vendor and model. Nevertheless, the mechanics of >> discovering, enabling, reading, and disabling counters can be >> standardized, as can a small set of universally meaningful counters like >> total GPU utilization. This would permit a vendor-independent GPU top >> app as suggested, as is I believe currently possible with >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) >> >> It looks like this discussion is dormant. Could we try to get this >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable >> to diagnose without access to the counters, so I'm eager for a mainline >> solution to be implemented. > > I spent a bit of time thinking about it and looking at different > solutions. > > debugfs/sysfs might not be the best solution, especially if we think > about the multi-user case (several instances of GPU perfmon tool > running in parallel), if we want it to work properly we need a way to > instantiate several perf monitors and let the driver add values to all > active perfmons everytime a dump happens (no matter who triggered the > dump). That's exactly what mali_kbase/gator does BTW. That's achievable > through debugs if we consider exposing a knob to instantiate such > perfmon instances, but that also means risking perfmon leaks if the > user does not take care of killing the perfmon it created when it's done > with it (or when it crashes). It might also prove hard to expose that to > non-root users in a secure way. Note that mali_kbase only has to maintain multiple sets of values because it provides backwards compatibility to an old version of the driver. You could maintain one set of counter values and simply ensure they are big enough not to wrap (or handle wrapping, but that is ugly and unlikely to happen). Very early versions of kbase simply exposed the hardware functionality (dump counters and reset) and therefore faced the issue that there could only ever be one user of counters at a time. This was "fixed" by emulating the existence of multiple interfaces (vinstr - "virtual" instrumentation). If I was redesigning it from scratch then I'd definitely remove the "reset" part of the interface and leave it for the clients to latch the first value (of each counter) and subtract that from subsequent counter reads. Sadly the hardware doesn't help here and some (most?) GPU revisions will saturate the counters rather than wrapping. Hence the need to frequently poll and reset the counters, even if you only want the 'start'/'end' values. Having only one set of counters might simplify the potential for "leaks" - although you still want to ensure that if nobody cares you stop collecting counters... Steve > I also had a quick look at the perf_event interface to see if we could > extend it to support monitoring GPU events. I might be wrong as I > didn't spend much time investigating how it works, but it seems that > perf counters are saved/dumped/restored at each thread context switch, > which is not what we want here (might add extra perfcnt dump points > thus impacting GPU perfs more than we expect). > > So maybe the best option is a pseudo-generic ioctl-based interface to > expose those perf counters. > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel >
On Mon, 13 May 2019 13:48:08 +0100 Steven Price <steven.price@arm.com> wrote: > On 12/05/2019 14:38, Boris Brezillon wrote: > > On Sat, 11 May 2019 15:32:20 -0700 > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > >> Hi all, > >> > >> As Steven Price explained, the "GPU top" kbase approach is often more > >> useful and accurate than per-draw timing. > >> > >> For a 3D game inside a GPU-accelerated desktop, the games' counters > >> *should* include desktop overhead. This external overhead does affect > >> the game's performance, especially if the contexts are competing for > >> resources like memory bandwidth. An isolated sample is easy to achieve > >> running only the app of interest; in ideal conditions (zero-copy > >> fullscreen), desktop interference is negligible. > >> > >> For driver developers, the system-wide measurements are preferable, > >> painting a complete system picture and avoiding disruptions. There is no > >> risk of confusion, as the driver developers understand how the counters > >> are exposed. Further, benchmarks rendering direct to a GBM surface are > >> available (glmark2-es2-drm), eliminating interference even with poor > >> desktop performance. > >> > >> For app developers, the confusion of multi-context interference is > >> unfortunate. Nevertheless, if enabling counters were to slow down an > >> app, the confusion could be worse. Consider second-order changes in the > >> app's performance characteristics due to slowdown: if techniques like > >> dynamic resolution scaling are employed, the counters' results can be > >> invalid. Likewise, even if the lower-performance counters are > >> appropriate for purely graphical workloads, complex apps with variable > >> CPU overhead (e.g. from an FPS-dependent physics engine) can further > >> confound counters. Low-overhead system-wide measurements mitigate these > >> concerns. > > > > I'd just like to point out that dumping counters the way > > mali_kbase/gator does likely has an impact on perfs (at least on some > > GPUs) because of the clean+invalidate-cache that happens before (or > > after, I don't remember) each dump. IIUC and this cache is actually > > The clean is required after the dump to ensure that the CPU can see the > dumped counters (otherwise they could be stuck in the GPU's cache). Just to make it clear, I was not questioning the need for a cache clean here, I was simply pointing out that dumping counters is not free even when we don't add those serialization points. Note that dumping counters also has an impact on the cache-miss counter of future dumps, but there's nothing we can do about that no matter the solution we go for. > Note > that if you don't immediately need to observe the values the clean can > be deferred. E.g. if each dump you target a different memory region then > you could perform a flush only after several dumps. Correct. Having a pool of dump buffers might be an option if we need to minimize the impact of dumps at some point. > > > global and not a per address space thing (which would be possible if the > > cache lines contain a tag attaching them to a specific address space), > > that means all jobs running when the clean+invalidate happens will have > > extra cache misses after each dump. Of course, that's not as invasive as > > the full serialization that happens with my solution, but it's not free > > either. > > Cache cleans (at the L2 cache level) are global. There are L1 caches > (and TLBs) but they are not relevant to counter dumping. Okay. > > In practice cache cleans are not that expensive most of the time. As you > note - mali_kbase doesn't bother to avoid them when doing counter capture. > > It should also be unnecessary to "invalidate" - a clean should be > sufficient. The invalidate is only required if e.g. MMU page tables have > been modified as well. I already dropped the invalidate in my patch as we only have a global address space right now and the buffer used to store perfcnt values is always the same. We'll have to rework that when support for per-context address space is added. > That would reduce the affect on currently running > jobs. But I notice that mali_kbase doesn't appear to use the "clean > only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being > broken though. 'clean-only' seems to work fine on T860 at least. > > >> > >> As Rob Clark suggested, system-wide counters could be exposed via a > >> semi-standardized interface, perhaps within debugfs/sysfs. The interface > >> could not be completely standard, as the list of counters exposed varies > >> substantially by vendor and model. Nevertheless, the mechanics of > >> discovering, enabling, reading, and disabling counters can be > >> standardized, as can a small set of universally meaningful counters like > >> total GPU utilization. This would permit a vendor-independent GPU top > >> app as suggested, as is I believe currently possible with > >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) > >> > >> It looks like this discussion is dormant. Could we try to get this > >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable > >> to diagnose without access to the counters, so I'm eager for a mainline > >> solution to be implemented. > > > > I spent a bit of time thinking about it and looking at different > > solutions. > > > > debugfs/sysfs might not be the best solution, especially if we think > > about the multi-user case (several instances of GPU perfmon tool > > running in parallel), if we want it to work properly we need a way to > > instantiate several perf monitors and let the driver add values to all > > active perfmons everytime a dump happens (no matter who triggered the > > dump). That's exactly what mali_kbase/gator does BTW. That's achievable > > through debugs if we consider exposing a knob to instantiate such > > perfmon instances, but that also means risking perfmon leaks if the > > user does not take care of killing the perfmon it created when it's done > > with it (or when it crashes). It might also prove hard to expose that to > > non-root users in a secure way. > > Note that mali_kbase only has to maintain multiple sets of values > because it provides backwards compatibility to an old version of the > driver. You could maintain one set of counter values and simply ensure > they are big enough not to wrap (or handle wrapping, but that is ugly > and unlikely to happen). Very early versions of kbase simply exposed the > hardware functionality (dump counters and reset) and therefore faced the > issue that there could only ever be one user of counters at a time. This > was "fixed" by emulating the existence of multiple interfaces (vinstr - > "virtual" instrumentation). If I was redesigning it from scratch then > I'd definitely remove the "reset" part of the interface and leave it for > the clients to latch the first value (of each counter) and subtract that > from subsequent counter reads. Noted. > > Sadly the hardware doesn't help here and some (most?) GPU revisions will > saturate the counters rather than wrapping. Hence the need to frequently > poll and reset the counters, even if you only want the 'start'/'end' values. Not sure wrapping would be better, as you can't tell how many times you've reached the max val. BTW, counters are reset every time we do a dump, right? If that's the case, there's no risk to have end_val < start_val and the only risk is to get inaccurate values when counter_val > U32_MAX, but reaching U32_MAX already means that this event occurred a lot during the monitoring period. > > Having only one set of counters might simplify the potential for "leaks" > - although you still want to ensure that if nobody cares you stop > collecting counters... Handling that with a debugfs-based iface implies implementing some heuristic like 'counters haven't been dumped for a long time, let's stop monitoring them', and I'm not a big fan of such things. This being said, I think I'll go for a simple debugfs-based iface to unblock Alyssa. debugfs is not part of the stable-ABI and I guess we can agree on explicitly marking it as "unstable" so that when we settle on a generic interface to expose such counters we can get rid of the old one.
On 13/05/2019 14:39, Boris Brezillon wrote: > On Mon, 13 May 2019 13:48:08 +0100 > Steven Price <steven.price@arm.com> wrote: > >> On 12/05/2019 14:38, Boris Brezillon wrote: >>> On Sat, 11 May 2019 15:32:20 -0700 >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: >>> >>>> Hi all, >>>> >>>> As Steven Price explained, the "GPU top" kbase approach is often more >>>> useful and accurate than per-draw timing. >>>> >>>> For a 3D game inside a GPU-accelerated desktop, the games' counters >>>> *should* include desktop overhead. This external overhead does affect >>>> the game's performance, especially if the contexts are competing for >>>> resources like memory bandwidth. An isolated sample is easy to achieve >>>> running only the app of interest; in ideal conditions (zero-copy >>>> fullscreen), desktop interference is negligible. >>>> >>>> For driver developers, the system-wide measurements are preferable, >>>> painting a complete system picture and avoiding disruptions. There is no >>>> risk of confusion, as the driver developers understand how the counters >>>> are exposed. Further, benchmarks rendering direct to a GBM surface are >>>> available (glmark2-es2-drm), eliminating interference even with poor >>>> desktop performance. >>>> >>>> For app developers, the confusion of multi-context interference is >>>> unfortunate. Nevertheless, if enabling counters were to slow down an >>>> app, the confusion could be worse. Consider second-order changes in the >>>> app's performance characteristics due to slowdown: if techniques like >>>> dynamic resolution scaling are employed, the counters' results can be >>>> invalid. Likewise, even if the lower-performance counters are >>>> appropriate for purely graphical workloads, complex apps with variable >>>> CPU overhead (e.g. from an FPS-dependent physics engine) can further >>>> confound counters. Low-overhead system-wide measurements mitigate these >>>> concerns. >>> >>> I'd just like to point out that dumping counters the way >>> mali_kbase/gator does likely has an impact on perfs (at least on some >>> GPUs) because of the clean+invalidate-cache that happens before (or >>> after, I don't remember) each dump. IIUC and this cache is actually >> >> The clean is required after the dump to ensure that the CPU can see the >> dumped counters (otherwise they could be stuck in the GPU's cache). > > Just to make it clear, I was not questioning the need for a cache clean > here, I was simply pointing out that dumping counters is not free even > when we don't add those serialization points. Note that dumping > counters also has an impact on the cache-miss counter of future dumps, > but there's nothing we can do about that no matter the solution we go > for. Indeed, counters go through the caches, so clearly will have an impact on the cache performance. But usually the impact is sufficiently "in the noise" that it doesn't matter. But this is a hardware limitation. > [..] >> >> Sadly the hardware doesn't help here and some (most?) GPU revisions will >> saturate the counters rather than wrapping. Hence the need to frequently >> poll and reset the counters, even if you only want the 'start'/'end' values. > > Not sure wrapping would be better, as you can't tell how many times > you've reached the max val. > > BTW, counters are reset every time we do a dump, right? If that's the > case, there's no risk to have end_val < start_val and the only risk is > to get inaccurate values when counter_val > U32_MAX, but reaching > U32_MAX already means that this event occurred a lot during the > monitoring period. Yes the h/w resets the counters when dumping. What I intended to say (and re-reading it didn't do a good job) was that it would be quite helpful if the hardware didn't reset and didn't saturate. That way multiple uses could simply observe the counter values. The combination of saturating and resetting on read means that there has to only be done entity reading the counter values from the hardware and summing up. I *think* later hardware might have dropped the saturating logic (because it costs gates and is fairly useless for software), but sadly because of backwards compatibility the resetting on read behaviour isn't going to change. >> >> Having only one set of counters might simplify the potential for "leaks" >> - although you still want to ensure that if nobody cares you stop >> collecting counters... > > Handling that with a debugfs-based iface implies implementing some > heuristic like 'counters haven't been dumped for a long time, let's > stop monitoring them', and I'm not a big fan of such things. Agreed, it's fairly ugly. If you expose "native" sized counters (i.e. u32) then there's a reasonable heuristic: if (1^32/GPU clock frequency) has passed then the counters might have overflowed so are unlikely to be useful to anyone still looking at them. > This being said, I think I'll go for a simple debugfs-based iface to > unblock Alyssa. debugfs is not part of the stable-ABI and I guess we > can agree on explicitly marking it as "unstable" so that when we settle > on a generic interface to expose such counters we can get rid of the > old one. A generic interface would be good, but short-term I can definitely see the value of exporting the counters in debugfs. Steve
> This being said, I think I'll go for a simple debugfs-based iface to > unblock Alyssa. debugfs is not part of the stable-ABI and I guess we > can agree on explicitly marking it as "unstable" so that when we settle > on a generic interface to expose such counters we can get rid of the > old one. Thank you :)
On Sun, May 12, 2019 at 03:40:26PM +0200, Boris Brezillon wrote: > On Tue, 30 Apr 2019 09:49:51 -0600 > Jordan Crouse <jcrouse@codeaurora.org> wrote: > > > On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote: > > > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon > > > <boris.brezillon@collabora.com> wrote: > > > > > > > > +Rob, Eric, Mark and more > > > > > > > > Hi, > > > > > > > > On Fri, 5 Apr 2019 16:20:45 +0100 > > > > Steven Price <steven.price@arm.com> wrote: > > > > > > > > > On 04/04/2019 16:20, Boris Brezillon wrote: > > > > > > Hello, > > > > > > > > > > > > This patch adds new ioctls to expose GPU counters to userspace. > > > > > > These will be used by the mesa driver (should be posted soon). > > > > > > > > > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > > > > > where perf counters are retrieved on a per-job basis. This allows one > > > > > > to have get accurate results when there are users using the GPU > > > > > > concurrently. > > > > > > AFAICT, the mali kbase is using a different approach where several > > > > > > users can register a performance monitor but with no way to have fined > > > > > > grained control over what job/GPU-context to track. > > > > > > > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > > > > > be from different contexts (address spaces), and mali_kbase also fully > > > > > uses the _NEXT registers. So there can be a job from one context > > > > > executing on slot 0 and a job from a different context waiting in the > > > > > _NEXT registers. (And the same for slot 1). This means that there's no > > > > > (visible) gap between the first job finishing and the second job > > > > > starting. Early versions of the driver even had a throttle to avoid > > > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > > > > > IRQ - but thankfully that's gone. > > > > > > > > > > The upshot is that it's basically impossible to measure "per-job" > > > > > counters when running at full speed. Because multiple jobs are running > > > > > and the driver doesn't actually know when one ends and the next starts. > > > > > > > > > > Since one of the primary use cases is to draw pretty graphs of the > > > > > system load [1], this "per-job" information isn't all that relevant (and > > > > > minimal performance overhead is important). And if you want to monitor > > > > > just one application it is usually easiest to ensure that it is the only > > > > > thing running. > > > > > > > > > > [1] > > > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > > > > > > > > > This design choice comes at a cost: every time the perfmon context > > > > > > changes (the perfmon context is the list of currently active > > > > > > perfmons), the driver has to add a fence to prevent new jobs from > > > > > > corrupting counters that will be dumped by previous jobs. > > > > > > > > > > > > Let me know if that's an issue and if you think we should approach > > > > > > things differently. > > > > > > > > > > It depends what you expect to do with the counters. Per-job counters are > > > > > certainly useful sometimes. But serialising all jobs can mess up the > > > > > thing you are trying to measure the performance of. > > > > > > > > I finally found some time to work on v2 this morning, and it turns out > > > > implementing global perf monitors as done in mali_kbase means rewriting > > > > almost everything (apart from the perfcnt layout stuff). I'm not against > > > > doing that, but I'd like to be sure this is really what we want. > > > > > > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > > > > through the pipe_query/AMD_perfmon interface if we don't have this > > > > job (or at least draw call) granularity? If not, should we keep the > > > > solution I'm proposing here to make sure counters values are accurate, > > > > or should we expose perf counters through a non-standard API? > > > > > > I think if you can't do per-draw level granularity, then you should > > > not try to implement AMD_perfmon.. instead the use case is more for a > > > sort of "gpu top" app (as opposed to something like frameretrace which > > > is taking per-draw-call level measurements from within the app. > > > Things that use AMD_perfmon are going to, I think, expect to query > > > values between individual glDraw calls, and you probably don't want to > > > flush tile passes 500 times per frame. > > > > > > (Although, I suppose if there are multiple GPUs where perfcntrs work > > > this way, it might be an interesting exercise to think about coming up > > > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so > > > you could have a single userspace tool that works across several > > > different drivers.) > > > > I agree. We've been pondering a lot of the same issues for Adreno. > > So, you also have those global perf counters that can't be retrieved > through the cmdstream? After the discussion I had with Rob I had the > impression freedreno was supporting the AMD_perfmon interface in a > proper way. It is true that we can read most of the counters from the cmdstream but we have a limited number of global physical counters that have to be shared across contexts and a metric ton of countables that can be selected. As long as we have just the one process running AMD_perfmon then we are fine, but it breaks down after that. I am also interested in the 'gpu top' use case which can somewhat emulated through a cmdstream assuming that we stick to a finite set of counters and countables that won't conflict with extensions. This is fine for basic work but anybody doing any serious performance work would run into some limitations quickly. And finally the kernel uses a few counters for its own purposes and I hate the "gentlemen's agreement" between the user and the kernel as to which physical counters each get ownership of. It would be a lot better to turn it into an explicit dynamic reservation which I feel would be a natural outshoot of a generic interface. Jordan