mbox series

[0/3] drm/panfrost: Expose HW counters to userspace

Message ID 20190404152051.17996-1-boris.brezillon@collabora.com (mailing list archive)
Headers show
Series drm/panfrost: Expose HW counters to userspace | expand

Message

Boris Brezillon April 4, 2019, 3:20 p.m. UTC
Hello,

This patch adds new ioctls to expose GPU counters to userspace.
These will be used by the mesa driver (should be posted soon).

A few words about the implementation: I followed the VC4/Etnaviv model
where perf counters are retrieved on a per-job basis. This allows one
to have get accurate results when there are users using the GPU
concurrently.
AFAICT, the mali kbase is using a different approach where several
users can register a performance monitor but with no way to have fined
grained control over what job/GPU-context to track.

This design choice comes at a cost: every time the perfmon context
changes (the perfmon context is the list of currently active
perfmons), the driver has to add a fence to prevent new jobs from
corrupting counters that will be dumped by previous jobs.

Let me know if that's an issue and if you think we should approach
things differently.

Regards,

Boris

Boris Brezillon (3):
  drm/panfrost: Move gpu_{write,read}() macros to panfrost_regs.h
  drm/panfrost: Expose HW counters to userspace
  panfrost/drm: Define T860 perf counters

 drivers/gpu/drm/panfrost/Makefile           |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.c  |   8 +
 drivers/gpu/drm/panfrost/panfrost_device.h  |  11 +
 drivers/gpu/drm/panfrost/panfrost_drv.c     |  22 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c     |  46 +-
 drivers/gpu/drm/panfrost/panfrost_job.c     |  24 +
 drivers/gpu/drm/panfrost/panfrost_job.h     |   4 +
 drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 954 ++++++++++++++++++++
 drivers/gpu/drm/panfrost/panfrost_perfcnt.h |  59 ++
 drivers/gpu/drm/panfrost/panfrost_regs.h    |  22 +
 include/uapi/drm/panfrost_drm.h             | 122 +++
 11 files changed, 1268 insertions(+), 7 deletions(-)
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h

Comments

Steven Price April 5, 2019, 3:20 p.m. UTC | #1
On 04/04/2019 16:20, Boris Brezillon wrote:
> Hello,
> 
> This patch adds new ioctls to expose GPU counters to userspace.
> These will be used by the mesa driver (should be posted soon).
> 
> A few words about the implementation: I followed the VC4/Etnaviv model
> where perf counters are retrieved on a per-job basis. This allows one
> to have get accurate results when there are users using the GPU
> concurrently.
> AFAICT, the mali kbase is using a different approach where several
> users can register a performance monitor but with no way to have fined
> grained control over what job/GPU-context to track.

mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
be from different contexts (address spaces), and mali_kbase also fully
uses the _NEXT registers. So there can be a job from one context
executing on slot 0 and a job from a different context waiting in the
_NEXT registers. (And the same for slot 1). This means that there's no
(visible) gap between the first job finishing and the second job
starting. Early versions of the driver even had a throttle to avoid
interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
IRQ - but thankfully that's gone.

The upshot is that it's basically impossible to measure "per-job"
counters when running at full speed. Because multiple jobs are running
and the driver doesn't actually know when one ends and the next starts.

Since one of the primary use cases is to draw pretty graphs of the
system load [1], this "per-job" information isn't all that relevant (and
minimal performance overhead is important). And if you want to monitor
just one application it is usually easiest to ensure that it is the only
thing running.

[1]
https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer

> This design choice comes at a cost: every time the perfmon context
> changes (the perfmon context is the list of currently active
> perfmons), the driver has to add a fence to prevent new jobs from
> corrupting counters that will be dumped by previous jobs.
> 
> Let me know if that's an issue and if you think we should approach
> things differently.

It depends what you expect to do with the counters. Per-job counters are
certainly useful sometimes. But serialising all jobs can mess up the
thing you are trying to measure the performance of.

Steve

> Regards,
> 
> Boris
> 
> Boris Brezillon (3):
>   drm/panfrost: Move gpu_{write,read}() macros to panfrost_regs.h
>   drm/panfrost: Expose HW counters to userspace
>   panfrost/drm: Define T860 perf counters
> 
>  drivers/gpu/drm/panfrost/Makefile           |   3 +-
>  drivers/gpu/drm/panfrost/panfrost_device.c  |   8 +
>  drivers/gpu/drm/panfrost/panfrost_device.h  |  11 +
>  drivers/gpu/drm/panfrost/panfrost_drv.c     |  22 +-
>  drivers/gpu/drm/panfrost/panfrost_gpu.c     |  46 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c     |  24 +
>  drivers/gpu/drm/panfrost/panfrost_job.h     |   4 +
>  drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 954 ++++++++++++++++++++
>  drivers/gpu/drm/panfrost/panfrost_perfcnt.h |  59 ++
>  drivers/gpu/drm/panfrost/panfrost_regs.h    |  22 +
>  include/uapi/drm/panfrost_drm.h             | 122 +++
>  11 files changed, 1268 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c
>  create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h
>
Alyssa Rosenzweig April 5, 2019, 4:33 p.m. UTC | #2
> Since one of the primary use cases is to draw pretty graphs of the
> system load [1], this "per-job" information isn't all that relevant (and
> minimal performance overhead is important). And if you want to monitor
> just one application it is usually easiest to ensure that it is the only
> thing running.

Ah-ha, gotch. I don't know why I didn't put 2 and 2 together, but that
definitely makes sense, yeah :) Boris, thoughts?
Boris Brezillon April 5, 2019, 5:40 p.m. UTC | #3
On Fri, 5 Apr 2019 09:33:55 -0700
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> > Since one of the primary use cases is to draw pretty graphs of the
> > system load [1], this "per-job" information isn't all that relevant (and
> > minimal performance overhead is important). And if you want to monitor
> > just one application it is usually easiest to ensure that it is the only
> > thing running.  
> 
> Ah-ha, gotch. I don't know why I didn't put 2 and 2 together, but that
> definitely makes sense, yeah :) Boris, thoughts?

Nothing to add, except, I had the gut feeling I was doing the wrong
choice here, hence the mention to this design decision in my cover
letter :-).
I'll rework the implementation to have perfmons apply globally (instead
of being attached to jobs) and get rid of the perfcnt fence.

Note that we could support both per-job and global perfmons and avoid
the perf penalty when only global perfmons are used, but I don't think
it's worth the extra complexity (not to mention that it makes things
even more confusing for userspace users).
Alyssa Rosenzweig April 5, 2019, 5:43 p.m. UTC | #4
+1
Boris Brezillon April 30, 2019, 12:42 p.m. UTC | #5
+Rob, Eric, Mark and more

Hi,

On Fri, 5 Apr 2019 16:20:45 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/04/2019 16:20, Boris Brezillon wrote:
> > Hello,
> > 
> > This patch adds new ioctls to expose GPU counters to userspace.
> > These will be used by the mesa driver (should be posted soon).
> > 
> > A few words about the implementation: I followed the VC4/Etnaviv model
> > where perf counters are retrieved on a per-job basis. This allows one
> > to have get accurate results when there are users using the GPU
> > concurrently.
> > AFAICT, the mali kbase is using a different approach where several
> > users can register a performance monitor but with no way to have fined
> > grained control over what job/GPU-context to track.  
> 
> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> be from different contexts (address spaces), and mali_kbase also fully
> uses the _NEXT registers. So there can be a job from one context
> executing on slot 0 and a job from a different context waiting in the
> _NEXT registers. (And the same for slot 1). This means that there's no
> (visible) gap between the first job finishing and the second job
> starting. Early versions of the driver even had a throttle to avoid
> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> IRQ - but thankfully that's gone.
> 
> The upshot is that it's basically impossible to measure "per-job"
> counters when running at full speed. Because multiple jobs are running
> and the driver doesn't actually know when one ends and the next starts.
> 
> Since one of the primary use cases is to draw pretty graphs of the
> system load [1], this "per-job" information isn't all that relevant (and
> minimal performance overhead is important). And if you want to monitor
> just one application it is usually easiest to ensure that it is the only
> thing running.
> 
> [1]
> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> 
> > This design choice comes at a cost: every time the perfmon context
> > changes (the perfmon context is the list of currently active
> > perfmons), the driver has to add a fence to prevent new jobs from
> > corrupting counters that will be dumped by previous jobs.
> > 
> > Let me know if that's an issue and if you think we should approach
> > things differently.  
> 
> It depends what you expect to do with the counters. Per-job counters are
> certainly useful sometimes. But serialising all jobs can mess up the
> thing you are trying to measure the performance of.

I finally found some time to work on v2 this morning, and it turns out
implementing global perf monitors as done in mali_kbase means rewriting
almost everything (apart from the perfcnt layout stuff). I'm not against
doing that, but I'd like to be sure this is really what we want.

Eric, Rob, any opinion on that? Is it acceptable to expose counters
through the pipe_query/AMD_perfmon interface if we don't have this
job (or at least draw call) granularity? If not, should we keep the
solution I'm proposing here to make sure counters values are accurate,
or should we expose perf counters through a non-standard API?

BTW, I'd like to remind you that serialization (waiting on the perfcnt
fence) only happens if we have a perfmon context change between 2
consecutive jobs, which only happens when
* 2 applications are running in // and at least one of them is
  monitored
* or when userspace decides to stop monitoring things and dump counter
  values

That means that, for the usual case (all perfmons disabled), there's
almost zero overhead (just a few more checks in the submit job code).
That also means that, if we ever decide to support global perfmon (perf
monitors that track things globably) on top of the current approach,
and only global perfmons are enabled, things won't be serialized as
with the per-job approach, because everyone will share the same perfmon
ctx (the same set of perfmons).

I'd appreciate any feedback from people that have used perf counters
(or implemented a way to dump them) on their platform.

Thanks,

Boris
Rob Clark April 30, 2019, 1:10 p.m. UTC | #6
On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> +Rob, Eric, Mark and more
>
> Hi,
>
> On Fri, 5 Apr 2019 16:20:45 +0100
> Steven Price <steven.price@arm.com> wrote:
>
> > On 04/04/2019 16:20, Boris Brezillon wrote:
> > > Hello,
> > >
> > > This patch adds new ioctls to expose GPU counters to userspace.
> > > These will be used by the mesa driver (should be posted soon).
> > >
> > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > where perf counters are retrieved on a per-job basis. This allows one
> > > to have get accurate results when there are users using the GPU
> > > concurrently.
> > > AFAICT, the mali kbase is using a different approach where several
> > > users can register a performance monitor but with no way to have fined
> > > grained control over what job/GPU-context to track.
> >
> > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > be from different contexts (address spaces), and mali_kbase also fully
> > uses the _NEXT registers. So there can be a job from one context
> > executing on slot 0 and a job from a different context waiting in the
> > _NEXT registers. (And the same for slot 1). This means that there's no
> > (visible) gap between the first job finishing and the second job
> > starting. Early versions of the driver even had a throttle to avoid
> > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > IRQ - but thankfully that's gone.
> >
> > The upshot is that it's basically impossible to measure "per-job"
> > counters when running at full speed. Because multiple jobs are running
> > and the driver doesn't actually know when one ends and the next starts.
> >
> > Since one of the primary use cases is to draw pretty graphs of the
> > system load [1], this "per-job" information isn't all that relevant (and
> > minimal performance overhead is important). And if you want to monitor
> > just one application it is usually easiest to ensure that it is the only
> > thing running.
> >
> > [1]
> > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> >
> > > This design choice comes at a cost: every time the perfmon context
> > > changes (the perfmon context is the list of currently active
> > > perfmons), the driver has to add a fence to prevent new jobs from
> > > corrupting counters that will be dumped by previous jobs.
> > >
> > > Let me know if that's an issue and if you think we should approach
> > > things differently.
> >
> > It depends what you expect to do with the counters. Per-job counters are
> > certainly useful sometimes. But serialising all jobs can mess up the
> > thing you are trying to measure the performance of.
>
> I finally found some time to work on v2 this morning, and it turns out
> implementing global perf monitors as done in mali_kbase means rewriting
> almost everything (apart from the perfcnt layout stuff). I'm not against
> doing that, but I'd like to be sure this is really what we want.
>
> Eric, Rob, any opinion on that? Is it acceptable to expose counters
> through the pipe_query/AMD_perfmon interface if we don't have this
> job (or at least draw call) granularity? If not, should we keep the
> solution I'm proposing here to make sure counters values are accurate,
> or should we expose perf counters through a non-standard API?

I think if you can't do per-draw level granularity, then you should
not try to implement AMD_perfmon..  instead the use case is more for a
sort of "gpu top" app (as opposed to something like frameretrace which
is taking per-draw-call level measurements from within the app.
Things that use AMD_perfmon are going to, I think, expect to query
values between individual glDraw calls, and you probably don't want to
flush tile passes 500 times per frame.

(Although, I suppose if there are multiple GPUs where perfcntrs work
this way, it might be an interesting exercise to think about coming up
w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
you could have a single userspace tool that works across several
different drivers.)

BR,
-R

>
> BTW, I'd like to remind you that serialization (waiting on the perfcnt
> fence) only happens if we have a perfmon context change between 2
> consecutive jobs, which only happens when
> * 2 applications are running in // and at least one of them is
>   monitored
> * or when userspace decides to stop monitoring things and dump counter
>   values
>
> That means that, for the usual case (all perfmons disabled), there's
> almost zero overhead (just a few more checks in the submit job code).
> That also means that, if we ever decide to support global perfmon (perf
> monitors that track things globably) on top of the current approach,
> and only global perfmons are enabled, things won't be serialized as
> with the per-job approach, because everyone will share the same perfmon
> ctx (the same set of perfmons).
>
> I'd appreciate any feedback from people that have used perf counters
> (or implemented a way to dump them) on their platform.
>
> Thanks,
>
> Boris
Jordan Crouse April 30, 2019, 3:49 p.m. UTC | #7
On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote:
> On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > +Rob, Eric, Mark and more
> >
> > Hi,
> >
> > On Fri, 5 Apr 2019 16:20:45 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >
> > > On 04/04/2019 16:20, Boris Brezillon wrote:
> > > > Hello,
> > > >
> > > > This patch adds new ioctls to expose GPU counters to userspace.
> > > > These will be used by the mesa driver (should be posted soon).
> > > >
> > > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > > where perf counters are retrieved on a per-job basis. This allows one
> > > > to have get accurate results when there are users using the GPU
> > > > concurrently.
> > > > AFAICT, the mali kbase is using a different approach where several
> > > > users can register a performance monitor but with no way to have fined
> > > > grained control over what job/GPU-context to track.
> > >
> > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > > be from different contexts (address spaces), and mali_kbase also fully
> > > uses the _NEXT registers. So there can be a job from one context
> > > executing on slot 0 and a job from a different context waiting in the
> > > _NEXT registers. (And the same for slot 1). This means that there's no
> > > (visible) gap between the first job finishing and the second job
> > > starting. Early versions of the driver even had a throttle to avoid
> > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > > IRQ - but thankfully that's gone.
> > >
> > > The upshot is that it's basically impossible to measure "per-job"
> > > counters when running at full speed. Because multiple jobs are running
> > > and the driver doesn't actually know when one ends and the next starts.
> > >
> > > Since one of the primary use cases is to draw pretty graphs of the
> > > system load [1], this "per-job" information isn't all that relevant (and
> > > minimal performance overhead is important). And if you want to monitor
> > > just one application it is usually easiest to ensure that it is the only
> > > thing running.
> > >
> > > [1]
> > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> > >
> > > > This design choice comes at a cost: every time the perfmon context
> > > > changes (the perfmon context is the list of currently active
> > > > perfmons), the driver has to add a fence to prevent new jobs from
> > > > corrupting counters that will be dumped by previous jobs.
> > > >
> > > > Let me know if that's an issue and if you think we should approach
> > > > things differently.
> > >
> > > It depends what you expect to do with the counters. Per-job counters are
> > > certainly useful sometimes. But serialising all jobs can mess up the
> > > thing you are trying to measure the performance of.
> >
> > I finally found some time to work on v2 this morning, and it turns out
> > implementing global perf monitors as done in mali_kbase means rewriting
> > almost everything (apart from the perfcnt layout stuff). I'm not against
> > doing that, but I'd like to be sure this is really what we want.
> >
> > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > through the pipe_query/AMD_perfmon interface if we don't have this
> > job (or at least draw call) granularity? If not, should we keep the
> > solution I'm proposing here to make sure counters values are accurate,
> > or should we expose perf counters through a non-standard API?
> 
> I think if you can't do per-draw level granularity, then you should
> not try to implement AMD_perfmon..  instead the use case is more for a
> sort of "gpu top" app (as opposed to something like frameretrace which
> is taking per-draw-call level measurements from within the app.
> Things that use AMD_perfmon are going to, I think, expect to query
> values between individual glDraw calls, and you probably don't want to
> flush tile passes 500 times per frame.
> 
> (Although, I suppose if there are multiple GPUs where perfcntrs work
> this way, it might be an interesting exercise to think about coming up
> w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
> you could have a single userspace tool that works across several
> different drivers.)

I agree. We've been pondering a lot of the same issues for Adreno. I would be
greatly interested in seeing if we could come up with a standard solution we
can use.

Jordan
> 
> >
> > BTW, I'd like to remind you that serialization (waiting on the perfcnt
> > fence) only happens if we have a perfmon context change between 2
> > consecutive jobs, which only happens when
> > * 2 applications are running in // and at least one of them is
> >   monitored
> > * or when userspace decides to stop monitoring things and dump counter
> >   values
> >
> > That means that, for the usual case (all perfmons disabled), there's
> > almost zero overhead (just a few more checks in the submit job code).
> > That also means that, if we ever decide to support global perfmon (perf
> > monitors that track things globably) on top of the current approach,
> > and only global perfmons are enabled, things won't be serialized as
> > with the per-job approach, because everyone will share the same perfmon
> > ctx (the same set of perfmons).
> >
> > I'd appreciate any feedback from people that have used perf counters
> > (or implemented a way to dump them) on their platform.
> >
> > Thanks,
> >
> > Boris
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Eric Anholt May 1, 2019, 5:12 p.m. UTC | #8
Boris Brezillon <boris.brezillon@collabora.com> writes:

> +Rob, Eric, Mark and more
>
> Hi,
>
> On Fri, 5 Apr 2019 16:20:45 +0100
> Steven Price <steven.price@arm.com> wrote:
>
>> On 04/04/2019 16:20, Boris Brezillon wrote:
>> > Hello,
>> > 
>> > This patch adds new ioctls to expose GPU counters to userspace.
>> > These will be used by the mesa driver (should be posted soon).
>> > 
>> > A few words about the implementation: I followed the VC4/Etnaviv model
>> > where perf counters are retrieved on a per-job basis. This allows one
>> > to have get accurate results when there are users using the GPU
>> > concurrently.
>> > AFAICT, the mali kbase is using a different approach where several
>> > users can register a performance monitor but with no way to have fined
>> > grained control over what job/GPU-context to track.  
>> 
>> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
>> be from different contexts (address spaces), and mali_kbase also fully
>> uses the _NEXT registers. So there can be a job from one context
>> executing on slot 0 and a job from a different context waiting in the
>> _NEXT registers. (And the same for slot 1). This means that there's no
>> (visible) gap between the first job finishing and the second job
>> starting. Early versions of the driver even had a throttle to avoid
>> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
>> IRQ - but thankfully that's gone.
>> 
>> The upshot is that it's basically impossible to measure "per-job"
>> counters when running at full speed. Because multiple jobs are running
>> and the driver doesn't actually know when one ends and the next starts.
>> 
>> Since one of the primary use cases is to draw pretty graphs of the
>> system load [1], this "per-job" information isn't all that relevant (and
>> minimal performance overhead is important). And if you want to monitor
>> just one application it is usually easiest to ensure that it is the only
>> thing running.
>> 
>> [1]
>> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
>> 
>> > This design choice comes at a cost: every time the perfmon context
>> > changes (the perfmon context is the list of currently active
>> > perfmons), the driver has to add a fence to prevent new jobs from
>> > corrupting counters that will be dumped by previous jobs.
>> > 
>> > Let me know if that's an issue and if you think we should approach
>> > things differently.  
>> 
>> It depends what you expect to do with the counters. Per-job counters are
>> certainly useful sometimes. But serialising all jobs can mess up the
>> thing you are trying to measure the performance of.
>
> I finally found some time to work on v2 this morning, and it turns out
> implementing global perf monitors as done in mali_kbase means rewriting
> almost everything (apart from the perfcnt layout stuff). I'm not against
> doing that, but I'd like to be sure this is really what we want.
>
> Eric, Rob, any opinion on that? Is it acceptable to expose counters
> through the pipe_query/AMD_perfmon interface if we don't have this
> job (or at least draw call) granularity? If not, should we keep the
> solution I'm proposing here to make sure counters values are accurate,
> or should we expose perf counters through a non-standard API?

You should definitely not count perf results from someone else's context
against your own!  People doing perf analysis will expect slight
performance changes (like missing bin/render parallelism between
contexts) when doing perf queries, but they will be absolutely lost if
their non-texturing job starts showing texturing results from some
unrelated context.
Alyssa Rosenzweig May 11, 2019, 10:32 p.m. UTC | #9
Hi all,

As Steven Price explained, the "GPU top" kbase approach is often more
useful and accurate than per-draw timing. 

For a 3D game inside a GPU-accelerated desktop, the games' counters
*should* include desktop overhead. This external overhead does affect
the game's performance, especially if the contexts are competing for
resources like memory bandwidth. An isolated sample is easy to achieve
running only the app of interest; in ideal conditions (zero-copy
fullscreen), desktop interference is negligible. 

For driver developers, the system-wide measurements are preferable,
painting a complete system picture and avoiding disruptions. There is no
risk of confusion, as the driver developers understand how the counters
are exposed. Further, benchmarks rendering direct to a GBM surface are
available (glmark2-es2-drm), eliminating interference even with poor
desktop performance.

For app developers, the confusion of multi-context interference is
unfortunate. Nevertheless, if enabling counters were to slow down an
app, the confusion could be worse. Consider second-order changes in the
app's performance characteristics due to slowdown: if techniques like
dynamic resolution scaling are employed, the counters' results can be
invalid.  Likewise, even if the lower-performance counters are
appropriate for purely graphical workloads, complex apps with variable
CPU overhead (e.g. from an FPS-dependent physics engine) can further
confound counters. Low-overhead system-wide measurements mitigate these
concerns.

As Rob Clark suggested, system-wide counters could be exposed via a
semi-standardized interface, perhaps within debugfs/sysfs. The interface
could not be completely standard, as the list of counters exposed varies
substantially by vendor and model. Nevertheless, the mechanics of
discovering, enabling, reading, and disabling counters can be
standardized, as can a small set of universally meaningful counters like
total GPU utilization. This would permit a vendor-independent GPU top
app as suggested, as is I believe currently possible with
vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)

It looks like this discussion is dormant. Could we try to get this
sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
to diagnose without access to the counters, so I'm eager for a mainline
solution to be implemented.

Thank you,

Alyssa
Boris Brezillon May 12, 2019, 1:17 p.m. UTC | #10
On Wed, 01 May 2019 10:12:42 -0700
Eric Anholt <eric@anholt.net> wrote:

> Boris Brezillon <boris.brezillon@collabora.com> writes:
> 
> > +Rob, Eric, Mark and more
> >
> > Hi,
> >
> > On Fri, 5 Apr 2019 16:20:45 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >  
> >> On 04/04/2019 16:20, Boris Brezillon wrote:  
> >> > Hello,
> >> > 
> >> > This patch adds new ioctls to expose GPU counters to userspace.
> >> > These will be used by the mesa driver (should be posted soon).
> >> > 
> >> > A few words about the implementation: I followed the VC4/Etnaviv model
> >> > where perf counters are retrieved on a per-job basis. This allows one
> >> > to have get accurate results when there are users using the GPU
> >> > concurrently.
> >> > AFAICT, the mali kbase is using a different approach where several
> >> > users can register a performance monitor but with no way to have fined
> >> > grained control over what job/GPU-context to track.    
> >> 
> >> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> >> be from different contexts (address spaces), and mali_kbase also fully
> >> uses the _NEXT registers. So there can be a job from one context
> >> executing on slot 0 and a job from a different context waiting in the
> >> _NEXT registers. (And the same for slot 1). This means that there's no
> >> (visible) gap between the first job finishing and the second job
> >> starting. Early versions of the driver even had a throttle to avoid
> >> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> >> IRQ - but thankfully that's gone.
> >> 
> >> The upshot is that it's basically impossible to measure "per-job"
> >> counters when running at full speed. Because multiple jobs are running
> >> and the driver doesn't actually know when one ends and the next starts.
> >> 
> >> Since one of the primary use cases is to draw pretty graphs of the
> >> system load [1], this "per-job" information isn't all that relevant (and
> >> minimal performance overhead is important). And if you want to monitor
> >> just one application it is usually easiest to ensure that it is the only
> >> thing running.
> >> 
> >> [1]
> >> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> >>   
> >> > This design choice comes at a cost: every time the perfmon context
> >> > changes (the perfmon context is the list of currently active
> >> > perfmons), the driver has to add a fence to prevent new jobs from
> >> > corrupting counters that will be dumped by previous jobs.
> >> > 
> >> > Let me know if that's an issue and if you think we should approach
> >> > things differently.    
> >> 
> >> It depends what you expect to do with the counters. Per-job counters are
> >> certainly useful sometimes. But serialising all jobs can mess up the
> >> thing you are trying to measure the performance of.  
> >
> > I finally found some time to work on v2 this morning, and it turns out
> > implementing global perf monitors as done in mali_kbase means rewriting
> > almost everything (apart from the perfcnt layout stuff). I'm not against
> > doing that, but I'd like to be sure this is really what we want.
> >
> > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > through the pipe_query/AMD_perfmon interface if we don't have this
> > job (or at least draw call) granularity? If not, should we keep the
> > solution I'm proposing here to make sure counters values are accurate,
> > or should we expose perf counters through a non-standard API?  
> 
> You should definitely not count perf results from someone else's context
> against your own!

Also had the feeling that doing that was a bad idea (which you and Rob
confirmed), but I listed it here to clear up my doubts.

> People doing perf analysis will expect slight
> performance changes (like missing bin/render parallelism between
> contexts) when doing perf queries, but they will be absolutely lost if
> their non-texturing job starts showing texturing results from some
> unrelated context.

Given the feedback I had so far, it looks like defining a new interface
for this type of 'global perfmon' is the way to go if we want to make
everyone happy. Also note that the work I've done on VC4 has the same
limitation: a perfmon context between 2 jobs will introduce a
serialization point. It seems that you're not as worried as Steven or
Rob is about extra cost if this serialization, but I wanted to point
that out.
Boris Brezillon May 12, 2019, 1:38 p.m. UTC | #11
On Sat, 11 May 2019 15:32:20 -0700
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> Hi all,
> 
> As Steven Price explained, the "GPU top" kbase approach is often more
> useful and accurate than per-draw timing. 
> 
> For a 3D game inside a GPU-accelerated desktop, the games' counters
> *should* include desktop overhead. This external overhead does affect
> the game's performance, especially if the contexts are competing for
> resources like memory bandwidth. An isolated sample is easy to achieve
> running only the app of interest; in ideal conditions (zero-copy
> fullscreen), desktop interference is negligible. 
> 
> For driver developers, the system-wide measurements are preferable,
> painting a complete system picture and avoiding disruptions. There is no
> risk of confusion, as the driver developers understand how the counters
> are exposed. Further, benchmarks rendering direct to a GBM surface are
> available (glmark2-es2-drm), eliminating interference even with poor
> desktop performance.
> 
> For app developers, the confusion of multi-context interference is
> unfortunate. Nevertheless, if enabling counters were to slow down an
> app, the confusion could be worse. Consider second-order changes in the
> app's performance characteristics due to slowdown: if techniques like
> dynamic resolution scaling are employed, the counters' results can be
> invalid.  Likewise, even if the lower-performance counters are
> appropriate for purely graphical workloads, complex apps with variable
> CPU overhead (e.g. from an FPS-dependent physics engine) can further
> confound counters. Low-overhead system-wide measurements mitigate these
> concerns.

I'd just like to point out that dumping counters the way
mali_kbase/gator does likely has an impact on perfs (at least on some
GPUs) because of the clean+invalidate-cache that happens before (or
after, I don't remember) each dump. IIUC and this cache is actually
global and not a per address space thing (which would be possible if the
cache lines contain a tag attaching them to a specific address space),
that means all jobs running when the clean+invalidate happens will have
extra cache misses after each dump. Of course, that's not as invasive as
the full serialization that happens with my solution, but it's not free
either.

> 
> As Rob Clark suggested, system-wide counters could be exposed via a
> semi-standardized interface, perhaps within debugfs/sysfs. The interface
> could not be completely standard, as the list of counters exposed varies
> substantially by vendor and model. Nevertheless, the mechanics of
> discovering, enabling, reading, and disabling counters can be
> standardized, as can a small set of universally meaningful counters like
> total GPU utilization. This would permit a vendor-independent GPU top
> app as suggested, as is I believe currently possible with
> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
> 
> It looks like this discussion is dormant. Could we try to get this
> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
> to diagnose without access to the counters, so I'm eager for a mainline
> solution to be implemented.

I spent a bit of time thinking about it and looking at different
solutions.

debugfs/sysfs might not be the best solution, especially if we think
about the multi-user case (several instances of GPU perfmon tool
running in parallel), if we want it to work properly we need a way to
instantiate several perf monitors and let the driver add values to all
active perfmons everytime a dump happens (no matter who triggered the
dump). That's exactly what mali_kbase/gator does BTW. That's achievable
through debugs if we consider exposing a knob to instantiate such
perfmon instances, but that also means risking perfmon leaks if the
user does not take care of killing the perfmon it created when it's done
with it (or when it crashes). It might also prove hard to expose that to
non-root users in a secure way.

I also had a quick look at the perf_event interface to see if we could
extend it to support monitoring GPU events. I might be wrong as I
didn't spend much time investigating how it works, but it seems that
perf counters are saved/dumped/restored at each thread context switch,
which is not what we want here (might add extra perfcnt dump points
thus impacting GPU perfs more than we expect).

So maybe the best option is a pseudo-generic ioctl-based interface to
expose those perf counters.
Boris Brezillon May 12, 2019, 1:40 p.m. UTC | #12
On Tue, 30 Apr 2019 09:49:51 -0600
Jordan Crouse <jcrouse@codeaurora.org> wrote:

> On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote:
> > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
> > <boris.brezillon@collabora.com> wrote:  
> > >
> > > +Rob, Eric, Mark and more
> > >
> > > Hi,
> > >
> > > On Fri, 5 Apr 2019 16:20:45 +0100
> > > Steven Price <steven.price@arm.com> wrote:
> > >  
> > > > On 04/04/2019 16:20, Boris Brezillon wrote:  
> > > > > Hello,
> > > > >
> > > > > This patch adds new ioctls to expose GPU counters to userspace.
> > > > > These will be used by the mesa driver (should be posted soon).
> > > > >
> > > > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > > > where perf counters are retrieved on a per-job basis. This allows one
> > > > > to have get accurate results when there are users using the GPU
> > > > > concurrently.
> > > > > AFAICT, the mali kbase is using a different approach where several
> > > > > users can register a performance monitor but with no way to have fined
> > > > > grained control over what job/GPU-context to track.  
> > > >
> > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > > > be from different contexts (address spaces), and mali_kbase also fully
> > > > uses the _NEXT registers. So there can be a job from one context
> > > > executing on slot 0 and a job from a different context waiting in the
> > > > _NEXT registers. (And the same for slot 1). This means that there's no
> > > > (visible) gap between the first job finishing and the second job
> > > > starting. Early versions of the driver even had a throttle to avoid
> > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > > > IRQ - but thankfully that's gone.
> > > >
> > > > The upshot is that it's basically impossible to measure "per-job"
> > > > counters when running at full speed. Because multiple jobs are running
> > > > and the driver doesn't actually know when one ends and the next starts.
> > > >
> > > > Since one of the primary use cases is to draw pretty graphs of the
> > > > system load [1], this "per-job" information isn't all that relevant (and
> > > > minimal performance overhead is important). And if you want to monitor
> > > > just one application it is usually easiest to ensure that it is the only
> > > > thing running.
> > > >
> > > > [1]
> > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> > > >  
> > > > > This design choice comes at a cost: every time the perfmon context
> > > > > changes (the perfmon context is the list of currently active
> > > > > perfmons), the driver has to add a fence to prevent new jobs from
> > > > > corrupting counters that will be dumped by previous jobs.
> > > > >
> > > > > Let me know if that's an issue and if you think we should approach
> > > > > things differently.  
> > > >
> > > > It depends what you expect to do with the counters. Per-job counters are
> > > > certainly useful sometimes. But serialising all jobs can mess up the
> > > > thing you are trying to measure the performance of.  
> > >
> > > I finally found some time to work on v2 this morning, and it turns out
> > > implementing global perf monitors as done in mali_kbase means rewriting
> > > almost everything (apart from the perfcnt layout stuff). I'm not against
> > > doing that, but I'd like to be sure this is really what we want.
> > >
> > > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > > through the pipe_query/AMD_perfmon interface if we don't have this
> > > job (or at least draw call) granularity? If not, should we keep the
> > > solution I'm proposing here to make sure counters values are accurate,
> > > or should we expose perf counters through a non-standard API?  
> > 
> > I think if you can't do per-draw level granularity, then you should
> > not try to implement AMD_perfmon..  instead the use case is more for a
> > sort of "gpu top" app (as opposed to something like frameretrace which
> > is taking per-draw-call level measurements from within the app.
> > Things that use AMD_perfmon are going to, I think, expect to query
> > values between individual glDraw calls, and you probably don't want to
> > flush tile passes 500 times per frame.
> > 
> > (Although, I suppose if there are multiple GPUs where perfcntrs work
> > this way, it might be an interesting exercise to think about coming up
> > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
> > you could have a single userspace tool that works across several
> > different drivers.)  
> 
> I agree. We've been pondering a lot of the same issues for Adreno.

So, you also have those global perf counters that can't be retrieved
through the cmdstream? After the discussion I had with Rob I had the
impression freedreno was supporting the AMD_perfmon interface in a
proper way.
Steven Price May 13, 2019, 12:48 p.m. UTC | #13
On 12/05/2019 14:38, Boris Brezillon wrote:
> On Sat, 11 May 2019 15:32:20 -0700
> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> 
>> Hi all,
>>
>> As Steven Price explained, the "GPU top" kbase approach is often more
>> useful and accurate than per-draw timing. 
>>
>> For a 3D game inside a GPU-accelerated desktop, the games' counters
>> *should* include desktop overhead. This external overhead does affect
>> the game's performance, especially if the contexts are competing for
>> resources like memory bandwidth. An isolated sample is easy to achieve
>> running only the app of interest; in ideal conditions (zero-copy
>> fullscreen), desktop interference is negligible. 
>>
>> For driver developers, the system-wide measurements are preferable,
>> painting a complete system picture and avoiding disruptions. There is no
>> risk of confusion, as the driver developers understand how the counters
>> are exposed. Further, benchmarks rendering direct to a GBM surface are
>> available (glmark2-es2-drm), eliminating interference even with poor
>> desktop performance.
>>
>> For app developers, the confusion of multi-context interference is
>> unfortunate. Nevertheless, if enabling counters were to slow down an
>> app, the confusion could be worse. Consider second-order changes in the
>> app's performance characteristics due to slowdown: if techniques like
>> dynamic resolution scaling are employed, the counters' results can be
>> invalid.  Likewise, even if the lower-performance counters are
>> appropriate for purely graphical workloads, complex apps with variable
>> CPU overhead (e.g. from an FPS-dependent physics engine) can further
>> confound counters. Low-overhead system-wide measurements mitigate these
>> concerns.
> 
> I'd just like to point out that dumping counters the way
> mali_kbase/gator does likely has an impact on perfs (at least on some
> GPUs) because of the clean+invalidate-cache that happens before (or
> after, I don't remember) each dump. IIUC and this cache is actually

The clean is required after the dump to ensure that the CPU can see the
dumped counters (otherwise they could be stuck in the GPU's cache). Note
that if you don't immediately need to observe the values the clean can
be deferred. E.g. if each dump you target a different memory region then
you could perform a flush only after several dumps.

> global and not a per address space thing (which would be possible if the
> cache lines contain a tag attaching them to a specific address space),
> that means all jobs running when the clean+invalidate happens will have
> extra cache misses after each dump. Of course, that's not as invasive as
> the full serialization that happens with my solution, but it's not free
> either.

Cache cleans (at the L2 cache level) are global. There are L1 caches
(and TLBs) but they are not relevant to counter dumping.

In practice cache cleans are not that expensive most of the time. As you
note - mali_kbase doesn't bother to avoid them when doing counter capture.

It should also be unnecessary to "invalidate" - a clean should be
sufficient. The invalidate is only required if e.g. MMU page tables have
been modified as well. That would reduce the affect on currently running
jobs. But I notice that mali_kbase doesn't appear to use the "clean
only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being
broken though.

>>
>> As Rob Clark suggested, system-wide counters could be exposed via a
>> semi-standardized interface, perhaps within debugfs/sysfs. The interface
>> could not be completely standard, as the list of counters exposed varies
>> substantially by vendor and model. Nevertheless, the mechanics of
>> discovering, enabling, reading, and disabling counters can be
>> standardized, as can a small set of universally meaningful counters like
>> total GPU utilization. This would permit a vendor-independent GPU top
>> app as suggested, as is I believe currently possible with
>> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
>>
>> It looks like this discussion is dormant. Could we try to get this
>> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
>> to diagnose without access to the counters, so I'm eager for a mainline
>> solution to be implemented.
> 
> I spent a bit of time thinking about it and looking at different
> solutions.
> 
> debugfs/sysfs might not be the best solution, especially if we think
> about the multi-user case (several instances of GPU perfmon tool
> running in parallel), if we want it to work properly we need a way to
> instantiate several perf monitors and let the driver add values to all
> active perfmons everytime a dump happens (no matter who triggered the
> dump). That's exactly what mali_kbase/gator does BTW. That's achievable
> through debugs if we consider exposing a knob to instantiate such
> perfmon instances, but that also means risking perfmon leaks if the
> user does not take care of killing the perfmon it created when it's done
> with it (or when it crashes). It might also prove hard to expose that to
> non-root users in a secure way.

Note that mali_kbase only has to maintain multiple sets of values
because it provides backwards compatibility to an old version of the
driver. You could maintain one set of counter values and simply ensure
they are big enough not to wrap (or handle wrapping, but that is ugly
and unlikely to happen). Very early versions of kbase simply exposed the
hardware functionality (dump counters and reset) and therefore faced the
issue that there could only ever be one user of counters at a time. This
was "fixed" by emulating the existence of multiple interfaces (vinstr -
"virtual" instrumentation). If I was redesigning it from scratch then
I'd definitely remove the "reset" part of the interface and leave it for
the clients to latch the first value (of each counter) and subtract that
from subsequent counter reads.

Sadly the hardware doesn't help here and some (most?) GPU revisions will
saturate the counters rather than wrapping. Hence the need to frequently
poll and reset the counters, even if you only want the 'start'/'end' values.

Having only one set of counters might simplify the potential for "leaks"
- although you still want to ensure that if nobody cares you stop
collecting counters...

Steve

> I also had a quick look at the perf_event interface to see if we could
> extend it to support monitoring GPU events. I might be wrong as I
> didn't spend much time investigating how it works, but it seems that
> perf counters are saved/dumped/restored at each thread context switch,
> which is not what we want here (might add extra perfcnt dump points
> thus impacting GPU perfs more than we expect).
> 
> So maybe the best option is a pseudo-generic ioctl-based interface to
> expose those perf counters.
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Boris Brezillon May 13, 2019, 1:39 p.m. UTC | #14
On Mon, 13 May 2019 13:48:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 12/05/2019 14:38, Boris Brezillon wrote:
> > On Sat, 11 May 2019 15:32:20 -0700
> > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >   
> >> Hi all,
> >>
> >> As Steven Price explained, the "GPU top" kbase approach is often more
> >> useful and accurate than per-draw timing. 
> >>
> >> For a 3D game inside a GPU-accelerated desktop, the games' counters
> >> *should* include desktop overhead. This external overhead does affect
> >> the game's performance, especially if the contexts are competing for
> >> resources like memory bandwidth. An isolated sample is easy to achieve
> >> running only the app of interest; in ideal conditions (zero-copy
> >> fullscreen), desktop interference is negligible. 
> >>
> >> For driver developers, the system-wide measurements are preferable,
> >> painting a complete system picture and avoiding disruptions. There is no
> >> risk of confusion, as the driver developers understand how the counters
> >> are exposed. Further, benchmarks rendering direct to a GBM surface are
> >> available (glmark2-es2-drm), eliminating interference even with poor
> >> desktop performance.
> >>
> >> For app developers, the confusion of multi-context interference is
> >> unfortunate. Nevertheless, if enabling counters were to slow down an
> >> app, the confusion could be worse. Consider second-order changes in the
> >> app's performance characteristics due to slowdown: if techniques like
> >> dynamic resolution scaling are employed, the counters' results can be
> >> invalid.  Likewise, even if the lower-performance counters are
> >> appropriate for purely graphical workloads, complex apps with variable
> >> CPU overhead (e.g. from an FPS-dependent physics engine) can further
> >> confound counters. Low-overhead system-wide measurements mitigate these
> >> concerns.  
> > 
> > I'd just like to point out that dumping counters the way
> > mali_kbase/gator does likely has an impact on perfs (at least on some
> > GPUs) because of the clean+invalidate-cache that happens before (or
> > after, I don't remember) each dump. IIUC and this cache is actually  
> 
> The clean is required after the dump to ensure that the CPU can see the
> dumped counters (otherwise they could be stuck in the GPU's cache).

Just to make it clear, I was not questioning the need for a cache clean
here, I was simply pointing out that dumping counters is not free even
when we don't add those serialization points. Note that dumping
counters also has an impact on the cache-miss counter of future dumps,
but there's nothing we can do about that no matter the solution we go
for.

> Note
> that if you don't immediately need to observe the values the clean can
> be deferred. E.g. if each dump you target a different memory region then
> you could perform a flush only after several dumps.

Correct. Having a pool of dump buffers might be an option if we need to
minimize the impact of dumps at some point.

> 
> > global and not a per address space thing (which would be possible if the
> > cache lines contain a tag attaching them to a specific address space),
> > that means all jobs running when the clean+invalidate happens will have
> > extra cache misses after each dump. Of course, that's not as invasive as
> > the full serialization that happens with my solution, but it's not free
> > either.  
> 
> Cache cleans (at the L2 cache level) are global. There are L1 caches
> (and TLBs) but they are not relevant to counter dumping.

Okay.

> 
> In practice cache cleans are not that expensive most of the time. As you
> note - mali_kbase doesn't bother to avoid them when doing counter capture.
> 
> It should also be unnecessary to "invalidate" - a clean should be
> sufficient. The invalidate is only required if e.g. MMU page tables have
> been modified as well.

I already dropped the invalidate in my patch as we only have a global
address space right now and the buffer used to store perfcnt values is
always the same. We'll have to rework that when support for per-context
address space is added.

> That would reduce the affect on currently running
> jobs. But I notice that mali_kbase doesn't appear to use the "clean
> only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being
> broken though.

'clean-only' seems to work fine on T860 at least.

> 
> >>
> >> As Rob Clark suggested, system-wide counters could be exposed via a
> >> semi-standardized interface, perhaps within debugfs/sysfs. The interface
> >> could not be completely standard, as the list of counters exposed varies
> >> substantially by vendor and model. Nevertheless, the mechanics of
> >> discovering, enabling, reading, and disabling counters can be
> >> standardized, as can a small set of universally meaningful counters like
> >> total GPU utilization. This would permit a vendor-independent GPU top
> >> app as suggested, as is I believe currently possible with
> >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
> >>
> >> It looks like this discussion is dormant. Could we try to get this
> >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
> >> to diagnose without access to the counters, so I'm eager for a mainline
> >> solution to be implemented.  
> > 
> > I spent a bit of time thinking about it and looking at different
> > solutions.
> > 
> > debugfs/sysfs might not be the best solution, especially if we think
> > about the multi-user case (several instances of GPU perfmon tool
> > running in parallel), if we want it to work properly we need a way to
> > instantiate several perf monitors and let the driver add values to all
> > active perfmons everytime a dump happens (no matter who triggered the
> > dump). That's exactly what mali_kbase/gator does BTW. That's achievable
> > through debugs if we consider exposing a knob to instantiate such
> > perfmon instances, but that also means risking perfmon leaks if the
> > user does not take care of killing the perfmon it created when it's done
> > with it (or when it crashes). It might also prove hard to expose that to
> > non-root users in a secure way.  
> 
> Note that mali_kbase only has to maintain multiple sets of values
> because it provides backwards compatibility to an old version of the
> driver. You could maintain one set of counter values and simply ensure
> they are big enough not to wrap (or handle wrapping, but that is ugly
> and unlikely to happen). Very early versions of kbase simply exposed the
> hardware functionality (dump counters and reset) and therefore faced the
> issue that there could only ever be one user of counters at a time. This
> was "fixed" by emulating the existence of multiple interfaces (vinstr -
> "virtual" instrumentation). If I was redesigning it from scratch then
> I'd definitely remove the "reset" part of the interface and leave it for
> the clients to latch the first value (of each counter) and subtract that
> from subsequent counter reads.

Noted.

> 
> Sadly the hardware doesn't help here and some (most?) GPU revisions will
> saturate the counters rather than wrapping. Hence the need to frequently
> poll and reset the counters, even if you only want the 'start'/'end' values.

Not sure wrapping would be better, as you can't tell how many times
you've reached the max val.

BTW, counters are reset every time we do a dump, right? If that's the
case, there's no risk to have end_val < start_val and the only risk is
to get inaccurate values when counter_val > U32_MAX, but reaching
U32_MAX already means that this event occurred a lot during the
monitoring period.

> 
> Having only one set of counters might simplify the potential for "leaks"
> - although you still want to ensure that if nobody cares you stop
> collecting counters...

Handling that with a debugfs-based iface implies implementing some
heuristic like 'counters haven't been dumped for a long time, let's
stop monitoring them', and I'm not a big fan of such things.

This being said, I think I'll go for a simple debugfs-based iface to
unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
can agree on explicitly marking it as "unstable" so that when we settle
on a generic interface to expose such counters we can get rid of the
old one.
Steven Price May 13, 2019, 2:13 p.m. UTC | #15
On 13/05/2019 14:39, Boris Brezillon wrote:
> On Mon, 13 May 2019 13:48:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 12/05/2019 14:38, Boris Brezillon wrote:
>>> On Sat, 11 May 2019 15:32:20 -0700
>>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
>>>   
>>>> Hi all,
>>>>
>>>> As Steven Price explained, the "GPU top" kbase approach is often more
>>>> useful and accurate than per-draw timing. 
>>>>
>>>> For a 3D game inside a GPU-accelerated desktop, the games' counters
>>>> *should* include desktop overhead. This external overhead does affect
>>>> the game's performance, especially if the contexts are competing for
>>>> resources like memory bandwidth. An isolated sample is easy to achieve
>>>> running only the app of interest; in ideal conditions (zero-copy
>>>> fullscreen), desktop interference is negligible. 
>>>>
>>>> For driver developers, the system-wide measurements are preferable,
>>>> painting a complete system picture and avoiding disruptions. There is no
>>>> risk of confusion, as the driver developers understand how the counters
>>>> are exposed. Further, benchmarks rendering direct to a GBM surface are
>>>> available (glmark2-es2-drm), eliminating interference even with poor
>>>> desktop performance.
>>>>
>>>> For app developers, the confusion of multi-context interference is
>>>> unfortunate. Nevertheless, if enabling counters were to slow down an
>>>> app, the confusion could be worse. Consider second-order changes in the
>>>> app's performance characteristics due to slowdown: if techniques like
>>>> dynamic resolution scaling are employed, the counters' results can be
>>>> invalid.  Likewise, even if the lower-performance counters are
>>>> appropriate for purely graphical workloads, complex apps with variable
>>>> CPU overhead (e.g. from an FPS-dependent physics engine) can further
>>>> confound counters. Low-overhead system-wide measurements mitigate these
>>>> concerns.  
>>>
>>> I'd just like to point out that dumping counters the way
>>> mali_kbase/gator does likely has an impact on perfs (at least on some
>>> GPUs) because of the clean+invalidate-cache that happens before (or
>>> after, I don't remember) each dump. IIUC and this cache is actually  
>>
>> The clean is required after the dump to ensure that the CPU can see the
>> dumped counters (otherwise they could be stuck in the GPU's cache).
> 
> Just to make it clear, I was not questioning the need for a cache clean
> here, I was simply pointing out that dumping counters is not free even
> when we don't add those serialization points. Note that dumping
> counters also has an impact on the cache-miss counter of future dumps,
> but there's nothing we can do about that no matter the solution we go
> for.

Indeed, counters go through the caches, so clearly will have an impact
on the cache performance. But usually the impact is sufficiently "in the
noise" that it doesn't matter. But this is a hardware limitation.

> [..]
>>
>> Sadly the hardware doesn't help here and some (most?) GPU revisions will
>> saturate the counters rather than wrapping. Hence the need to frequently
>> poll and reset the counters, even if you only want the 'start'/'end' values.
> 
> Not sure wrapping would be better, as you can't tell how many times
> you've reached the max val.
> 
> BTW, counters are reset every time we do a dump, right? If that's the
> case, there's no risk to have end_val < start_val and the only risk is
> to get inaccurate values when counter_val > U32_MAX, but reaching
> U32_MAX already means that this event occurred a lot during the
> monitoring period.

Yes the h/w resets the counters when dumping. What I intended to say
(and re-reading it didn't do a good job) was that it would be quite
helpful if the hardware didn't reset and didn't saturate. That way
multiple uses could simply observe the counter values. The combination
of saturating and resetting on read means that there has to only be done
entity reading the counter values from the hardware and summing up.

I *think* later hardware might have dropped the saturating logic
(because it costs gates and is fairly useless for software), but sadly
because of backwards compatibility the resetting on read behaviour isn't
going to change.

>>
>> Having only one set of counters might simplify the potential for "leaks"
>> - although you still want to ensure that if nobody cares you stop
>> collecting counters...
> 
> Handling that with a debugfs-based iface implies implementing some
> heuristic like 'counters haven't been dumped for a long time, let's
> stop monitoring them', and I'm not a big fan of such things.

Agreed, it's fairly ugly. If you expose "native" sized counters (i.e.
u32) then there's a reasonable heuristic: if (1^32/GPU clock frequency)
has passed then the counters might have overflowed so are unlikely to be
useful to anyone still looking at them.

> This being said, I think I'll go for a simple debugfs-based iface to
> unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
> can agree on explicitly marking it as "unstable" so that when we settle
> on a generic interface to expose such counters we can get rid of the
> old one.

A generic interface would be good, but short-term I can definitely see
the value of exporting the counters in debugfs.

Steve
Alyssa Rosenzweig May 13, 2019, 2:56 p.m. UTC | #16
> This being said, I think I'll go for a simple debugfs-based iface to
> unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
> can agree on explicitly marking it as "unstable" so that when we settle
> on a generic interface to expose such counters we can get rid of the
> old one.

Thank you :)
Jordan Crouse May 13, 2019, 3 p.m. UTC | #17
On Sun, May 12, 2019 at 03:40:26PM +0200, Boris Brezillon wrote:
> On Tue, 30 Apr 2019 09:49:51 -0600
> Jordan Crouse <jcrouse@codeaurora.org> wrote:
> 
> > On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote:
> > > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
> > > <boris.brezillon@collabora.com> wrote:  
> > > >
> > > > +Rob, Eric, Mark and more
> > > >
> > > > Hi,
> > > >
> > > > On Fri, 5 Apr 2019 16:20:45 +0100
> > > > Steven Price <steven.price@arm.com> wrote:
> > > >  
> > > > > On 04/04/2019 16:20, Boris Brezillon wrote:  
> > > > > > Hello,
> > > > > >
> > > > > > This patch adds new ioctls to expose GPU counters to userspace.
> > > > > > These will be used by the mesa driver (should be posted soon).
> > > > > >
> > > > > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > > > > where perf counters are retrieved on a per-job basis. This allows one
> > > > > > to have get accurate results when there are users using the GPU
> > > > > > concurrently.
> > > > > > AFAICT, the mali kbase is using a different approach where several
> > > > > > users can register a performance monitor but with no way to have fined
> > > > > > grained control over what job/GPU-context to track.  
> > > > >
> > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > > > > be from different contexts (address spaces), and mali_kbase also fully
> > > > > uses the _NEXT registers. So there can be a job from one context
> > > > > executing on slot 0 and a job from a different context waiting in the
> > > > > _NEXT registers. (And the same for slot 1). This means that there's no
> > > > > (visible) gap between the first job finishing and the second job
> > > > > starting. Early versions of the driver even had a throttle to avoid
> > > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > > > > IRQ - but thankfully that's gone.
> > > > >
> > > > > The upshot is that it's basically impossible to measure "per-job"
> > > > > counters when running at full speed. Because multiple jobs are running
> > > > > and the driver doesn't actually know when one ends and the next starts.
> > > > >
> > > > > Since one of the primary use cases is to draw pretty graphs of the
> > > > > system load [1], this "per-job" information isn't all that relevant (and
> > > > > minimal performance overhead is important). And if you want to monitor
> > > > > just one application it is usually easiest to ensure that it is the only
> > > > > thing running.
> > > > >
> > > > > [1]
> > > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> > > > >  
> > > > > > This design choice comes at a cost: every time the perfmon context
> > > > > > changes (the perfmon context is the list of currently active
> > > > > > perfmons), the driver has to add a fence to prevent new jobs from
> > > > > > corrupting counters that will be dumped by previous jobs.
> > > > > >
> > > > > > Let me know if that's an issue and if you think we should approach
> > > > > > things differently.  
> > > > >
> > > > > It depends what you expect to do with the counters. Per-job counters are
> > > > > certainly useful sometimes. But serialising all jobs can mess up the
> > > > > thing you are trying to measure the performance of.  
> > > >
> > > > I finally found some time to work on v2 this morning, and it turns out
> > > > implementing global perf monitors as done in mali_kbase means rewriting
> > > > almost everything (apart from the perfcnt layout stuff). I'm not against
> > > > doing that, but I'd like to be sure this is really what we want.
> > > >
> > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > > > through the pipe_query/AMD_perfmon interface if we don't have this
> > > > job (or at least draw call) granularity? If not, should we keep the
> > > > solution I'm proposing here to make sure counters values are accurate,
> > > > or should we expose perf counters through a non-standard API?  
> > > 
> > > I think if you can't do per-draw level granularity, then you should
> > > not try to implement AMD_perfmon..  instead the use case is more for a
> > > sort of "gpu top" app (as opposed to something like frameretrace which
> > > is taking per-draw-call level measurements from within the app.
> > > Things that use AMD_perfmon are going to, I think, expect to query
> > > values between individual glDraw calls, and you probably don't want to
> > > flush tile passes 500 times per frame.
> > > 
> > > (Although, I suppose if there are multiple GPUs where perfcntrs work
> > > this way, it might be an interesting exercise to think about coming up
> > > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
> > > you could have a single userspace tool that works across several
> > > different drivers.)  
> > 
> > I agree. We've been pondering a lot of the same issues for Adreno.
> 
> So, you also have those global perf counters that can't be retrieved
> through the cmdstream? After the discussion I had with Rob I had the
> impression freedreno was supporting the AMD_perfmon interface in a
> proper way.

It is true that we can read most of the counters from the cmdstream but we have
a limited number of global physical counters that have to be shared across
contexts and a metric ton of countables that can be selected. As long as we
have just the one process running AMD_perfmon then we are fine, but it breaks
down after that.

I am also interested in the 'gpu top' use case which can somewhat emulated
through a cmdstream assuming that we stick to a finite set of counters and
countables that won't conflict with extensions. This is fine for basic work but
anybody doing any serious performance work would run into some limitations
quickly.

And finally the kernel uses a few counters for its own purposes and I hate the
"gentlemen's agreement" between the user and the kernel as to which physical
counters each get ownership of. It would be a lot better to turn it into an
explicit dynamic reservation which I feel would be a natural outshoot of a
generic interface.

Jordan