mbox series

[RFC,0/4] drm/panfrost: Expose memory usage stats through fdinfo

Message ID 20230104130308.3467806-1-boris.brezillon@collabora.com (mailing list archive)
Headers show
Series drm/panfrost: Expose memory usage stats through fdinfo | expand

Message

Boris Brezillon Jan. 4, 2023, 1:03 p.m. UTC
Hello,

Here's an attempt at exposing some memory usage stats through fdinfo,
which recently proved useful in debugging a memory leak. Not entirely
sure the name I chose are accurate, so feel free to propose
alternatives, and let me know if you see any other mem-related stuff
that would be interesting to expose.

Regards,

Boris

Boris Brezillon (4):
  drm/panfrost: Provide a dummy show_fdinfo() implementation
  drm/panfrost: Track BO resident size
  drm/panfrost: Add a helper to retrieve MMU context stats
  drm/panfrost: Expose some memory related stats through fdinfo

 drivers/gpu/drm/panfrost/panfrost_drv.c       | 24 ++++++++++++++++-
 drivers/gpu/drm/panfrost/panfrost_gem.h       |  7 +++++
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c       | 27 +++++++++++++++++++
 drivers/gpu/drm/panfrost/panfrost_mmu.h       | 10 +++++++
 5 files changed, 68 insertions(+), 1 deletion(-)

Comments

Steven Price Jan. 16, 2023, 10:30 a.m. UTC | #1
On 04/01/2023 13:03, Boris Brezillon wrote:
> Hello,
> 
> Here's an attempt at exposing some memory usage stats through fdinfo,
> which recently proved useful in debugging a memory leak. Not entirely
> sure the name I chose are accurate, so feel free to propose
> alternatives, and let me know if you see any other mem-related stuff
> that would be interesting to expose.

Sorry it's taken me a while to look at this - I'm still working through
the holiday backlog.

The names look reasonable to me, and I gave this a quick spin and it
seemed to work (the numbers reported looks reasonable). As Daniel
suggested it would be good if some of the boiler plate fdinfo code could
be moved to generic code (although to be fair there's not much here).

Of course what we're missing is the 'engine' usage information for
gputop - it's been on my todo list of a while, but I'm more than happy
for you to do it for me ;) It's somewhat more tricky because of the
whole 'queuing' on slots mechanism that Mali has. But we obviously
shouldn't block this memory implementation on that, it can be added
afterwards.

Anyway, for the series as it is:

Reviewed-by: Steven Price <steven.price@arm.com>

Thanks,

Steve

> Regards,
> 
> Boris
> 
> Boris Brezillon (4):
>   drm/panfrost: Provide a dummy show_fdinfo() implementation
>   drm/panfrost: Track BO resident size
>   drm/panfrost: Add a helper to retrieve MMU context stats
>   drm/panfrost: Expose some memory related stats through fdinfo
> 
>  drivers/gpu/drm/panfrost/panfrost_drv.c       | 24 ++++++++++++++++-
>  drivers/gpu/drm/panfrost/panfrost_gem.h       |  7 +++++
>  .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |  1 +
>  drivers/gpu/drm/panfrost/panfrost_mmu.c       | 27 +++++++++++++++++++
>  drivers/gpu/drm/panfrost/panfrost_mmu.h       | 10 +++++++
>  5 files changed, 68 insertions(+), 1 deletion(-)
>
Boris Brezillon Jan. 16, 2023, 11:05 a.m. UTC | #2
Hi Steven,

On Mon, 16 Jan 2023 10:30:21 +0000
Steven Price <steven.price@arm.com> wrote:

> On 04/01/2023 13:03, Boris Brezillon wrote:
> > Hello,
> > 
> > Here's an attempt at exposing some memory usage stats through fdinfo,
> > which recently proved useful in debugging a memory leak. Not entirely
> > sure the name I chose are accurate, so feel free to propose
> > alternatives, and let me know if you see any other mem-related stuff
> > that would be interesting to expose.  
> 
> Sorry it's taken me a while to look at this - I'm still working through
> the holiday backlog.
> 
> The names look reasonable to me, and I gave this a quick spin and it
> seemed to work (the numbers reported looks reasonable). As Daniel
> suggested it would be good if some of the boiler plate fdinfo code could
> be moved to generic code (although to be fair there's not much here).
> 
> Of course what we're missing is the 'engine' usage information for
> gputop - it's been on my todo list of a while, but I'm more than happy
> for you to do it for me ;) It's somewhat more tricky because of the
> whole 'queuing' on slots mechanism that Mali has. But we obviously
> shouldn't block this memory implementation on that, it can be added
> afterwards.

Yeah, we've been discussing this drm-engine-xxx feature with Chris, and
I was telling him there's no easy way to get accurate numbers when
_NEXT queuing is involved. It all depends on whether we're able to
process the first job DONE interrupt before the second one kicks in, and
even then, we can't tell for sure for how long the second job has been
running when we get to process the first job interrupt. Inserting
WRITE_JOB(CYCLE_COUNT) before a job chain is doable, but inserting it
after isn't, and I'm not sure we want to add such tricks to the kernel
driver anyway. Don't know if you have any better ideas. If not, I guess
we can leave with this inaccuracy and still expose drm-engine-xxx...

Regards,

Boris
Steven Price Jan. 16, 2023, 12:21 p.m. UTC | #3
On 16/01/2023 11:05, Boris Brezillon wrote:
> Hi Steven,
> 
> On Mon, 16 Jan 2023 10:30:21 +0000
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 04/01/2023 13:03, Boris Brezillon wrote:
>>> Hello,
>>>
>>> Here's an attempt at exposing some memory usage stats through fdinfo,
>>> which recently proved useful in debugging a memory leak. Not entirely
>>> sure the name I chose are accurate, so feel free to propose
>>> alternatives, and let me know if you see any other mem-related stuff
>>> that would be interesting to expose.  
>>
>> Sorry it's taken me a while to look at this - I'm still working through
>> the holiday backlog.
>>
>> The names look reasonable to me, and I gave this a quick spin and it
>> seemed to work (the numbers reported looks reasonable). As Daniel
>> suggested it would be good if some of the boiler plate fdinfo code could
>> be moved to generic code (although to be fair there's not much here).
>>
>> Of course what we're missing is the 'engine' usage information for
>> gputop - it's been on my todo list of a while, but I'm more than happy
>> for you to do it for me ;) It's somewhat more tricky because of the
>> whole 'queuing' on slots mechanism that Mali has. But we obviously
>> shouldn't block this memory implementation on that, it can be added
>> afterwards.
> 
> Yeah, we've been discussing this drm-engine-xxx feature with Chris, and
> I was telling him there's no easy way to get accurate numbers when
> _NEXT queuing is involved. It all depends on whether we're able to
> process the first job DONE interrupt before the second one kicks in, and
> even then, we can't tell for sure for how long the second job has been
> running when we get to process the first job interrupt. Inserting
> WRITE_JOB(CYCLE_COUNT) before a job chain is doable, but inserting it
> after isn't, and I'm not sure we want to add such tricks to the kernel
> driver anyway. Don't know if you have any better ideas. If not, I guess
> we can leave with this inaccuracy and still expose drm-engine-xxx...

It's fun isn't it ;) I spent many hours in the past puzzling over this!

Realistically it doesn't make sense for the kernel to get involved in
inserting write_jobs. You open up so many cans of worms regarding how to
manage the memory for the GPU to write in. The closed DDK handles this
by the user space driver adding these jobs and the tooling capturing
data from both the kernel and user space.

But for just the gputop type tooling I don't think we need that level of
accuracy[1]. If you ignore the impacts of interrupt latency then it's
possible to tell which job the GPU is currently executing and do the
accounting. Obviously interrupt latency is far from zero (that's why we
have _NEXT) but it's usually small enough that it won't skew the results
too far.

Obviously in the case you describe (second DONE interrupt before the
first one is handled) you also get the weird reporting that the second
job took no time. Which is 'wrong' but clearly the second job was
'quick' so 0 isn't likely to be too far out. And fdinfo isn't exposing
these 'per job' timings so it's unlikely to be very visible.

Thanks,

Steve

[1] As a side-note, there's a bunch of issues even with the data from
write_job(cycle_count): the frequency of the GPU may change, the
frequency of the memory system may change, a job running on the other
slot may be competing for resources on the GPU etc. When actually
profiling an application reliably it becomes necessary to lock
frequencies and ensure that it is the only thing running. I think the
DDK even includes an option to run each job by itself to avoid
cross-slot impacts (although that is unrealistic in itself).

> Regards,
> 
> Boris