mbox series

[v4,0/3] Improve gpu_scheduler trace events

Message ID 20240610132707.61404-1-pierre-eric.pelloux-prayer@amd.com (mailing list archive)
Headers show
Series Improve gpu_scheduler trace events | expand

Message

Pierre-Eric Pelloux-Prayer June 10, 2024, 1:26 p.m. UTC
v3: https://lists.freedesktop.org/archives/dri-devel/2024-June/456792.html

Changes since v3:
* trace device name instead of drm_device primary index
* no pointer deref in the TP_printk anymore. Instead the fence context/seqno
are saved in TP_fast_assign

Pierre-Eric Pelloux-Prayer (3):
  drm/sched: add device name to the drm_sched_process_job event
  drm/sched: cleanup gpu_scheduler trace events
  drm/sched: trace dependencies for gpu jobs

 .../gpu/drm/scheduler/gpu_scheduler_trace.h   | 97 +++++++++++++++----
 drivers/gpu/drm/scheduler/sched_entity.c      |  8 +-
 drivers/gpu/drm/scheduler/sched_main.c        |  2 +-
 3 files changed, 84 insertions(+), 23 deletions(-)

Comments

Daniel Vetter June 10, 2024, 4:31 p.m. UTC | #1
On Mon, Jun 10, 2024 at 03:26:53PM +0200, Pierre-Eric Pelloux-Prayer wrote:
> v3: https://lists.freedesktop.org/archives/dri-devel/2024-June/456792.html
> 
> Changes since v3:
> * trace device name instead of drm_device primary index
> * no pointer deref in the TP_printk anymore. Instead the fence context/seqno
> are saved in TP_fast_assign

Some high-level comments:

- Quick summary of the what, why and how in the cover letter would be
  great.

- Link to the userspace, once you have that. At least last time we chatted
  that was still wip.

- Maybe most important to make this actually work, work well, and work
  long-term: I think we should clearly commit to these tracepoints being
  stable uapi, and document that by adding a stable tracepoint section in
  the drm uapi book.

  And then get acks from a pile of driver maintainers that they really
  think this is a good idea and has a future. Should also help with
  getting good review on the tracepoints themselves.

  Otherwise I fear we'll miss the mark again and still force userspace to
  hand-roll tracing for every driver, or maybe worse, even specific kernel
  versions.

Cheers, Sima

> 
> Pierre-Eric Pelloux-Prayer (3):
>   drm/sched: add device name to the drm_sched_process_job event
>   drm/sched: cleanup gpu_scheduler trace events
>   drm/sched: trace dependencies for gpu jobs
> 
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   | 97 +++++++++++++++----
>  drivers/gpu/drm/scheduler/sched_entity.c      |  8 +-
>  drivers/gpu/drm/scheduler/sched_main.c        |  2 +-
>  3 files changed, 84 insertions(+), 23 deletions(-)
> 
> -- 
> 2.40.1
>
Pierre-Eric Pelloux-Prayer June 12, 2024, 2:06 p.m. UTC | #2
Hi,

Le 10/06/2024 à 18:31, Daniel Vetter a écrit :
 > On Mon, Jun 10, 2024 at 03:26:53PM +0200, Pierre-Eric Pelloux-Prayer 
wrote:
 >> v3: 
https://lists.freedesktop.org/archives/dri-devel/2024-June/456792.html
 >>
 >> Changes since v3:
 >> * trace device name instead of drm_device primary index
 >> * no pointer deref in the TP_printk anymore. Instead the fence 
context/seqno
 >> are saved in TP_fast_assign
 >
 > Some high-level comments:
 >
 > - Quick summary of the what, why and how in the cover letter would be
 >    great.
Oops, I forgot to copy-paste the cover letter from v3.

Here it is, maybe whe 'Why' is missing? I'll improve it for v5.
----
The main ideas implemented are: trace dependencies between jobs and
identify the GPU running jobs (because 'ring' is not a unique attribute).

Changes from v2:
* dropped all amdgpu changes. The goal here is to make the gpu_scheduler
trace events usable by a tool, without dependencies on driver-specific
events
* dropped the vblank related changes. I'll post a separate series to
cover drm/vblank events later.
* reworked fence printing so it's coherent between all events.
* added a dev_index to let the tools parsing the events which GPU is
running a job. It wasn't needed before the gpu scheduler switch to
workqueues because the queue pid was enough to identify the hardware queue.
* dropped the changes to trace the "why" of a dependency. I implemented
a version based on Sima's suggestion using xa_tag_pointer but I'm not
convinced it's really useful, so I'm dropping it for now.

Open questions:
* assuming the new fence printing format is agreed on,
should we add some code to preserve the old format?
* checkpatch doesn't like the indentation in the last patch, because
everything is vertically aligned to 'TP_fast_assign('. How to best fix it?

    WARNING: Statements should start on a tabstop
    #82: FILE: drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:80:
    +        unsigned long idx;
----



 >
 > - Link to the userspace, once you have that. At least last time we 
chatted
 >    that was still wip.

Userspace is at 
https://gitlab.freedesktop.org/tomstdenis/umr/-/merge_requests/37

Note that most of it also works fine without these patches, but some 
features will be lacking:
* job dependencies
* multi-GPU system won't work as expected on 6.8+ kernels

The tool depends on amdgpu, but the part using these events can work on 
any driver using gpu_scheduler.
I tried to use it on a RPi3 but couldn't figure out how to get the 
system to use v3d :/

I've also opened an issue in gpuvis issue tracker to get more feedback 
about these events.

 >
 > - Maybe most important to make this actually work, work well, and work
 >    long-term: I think we should clearly commit to these tracepoints being
 >    stable uapi, and document that by adding a stable tracepoint 
section in
 >    the drm uapi book.
You mean, Documentation/gpu/drm-uapi.rst?

I can send a v5 with an updated cover letter and a new patch updating 
the documentation.

Thanks,
Pierre-Eric


 >
 >    And then get acks from a pile of driver maintainers that they really
 >    think this is a good idea and has a future. Should also help with
 >    getting good review on the tracepoints themselves.
 >
 >    Otherwise I fear we'll miss the mark again and still force 
userspace to
 >    hand-roll tracing for every driver, or maybe worse, even specific 
kernel
 >    versions.
 >
 > Cheers, Sima
 >
 >>
 >> Pierre-Eric Pelloux-Prayer (3):
 >>    drm/sched: add device name to the drm_sched_process_job event
 >>    drm/sched: cleanup gpu_scheduler trace events
 >>    drm/sched: trace dependencies for gpu jobs
 >>
 >>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   | 97 +++++++++++++++----
 >>   drivers/gpu/drm/scheduler/sched_entity.c      |  8 +-
 >>   drivers/gpu/drm/scheduler/sched_main.c        |  2 +-
 >>   3 files changed, 84 insertions(+), 23 deletions(-)
 >>
 >> --
 >> 2.40.1
 >>
 >