mbox series

[00/14] Introduce xe_devcoredump.

Message ID 20230426205713.512695-1-rodrigo.vivi@intel.com (mailing list archive)
Headers show
Series Introduce xe_devcoredump. | expand

Message

Rodrigo Vivi April 26, 2023, 8:56 p.m. UTC
Xe needs to align with other drivers on the way that the error states are
dumped, avoiding a Xe only error_state solution. The goal is to use devcoredump
infrastructure to report error states, since it produces a standardized way
by exposing a virtual and temporary /sys/class/devcoredump device.

The initial goal is to have the simple_error_state in the devcoredump
so we start using the infrastructure.

But this is just a start point to start building a useful and
organized crash dump, using standard infrastructure. Later this
will be changed to have output that can be parsed by tools and
used for error replay.

Later, when we are in-tree, the goal is to collaborate with devcoredump
infrastructure with overall possible improvements, like multiple file support
for better organization of the dumps, snapshot support, dmesg extra print,
and whatever may make sense and help the overall infrastructure.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Rodrigo Vivi (14):
  drm/xe: Fix print of RING_EXECLIST_SQ_CONTENTS_HI
  drm/xe: Introduce the dev_coredump infrastructure.
  drm/xe: Do not take any action if our device was removed.
  drm/xe: Extract non mapped regions out of GuC CTB into its own struct.
  drm/xe: Convert GuC CT print to snapshot capture and print.
  drm/xe: Add GuC CT snapshot to xe_devcoredump.
  drm/xe: Introduce guc_submit_types.h with relevant structs.
  drm/xe: Convert GuC Engine print to snapshot capture and print.
  drm/xe: Add GuC Submit Engine snapshot to xe_devcoredump.
  drm/xe: Convert Xe HW Engine print to snapshot capture and print.
  drm/xe: Add HW Engine snapshot to xe_devcoredump.
  drm/xe: Limit CONFIG_DRM_XE_SIMPLE_ERROR_CAPTURE to itself.
  drm/xe: Convert VM print to snapshot capture and print.
  drm/xe: Add VM snapshot to xe_devcoredump.

 drivers/gpu/drm/xe/Kconfig                |   1 +
 drivers/gpu/drm/xe/Makefile               |   1 +
 drivers/gpu/drm/xe/regs/xe_engine_regs.h  |   3 +-
 drivers/gpu/drm/xe/xe_devcoredump.c       | 227 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_devcoredump.h       |  22 ++
 drivers/gpu/drm/xe/xe_devcoredump_types.h |  60 +++++
 drivers/gpu/drm/xe/xe_device_types.h      |   4 +
 drivers/gpu/drm/xe/xe_execlist.c          |   4 +-
 drivers/gpu/drm/xe/xe_gt_debugfs.c        |   2 +-
 drivers/gpu/drm/xe/xe_guc_ct.c            | 275 +++++++++++++++-------
 drivers/gpu/drm/xe/xe_guc_ct.h            |   7 +-
 drivers/gpu/drm/xe/xe_guc_ct_types.h      |  46 +++-
 drivers/gpu/drm/xe/xe_guc_fwif.h          |  29 ---
 drivers/gpu/drm/xe/xe_guc_submit.c        | 258 ++++++++++++++------
 drivers/gpu/drm/xe/xe_guc_submit.h        |  10 +-
 drivers/gpu/drm/xe/xe_guc_submit_types.h  | 155 ++++++++++++
 drivers/gpu/drm/xe/xe_hw_engine.c         | 210 ++++++++++++-----
 drivers/gpu/drm/xe/xe_hw_engine.h         |   8 +-
 drivers/gpu/drm/xe/xe_hw_engine_types.h   |  78 ++++++
 drivers/gpu/drm/xe/xe_pci.c               |   2 +
 drivers/gpu/drm/xe/xe_vm.c                | 140 +++++++++--
 drivers/gpu/drm/xe/xe_vm.h                |   6 +-
 drivers/gpu/drm/xe/xe_vm_types.h          |  18 ++
 23 files changed, 1288 insertions(+), 278 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
 create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
 create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h
 create mode 100644 drivers/gpu/drm/xe/xe_guc_submit_types.h

--
2.39.2

Comments

Matthew Brost May 2, 2023, 8:11 a.m. UTC | #1
On Wed, Apr 26, 2023 at 04:56:59PM -0400, Rodrigo Vivi wrote:
> Xe needs to align with other drivers on the way that the error states are
> dumped, avoiding a Xe only error_state solution. The goal is to use devcoredump
> infrastructure to report error states, since it produces a standardized way
> by exposing a virtual and temporary /sys/class/devcoredump device.
> 
> The initial goal is to have the simple_error_state in the devcoredump
> so we start using the infrastructure.
> 
> But this is just a start point to start building a useful and
> organized crash dump, using standard infrastructure. Later this
> will be changed to have output that can be parsed by tools and
> used for error replay.

We are certainly missing the GuC log, it would also be really nice to
get the ftrace included too. Not sure if the later is easy, I know I
looked into this on the i915 and couldn't figure it out but this was a
while ago and admittedly didn't try all that hard.

Matt 

> 
> Later, when we are in-tree, the goal is to collaborate with devcoredump
> infrastructure with overall possible improvements, like multiple file support
> for better organization of the dumps, snapshot support, dmesg extra print,
> and whatever may make sense and help the overall infrastructure.
> 
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Rodrigo Vivi (14):
>   drm/xe: Fix print of RING_EXECLIST_SQ_CONTENTS_HI
>   drm/xe: Introduce the dev_coredump infrastructure.
>   drm/xe: Do not take any action if our device was removed.
>   drm/xe: Extract non mapped regions out of GuC CTB into its own struct.
>   drm/xe: Convert GuC CT print to snapshot capture and print.
>   drm/xe: Add GuC CT snapshot to xe_devcoredump.
>   drm/xe: Introduce guc_submit_types.h with relevant structs.
>   drm/xe: Convert GuC Engine print to snapshot capture and print.
>   drm/xe: Add GuC Submit Engine snapshot to xe_devcoredump.
>   drm/xe: Convert Xe HW Engine print to snapshot capture and print.
>   drm/xe: Add HW Engine snapshot to xe_devcoredump.
>   drm/xe: Limit CONFIG_DRM_XE_SIMPLE_ERROR_CAPTURE to itself.
>   drm/xe: Convert VM print to snapshot capture and print.
>   drm/xe: Add VM snapshot to xe_devcoredump.
> 
>  drivers/gpu/drm/xe/Kconfig                |   1 +
>  drivers/gpu/drm/xe/Makefile               |   1 +
>  drivers/gpu/drm/xe/regs/xe_engine_regs.h  |   3 +-
>  drivers/gpu/drm/xe/xe_devcoredump.c       | 227 ++++++++++++++++++
>  drivers/gpu/drm/xe/xe_devcoredump.h       |  22 ++
>  drivers/gpu/drm/xe/xe_devcoredump_types.h |  60 +++++
>  drivers/gpu/drm/xe/xe_device_types.h      |   4 +
>  drivers/gpu/drm/xe/xe_execlist.c          |   4 +-
>  drivers/gpu/drm/xe/xe_gt_debugfs.c        |   2 +-
>  drivers/gpu/drm/xe/xe_guc_ct.c            | 275 +++++++++++++++-------
>  drivers/gpu/drm/xe/xe_guc_ct.h            |   7 +-
>  drivers/gpu/drm/xe/xe_guc_ct_types.h      |  46 +++-
>  drivers/gpu/drm/xe/xe_guc_fwif.h          |  29 ---
>  drivers/gpu/drm/xe/xe_guc_submit.c        | 258 ++++++++++++++------
>  drivers/gpu/drm/xe/xe_guc_submit.h        |  10 +-
>  drivers/gpu/drm/xe/xe_guc_submit_types.h  | 155 ++++++++++++
>  drivers/gpu/drm/xe/xe_hw_engine.c         | 210 ++++++++++++-----
>  drivers/gpu/drm/xe/xe_hw_engine.h         |   8 +-
>  drivers/gpu/drm/xe/xe_hw_engine_types.h   |  78 ++++++
>  drivers/gpu/drm/xe/xe_pci.c               |   2 +
>  drivers/gpu/drm/xe/xe_vm.c                | 140 +++++++++--
>  drivers/gpu/drm/xe/xe_vm.h                |   6 +-
>  drivers/gpu/drm/xe/xe_vm_types.h          |  18 ++
>  23 files changed, 1288 insertions(+), 278 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
>  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
>  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h
>  create mode 100644 drivers/gpu/drm/xe/xe_guc_submit_types.h
> 
> --
> 2.39.2