mbox series

[v3,00/15] CCS static load balance

Message ID 20240823130855.72436-1-andi.shyti@linux.intel.com (mailing list archive)
Headers show
Series CCS static load balance | expand

Message

Andi Shyti Aug. 23, 2024, 1:08 p.m. UTC
Hi,

This patch series introduces static load balancing for GPUs with
multiple compute engines. It's a lengthy series, and some
challenging aspects still need to be resolved.

I have tried to split the work as much as possible to facilitate
the review process.

To summarize, in patches 1 to 14, no functional changes occur
except for the addition of the num_cslices interface. The
significant changes happen in patch 15, which is the core part of
the CCS mode setting, utilizing the groundwork laid in the
earlier patches.

In this updated approach, the focus is now on managing the UABI
engine list, which controls the engines exposed to userspace.
Instead of manipulating phuscal engines and their memory, we now
handle engine exposure through this list.

I would greatly appreciate further input from all reviewers who
have already assisted with the previous work.

IGT tests have also been developed, but I haven't sent them yet.

Thank you Chris for the offline reviews.

Thanks,
Andi

Changelog:
==========
PATCHv2 -> PATCHv3
------------------
 - Fix a NULL pointer dereference during module unload.
   In i915_gem_driver_remove() I was accessing the gt after the
   gt was removed. Use the dev_priv, instead (obviously!).
 - Fix a lockdep issue: Some of the uabi_engines_mutex unlocks
   were not correctly placed in the exit paths.
 - Fix a checkpatch error for spaces after and before parenthesis
   in the for_each_enabled_engine() definition.

PATCHv1 -> PATCHv2
------------------
 - Use uabi_mutex to protect the uabi_engines, not the engine
   itself. Rename it to uabi_engines_mutex.
 - Use kobject_add/kobject_del for adding and removing
   interfaces, this way we don't need to destroy and recreate the
   engines, anymore. Refactor intel_engine_add_single_sysfs() to
   reflect this scenario.
 - After adding engines to the rb_tree check that they have been
   added correctly.
 - Fix rb_find_add() compare function to take into accoung also
   the class, not just the instance.

RFCv2 -> PATCHv1
----------------
 - Removed gt->ccs.mutex
 - Rename m -> width, ccs_id -> engine in
   intel_gt_apply_ccs_mode().
 - In the CCS register value calculation
   (intel_gt_apply_ccs_mode()) the engine (ccs_id) needs to move
   along the ccs_mask (set by the user) instead of the
   cslice_mask.
 - Add GEM_BUG_ON after calculating the new ccs_mask
   (update_ccs_mask()) to make sure all angines have been
   evaluated (i.e. ccs_mask must be '0' at the end of the
   algorithm).
 - move wakeref lock before evaluating intel_gt_pm_is_awake() and
   fix exit path accordingly.
 - Use a more compact form in intel_gt_sysfs_ccs_init() and
   add_uabi_ccs_engines() when evaluating sysfs_create_file(): no
   need to store the return value to the err variable which is
   unused. Get rid of err.
 - Print a warnging instead of a debug message if we fail to
   create the sysfs files.
 - If engine files creation fails in
   intel_engine_add_single_sysfs(), print a warning, not an
   error.
 - Rename gt->ccs.ccs_mask to gt->ccs.id_mask and add a comment
   to explain its purpose.
 - During uabi engine creation, in
   intel_engines_driver_register(), the uabi_ccs_instance is
   redundant because the ccs_instances is already tracked in
   engine->uabi_instance.
 - Mark add_uabi_ccs_engines() and remove_uabi_ccs_engines() as
   __maybe_unused not to break bisectability. They wouldn't
   compile in their own commit. They will be used in the next
   patch and the __maybe_unused is removed.
 - Update engine's workaround every time a new mode is set in
   update_ccs_mask().
 - Mark engines as valid or invalid using their status as
   rb_node. Invalid engines are marked as invalid using
   RB_CLEAR_NODE(). Execbufs will check for their validity when
   selecting the engine to be combined to a context.
 - Create for_each_enabled_engine() which skips the non valid
   engines and use it in selftests.

RFCv1 -> RFCv2
--------------
Compared to the first version I've taken a completely different
approach to adding and removing engines. in v1 physical engines
were directly added and removed, along with the memory allocated
to them, each time the user changed the CCS mode (from the
previous cover letter).

Andi Shyti (15):
  drm/i915/gt: Avoid using masked workaround for CCS_MODE setting
  drm/i915/gt: Move the CCS mode variable to a global position
  drm/i915/gt: Allow the creation of multi-mode CCS masks
  drm/i915/gt: Refactor uabi engine class/instance list creation
  drm/i915/gem: Mark and verify UABI engine validity
  drm/i915/gt: Introduce for_each_enabled_engine() and apply it in
    selftests
  drm/i915/gt: Manage CCS engine creation within UABI exposure
  drm/i915/gt: Remove cslices mask value from the CCS structure
  drm/i915/gt: Expose the number of total CCS slices
  drm/i915/gt: Store engine-related sysfs kobjects
  drm/i915/gt: Store active CCS mask
  drm/i915: Protect access to the UABI engines list with a mutex
  drm/i915/gt: Isolate single sysfs engine file creation
  drm/i915/gt: Implement creation and removal routines for CCS engines
  drm/i915/gt: Allow the user to change the CCS mode through sysfs

 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   3 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  23 --
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_engine_user.c   |  62 ++-
 drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
 drivers/gpu/drm/i915/gt/intel_gt.h            |  12 +
 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c   | 353 +++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h   |   5 +-
 drivers/gpu/drm/i915/gt/intel_gt_sysfs.c      |   2 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |  19 +-
 drivers/gpu/drm/i915/gt/intel_workarounds.c   |   8 +-
 drivers/gpu/drm/i915/gt/selftest_context.c    |   6 +-
 drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   4 +-
 .../drm/i915/gt/selftest_engine_heartbeat.c   |   6 +-
 drivers/gpu/drm/i915/gt/selftest_engine_pm.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  52 +--
 drivers/gpu/drm/i915/gt/selftest_gt_pm.c      |   2 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  22 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  18 +-
 drivers/gpu/drm/i915/gt/selftest_mocs.c       |   6 +-
 drivers/gpu/drm/i915/gt/selftest_rc6.c        |   4 +-
 drivers/gpu/drm/i915/gt/selftest_reset.c      |   8 +-
 .../drm/i915/gt/selftest_ring_submission.c    |   2 +-
 drivers/gpu/drm/i915/gt/selftest_rps.c        |  14 +-
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  14 +-
 drivers/gpu/drm/i915/gt/selftest_tlb.c        |   2 +-
 .../gpu/drm/i915/gt/selftest_workarounds.c    |  14 +-
 drivers/gpu/drm/i915/gt/sysfs_engines.c       |  79 ++--
 drivers/gpu/drm/i915/gt/sysfs_engines.h       |   2 +
 drivers/gpu/drm/i915/i915_cmd_parser.c        |   2 +
 drivers/gpu/drm/i915/i915_debugfs.c           |   4 +
 drivers/gpu/drm/i915/i915_drv.h               |   5 +
 drivers/gpu/drm/i915/i915_gem.c               |   4 +
 drivers/gpu/drm/i915/i915_perf.c              |   8 +-
 drivers/gpu/drm/i915/i915_pmu.c               |  11 +-
 drivers/gpu/drm/i915/i915_query.c             |  21 +-
 37 files changed, 643 insertions(+), 193 deletions(-)

Comments

Andi Shyti Aug. 27, 2024, 9:25 a.m. UTC | #1
Hi Chris,

just a kind ping: any chance you can take a look at this? I would
really appreciate.

Thanks,
Andi

On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> Hi,
> 
> This patch series introduces static load balancing for GPUs with
> multiple compute engines. It's a lengthy series, and some
> challenging aspects still need to be resolved.
> 
> I have tried to split the work as much as possible to facilitate
> the review process.
> 
> To summarize, in patches 1 to 14, no functional changes occur
> except for the addition of the num_cslices interface. The
> significant changes happen in patch 15, which is the core part of
> the CCS mode setting, utilizing the groundwork laid in the
> earlier patches.
> 
> In this updated approach, the focus is now on managing the UABI
> engine list, which controls the engines exposed to userspace.
> Instead of manipulating phuscal engines and their memory, we now
> handle engine exposure through this list.
> 
> I would greatly appreciate further input from all reviewers who
> have already assisted with the previous work.
> 
> IGT tests have also been developed, but I haven't sent them yet.
> 
> Thank you Chris for the offline reviews.
> 
> Thanks,
> Andi
> 
> Changelog:
> ==========
> PATCHv2 -> PATCHv3
> ------------------
>  - Fix a NULL pointer dereference during module unload.
>    In i915_gem_driver_remove() I was accessing the gt after the
>    gt was removed. Use the dev_priv, instead (obviously!).
>  - Fix a lockdep issue: Some of the uabi_engines_mutex unlocks
>    were not correctly placed in the exit paths.
>  - Fix a checkpatch error for spaces after and before parenthesis
>    in the for_each_enabled_engine() definition.
> 
> PATCHv1 -> PATCHv2
> ------------------
>  - Use uabi_mutex to protect the uabi_engines, not the engine
>    itself. Rename it to uabi_engines_mutex.
>  - Use kobject_add/kobject_del for adding and removing
>    interfaces, this way we don't need to destroy and recreate the
>    engines, anymore. Refactor intel_engine_add_single_sysfs() to
>    reflect this scenario.
>  - After adding engines to the rb_tree check that they have been
>    added correctly.
>  - Fix rb_find_add() compare function to take into accoung also
>    the class, not just the instance.
> 
> RFCv2 -> PATCHv1
> ----------------
>  - Removed gt->ccs.mutex
>  - Rename m -> width, ccs_id -> engine in
>    intel_gt_apply_ccs_mode().
>  - In the CCS register value calculation
>    (intel_gt_apply_ccs_mode()) the engine (ccs_id) needs to move
>    along the ccs_mask (set by the user) instead of the
>    cslice_mask.
>  - Add GEM_BUG_ON after calculating the new ccs_mask
>    (update_ccs_mask()) to make sure all angines have been
>    evaluated (i.e. ccs_mask must be '0' at the end of the
>    algorithm).
>  - move wakeref lock before evaluating intel_gt_pm_is_awake() and
>    fix exit path accordingly.
>  - Use a more compact form in intel_gt_sysfs_ccs_init() and
>    add_uabi_ccs_engines() when evaluating sysfs_create_file(): no
>    need to store the return value to the err variable which is
>    unused. Get rid of err.
>  - Print a warnging instead of a debug message if we fail to
>    create the sysfs files.
>  - If engine files creation fails in
>    intel_engine_add_single_sysfs(), print a warning, not an
>    error.
>  - Rename gt->ccs.ccs_mask to gt->ccs.id_mask and add a comment
>    to explain its purpose.
>  - During uabi engine creation, in
>    intel_engines_driver_register(), the uabi_ccs_instance is
>    redundant because the ccs_instances is already tracked in
>    engine->uabi_instance.
>  - Mark add_uabi_ccs_engines() and remove_uabi_ccs_engines() as
>    __maybe_unused not to break bisectability. They wouldn't
>    compile in their own commit. They will be used in the next
>    patch and the __maybe_unused is removed.
>  - Update engine's workaround every time a new mode is set in
>    update_ccs_mask().
>  - Mark engines as valid or invalid using their status as
>    rb_node. Invalid engines are marked as invalid using
>    RB_CLEAR_NODE(). Execbufs will check for their validity when
>    selecting the engine to be combined to a context.
>  - Create for_each_enabled_engine() which skips the non valid
>    engines and use it in selftests.
> 
> RFCv1 -> RFCv2
> --------------
> Compared to the first version I've taken a completely different
> approach to adding and removing engines. in v1 physical engines
> were directly added and removed, along with the memory allocated
> to them, each time the user changed the CCS mode (from the
> previous cover letter).
> 
> Andi Shyti (15):
>   drm/i915/gt: Avoid using masked workaround for CCS_MODE setting
>   drm/i915/gt: Move the CCS mode variable to a global position
>   drm/i915/gt: Allow the creation of multi-mode CCS masks
>   drm/i915/gt: Refactor uabi engine class/instance list creation
>   drm/i915/gem: Mark and verify UABI engine validity
>   drm/i915/gt: Introduce for_each_enabled_engine() and apply it in
>     selftests
>   drm/i915/gt: Manage CCS engine creation within UABI exposure
>   drm/i915/gt: Remove cslices mask value from the CCS structure
>   drm/i915/gt: Expose the number of total CCS slices
>   drm/i915/gt: Store engine-related sysfs kobjects
>   drm/i915/gt: Store active CCS mask
>   drm/i915: Protect access to the UABI engines list with a mutex
>   drm/i915/gt: Isolate single sysfs engine file creation
>   drm/i915/gt: Implement creation and removal routines for CCS engines
>   drm/i915/gt: Allow the user to change the CCS mode through sysfs
> 
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   3 +
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  28 +-
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  23 --
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_engine_user.c   |  62 ++-
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |  12 +
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c   | 353 +++++++++++++++++-
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h   |   5 +-
>  drivers/gpu/drm/i915/gt/intel_gt_sysfs.c      |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |  19 +-
>  drivers/gpu/drm/i915/gt/intel_workarounds.c   |   8 +-
>  drivers/gpu/drm/i915/gt/selftest_context.c    |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   4 +-
>  .../drm/i915/gt/selftest_engine_heartbeat.c   |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c  |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_execlists.c  |  52 +--
>  drivers/gpu/drm/i915/gt/selftest_gt_pm.c      |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  22 +-
>  drivers/gpu/drm/i915/gt/selftest_lrc.c        |  18 +-
>  drivers/gpu/drm/i915/gt/selftest_mocs.c       |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_rc6.c        |   4 +-
>  drivers/gpu/drm/i915/gt/selftest_reset.c      |   8 +-
>  .../drm/i915/gt/selftest_ring_submission.c    |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_rps.c        |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_timeline.c   |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_tlb.c        |   2 +-
>  .../gpu/drm/i915/gt/selftest_workarounds.c    |  14 +-
>  drivers/gpu/drm/i915/gt/sysfs_engines.c       |  79 ++--
>  drivers/gpu/drm/i915/gt/sysfs_engines.h       |   2 +
>  drivers/gpu/drm/i915/i915_cmd_parser.c        |   2 +
>  drivers/gpu/drm/i915/i915_debugfs.c           |   4 +
>  drivers/gpu/drm/i915/i915_drv.h               |   5 +
>  drivers/gpu/drm/i915/i915_gem.c               |   4 +
>  drivers/gpu/drm/i915/i915_perf.c              |   8 +-
>  drivers/gpu/drm/i915/i915_pmu.c               |  11 +-
>  drivers/gpu/drm/i915/i915_query.c             |  21 +-
>  37 files changed, 643 insertions(+), 193 deletions(-)
> 
> -- 
> 2.45.2
Daniel Vetter Aug. 27, 2024, 5:31 p.m. UTC | #2
On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> Hi,
> 
> This patch series introduces static load balancing for GPUs with
> multiple compute engines. It's a lengthy series, and some
> challenging aspects still need to be resolved.

Do we have an actual user for this, where just reloading the entire driver
(or well-rebinding, if you only want to change the value for a specific
device) with a new module option isn't enough?

There's some really gnarly locking and lifetime fun in there, and it needs
a corresponding justification. Which needs to be enormous for this case,
meaning actual customers willing to shout on dri-devel that they really,
absolutely need this, or their machines will go up in flames.

Otherwise this is a nack from me.

Thanks, Sima

> 
> I have tried to split the work as much as possible to facilitate
> the review process.
> 
> To summarize, in patches 1 to 14, no functional changes occur
> except for the addition of the num_cslices interface. The
> significant changes happen in patch 15, which is the core part of
> the CCS mode setting, utilizing the groundwork laid in the
> earlier patches.
> 
> In this updated approach, the focus is now on managing the UABI
> engine list, which controls the engines exposed to userspace.
> Instead of manipulating phuscal engines and their memory, we now
> handle engine exposure through this list.
> 
> I would greatly appreciate further input from all reviewers who
> have already assisted with the previous work.
> 
> IGT tests have also been developed, but I haven't sent them yet.
> 
> Thank you Chris for the offline reviews.
> 
> Thanks,
> Andi
> 
> Changelog:
> ==========
> PATCHv2 -> PATCHv3
> ------------------
>  - Fix a NULL pointer dereference during module unload.
>    In i915_gem_driver_remove() I was accessing the gt after the
>    gt was removed. Use the dev_priv, instead (obviously!).
>  - Fix a lockdep issue: Some of the uabi_engines_mutex unlocks
>    were not correctly placed in the exit paths.
>  - Fix a checkpatch error for spaces after and before parenthesis
>    in the for_each_enabled_engine() definition.
> 
> PATCHv1 -> PATCHv2
> ------------------
>  - Use uabi_mutex to protect the uabi_engines, not the engine
>    itself. Rename it to uabi_engines_mutex.
>  - Use kobject_add/kobject_del for adding and removing
>    interfaces, this way we don't need to destroy and recreate the
>    engines, anymore. Refactor intel_engine_add_single_sysfs() to
>    reflect this scenario.
>  - After adding engines to the rb_tree check that they have been
>    added correctly.
>  - Fix rb_find_add() compare function to take into accoung also
>    the class, not just the instance.
> 
> RFCv2 -> PATCHv1
> ----------------
>  - Removed gt->ccs.mutex
>  - Rename m -> width, ccs_id -> engine in
>    intel_gt_apply_ccs_mode().
>  - In the CCS register value calculation
>    (intel_gt_apply_ccs_mode()) the engine (ccs_id) needs to move
>    along the ccs_mask (set by the user) instead of the
>    cslice_mask.
>  - Add GEM_BUG_ON after calculating the new ccs_mask
>    (update_ccs_mask()) to make sure all angines have been
>    evaluated (i.e. ccs_mask must be '0' at the end of the
>    algorithm).
>  - move wakeref lock before evaluating intel_gt_pm_is_awake() and
>    fix exit path accordingly.
>  - Use a more compact form in intel_gt_sysfs_ccs_init() and
>    add_uabi_ccs_engines() when evaluating sysfs_create_file(): no
>    need to store the return value to the err variable which is
>    unused. Get rid of err.
>  - Print a warnging instead of a debug message if we fail to
>    create the sysfs files.
>  - If engine files creation fails in
>    intel_engine_add_single_sysfs(), print a warning, not an
>    error.
>  - Rename gt->ccs.ccs_mask to gt->ccs.id_mask and add a comment
>    to explain its purpose.
>  - During uabi engine creation, in
>    intel_engines_driver_register(), the uabi_ccs_instance is
>    redundant because the ccs_instances is already tracked in
>    engine->uabi_instance.
>  - Mark add_uabi_ccs_engines() and remove_uabi_ccs_engines() as
>    __maybe_unused not to break bisectability. They wouldn't
>    compile in their own commit. They will be used in the next
>    patch and the __maybe_unused is removed.
>  - Update engine's workaround every time a new mode is set in
>    update_ccs_mask().
>  - Mark engines as valid or invalid using their status as
>    rb_node. Invalid engines are marked as invalid using
>    RB_CLEAR_NODE(). Execbufs will check for their validity when
>    selecting the engine to be combined to a context.
>  - Create for_each_enabled_engine() which skips the non valid
>    engines and use it in selftests.
> 
> RFCv1 -> RFCv2
> --------------
> Compared to the first version I've taken a completely different
> approach to adding and removing engines. in v1 physical engines
> were directly added and removed, along with the memory allocated
> to them, each time the user changed the CCS mode (from the
> previous cover letter).
> 
> Andi Shyti (15):
>   drm/i915/gt: Avoid using masked workaround for CCS_MODE setting
>   drm/i915/gt: Move the CCS mode variable to a global position
>   drm/i915/gt: Allow the creation of multi-mode CCS masks
>   drm/i915/gt: Refactor uabi engine class/instance list creation
>   drm/i915/gem: Mark and verify UABI engine validity
>   drm/i915/gt: Introduce for_each_enabled_engine() and apply it in
>     selftests
>   drm/i915/gt: Manage CCS engine creation within UABI exposure
>   drm/i915/gt: Remove cslices mask value from the CCS structure
>   drm/i915/gt: Expose the number of total CCS slices
>   drm/i915/gt: Store engine-related sysfs kobjects
>   drm/i915/gt: Store active CCS mask
>   drm/i915: Protect access to the UABI engines list with a mutex
>   drm/i915/gt: Isolate single sysfs engine file creation
>   drm/i915/gt: Implement creation and removal routines for CCS engines
>   drm/i915/gt: Allow the user to change the CCS mode through sysfs
> 
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   3 +
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  28 +-
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  23 --
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_engine_user.c   |  62 ++-
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |  12 +
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c   | 353 +++++++++++++++++-
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h   |   5 +-
>  drivers/gpu/drm/i915/gt/intel_gt_sysfs.c      |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |  19 +-
>  drivers/gpu/drm/i915/gt/intel_workarounds.c   |   8 +-
>  drivers/gpu/drm/i915/gt/selftest_context.c    |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   4 +-
>  .../drm/i915/gt/selftest_engine_heartbeat.c   |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c  |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_execlists.c  |  52 +--
>  drivers/gpu/drm/i915/gt/selftest_gt_pm.c      |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  22 +-
>  drivers/gpu/drm/i915/gt/selftest_lrc.c        |  18 +-
>  drivers/gpu/drm/i915/gt/selftest_mocs.c       |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_rc6.c        |   4 +-
>  drivers/gpu/drm/i915/gt/selftest_reset.c      |   8 +-
>  .../drm/i915/gt/selftest_ring_submission.c    |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_rps.c        |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_timeline.c   |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_tlb.c        |   2 +-
>  .../gpu/drm/i915/gt/selftest_workarounds.c    |  14 +-
>  drivers/gpu/drm/i915/gt/sysfs_engines.c       |  79 ++--
>  drivers/gpu/drm/i915/gt/sysfs_engines.h       |   2 +
>  drivers/gpu/drm/i915/i915_cmd_parser.c        |   2 +
>  drivers/gpu/drm/i915/i915_debugfs.c           |   4 +
>  drivers/gpu/drm/i915/i915_drv.h               |   5 +
>  drivers/gpu/drm/i915/i915_gem.c               |   4 +
>  drivers/gpu/drm/i915/i915_perf.c              |   8 +-
>  drivers/gpu/drm/i915/i915_pmu.c               |  11 +-
>  drivers/gpu/drm/i915/i915_query.c             |  21 +-
>  37 files changed, 643 insertions(+), 193 deletions(-)
> 
> -- 
> 2.45.2
>
Andi Shyti Aug. 28, 2024, 8:20 a.m. UTC | #3
Hi Sima,

first of all, thanks for looking into this series.

On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote:
> On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> > Hi,
> > 
> > This patch series introduces static load balancing for GPUs with
> > multiple compute engines. It's a lengthy series, and some
> > challenging aspects still need to be resolved.
> 
> Do we have an actual user for this, where just reloading the entire driver
> (or well-rebinding, if you only want to change the value for a specific
> device) with a new module option isn't enough?

Yes, we have users for this and this has been already agreed with
architects and maintainers.

Why are you saying that we are reloading/rebinding the driver?
I'm only removing the exposure of user engines, which is
basically a flag in the engines data structure.

> There's some really gnarly locking and lifetime fun in there, and it needs
> a corresponding justification.

What locking are you referring about?

I only added one single mutex that has a comment and a
justification. If you think that's not enough, I can of course
improve it (please note that the changes have a good amount of
comments and I tried to be aso more descriptive as I could).

When I change the engines configurations only for the compute
engines and only for DG2 platforms, I need to make sure that no
other user is affected by the change. Thus I need to make sure
that access to some of the strucures are properly serialized.

> Which needs to be enormous for this case,
> meaning actual customers willing to shout on dri-devel that they really,
> absolutely need this, or their machines will go up in flames.
> Otherwise this is a nack from me.

Would you please tell me why are you nacking the patch? So that I
address your comments for v4?

Thanks,
Andi
Daniel Vetter Aug. 28, 2024, 1:47 p.m. UTC | #4
On Wed, Aug 28, 2024 at 10:20:15AM +0200, Andi Shyti wrote:
> Hi Sima,
> 
> first of all, thanks for looking into this series.
> 
> On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote:
> > On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> > > Hi,
> > > 
> > > This patch series introduces static load balancing for GPUs with
> > > multiple compute engines. It's a lengthy series, and some
> > > challenging aspects still need to be resolved.
> > 
> > Do we have an actual user for this, where just reloading the entire driver
> > (or well-rebinding, if you only want to change the value for a specific
> > device) with a new module option isn't enough?
> 
> Yes, we have users for this and this has been already agreed with
> architects and maintainers.

So my understanding is that for upstream, this only applies to dg2,
because the other platforms don't have enough CCS engines to make this a
real issue.

Do we really have upstream demand for this feature on dg2 only?

Also how hard would it be to make these users happy with xe-on-dg2 in
upstream instead?

> Why are you saying that we are reloading/rebinding the driver?

That's the other alternate solution.

> I'm only removing the exposure of user engines, which is
> basically a flag in the engines data structure.
> 
> > There's some really gnarly locking and lifetime fun in there, and it needs
> > a corresponding justification.
> 
> What locking are you referring about?
> 
> I only added one single mutex that has a comment and a
> justification. If you think that's not enough, I can of course
> improve it (please note that the changes have a good amount of
> comments and I tried to be aso more descriptive as I could).
> 
> When I change the engines configurations only for the compute
> engines and only for DG2 platforms, I need to make sure that no
> other user is affected by the change. Thus I need to make sure
> that access to some of the strucures are properly serialized.
> 
> > Which needs to be enormous for this case,
> > meaning actual customers willing to shout on dri-devel that they really,
> > absolutely need this, or their machines will go up in flames.
> > Otherwise this is a nack from me.
> 
> Would you please tell me why are you nacking the patch? So that I
> address your comments for v4?

So for one, this is substantially more flexible than the solution merged
into xe. And the patch set doesn't explain why (the commit messages
actualy describe the design xe has).

That does not inspire confidence at all.

Second, I don't think anyone understands the entire engine/ctx locking
design in i915-gem. And the fix for that was to make as much as absolutely
possible immutable. Yes the implementation looks correct, but when I
looked at the much, much simpler xe implementation I'm pretty sure I've
found an issue there too. Here I can't even tell.
-Sima
Andi Shyti Aug. 28, 2024, 3:35 p.m. UTC | #5
Hi Sima,

On Wed, Aug 28, 2024 at 03:47:21PM +0200, Daniel Vetter wrote:
> On Wed, Aug 28, 2024 at 10:20:15AM +0200, Andi Shyti wrote:
> > Hi Sima,
> > 
> > first of all, thanks for looking into this series.
> > 
> > On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote:
> > > On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> > > > Hi,
> > > > 
> > > > This patch series introduces static load balancing for GPUs with
> > > > multiple compute engines. It's a lengthy series, and some
> > > > challenging aspects still need to be resolved.
> > > 
> > > Do we have an actual user for this, where just reloading the entire driver
> > > (or well-rebinding, if you only want to change the value for a specific
> > > device) with a new module option isn't enough?
> > 
> > Yes, we have users for this and this has been already agreed with
> > architects and maintainers.
> 
> So my understanding is that for upstream, this only applies to dg2,
> because the other platforms don't have enough CCS engines to make this a
> real issue.
> 
> Do we really have upstream demand for this feature on dg2 only?

That's my understanding.

> Also how hard would it be to make these users happy with xe-on-dg2 in
> upstream instead?

I don't know this, I think the user is already on i915.

> > Why are you saying that we are reloading/rebinding the driver?
> 
> That's the other alternate solution.

But that's not how XE does it, though.

The use case is that userspace has an environment variable that
they change ondemand for choosing the CCS mode. They want to
change the value of that variable on the fly and, as we are only
adding or removing a few engines, this is done without reprobing
the whole driver.

In a previous implementation (from where both I and Niranjana for
XE took inspiration) the CCS mode was passed during compute
execbuf.

> > I'm only removing the exposure of user engines, which is
> > basically a flag in the engines data structure.
> > 
> > > There's some really gnarly locking and lifetime fun in there, and it needs
> > > a corresponding justification.
> > 
> > What locking are you referring about?
> > 
> > I only added one single mutex that has a comment and a
> > justification. If you think that's not enough, I can of course
> > improve it (please note that the changes have a good amount of
> > comments and I tried to be aso more descriptive as I could).
> > 
> > When I change the engines configurations only for the compute
> > engines and only for DG2 platforms, I need to make sure that no
> > other user is affected by the change. Thus I need to make sure
> > that access to some of the strucures are properly serialized.
> > 
> > > Which needs to be enormous for this case,
> > > meaning actual customers willing to shout on dri-devel that they really,
> > > absolutely need this, or their machines will go up in flames.
> > > Otherwise this is a nack from me.
> > 
> > Would you please tell me why are you nacking the patch? So that I
> > address your comments for v4?
> 
> So for one, this is substantially more flexible than the solution merged
> into xe. And the patch set doesn't explain why (the commit messages
> actualy describe the design xe has).

I think in XE we might have missed a few things and my plan is to
check the XE implementation once I'm done with i915 (I was one of
the XE reviewers). And, many of the things in XE are so different
that the solution can't be taken as it is.

> That does not inspire confidence at all.

Consider that most of the patches are refactoring, only the last
patch is doing the real job. That's because the first workaround
was already merged a while ago. While XE didn't need the
refactorings I made.

> Second, I don't think anyone understands the entire engine/ctx locking
> design in i915-gem. And the fix for that was to make as much as absolutely
> possible immutable. Yes the implementation looks correct, but when I
> looked at the much, much simpler xe implementation I'm pretty sure I've
> found an issue there too. Here I can't even tell.

The locking is fairly simple, when the user wants to set a
specific CCS mode, I take the wakrefe lock and I check no one is
holding it. This way I am sure that I am the only user of the GPU
(otherwise the GPU would be off).

I added one single lock to be used for the for_each_uabi_engine.
It's not really required but I really want to be sure that I am
not changing the CCS mode while someone else is using the uabi
engines.

I'm also adding Joonas in Cc with whom I discussed many details
of the implementation. I would really appreaciate to know what
exactly is wrong here and what are the necessary changes needed
to get the series merged.

For now, thanks a lot for your comments,
Andi
Daniel Vetter Sept. 2, 2024, 11:13 a.m. UTC | #6
On Wed, Aug 28, 2024 at 05:35:45PM +0200, Andi Shyti wrote:
> Hi Sima,
> 
> On Wed, Aug 28, 2024 at 03:47:21PM +0200, Daniel Vetter wrote:
> > On Wed, Aug 28, 2024 at 10:20:15AM +0200, Andi Shyti wrote:
> > > Hi Sima,
> > > 
> > > first of all, thanks for looking into this series.
> > > 
> > > On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote:
> > > > On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> > > > > Hi,
> > > > > 
> > > > > This patch series introduces static load balancing for GPUs with
> > > > > multiple compute engines. It's a lengthy series, and some
> > > > > challenging aspects still need to be resolved.
> > > > 
> > > > Do we have an actual user for this, where just reloading the entire driver
> > > > (or well-rebinding, if you only want to change the value for a specific
> > > > device) with a new module option isn't enough?
> > > 
> > > Yes, we have users for this and this has been already agreed with
> > > architects and maintainers.
> > 
> > So my understanding is that for upstream, this only applies to dg2,
> > because the other platforms don't have enough CCS engines to make this a
> > real issue.
> > 
> > Do we really have upstream demand for this feature on dg2 only?
> 
> That's my understanding.
> 
> > Also how hard would it be to make these users happy with xe-on-dg2 in
> > upstream instead?
> 
> I don't know this, I think the user is already on i915.
> 
> > > Why are you saying that we are reloading/rebinding the driver?
> > 
> > That's the other alternate solution.
> 
> But that's not how XE does it, though.
> 
> The use case is that userspace has an environment variable that
> they change ondemand for choosing the CCS mode. They want to
> change the value of that variable on the fly and, as we are only
> adding or removing a few engines, this is done without reprobing
> the whole driver.
> 
> In a previous implementation (from where both I and Niranjana for
> XE took inspiration) the CCS mode was passed during compute
> execbuf.
> 
> > > I'm only removing the exposure of user engines, which is
> > > basically a flag in the engines data structure.
> > > 
> > > > There's some really gnarly locking and lifetime fun in there, and it needs
> > > > a corresponding justification.
> > > 
> > > What locking are you referring about?
> > > 
> > > I only added one single mutex that has a comment and a
> > > justification. If you think that's not enough, I can of course
> > > improve it (please note that the changes have a good amount of
> > > comments and I tried to be aso more descriptive as I could).
> > > 
> > > When I change the engines configurations only for the compute
> > > engines and only for DG2 platforms, I need to make sure that no
> > > other user is affected by the change. Thus I need to make sure
> > > that access to some of the strucures are properly serialized.
> > > 
> > > > Which needs to be enormous for this case,
> > > > meaning actual customers willing to shout on dri-devel that they really,
> > > > absolutely need this, or their machines will go up in flames.
> > > > Otherwise this is a nack from me.
> > > 
> > > Would you please tell me why are you nacking the patch? So that I
> > > address your comments for v4?
> > 
> > So for one, this is substantially more flexible than the solution merged
> > into xe. And the patch set doesn't explain why (the commit messages
> > actualy describe the design xe has).
> 
> I think in XE we might have missed a few things and my plan is to
> check the XE implementation once I'm done with i915 (I was one of
> the XE reviewers). And, many of the things in XE are so different
> that the solution can't be taken as it is.

Please fix XE first. XE is supposed to lead here with all new platform
support and uapi additions, i915-gem is supposed to be completely in
feature freeze. And exceptions need really good reasons.

> > That does not inspire confidence at all.
> 
> Consider that most of the patches are refactoring, only the last
> patch is doing the real job. That's because the first workaround
> was already merged a while ago. While XE didn't need the
> refactorings I made.
> 
> > Second, I don't think anyone understands the entire engine/ctx locking
> > design in i915-gem. And the fix for that was to make as much as absolutely
> > possible immutable. Yes the implementation looks correct, but when I
> > looked at the much, much simpler xe implementation I'm pretty sure I've
> > found an issue there too. Here I can't even tell.
> 
> The locking is fairly simple, when the user wants to set a
> specific CCS mode, I take the wakrefe lock and I check no one is
> holding it. This way I am sure that I am the only user of the GPU
> (otherwise the GPU would be off).
> 
> I added one single lock to be used for the for_each_uabi_engine.
> It's not really required but I really want to be sure that I am
> not changing the CCS mode while someone else is using the uabi
> engines.
> 
> I'm also adding Joonas in Cc with whom I discussed many details
> of the implementation. I would really appreaciate to know what
> exactly is wrong here and what are the necessary changes needed
> to get the series merged.

This is i915-gem. Locking is never simple here, and I think that entire
journey on the mmap confusion shows how little people understand how much
complexity and confusion and i915-gem being special there really is.

You're really not making a good case that you understand this stuff here,
which really doesn't make me enthusiastic about considering an uapi
extension for i915-gem.

Cheers, Sima