[v2,00/11] new cgroup controller for gpu/drm subsystem

Message ID	20200226190152.16131-1-Kenny.Ho@amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=g7RT=4O=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 96C6B222C4 Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) From: Kenny Ho <Kenny.Ho@amd.com> To: <y2kenny@gmail.com>, <cgroups@vger.kernel.org>, <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org>, <tj@kernel.org>, <alexander.deucher@amd.com>, <christian.koenig@amd.com>, <felix.kuehling@amd.com>, <joseph.greathouse@amd.com>, <jsparks@cray.com> Subject: [PATCH v2 00/11] new cgroup controller for gpu/drm subsystem Date: Wed, 26 Feb 2020 14:01:41 -0500 Message-ID: <20200226190152.16131-1-Kenny.Ho@amd.com> In-Reply-To: <lkaplan@cray.com; daniel@ffwll.ch; nirmoy.das@amd.com; damon.mcdougall@amd.com; juan.zuniga-anaya@amd.com; hannes@cmpxchg.org> References: <lkaplan@cray.com; daniel@ffwll.ch; nirmoy.das@amd.com; damon.mcdougall@amd.com; juan.zuniga-anaya@amd.com; hannes@cmpxchg.org> MIME-Version: 1.0 Precedence: list Cc: Kenny Ho <Kenny.Ho@amd.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	new cgroup controller for gpu/drm subsystem \| expand [v2,00/11] new cgroup controller for gpu/drm subsystem [v2,01/11] cgroup: Introduce cgroup for drm subsystem [v2,02/11] drm, cgroup: Bind drm and cgroup subsystem [v2,03/11] drm, cgroup: Initialize drmcg properties [v2,04/11] drm, cgroup: Add total GEM buffer allocation stats [v2,05/11] drm, cgroup: Add peak GEM buffer allocation stats [v2,06/11] drm, cgroup: Add GEM buffer allocation count stats [v2,07/11] drm, cgroup: Add total GEM buffer allocation limit [v2,08/11] drm, cgroup: Add peak GEM buffer allocation limit [v2,09/11] drm, cgroup: Add compute as gpu cgroup resource [v2,10/11] drm, cgroup: add update trigger after limit change [v2,11/11] drm/amdgpu: Integrate with DRM cgroup

Ho, Kenny Feb. 26, 2020, 7:01 p.m. UTC

This is a submission for the introduction of a new cgroup controller for the drm subsystem follow a series of RFCs [v1, v2, v3, v4]

Changes from PR v1
* changed cgroup controller name from drm to gpu
* removed lgpu
* added compute.weight resources, clarified resources being distributed as partitions of compute device

PR v1: https://www.spinics.net/lists/cgroups/msg24479.html

Changes from the RFC base on the feedbacks:
* drop all drm.memory.* related implementation and focus only on buffer and lgpu
* add weight resource type for logical gpu (lgpu)
* uncoupled drmcg device iteration from drm_minor

I'd also like to highlight the fact that these patches are currently released under MIT/X11 license aligning with the norm of the drm subsystem, but I am working to have the cgroup parts release under GPLv2 to align with the norm of the cgroup subsystem.

RFC:
[v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
[v2]: https://www.spinics.net/lists/cgroups/msg22074.html
[v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
[v4]: https://patchwork.kernel.org/cover/11120371/

Changes since the start of RFC are as follows:

v4:
Unchanged (no review needed)
* drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker)
Base on feedbacks on v3:
* update nominclature to drmcg
* embed per device drmcg properties into drm_device
* split GEM buffer related commits into stats and limit
* rename function name to align with convention
* combined buffer accounting and check into a try_charge function
* support buffer stats without limit enforcement
* removed GEM buffer sharing limitation
* updated documentations
New features:
* introducing logical GPU concept
* example implementation with AMD KFD

v3:
Base on feedbacks on v2:
* removed .help type file from v2
* conform to cgroup convention for default and max handling
* conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
* adopted memparse for memory size related attributes
* added macro to marshall drmcgrp cftype private ?(DRMCG_CTF_PRIV, etc.)
* added ttm buffer usage stats (per cgroup, for system, tt, vram.)
* added ttm buffer usage limit (per cgroup, for vram.)
* added per cgroup bandwidth stats and limiting (burst and average bandwidth)

v2:
* Removed the vendoring concepts
* Add limit to total buffer allocation
* Add limit to the maximum size of a buffer allocation

v1: cover letter

The purpose of this patch series is to start a discussion for a generic cgroup
controller for the drm subsystem.  The design proposed here is a very early 
one.  We are hoping to engage the community as we develop the idea.

Backgrounds
===========
Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with specialized
behaviour, such as accounting/limiting the resources which processes in a 
cgroup can access[1].  Weights, limits, protections, allocations are the main 
resource distribution models.  Existing cgroup controllers includes cpu, 
memory, io, rdma, and more.  cgroup is one of the foundational technologies 
that enables the popular container application deployment and management method.

Direct Rendering Manager/drm contains code intended to support the needs of
complex graphics devices. Graphics drivers in the kernel may make use of DRM
functions to make tasks like memory management, interrupt handling and DMA
easier, and provide a uniform interface to applications.  The DRM has also
developed beyond traditional graphics applications to support compute/GPGPU
applications.

Motivations
===========
As GPU grow beyond the realm of desktop/workstation graphics into areas like
data center clusters and IoT, there are increasing needs to monitor and 
regulate GPU as a resource like cpu, memory and io.

Matt Roper from Intel began working on similar idea in early 2018 [2] for the
purpose of managing GPU priority using the cgroup hierarchy.  While that
particular use case may not warrant a standalone drm cgroup controller, there
are other use cases where having one can be useful [3].  Monitoring GPU
resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
(execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
sysadmins get a better understanding of the applications usage profile. 
Further usage regulations of the aforementioned resources can also help sysadmins
optimize workload deployment on limited GPU resources.

With the increased importance of machine learning, data science and other
cloud-based applications, GPUs are already in production use in data centers
today [5,6,7].  Existing GPU resource management is very course grain, however,
as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
alternative is to use GPU virtualization (with or without SRIOV) but it
generally acts on the entire GPU instead of the specific resources in a GPU.
With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
resource management (in addition to what may be available via GPU
virtualization.)

In addition to production use, the DRM cgroup can also help with testing
graphics application robustness by providing a mean to artificially limit DRM
resources availble to the applications.


Challenges
==========
While there are common infrastructure in DRM that is shared across many vendors
(the scheduler [4] for example), there are also aspects of DRM that are vendor
specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
handle different kinds of cgroup controller.

Resources for DRM are also often device (GPU) specific instead of system
specific and a system may contain more than one GPU.  For this, we borrowed
some of the ideas from RDMA cgroup controller.

Approach
========
To experiment with the idea of a DRM cgroup, we would like to start with basic
accounting and statistics, then continue to iterate and add regulating
mechanisms into the driver.

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
[2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
[3] https://www.spinics.net/lists/cgroups/msg20720.html
[4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
[5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
[6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
[7] https://github.com/RadeonOpenCompute/k8s-device-plugin
[8] https://github.com/kubernetes/kubernetes/issues/52757

Kenny Ho (11):
  cgroup: Introduce cgroup for drm subsystem
  drm, cgroup: Bind drm and cgroup subsystem
  drm, cgroup: Initialize drmcg properties
  drm, cgroup: Add total GEM buffer allocation stats
  drm, cgroup: Add peak GEM buffer allocation stats
  drm, cgroup: Add GEM buffer allocation count stats
  drm, cgroup: Add total GEM buffer allocation limit
  drm, cgroup: Add peak GEM buffer allocation limit
  drm, cgroup: Add compute as gpu cgroup resource
  drm, cgroup: add update trigger after limit change
  drm/amdgpu: Integrate with DRM cgroup

 Documentation/admin-guide/cgroup-v2.rst       | 138 ++-
 Documentation/cgroup-v1/drm.rst               |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  48 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |   6 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   7 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   3 +
 .../amd/amdkfd/kfd_process_queue_manager.c    | 153 +++
 drivers/gpu/drm/drm_drv.c                     |  12 +
 drivers/gpu/drm/drm_gem.c                     |  16 +-
 include/drm/drm_cgroup.h                      |  81 ++
 include/drm/drm_device.h                      |   7 +
 include/drm/drm_drv.h                         |  19 +
 include/drm/drm_gem.h                         |  12 +-
 include/linux/cgroup_drm.h                    | 138 +++
 include/linux/cgroup_subsys.h                 |   4 +
 init/Kconfig                                  |   5 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/drm.c                           | 913 ++++++++++++++++++
 19 files changed, 1563 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/cgroup-v1/drm.rst
 create mode 100644 include/drm/drm_cgroup.h
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

Kenny Ho March 17, 2020, 4:03 p.m. UTC | #1

Hi Tejun,

What's your thoughts on this latest series?

Regards,
Kenny

On Wed, Feb 26, 2020 at 2:02 PM Kenny Ho <Kenny.Ho@amd.com> wrote:
>
> This is a submission for the introduction of a new cgroup controller for the drm subsystem follow a series of RFCs [v1, v2, v3, v4]
>
> Changes from PR v1
> * changed cgroup controller name from drm to gpu
> * removed lgpu
> * added compute.weight resources, clarified resources being distributed as partitions of compute device
>
> PR v1: https://www.spinics.net/lists/cgroups/msg24479.html
>
> Changes from the RFC base on the feedbacks:
> * drop all drm.memory.* related implementation and focus only on buffer and lgpu
> * add weight resource type for logical gpu (lgpu)
> * uncoupled drmcg device iteration from drm_minor
>
> I'd also like to highlight the fact that these patches are currently released under MIT/X11 license aligning with the norm of the drm subsystem, but I am working to have the cgroup parts release under GPLv2 to align with the norm of the cgroup subsystem.
>
> RFC:
> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> [v4]: https://patchwork.kernel.org/cover/11120371/
>
> Changes since the start of RFC are as follows:
>
> v4:
> Unchanged (no review needed)
> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
> and shrinker)
> Base on feedbacks on v3:
> * update nominclature to drmcg
> * embed per device drmcg properties into drm_device
> * split GEM buffer related commits into stats and limit
> * rename function name to align with convention
> * combined buffer accounting and check into a try_charge function
> * support buffer stats without limit enforcement
> * removed GEM buffer sharing limitation
> * updated documentations
> New features:
> * introducing logical GPU concept
> * example implementation with AMD KFD
>
> v3:
> Base on feedbacks on v2:
> * removed .help type file from v2
> * conform to cgroup convention for default and max handling
> * conform to cgroup convention for addressing device specific limits (with major:minor)
> New function:
> * adopted memparse for memory size related attributes
> * added macro to marshall drmcgrp cftype private ?(DRMCG_CTF_PRIV, etc.)
> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> * added ttm buffer usage limit (per cgroup, for vram.)
> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
>
> v2:
> * Removed the vendoring concepts
> * Add limit to total buffer allocation
> * Add limit to the maximum size of a buffer allocation
>
> v1: cover letter
>
> The purpose of this patch series is to start a discussion for a generic cgroup
> controller for the drm subsystem.  The design proposed here is a very early
> one.  We are hoping to engage the community as we develop the idea.
>
> Backgrounds
> ===========
> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
> tasks, and all their future children, into hierarchical groups with specialized
> behaviour, such as accounting/limiting the resources which processes in a
> cgroup can access[1].  Weights, limits, protections, allocations are the main
> resource distribution models.  Existing cgroup controllers includes cpu,
> memory, io, rdma, and more.  cgroup is one of the foundational technologies
> that enables the popular container application deployment and management method.
>
> Direct Rendering Manager/drm contains code intended to support the needs of
> complex graphics devices. Graphics drivers in the kernel may make use of DRM
> functions to make tasks like memory management, interrupt handling and DMA
> easier, and provide a uniform interface to applications.  The DRM has also
> developed beyond traditional graphics applications to support compute/GPGPU
> applications.
>
> Motivations
> ===========
> As GPU grow beyond the realm of desktop/workstation graphics into areas like
> data center clusters and IoT, there are increasing needs to monitor and
> regulate GPU as a resource like cpu, memory and io.
>
> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
> purpose of managing GPU priority using the cgroup hierarchy.  While that
> particular use case may not warrant a standalone drm cgroup controller, there
> are other use cases where having one can be useful [3].  Monitoring GPU
> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
> sysadmins get a better understanding of the applications usage profile.
> Further usage regulations of the aforementioned resources can also help sysadmins
> optimize workload deployment on limited GPU resources.
>
> With the increased importance of machine learning, data science and other
> cloud-based applications, GPUs are already in production use in data centers
> today [5,6,7].  Existing GPU resource management is very course grain, however,
> as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
> alternative is to use GPU virtualization (with or without SRIOV) but it
> generally acts on the entire GPU instead of the specific resources in a GPU.
> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
> resource management (in addition to what may be available via GPU
> virtualization.)
>
> In addition to production use, the DRM cgroup can also help with testing
> graphics application robustness by providing a mean to artificially limit DRM
> resources availble to the applications.
>
>
> Challenges
> ==========
> While there are common infrastructure in DRM that is shared across many vendors
> (the scheduler [4] for example), there are also aspects of DRM that are vendor
> specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
> handle different kinds of cgroup controller.
>
> Resources for DRM are also often device (GPU) specific instead of system
> specific and a system may contain more than one GPU.  For this, we borrowed
> some of the ideas from RDMA cgroup controller.
>
> Approach
> ========
> To experiment with the idea of a DRM cgroup, we would like to start with basic
> accounting and statistics, then continue to iterate and add regulating
> mechanisms into the driver.
>
> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> [3] https://www.spinics.net/lists/cgroups/msg20720.html
> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin
> [8] https://github.com/kubernetes/kubernetes/issues/52757
>
> Kenny Ho (11):
>   cgroup: Introduce cgroup for drm subsystem
>   drm, cgroup: Bind drm and cgroup subsystem
>   drm, cgroup: Initialize drmcg properties
>   drm, cgroup: Add total GEM buffer allocation stats
>   drm, cgroup: Add peak GEM buffer allocation stats
>   drm, cgroup: Add GEM buffer allocation count stats
>   drm, cgroup: Add total GEM buffer allocation limit
>   drm, cgroup: Add peak GEM buffer allocation limit
>   drm, cgroup: Add compute as gpu cgroup resource
>   drm, cgroup: add update trigger after limit change
>   drm/amdgpu: Integrate with DRM cgroup
>
>  Documentation/admin-guide/cgroup-v2.rst       | 138 ++-
>  Documentation/cgroup-v1/drm.rst               |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   4 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  48 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |   6 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   7 +
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   3 +
>  .../amd/amdkfd/kfd_process_queue_manager.c    | 153 +++
>  drivers/gpu/drm/drm_drv.c                     |  12 +
>  drivers/gpu/drm/drm_gem.c                     |  16 +-
>  include/drm/drm_cgroup.h                      |  81 ++
>  include/drm/drm_device.h                      |   7 +
>  include/drm/drm_drv.h                         |  19 +
>  include/drm/drm_gem.h                         |  12 +-
>  include/linux/cgroup_drm.h                    | 138 +++
>  include/linux/cgroup_subsys.h                 |   4 +
>  init/Kconfig                                  |   5 +
>  kernel/cgroup/Makefile                        |   1 +
>  kernel/cgroup/drm.c                           | 913 ++++++++++++++++++
>  19 files changed, 1563 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/cgroup-v1/drm.rst
>  create mode 100644 include/drm/drm_cgroup.h
>  create mode 100644 include/linux/cgroup_drm.h
>  create mode 100644 kernel/cgroup/drm.c
>
> --
> 2.25.0
>

Tejun Heo March 24, 2020, 6:46 p.m. UTC | #2

On Tue, Mar 17, 2020 at 12:03:20PM -0400, Kenny Ho wrote:
> What's your thoughts on this latest series?

My overall impression is that the feedbacks aren't being incorporated throughly
/ sufficiently.

Thanks.

Kenny Ho March 24, 2020, 6:49 p.m. UTC | #3

Hi Tejun,

Can you elaborate more on what are the missing pieces?

Regards,
Kenny

On Tue, Mar 24, 2020 at 2:46 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, Mar 17, 2020 at 12:03:20PM -0400, Kenny Ho wrote:
> > What's your thoughts on this latest series?
>
> My overall impression is that the feedbacks aren't being incorporated throughly
> / sufficiently.
>
> Thanks.
>
> --
> tejun

Tejun Heo April 13, 2020, 7:11 p.m. UTC | #4

Hello, Kenny.

On Tue, Mar 24, 2020 at 02:49:27PM -0400, Kenny Ho wrote:
> Can you elaborate more on what are the missing pieces?

Sorry about the long delay, but I think we've been going in circles for quite
a while now. Let's try to make it really simple as the first step. How about
something like the following?

* gpu.weight (should it be gpu.compute.weight? idk) - A single number
  per-device weight similar to io.weight, which distributes computation
  resources in work-conserving way.

* gpu.memory.high - A single number per-device on-device memory limit.

The above two, if works well, should already be plenty useful. And my guess is
that getting the above working well will be plenty challenging already even
though it's already excluding work-conserving memory distribution. So, let's
please do that as the first step and see what more would be needed from there.

Thanks.

Ho, Kenny April 13, 2020, 8:12 p.m. UTC | #5

[AMD Official Use Only - Internal Distribution Only]

Hi Tejun,

Thanks for taking the time to reply.

Perhaps we can even narrow things down to just gpu.weight/gpu.compute.weight as a start? In this aspect, is the key objection to the current implementation of gpu.compute.weight the work-conserving bit? This work-conserving requirement is probably what I have missed for the last two years (and hence going in circle.)

If this is the case, can you clarify/confirm the followings?

1) Is resource scheduling goal of cgroup purely for the purpose of throughput? (at the expense of other scheduling goals such as latency.)
2) If 1) is true, under what circumstances will the "Allocations" resource distribution model (as defined in the cgroup-v2) be acceptable?
3) If 1) is true, are things like cpuset from cgroup v1 no longer acceptable going forward?

To be clear, while some have framed this (time sharing vs spatial sharing) as a partisan issue, it is in fact a technical one. I have implemented the gpu cgroup support this way because we have a class of users that value low latency/low jitter/predictability/synchronicity. For example, they would like 4 tasks to share a GPU and they would like the tasks to start and finish at the same time.

What is the rationale behind picking the Weight model over Allocations as the first acceptable implementation? Can't we have both work-conserving and non-work-conserving ways of distributing GPU resources? If we can, why not allow non-work-conserving implementation first, especially when we have users asking for such functionality?

Regards,
Kenny

Kenny Ho April 13, 2020, 8:17 p.m. UTC | #6

(replying again in plain-text)

Hi Tejun,

Thanks for taking the time to reply.

Perhaps we can even narrow things down to just
gpu.weight/gpu.compute.weight as a start?  In this aspect, is the key
objection to the current implementation of gpu.compute.weight the
work-conserving bit?  This work-conserving requirement is probably
what I have missed for the last two years (and hence going in circle.)

If this is the case, can you clarify/confirm the followings?

1) Is resource scheduling goal of cgroup purely for the purpose of
throughput?  (at the expense of other scheduling goals such as
latency.)
2) If 1) is true, under what circumstances will the "Allocations"
resource distribution model (as defined in the cgroup-v2) be
acceptable?
3) If 1) is true, are things like cpuset from cgroup v1 no longer
acceptable going forward?

To be clear, while some have framed this (time sharing vs spatial
sharing) as a partisan issue, it is in fact a technical one.  I have
implemented the gpu cgroup support this way because we have a class of
users that value low latency/low jitter/predictability/synchronicity.
For example, they would like 4 tasks to share a GPU and they would
like the tasks to start and finish at the same time.

What is the rationale behind picking the Weight model over Allocations
as the first acceptable implementation?  Can't we have both
work-conserving and non-work-conserving ways of distributing GPU
resources?  If we can, why not allow non-work-conserving
implementation first, especially when we have users asking for such
functionality?

Regards,
Kenny

On Mon, Apr 13, 2020 at 3:11 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Kenny.
>
> On Tue, Mar 24, 2020 at 02:49:27PM -0400, Kenny Ho wrote:
> > Can you elaborate more on what are the missing pieces?
>
> Sorry about the long delay, but I think we've been going in circles for quite
> a while now. Let's try to make it really simple as the first step. How about
> something like the following?
>
> * gpu.weight (should it be gpu.compute.weight? idk) - A single number
>   per-device weight similar to io.weight, which distributes computation
>   resources in work-conserving way.
>
> * gpu.memory.high - A single number per-device on-device memory limit.
>
> The above two, if works well, should already be plenty useful. And my guess is
> that getting the above working well will be plenty challenging already even
> though it's already excluding work-conserving memory distribution. So, let's
> please do that as the first step and see what more would be needed from there.
>
> Thanks.
>
> --
> tejun

Tejun Heo April 13, 2020, 8:54 p.m. UTC | #7

Hello,

On Mon, Apr 13, 2020 at 04:17:14PM -0400, Kenny Ho wrote:
> Perhaps we can even narrow things down to just
> gpu.weight/gpu.compute.weight as a start?  In this aspect, is the key

That sounds great to me.

> objection to the current implementation of gpu.compute.weight the
> work-conserving bit?  This work-conserving requirement is probably
> what I have missed for the last two years (and hence going in circle.)
> 
> If this is the case, can you clarify/confirm the followings?
> 
> 1) Is resource scheduling goal of cgroup purely for the purpose of
> throughput?  (at the expense of other scheduling goals such as
> latency.)

It's not; however, work-conserving mechanisms are the easiest to use (cuz you
don't lose anything) while usually challenging to implement. It tends to
clarify how control mechanisms should be structured - even what resources are.

> 2) If 1) is true, under what circumstances will the "Allocations"
> resource distribution model (as defined in the cgroup-v2) be
> acceptable?

Allocations definitely are acceptable and it's not a pre-requisite to have
work-conserving control first either. Here, given the lack of consensus in
terms of what even constitute resource units, I don't think it'd be a good
idea to commit to the proposed interface and believe it'd be beneficial to
work on interface-wise simpler work conserving controls.

> 3) If 1) is true, are things like cpuset from cgroup v1 no longer
> acceptable going forward?

Again, they're acceptable.

> To be clear, while some have framed this (time sharing vs spatial
> sharing) as a partisan issue, it is in fact a technical one.  I have
> implemented the gpu cgroup support this way because we have a class of
> users that value low latency/low jitter/predictability/synchronicity.
> For example, they would like 4 tasks to share a GPU and they would
> like the tasks to start and finish at the same time.
> 
> What is the rationale behind picking the Weight model over Allocations
> as the first acceptable implementation?  Can't we have both
> work-conserving and non-work-conserving ways of distributing GPU
> resources?  If we can, why not allow non-work-conserving
> implementation first, especially when we have users asking for such
> functionality?

I hope the rationales are clear now. What I'm objecting is inclusion of
premature interface, which is a lot easier and more tempting to do for
hardware-specific limits and the proposals up until now have been showing
ample signs of that. I don't think my position has changed much since the
beginning - do the difficult-to-implement but easy-to-use weights first and
then you and everyone would have a better idea of what hard-limit or
allocation interfaces and mechanisms should look like, or even whether they're
needed.

Thanks.

Kenny Ho April 13, 2020, 9:40 p.m. UTC | #8

Hi,

On Mon, Apr 13, 2020 at 4:54 PM Tejun Heo <tj@kernel.org> wrote:
>
> Allocations definitely are acceptable and it's not a pre-requisite to have
> work-conserving control first either. Here, given the lack of consensus in
> terms of what even constitute resource units, I don't think it'd be a good
> idea to commit to the proposed interface and believe it'd be beneficial to
> work on interface-wise simpler work conserving controls.
>
...
> I hope the rationales are clear now. What I'm objecting is inclusion of
> premature interface, which is a lot easier and more tempting to do for
> hardware-specific limits and the proposals up until now have been showing
> ample signs of that. I don't think my position has changed much since the
> beginning - do the difficult-to-implement but easy-to-use weights first and
> then you and everyone would have a better idea of what hard-limit or
> allocation interfaces and mechanisms should look like, or even whether they're
> needed.

By lack of consense, do you mean Intel's assertion that a standard is
not a standard until Intel implements it? (That was in the context of
OpenCL language standard with the concept of SubDevice.)  I thought
the discussion so far has established that the concept of a compute
unit, while named differently (AMD's CUs, ARM's SCs, Intel's EUs,
Nvidia's SMs, Qualcomm's SPs), is cross vendor.  While an AMD CU is
not the same as an Intel EU or Nvidia SM, the same can be said for CPU
cores.  If cpuset is acceptable for a diversity of CPU core designs
and arrangements, I don't understand why an interface derived from GPU
SubDevice is considered premature.

If a decade-old language standard is not considered a consenses, can
you elaborate on what might consitute a consenses?

Regards,
Kenny

Tejun Heo April 13, 2020, 9:53 p.m. UTC | #9

Hello,

On Mon, Apr 13, 2020 at 05:40:32PM -0400, Kenny Ho wrote:
> By lack of consense, do you mean Intel's assertion that a standard is
> not a standard until Intel implements it? (That was in the context of
> OpenCL language standard with the concept of SubDevice.)  I thought
> the discussion so far has established that the concept of a compute
> unit, while named differently (AMD's CUs, ARM's SCs, Intel's EUs,
> Nvidia's SMs, Qualcomm's SPs), is cross vendor.  While an AMD CU is
> not the same as an Intel EU or Nvidia SM, the same can be said for CPU
> cores.  If cpuset is acceptable for a diversity of CPU core designs
> and arrangements, I don't understand why an interface derived from GPU
> SubDevice is considered premature.

CPUs are a lot more uniform across vendors than GPUs and have way higher user
observability and awareness. And, even then, it's something which has limited
usefulness because the configuration is inherently more complex involving
topology details and the end result is not work-conserving.

cpuset is there partly due to historical reasons and its features can often be
trivially replicated with some scripting around taskset. If that's all you're
trying to add, I don't see why it needs to be in cgroup at all. Just implement
a tool similar to taskset and build sufficient tooling around it. Given how
hardware specific it can become, that is likely the better direction anyway.

Thanks.

Daniel Vetter April 14, 2020, 12:20 p.m. UTC | #10

On Mon, Apr 13, 2020 at 03:11:36PM -0400, Tejun Heo wrote:
> Hello, Kenny.
> 
> On Tue, Mar 24, 2020 at 02:49:27PM -0400, Kenny Ho wrote:
> > Can you elaborate more on what are the missing pieces?
> 
> Sorry about the long delay, but I think we've been going in circles for quite
> a while now. Let's try to make it really simple as the first step. How about
> something like the following?
> 
> * gpu.weight (should it be gpu.compute.weight? idk) - A single number
>   per-device weight similar to io.weight, which distributes computation
>   resources in work-conserving way.
> 
> * gpu.memory.high - A single number per-device on-device memory limit.
> 
> The above two, if works well, should already be plenty useful. And my guess is
> that getting the above working well will be plenty challenging already even
> though it's already excluding work-conserving memory distribution. So, let's
> please do that as the first step and see what more would be needed from there.

This agrees with my understanding of the consensus here and what's
reasonable possible across different gpus. And in case this isn't clear:
This is very much me talking with my drm co-maintainer hat on, not with a
gpu vendor hat on (since that's implied somewhere further down the
discussion). My understanding from talking with a few other folks is that
the cpumask-style CU-weight thing is not something any other gpu can
reasonably support (and we have about 6+ of those in-tree), whereas some
work-preserving computation resource thing should be doable for anyone
with a scheduler. +/- more or less the same issues as io devices, there
might be quite bit latencies involved from going from one client to the
other because gpu pipelines are deed and pre-emption for gpus rather slow.
And ofc not all gpu "requests" use equal amounts of resources (different
engines and stuff just to begin with), same way not all io requests are
made equal. Plus since we do have a shared scheduler used by at least most
drivers, this shouldn't be too hard to get done somewhat consistently
across drivers

tldr; Acked by me.

Cheers, Daniel

Kenny Ho April 14, 2020, 12:47 p.m. UTC | #11

Hi Daniel,

On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> My understanding from talking with a few other folks is that
> the cpumask-style CU-weight thing is not something any other gpu can
> reasonably support (and we have about 6+ of those in-tree)

How does Intel plan to support the SubDevice API as described in your
own spec here:
https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support

Regards,
Kenny

Daniel Vetter April 14, 2020, 12:52 p.m. UTC | #12

On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > My understanding from talking with a few other folks is that
> > the cpumask-style CU-weight thing is not something any other gpu can
> > reasonably support (and we have about 6+ of those in-tree)
>
> How does Intel plan to support the SubDevice API as described in your
> own spec here:
> https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support

I can't talk about whether future products might or might not support
stuff and in what form exactly they might support stuff or not support
stuff. Or why exactly that's even in the spec there or not.

Geez
-Daniel

Kenny Ho April 14, 2020, 1:14 p.m. UTC | #13

Ok.  I was hoping you can clarify the contradiction between the
existance of the spec below and your "not something any other gpu can
reasonably support" statement.  I mean, OneAPI is Intel's spec and
doesn't that at least make SubDevice support "reasonable" for one more
vendor?

Partisanship aside, as a drm co-maintainer, do you really not see the
need for non-work-conserving way of distributing GPU as a resource?
You recognized the latencies involved (although that's really just
part of the story... time sharing is never going to be good enough
even if your switching cost is zero.)  As a drm co-maintainer, are you
suggesting GPU has no place in the HPC use case?

Regards,
Kenny

On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > My understanding from talking with a few other folks is that
> > > the cpumask-style CU-weight thing is not something any other gpu can
> > > reasonably support (and we have about 6+ of those in-tree)
> >
> > How does Intel plan to support the SubDevice API as described in your
> > own spec here:
> > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
>
> I can't talk about whether future products might or might not support
> stuff and in what form exactly they might support stuff or not support
> stuff. Or why exactly that's even in the spec there or not.
>
> Geez
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Daniel Vetter April 14, 2020, 1:26 p.m. UTC | #14

On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny@gmail.com> wrote:
>
> Ok.  I was hoping you can clarify the contradiction between the
> existance of the spec below and your "not something any other gpu can
> reasonably support" statement.  I mean, OneAPI is Intel's spec and
> doesn't that at least make SubDevice support "reasonable" for one more
> vendor?
>
> Partisanship aside, as a drm co-maintainer, do you really not see the
> need for non-work-conserving way of distributing GPU as a resource?
> You recognized the latencies involved (although that's really just
> part of the story... time sharing is never going to be good enough
> even if your switching cost is zero.)  As a drm co-maintainer, are you
> suggesting GPU has no place in the HPC use case?

 So I did chat with people and my understanding for how this subdevice
stuff works is roughly, from least to most fine grained support:
- Not possible at all, hw doesn't have any such support
- The hw is actually not a single gpu, but a bunch of chips behind a
magic bridge/interconnect, and there's a scheduler load-balancing
stuff and you can't actually run on all "cores" in parallel with one
compute/3d job. So subdevices just give you some of these cores, but
from client api pov they're exactly as powerful as the full device. So
this kinda works like assigning an entire NUMA node, including all the
cpu cores and memory bandwidth and everything.
- Hw has multiple "engines" which share resources (like compute cores
or whatever) behind the scenes. There's no control over how this
sharing works really, and whether you have guarantees about minimal
execution resources or not. This kinda works like hyperthreading.
- Then finally we have the CU mask thing amdgpu has. Which works like
what you're proposing, works on amd.

So this isn't something that I think we should standardize in a
resource management framework like cgroups. Because it's a complete
mess. Note that _all_ the above things (including the "no subdevices"
one) are valid implementations of "subdevices" in the various specs.

Now on your question on "why was this added to various standards?"
because opencl has that too (and the rocm thing, and everything else
it seems). What I heard is that a few people pushed really hard, and
no one objected hard enough (because not having subdevices is a
standards compliant implementation), so that's why it happened. Just
because it's in various standards doesn't mean that a) it's actually
standardized in a useful fashion and b) something we should just
blindly adopt.

Also like where exactly did you understand that I'm against gpus in
HPC uses cases. Approaching this in a slightly less tribal way would
really, really help to get something landed (which I'd like to see
happen, personally). Always spinning this as an Intel vs AMD thing
like you do here with every reply really doesn't help moving this in.

So yeah stricter isolation is something customers want, it's just not
something we can really give out right now at a level below the
device.
-Daniel

>
> Regards,
> Kenny
>
> On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > My understanding from talking with a few other folks is that
> > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > reasonably support (and we have about 6+ of those in-tree)
> > >
> > > How does Intel plan to support the SubDevice API as described in your
> > > own spec here:
> > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> >
> > I can't talk about whether future products might or might not support
> > stuff and in what form exactly they might support stuff or not support
> > stuff. Or why exactly that's even in the spec there or not.
> >
> > Geez
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Kenny Ho April 14, 2020, 1:50 p.m. UTC | #15

Hi Daniel,

I appreciate many of your review so far and I much prefer keeping
things technical but that is very difficult to do when I get Intel
developers calling my implementation "most AMD-specific solution
possible" and objecting to an implementation because their hardware
cannot support it.  Can you help me with a more charitable
interpretation of what has been happening?

Perhaps the following questions can help keep the discussion technical:
1)  Is it possible to implement non-work-conserving distribution of
GPU without spatial sharing?  (If yes, I'd love to hear a suggestion,
if not...question 2.)
2)  If spatial sharing is required to support GPU HPC use cases, what
would you implement if you have the hardware support today?

Regards,
Kenny

On Tue, Apr 14, 2020 at 9:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny@gmail.com> wrote:
> >
> > Ok.  I was hoping you can clarify the contradiction between the
> > existance of the spec below and your "not something any other gpu can
> > reasonably support" statement.  I mean, OneAPI is Intel's spec and
> > doesn't that at least make SubDevice support "reasonable" for one more
> > vendor?
> >
> > Partisanship aside, as a drm co-maintainer, do you really not see the
> > need for non-work-conserving way of distributing GPU as a resource?
> > You recognized the latencies involved (although that's really just
> > part of the story... time sharing is never going to be good enough
> > even if your switching cost is zero.)  As a drm co-maintainer, are you
> > suggesting GPU has no place in the HPC use case?
>
>  So I did chat with people and my understanding for how this subdevice
> stuff works is roughly, from least to most fine grained support:
> - Not possible at all, hw doesn't have any such support
> - The hw is actually not a single gpu, but a bunch of chips behind a
> magic bridge/interconnect, and there's a scheduler load-balancing
> stuff and you can't actually run on all "cores" in parallel with one
> compute/3d job. So subdevices just give you some of these cores, but
> from client api pov they're exactly as powerful as the full device. So
> this kinda works like assigning an entire NUMA node, including all the
> cpu cores and memory bandwidth and everything.
> - Hw has multiple "engines" which share resources (like compute cores
> or whatever) behind the scenes. There's no control over how this
> sharing works really, and whether you have guarantees about minimal
> execution resources or not. This kinda works like hyperthreading.
> - Then finally we have the CU mask thing amdgpu has. Which works like
> what you're proposing, works on amd.
>
> So this isn't something that I think we should standardize in a
> resource management framework like cgroups. Because it's a complete
> mess. Note that _all_ the above things (including the "no subdevices"
> one) are valid implementations of "subdevices" in the various specs.
>
> Now on your question on "why was this added to various standards?"
> because opencl has that too (and the rocm thing, and everything else
> it seems). What I heard is that a few people pushed really hard, and
> no one objected hard enough (because not having subdevices is a
> standards compliant implementation), so that's why it happened. Just
> because it's in various standards doesn't mean that a) it's actually
> standardized in a useful fashion and b) something we should just
> blindly adopt.
>
> Also like where exactly did you understand that I'm against gpus in
> HPC uses cases. Approaching this in a slightly less tribal way would
> really, really help to get something landed (which I'd like to see
> happen, personally). Always spinning this as an Intel vs AMD thing
> like you do here with every reply really doesn't help moving this in.
>
> So yeah stricter isolation is something customers want, it's just not
> something we can really give out right now at a level below the
> device.
> -Daniel
>
> >
> > Regards,
> > Kenny
> >
> > On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > My understanding from talking with a few other folks is that
> > > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > > reasonably support (and we have about 6+ of those in-tree)
> > > >
> > > > How does Intel plan to support the SubDevice API as described in your
> > > > own spec here:
> > > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> > >
> > > I can't talk about whether future products might or might not support
> > > stuff and in what form exactly they might support stuff or not support
> > > stuff. Or why exactly that's even in the spec there or not.
> > >
> > > Geez
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Daniel Vetter April 14, 2020, 2:04 p.m. UTC | #16

On Tue, Apr 14, 2020 at 3:50 PM Kenny Ho <y2kenny@gmail.com> wrote:
>
> Hi Daniel,
>
> I appreciate many of your review so far and I much prefer keeping
> things technical but that is very difficult to do when I get Intel
> developers calling my implementation "most AMD-specific solution
> possible" and objecting to an implementation because their hardware
> cannot support it.  Can you help me with a more charitable
> interpretation of what has been happening?

This is upstream. It's your job to show that this can be done,
reasonable, on other devices. This doesn't need to be an intel device,
you can pretty much pick any other driver stack and show that
sufficiently many of them can support what you want to do. But as long
as all I can see is something that only works on AMD, it's not useful
as an upstreamable resource management thing.

This has _nothing_ to do with Intel (I think over the past 25 years or
so intel has implemented all 4 versions of gpu splitting that I
listed, but not entirely sure).

So again pls less tribal fighting, more collaboration. If you can't do
that, let's pick nouveau/nvidia as arbitrary neutral ground.

> Perhaps the following questions can help keep the discussion technical:
> 1)  Is it possible to implement non-work-conserving distribution of
> GPU without spatial sharing?  (If yes, I'd love to hear a suggestion,
> if not...question 2.)
> 2)  If spatial sharing is required to support GPU HPC use cases, what
> would you implement if you have the hardware support today?

The thing we can currently do in upstream (from how I'm understanding
hw) is assign entire PCI devices to containers, so essentially only
the entire /dev/dri/* cdev. That works, and it works across all
drivers we have in upstream right now.

Anything more fine-grained I don't think is currently possible,
because everyone has a different idea of how to split up gpus. It
would be nice to have it, but in upstream, cross-vendor, I'm just not
seeing it happen right now.
-Daniel

>
> Regards,
> Kenny
>
> On Tue, Apr 14, 2020 at 9:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > >
> > > Ok.  I was hoping you can clarify the contradiction between the
> > > existance of the spec below and your "not something any other gpu can
> > > reasonably support" statement.  I mean, OneAPI is Intel's spec and
> > > doesn't that at least make SubDevice support "reasonable" for one more
> > > vendor?
> > >
> > > Partisanship aside, as a drm co-maintainer, do you really not see the
> > > need for non-work-conserving way of distributing GPU as a resource?
> > > You recognized the latencies involved (although that's really just
> > > part of the story... time sharing is never going to be good enough
> > > even if your switching cost is zero.)  As a drm co-maintainer, are you
> > > suggesting GPU has no place in the HPC use case?
> >
> >  So I did chat with people and my understanding for how this subdevice
> > stuff works is roughly, from least to most fine grained support:
> > - Not possible at all, hw doesn't have any such support
> > - The hw is actually not a single gpu, but a bunch of chips behind a
> > magic bridge/interconnect, and there's a scheduler load-balancing
> > stuff and you can't actually run on all "cores" in parallel with one
> > compute/3d job. So subdevices just give you some of these cores, but
> > from client api pov they're exactly as powerful as the full device. So
> > this kinda works like assigning an entire NUMA node, including all the
> > cpu cores and memory bandwidth and everything.
> > - Hw has multiple "engines" which share resources (like compute cores
> > or whatever) behind the scenes. There's no control over how this
> > sharing works really, and whether you have guarantees about minimal
> > execution resources or not. This kinda works like hyperthreading.
> > - Then finally we have the CU mask thing amdgpu has. Which works like
> > what you're proposing, works on amd.
> >
> > So this isn't something that I think we should standardize in a
> > resource management framework like cgroups. Because it's a complete
> > mess. Note that _all_ the above things (including the "no subdevices"
> > one) are valid implementations of "subdevices" in the various specs.
> >
> > Now on your question on "why was this added to various standards?"
> > because opencl has that too (and the rocm thing, and everything else
> > it seems). What I heard is that a few people pushed really hard, and
> > no one objected hard enough (because not having subdevices is a
> > standards compliant implementation), so that's why it happened. Just
> > because it's in various standards doesn't mean that a) it's actually
> > standardized in a useful fashion and b) something we should just
> > blindly adopt.
> >
> > Also like where exactly did you understand that I'm against gpus in
> > HPC uses cases. Approaching this in a slightly less tribal way would
> > really, really help to get something landed (which I'd like to see
> > happen, personally). Always spinning this as an Intel vs AMD thing
> > like you do here with every reply really doesn't help moving this in.
> >
> > So yeah stricter isolation is something customers want, it's just not
> > something we can really give out right now at a level below the
> > device.
> > -Daniel
> >
> > >
> > > Regards,
> > > Kenny
> > >
> > > On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > My understanding from talking with a few other folks is that
> > > > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > > > reasonably support (and we have about 6+ of those in-tree)
> > > > >
> > > > > How does Intel plan to support the SubDevice API as described in your
> > > > > own spec here:
> > > > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> > > >
> > > > I can't talk about whether future products might or might not support
> > > > stuff and in what form exactly they might support stuff or not support
> > > > stuff. Or why exactly that's even in the spec there or not.
> > > >
> > > > Geez
> > > > -Daniel
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Kenny Ho April 14, 2020, 2:29 p.m. UTC | #17

On Tue, Apr 14, 2020 at 10:04 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> This has _nothing_ to do with Intel (I think over the past 25 years or
> so intel has implemented all 4 versions of gpu splitting that I
> listed, but not entirely sure).
>
> So again pls less tribal fighting, more collaboration. If you can't do
> that, let's pick nouveau/nvidia as arbitrary neutral ground.

So are you saying Intel has implemented a form of masking before?  I
don't think we need to just pick a vendor as a neutral ground.  The
idea of spatial sharing vs time sharing is not vendor specific... it's
not even GPU specific.  This is why I asked the two questions below.

> > Perhaps the following questions can help keep the discussion technical:
> > 1)  Is it possible to implement non-work-conserving distribution of
> > GPU without spatial sharing?  (If yes, I'd love to hear a suggestion,
> > if not...question 2.)
> > 2)  If spatial sharing is required to support GPU HPC use cases, what
> > would you implement if you have the hardware support today?
>
> The thing we can currently do in upstream (from how I'm understanding
> hw) is assign entire PCI devices to containers, so essentially only
> the entire /dev/dri/* cdev. That works, and it works across all
> drivers we have in upstream right now.
>
> Anything more fine-grained I don't think is currently possible,
> because everyone has a different idea of how to split up gpus. It
> would be nice to have it, but in upstream, cross-vendor, I'm just not
> seeing it happen right now.

I understand the reality, but what would you implement to support the
concept (GPU in HPC, which you said you are not against) if you have
the hw support today?  How would you support low-jitter/low-latency
sharing of a single GPU if you have whatever hardware support you need
today?

Regards,
Kenny


> > On Tue, Apr 14, 2020 at 9:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > >
> > > > Ok.  I was hoping you can clarify the contradiction between the
> > > > existance of the spec below and your "not something any other gpu can
> > > > reasonably support" statement.  I mean, OneAPI is Intel's spec and
> > > > doesn't that at least make SubDevice support "reasonable" for one more
> > > > vendor?
> > > >
> > > > Partisanship aside, as a drm co-maintainer, do you really not see the
> > > > need for non-work-conserving way of distributing GPU as a resource?
> > > > You recognized the latencies involved (although that's really just
> > > > part of the story... time sharing is never going to be good enough
> > > > even if your switching cost is zero.)  As a drm co-maintainer, are you
> > > > suggesting GPU has no place in the HPC use case?
> > >
> > >  So I did chat with people and my understanding for how this subdevice
> > > stuff works is roughly, from least to most fine grained support:
> > > - Not possible at all, hw doesn't have any such support
> > > - The hw is actually not a single gpu, but a bunch of chips behind a
> > > magic bridge/interconnect, and there's a scheduler load-balancing
> > > stuff and you can't actually run on all "cores" in parallel with one
> > > compute/3d job. So subdevices just give you some of these cores, but
> > > from client api pov they're exactly as powerful as the full device. So
> > > this kinda works like assigning an entire NUMA node, including all the
> > > cpu cores and memory bandwidth and everything.
> > > - Hw has multiple "engines" which share resources (like compute cores
> > > or whatever) behind the scenes. There's no control over how this
> > > sharing works really, and whether you have guarantees about minimal
> > > execution resources or not. This kinda works like hyperthreading.
> > > - Then finally we have the CU mask thing amdgpu has. Which works like
> > > what you're proposing, works on amd.
> > >
> > > So this isn't something that I think we should standardize in a
> > > resource management framework like cgroups. Because it's a complete
> > > mess. Note that _all_ the above things (including the "no subdevices"
> > > one) are valid implementations of "subdevices" in the various specs.
> > >
> > > Now on your question on "why was this added to various standards?"
> > > because opencl has that too (and the rocm thing, and everything else
> > > it seems). What I heard is that a few people pushed really hard, and
> > > no one objected hard enough (because not having subdevices is a
> > > standards compliant implementation), so that's why it happened. Just
> > > because it's in various standards doesn't mean that a) it's actually
> > > standardized in a useful fashion and b) something we should just
> > > blindly adopt.
> > >
> > > Also like where exactly did you understand that I'm against gpus in
> > > HPC uses cases. Approaching this in a slightly less tribal way would
> > > really, really help to get something landed (which I'd like to see
> > > happen, personally). Always spinning this as an Intel vs AMD thing
> > > like you do here with every reply really doesn't help moving this in.
> > >
> > > So yeah stricter isolation is something customers want, it's just not
> > > something we can really give out right now at a level below the
> > > device.
> > > -Daniel
> > >
> > > >
> > > > Regards,
> > > > Kenny
> > > >
> > > > On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > > > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > > My understanding from talking with a few other folks is that
> > > > > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > > > > reasonably support (and we have about 6+ of those in-tree)
> > > > > >
> > > > > > How does Intel plan to support the SubDevice API as described in your
> > > > > > own spec here:
> > > > > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> > > > >
> > > > > I can't talk about whether future products might or might not support
> > > > > stuff and in what form exactly they might support stuff or not support
> > > > > stuff. Or why exactly that's even in the spec there or not.
> > > > >
> > > > > Geez
> > > > > -Daniel
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Daniel Vetter April 14, 2020, 3:01 p.m. UTC | #18

On Tue, Apr 14, 2020 at 4:29 PM Kenny Ho <y2kenny@gmail.com> wrote:
>
> On Tue, Apr 14, 2020 at 10:04 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > This has _nothing_ to do with Intel (I think over the past 25 years or
> > so intel has implemented all 4 versions of gpu splitting that I
> > listed, but not entirely sure).
> >
> > So again pls less tribal fighting, more collaboration. If you can't do
> > that, let's pick nouveau/nvidia as arbitrary neutral ground.
>
> So are you saying Intel has implemented a form of masking before?  I
> don't think we need to just pick a vendor as a neutral ground.  The
> idea of spatial sharing vs time sharing is not vendor specific... it's
> not even GPU specific.  This is why I asked the two questions below.
>
> > > Perhaps the following questions can help keep the discussion technical:
> > > 1)  Is it possible to implement non-work-conserving distribution of
> > > GPU without spatial sharing?  (If yes, I'd love to hear a suggestion,
> > > if not...question 2.)
> > > 2)  If spatial sharing is required to support GPU HPC use cases, what
> > > would you implement if you have the hardware support today?
> >
> > The thing we can currently do in upstream (from how I'm understanding
> > hw) is assign entire PCI devices to containers, so essentially only
> > the entire /dev/dri/* cdev. That works, and it works across all
> > drivers we have in upstream right now.
> >
> > Anything more fine-grained I don't think is currently possible,
> > because everyone has a different idea of how to split up gpus. It
> > would be nice to have it, but in upstream, cross-vendor, I'm just not
> > seeing it happen right now.
>
> I understand the reality, but what would you implement to support the
> concept (GPU in HPC, which you said you are not against) if you have
> the hw support today?  How would you support low-jitter/low-latency
> sharing of a single GPU if you have whatever hardware support you need
> today?

Whatever works on my gpu.

But there's a huge difference between what I can do for Intel, with my
Intel hat on, and ship that on some random intel-only repo or DKMS.
And what makes sense to push to upstream, because on upstream it needs
to be cross vendor and have reasonably clear semantics so that admins
understand it no matter whether you plug in an amd, nvidia or whatever
else gpu.

Yes this sucks, but as long as all the hw vendors insist on
differentiating here there's not much we can do. Maybe in the future
the VF stuff might help, but I'm not super hopeful that's actually
going to happen all that well. And the VF stuff at least works the
same way as what we currently can do already, with assigning an entire
/dev/dri/render* node to a container.

If you want more fine-grained then you (as a user) need to have
containers for amd, and different container isolation for nvidia, and
different container isolation for intel, and different container
isolation for $next_vendor, and so on. We can't just wish that there's
a standard way to manage this when there isn't. And merging
non-standard ways to split up gpus with cgroups, one for each gpu
vendor (generation maybe even?) just isn't going to work in upstream.

And really that's not a huge deal, because on the userspace side for
HPC it's the exact same sorry state of affairs, with cuda, rocm and
the oneapie effort from intel (not counting a bunch of things various
vendors tried to pull off on the soc side of things, there's even more
fun there). Standardizing the kernel management while you still need
to have different container images (these userspace generally have a
really hard time co-existing) isn't solving any real-world user
problems.

So yeah it sucks if you're a gpu compute user in some kind of server
setting :-/ And there's not really much I can do to fix this, except
tell vendors that everyone doing their own thing wont work (in
upstream, it'll work totally in all the vendor driver trees and
stacks, can't stop that).
-Daniel

> Regards,
> Kenny
>
>
> > > On Tue, Apr 14, 2020 at 9:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > > >
> > > > > Ok.  I was hoping you can clarify the contradiction between the
> > > > > existance of the spec below and your "not something any other gpu can
> > > > > reasonably support" statement.  I mean, OneAPI is Intel's spec and
> > > > > doesn't that at least make SubDevice support "reasonable" for one more
> > > > > vendor?
> > > > >
> > > > > Partisanship aside, as a drm co-maintainer, do you really not see the
> > > > > need for non-work-conserving way of distributing GPU as a resource?
> > > > > You recognized the latencies involved (although that's really just
> > > > > part of the story... time sharing is never going to be good enough
> > > > > even if your switching cost is zero.)  As a drm co-maintainer, are you
> > > > > suggesting GPU has no place in the HPC use case?
> > > >
> > > >  So I did chat with people and my understanding for how this subdevice
> > > > stuff works is roughly, from least to most fine grained support:
> > > > - Not possible at all, hw doesn't have any such support
> > > > - The hw is actually not a single gpu, but a bunch of chips behind a
> > > > magic bridge/interconnect, and there's a scheduler load-balancing
> > > > stuff and you can't actually run on all "cores" in parallel with one
> > > > compute/3d job. So subdevices just give you some of these cores, but
> > > > from client api pov they're exactly as powerful as the full device. So
> > > > this kinda works like assigning an entire NUMA node, including all the
> > > > cpu cores and memory bandwidth and everything.
> > > > - Hw has multiple "engines" which share resources (like compute cores
> > > > or whatever) behind the scenes. There's no control over how this
> > > > sharing works really, and whether you have guarantees about minimal
> > > > execution resources or not. This kinda works like hyperthreading.
> > > > - Then finally we have the CU mask thing amdgpu has. Which works like
> > > > what you're proposing, works on amd.
> > > >
> > > > So this isn't something that I think we should standardize in a
> > > > resource management framework like cgroups. Because it's a complete
> > > > mess. Note that _all_ the above things (including the "no subdevices"
> > > > one) are valid implementations of "subdevices" in the various specs.
> > > >
> > > > Now on your question on "why was this added to various standards?"
> > > > because opencl has that too (and the rocm thing, and everything else
> > > > it seems). What I heard is that a few people pushed really hard, and
> > > > no one objected hard enough (because not having subdevices is a
> > > > standards compliant implementation), so that's why it happened. Just
> > > > because it's in various standards doesn't mean that a) it's actually
> > > > standardized in a useful fashion and b) something we should just
> > > > blindly adopt.
> > > >
> > > > Also like where exactly did you understand that I'm against gpus in
> > > > HPC uses cases. Approaching this in a slightly less tribal way would
> > > > really, really help to get something landed (which I'd like to see
> > > > happen, personally). Always spinning this as an Intel vs AMD thing
> > > > like you do here with every reply really doesn't help moving this in.
> > > >
> > > > So yeah stricter isolation is something customers want, it's just not
> > > > something we can really give out right now at a level below the
> > > > device.
> > > > -Daniel
> > > >
> > > > >
> > > > > Regards,
> > > > > Kenny
> > > > >
> > > > > On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >
> > > > > > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > > > > > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > > > My understanding from talking with a few other folks is that
> > > > > > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > > > > > reasonably support (and we have about 6+ of those in-tree)
> > > > > > >
> > > > > > > How does Intel plan to support the SubDevice API as described in your
> > > > > > > own spec here:
> > > > > > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> > > > > >
> > > > > > I can't talk about whether future products might or might not support
> > > > > > stuff and in what form exactly they might support stuff or not support
> > > > > > stuff. Or why exactly that's even in the spec there or not.
> > > > > >
> > > > > > Geez
> > > > > > -Daniel
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > > >
> > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch

[v2,00/11] new cgroup controller for gpu/drm subsystem

Message

Comments