[RFC,v4,00/16] new cgroup controller for gpu/drm subsystem

Message ID	20190829060533.32315-1-Kenny.Ho@amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=LpEv=WZ=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 27D60233A1 Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) From: Kenny Ho <Kenny.Ho@amd.com> To: <y2kenny@gmail.com>, <cgroups@vger.kernel.org>, <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org>, <tj@kernel.org>, <alexander.deucher@amd.com>, <christian.koenig@amd.com>, <felix.kuehling@amd.com>, <joseph.greathouse@amd.com>, <jsparks@cray.com>, <lkaplan@cray.com>, <daniel@ffwll.ch> Subject: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem Date: Thu, 29 Aug 2019 02:05:17 -0400 Message-ID: <20190829060533.32315-1-Kenny.Ho@amd.com> MIME-Version: 1.0 Precedence: list Cc: Kenny Ho <Kenny.Ho@amd.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	new cgroup controller for gpu/drm subsystem \| expand [RFC,v4,00/16] new cgroup controller for gpu/drm subsystem [RFC,v4,01/16] drm: Add drm_minor_for_each [RFC,v4,02/16] cgroup: Introduce cgroup for drm subsystem [RFC,v4,03/16] drm, cgroup: Initialize drmcg properties [RFC,v4,04/16] drm, cgroup: Add total GEM buffer allocation stats [RFC,v4,05/16] drm, cgroup: Add peak GEM buffer allocation stats [RFC,v4,06/16] drm, cgroup: Add GEM buffer allocation count stats [RFC,v4,07/16] drm, cgroup: Add total GEM buffer allocation limit [RFC,v4,08/16] drm, cgroup: Add peak GEM buffer allocation limit [RFC,v4,09/16] drm, cgroup: Add TTM buffer allocation stats [RFC,v4,10/16] drm, cgroup: Add TTM buffer peak usage stats [RFC,v4,11/16] drm, cgroup: Add per cgroup bw measure and control [RFC,v4,12/16] drm, cgroup: Add soft VRAM limit [RFC,v4,13/16] drm, cgroup: Allow more aggressive memory reclaim [RFC,v4,14/16] drm, cgroup: Introduce lgpu as DRM cgroup resource [RFC,v4,15/16] drm, cgroup: add update trigger after limit change [RFC,v4,16/16] drm/amdgpu: Integrate with DRM cgroup

Ho, Kenny Aug. 29, 2019, 6:05 a.m. UTC

This is a follow up to the RFC I made previously to introduce a cgroup
controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
provide resource management to GPU resources using things like container.  

With this RFC v4, I am hoping to have some consensus on a merge plan.  I believe
the GEM related resources (drm.buffer.*) introduced in previous RFC and,
hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
uncontroversial and ready to move out of RFC and into a more formal review.  I
will continue to work on the memory backend resources (drm.memory.*).

The cover letter from v1 is copied below for reference.

[v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
[v2]: https://www.spinics.net/lists/cgroups/msg22074.html
[v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html

v4:
Unchanged (no review needed)
* drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker)
Base on feedbacks on v3:
* update nominclature to drmcg
* embed per device drmcg properties into drm_device
* split GEM buffer related commits into stats and limit
* rename function name to align with convention
* combined buffer accounting and check into a try_charge function
* support buffer stats without limit enforcement
* removed GEM buffer sharing limitation
* updated documentations
New features:
* introducing logical GPU concept
* example implementation with AMD KFD

v3:
Base on feedbacks on v2:
* removed .help type file from v2
* conform to cgroup convention for default and max handling
* conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
* adopted memparse for memory size related attributes
* added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
* added ttm buffer usage stats (per cgroup, for system, tt, vram.)
* added ttm buffer usage limit (per cgroup, for vram.)
* added per cgroup bandwidth stats and limiting (burst and average bandwidth)

v2:
* Removed the vendoring concepts
* Add limit to total buffer allocation
* Add limit to the maximum size of a buffer allocation

v1: cover letter

The purpose of this patch series is to start a discussion for a generic cgroup
controller for the drm subsystem.  The design proposed here is a very early one.
We are hoping to engage the community as we develop the idea.


Backgrounds
==========
Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with specialized
behaviour, such as accounting/limiting the resources which processes in a cgroup
can access[1].  Weights, limits, protections, allocations are the main resource
distribution models.  Existing cgroup controllers includes cpu, memory, io,
rdma, and more.  cgroup is one of the foundational technologies that enables the
popular container application deployment and management method.

Direct Rendering Manager/drm contains code intended to support the needs of
complex graphics devices. Graphics drivers in the kernel may make use of DRM
functions to make tasks like memory management, interrupt handling and DMA
easier, and provide a uniform interface to applications.  The DRM has also
developed beyond traditional graphics applications to support compute/GPGPU
applications.


Motivations
=========
As GPU grow beyond the realm of desktop/workstation graphics into areas like
data center clusters and IoT, there are increasing needs to monitor and regulate
GPU as a resource like cpu, memory and io.

Matt Roper from Intel began working on similar idea in early 2018 [2] for the
purpose of managing GPU priority using the cgroup hierarchy.  While that
particular use case may not warrant a standalone drm cgroup controller, there
are other use cases where having one can be useful [3].  Monitoring GPU
resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
(execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
sysadmins get a better understanding of the applications usage profile.  Further
usage regulations of the aforementioned resources can also help sysadmins
optimize workload deployment on limited GPU resources.

With the increased importance of machine learning, data science and other
cloud-based applications, GPUs are already in production use in data centers
today [5,6,7].  Existing GPU resource management is very course grain, however,
as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
alternative is to use GPU virtualization (with or without SRIOV) but it
generally acts on the entire GPU instead of the specific resources in a GPU.
With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
resource management (in addition to what may be available via GPU
virtualization.)

In addition to production use, the DRM cgroup can also help with testing
graphics application robustness by providing a mean to artificially limit DRM
resources availble to the applications.


Challenges
========
While there are common infrastructure in DRM that is shared across many vendors
(the scheduler [4] for example), there are also aspects of DRM that are vendor
specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
handle different kinds of cgroup controller.

Resources for DRM are also often device (GPU) specific instead of system
specific and a system may contain more than one GPU.  For this, we borrowed some
of the ideas from RDMA cgroup controller.

Approach
=======
To experiment with the idea of a DRM cgroup, we would like to start with basic
accounting and statistics, then continue to iterate and add regulating
mechanisms into the driver.

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
[2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
[3] https://www.spinics.net/lists/cgroups/msg20720.html
[4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
[5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
[6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
[7] https://github.com/RadeonOpenCompute/k8s-device-plugin
[8] https://github.com/kubernetes/kubernetes/issues/52757

Kenny Ho (16):
  drm: Add drm_minor_for_each
  cgroup: Introduce cgroup for drm subsystem
  drm, cgroup: Initialize drmcg properties
  drm, cgroup: Add total GEM buffer allocation stats
  drm, cgroup: Add peak GEM buffer allocation stats
  drm, cgroup: Add GEM buffer allocation count stats
  drm, cgroup: Add total GEM buffer allocation limit
  drm, cgroup: Add peak GEM buffer allocation limit
  drm, cgroup: Add TTM buffer allocation stats
  drm, cgroup: Add TTM buffer peak usage stats
  drm, cgroup: Add per cgroup bw measure and control
  drm, cgroup: Add soft VRAM limit
  drm, cgroup: Allow more aggressive memory reclaim
  drm, cgroup: Introduce lgpu as DRM cgroup resource
  drm, cgroup: add update trigger after limit change
  drm/amdgpu: Integrate with DRM cgroup

 Documentation/admin-guide/cgroup-v2.rst       |  163 +-
 Documentation/cgroup-v1/drm.rst               |    1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   29 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |    6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |    3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |    6 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |    3 +
 .../amd/amdkfd/kfd_process_queue_manager.c    |  140 ++
 drivers/gpu/drm/drm_drv.c                     |   26 +
 drivers/gpu/drm/drm_gem.c                     |   16 +-
 drivers/gpu/drm/drm_internal.h                |    4 -
 drivers/gpu/drm/ttm/ttm_bo.c                  |   93 ++
 drivers/gpu/drm/ttm/ttm_bo_util.c             |    4 +
 include/drm/drm_cgroup.h                      |  122 ++
 include/drm/drm_device.h                      |    7 +
 include/drm/drm_drv.h                         |   23 +
 include/drm/drm_gem.h                         |   13 +-
 include/drm/ttm/ttm_bo_api.h                  |    2 +
 include/drm/ttm/ttm_bo_driver.h               |   10 +
 include/linux/cgroup_drm.h                    |  151 ++
 include/linux/cgroup_subsys.h                 |    4 +
 init/Kconfig                                  |    5 +
 kernel/cgroup/Makefile                        |    1 +
 kernel/cgroup/drm.c                           | 1367 +++++++++++++++++
 25 files changed, 2193 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/cgroup-v1/drm.rst
 create mode 100644 include/drm/drm_cgroup.h
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

Tejun Heo Aug. 31, 2019, 4:28 a.m. UTC | #1

Hello,

I just glanced through the interface and don't have enough context to
give any kind of detailed review yet.  I'll try to read up and
understand more and would greatly appreciate if you can give me some
pointers to read up on the resources being controlled and how the
actual use cases would look like.  That said, I have some basic
concerns.

* TTM vs. GEM distinction seems to be internal implementation detail
  rather than anything relating to underlying physical resources.
  Provided that's the case, I'm afraid these internal constructs being
  used as primary resource control objects likely isn't the right
  approach.  Whether a given driver uses one or the other internal
  abstraction layer shouldn't determine how resources are represented
  at the userland interface layer.

* While breaking up and applying control to different types of
  internal objects may seem attractive to folks who work day in and
  day out with the subsystem, they aren't all that useful to users and
  the siloed controls are likely to make the whole mechanism a lot
  less useful.  We had the same problem with cgroup1 memcg - putting
  control of different uses of memory under separate knobs.  It made
  the whole thing pretty useless.  e.g. if you constrain all knobs
  tight enough to control the overall usage, overall utilization
  suffers, but if you don't, you really don't have control over actual
  usage.  For memcg, what has to be allocated and controlled is
  physical memory, no matter how they're used.  It's not like you can
  go buy more "socket" memory.  At least from the looks of it, I'm
  afraid gpu controller is repeating the same mistakes.

Thanks.

Daniel Vetter Sept. 3, 2019, 7:55 a.m. UTC | #2

On Fri, Aug 30, 2019 at 09:28:57PM -0700, Tejun Heo wrote:
> Hello,
> 
> I just glanced through the interface and don't have enough context to
> give any kind of detailed review yet.  I'll try to read up and
> understand more and would greatly appreciate if you can give me some
> pointers to read up on the resources being controlled and how the
> actual use cases would look like.  That said, I have some basic
> concerns.
> 
> * TTM vs. GEM distinction seems to be internal implementation detail
>   rather than anything relating to underlying physical resources.
>   Provided that's the case, I'm afraid these internal constructs being
>   used as primary resource control objects likely isn't the right
>   approach.  Whether a given driver uses one or the other internal
>   abstraction layer shouldn't determine how resources are represented
>   at the userland interface layer.

Yeah there's another RFC series from Brian Welty to abstract this away as
a memory region concept for gpus.

> * While breaking up and applying control to different types of
>   internal objects may seem attractive to folks who work day in and
>   day out with the subsystem, they aren't all that useful to users and
>   the siloed controls are likely to make the whole mechanism a lot
>   less useful.  We had the same problem with cgroup1 memcg - putting
>   control of different uses of memory under separate knobs.  It made
>   the whole thing pretty useless.  e.g. if you constrain all knobs
>   tight enough to control the overall usage, overall utilization
>   suffers, but if you don't, you really don't have control over actual
>   usage.  For memcg, what has to be allocated and controlled is
>   physical memory, no matter how they're used.  It's not like you can
>   go buy more "socket" memory.  At least from the looks of it, I'm
>   afraid gpu controller is repeating the same mistakes.

We do have quite a pile of different memories and ranges, so I don't
thinkt we're doing the same mistake here. But it is maybe a bit too
complicated, and exposes stuff that most users really don't care about.
-Daniel

Daniel Vetter Sept. 3, 2019, 8:02 a.m. UTC | #3

On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> This is a follow up to the RFC I made previously to introduce a cgroup
> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
> provide resource management to GPU resources using things like container.  
> 
> With this RFC v4, I am hoping to have some consensus on a merge plan.  I believe
> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
> uncontroversial and ready to move out of RFC and into a more formal review.  I
> will continue to work on the memory backend resources (drm.memory.*).
> 
> The cover letter from v1 is copied below for reference.
> 
> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html

So looking at all this doesn't seem to have changed much, and the old
discussion didn't really conclude anywhere (aside from some details).

One more open though that crossed my mind, having read a ton of ttm again
recently: How does this all interact with ttm global limits? I'd say the
ttm global limits is the ur-cgroups we have in drm, and not looking at
that seems kinda bad.
-Daniel

> 
> v4:
> Unchanged (no review needed)
> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
> and shrinker)
> Base on feedbacks on v3:
> * update nominclature to drmcg
> * embed per device drmcg properties into drm_device
> * split GEM buffer related commits into stats and limit
> * rename function name to align with convention
> * combined buffer accounting and check into a try_charge function
> * support buffer stats without limit enforcement
> * removed GEM buffer sharing limitation
> * updated documentations
> New features:
> * introducing logical GPU concept
> * example implementation with AMD KFD
> 
> v3:
> Base on feedbacks on v2:
> * removed .help type file from v2
> * conform to cgroup convention for default and max handling
> * conform to cgroup convention for addressing device specific limits (with major:minor)
> New function:
> * adopted memparse for memory size related attributes
> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> * added ttm buffer usage limit (per cgroup, for vram.)
> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
> 
> v2:
> * Removed the vendoring concepts
> * Add limit to total buffer allocation
> * Add limit to the maximum size of a buffer allocation
> 
> v1: cover letter
> 
> The purpose of this patch series is to start a discussion for a generic cgroup
> controller for the drm subsystem.  The design proposed here is a very early one.
> We are hoping to engage the community as we develop the idea.
> 
> 
> Backgrounds
> ==========
> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
> tasks, and all their future children, into hierarchical groups with specialized
> behaviour, such as accounting/limiting the resources which processes in a cgroup
> can access[1].  Weights, limits, protections, allocations are the main resource
> distribution models.  Existing cgroup controllers includes cpu, memory, io,
> rdma, and more.  cgroup is one of the foundational technologies that enables the
> popular container application deployment and management method.
> 
> Direct Rendering Manager/drm contains code intended to support the needs of
> complex graphics devices. Graphics drivers in the kernel may make use of DRM
> functions to make tasks like memory management, interrupt handling and DMA
> easier, and provide a uniform interface to applications.  The DRM has also
> developed beyond traditional graphics applications to support compute/GPGPU
> applications.
> 
> 
> Motivations
> =========
> As GPU grow beyond the realm of desktop/workstation graphics into areas like
> data center clusters and IoT, there are increasing needs to monitor and regulate
> GPU as a resource like cpu, memory and io.
> 
> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
> purpose of managing GPU priority using the cgroup hierarchy.  While that
> particular use case may not warrant a standalone drm cgroup controller, there
> are other use cases where having one can be useful [3].  Monitoring GPU
> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
> sysadmins get a better understanding of the applications usage profile.  Further
> usage regulations of the aforementioned resources can also help sysadmins
> optimize workload deployment on limited GPU resources.
> 
> With the increased importance of machine learning, data science and other
> cloud-based applications, GPUs are already in production use in data centers
> today [5,6,7].  Existing GPU resource management is very course grain, however,
> as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
> alternative is to use GPU virtualization (with or without SRIOV) but it
> generally acts on the entire GPU instead of the specific resources in a GPU.
> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
> resource management (in addition to what may be available via GPU
> virtualization.)
> 
> In addition to production use, the DRM cgroup can also help with testing
> graphics application robustness by providing a mean to artificially limit DRM
> resources availble to the applications.
> 
> 
> Challenges
> ========
> While there are common infrastructure in DRM that is shared across many vendors
> (the scheduler [4] for example), there are also aspects of DRM that are vendor
> specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
> handle different kinds of cgroup controller.
> 
> Resources for DRM are also often device (GPU) specific instead of system
> specific and a system may contain more than one GPU.  For this, we borrowed some
> of the ideas from RDMA cgroup controller.
> 
> Approach
> =======
> To experiment with the idea of a DRM cgroup, we would like to start with basic
> accounting and statistics, then continue to iterate and add regulating
> mechanisms into the driver.
> 
> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> [3] https://www.spinics.net/lists/cgroups/msg20720.html
> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin
> [8] https://github.com/kubernetes/kubernetes/issues/52757
> 
> Kenny Ho (16):
>   drm: Add drm_minor_for_each
>   cgroup: Introduce cgroup for drm subsystem
>   drm, cgroup: Initialize drmcg properties
>   drm, cgroup: Add total GEM buffer allocation stats
>   drm, cgroup: Add peak GEM buffer allocation stats
>   drm, cgroup: Add GEM buffer allocation count stats
>   drm, cgroup: Add total GEM buffer allocation limit
>   drm, cgroup: Add peak GEM buffer allocation limit
>   drm, cgroup: Add TTM buffer allocation stats
>   drm, cgroup: Add TTM buffer peak usage stats
>   drm, cgroup: Add per cgroup bw measure and control
>   drm, cgroup: Add soft VRAM limit
>   drm, cgroup: Allow more aggressive memory reclaim
>   drm, cgroup: Introduce lgpu as DRM cgroup resource
>   drm, cgroup: add update trigger after limit change
>   drm/amdgpu: Integrate with DRM cgroup
> 
>  Documentation/admin-guide/cgroup-v2.rst       |  163 +-
>  Documentation/cgroup-v1/drm.rst               |    1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   29 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |    6 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |    3 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |    6 +
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |    3 +
>  .../amd/amdkfd/kfd_process_queue_manager.c    |  140 ++
>  drivers/gpu/drm/drm_drv.c                     |   26 +
>  drivers/gpu/drm/drm_gem.c                     |   16 +-
>  drivers/gpu/drm/drm_internal.h                |    4 -
>  drivers/gpu/drm/ttm/ttm_bo.c                  |   93 ++
>  drivers/gpu/drm/ttm/ttm_bo_util.c             |    4 +
>  include/drm/drm_cgroup.h                      |  122 ++
>  include/drm/drm_device.h                      |    7 +
>  include/drm/drm_drv.h                         |   23 +
>  include/drm/drm_gem.h                         |   13 +-
>  include/drm/ttm/ttm_bo_api.h                  |    2 +
>  include/drm/ttm/ttm_bo_driver.h               |   10 +
>  include/linux/cgroup_drm.h                    |  151 ++
>  include/linux/cgroup_subsys.h                 |    4 +
>  init/Kconfig                                  |    5 +
>  kernel/cgroup/Makefile                        |    1 +
>  kernel/cgroup/drm.c                           | 1367 +++++++++++++++++
>  25 files changed, 2193 insertions(+), 10 deletions(-)
>  create mode 100644 Documentation/cgroup-v1/drm.rst
>  create mode 100644 include/drm/drm_cgroup.h
>  create mode 100644 include/linux/cgroup_drm.h
>  create mode 100644 kernel/cgroup/drm.c
> 
> -- 
> 2.22.0
>

Christian König Sept. 3, 2019, 8:24 a.m. UTC | #4

Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
>> This is a follow up to the RFC I made previously to introduce a cgroup
>> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
>> provide resource management to GPU resources using things like container.
>>
>> With this RFC v4, I am hoping to have some consensus on a merge plan.  I believe
>> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
>> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
>> uncontroversial and ready to move out of RFC and into a more formal review.  I
>> will continue to work on the memory backend resources (drm.memory.*).
>>
>> The cover letter from v1 is copied below for reference.
>>
>> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
>> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
>> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> So looking at all this doesn't seem to have changed much, and the old
> discussion didn't really conclude anywhere (aside from some details).
>
> One more open though that crossed my mind, having read a ton of ttm again
> recently: How does this all interact with ttm global limits? I'd say the
> ttm global limits is the ur-cgroups we have in drm, and not looking at
> that seems kinda bad.

At least my hope was to completely replace ttm globals with those 
limitations here when it is ready.

Christian.

> -Daniel
>
>> v4:
>> Unchanged (no review needed)
>> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
>> and shrinker)
>> Base on feedbacks on v3:
>> * update nominclature to drmcg
>> * embed per device drmcg properties into drm_device
>> * split GEM buffer related commits into stats and limit
>> * rename function name to align with convention
>> * combined buffer accounting and check into a try_charge function
>> * support buffer stats without limit enforcement
>> * removed GEM buffer sharing limitation
>> * updated documentations
>> New features:
>> * introducing logical GPU concept
>> * example implementation with AMD KFD
>>
>> v3:
>> Base on feedbacks on v2:
>> * removed .help type file from v2
>> * conform to cgroup convention for default and max handling
>> * conform to cgroup convention for addressing device specific limits (with major:minor)
>> New function:
>> * adopted memparse for memory size related attributes
>> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
>> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
>> * added ttm buffer usage limit (per cgroup, for vram.)
>> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
>>
>> v2:
>> * Removed the vendoring concepts
>> * Add limit to total buffer allocation
>> * Add limit to the maximum size of a buffer allocation
>>
>> v1: cover letter
>>
>> The purpose of this patch series is to start a discussion for a generic cgroup
>> controller for the drm subsystem.  The design proposed here is a very early one.
>> We are hoping to engage the community as we develop the idea.
>>
>>
>> Backgrounds
>> ==========
>> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
>> tasks, and all their future children, into hierarchical groups with specialized
>> behaviour, such as accounting/limiting the resources which processes in a cgroup
>> can access[1].  Weights, limits, protections, allocations are the main resource
>> distribution models.  Existing cgroup controllers includes cpu, memory, io,
>> rdma, and more.  cgroup is one of the foundational technologies that enables the
>> popular container application deployment and management method.
>>
>> Direct Rendering Manager/drm contains code intended to support the needs of
>> complex graphics devices. Graphics drivers in the kernel may make use of DRM
>> functions to make tasks like memory management, interrupt handling and DMA
>> easier, and provide a uniform interface to applications.  The DRM has also
>> developed beyond traditional graphics applications to support compute/GPGPU
>> applications.
>>
>>
>> Motivations
>> =========
>> As GPU grow beyond the realm of desktop/workstation graphics into areas like
>> data center clusters and IoT, there are increasing needs to monitor and regulate
>> GPU as a resource like cpu, memory and io.
>>
>> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
>> purpose of managing GPU priority using the cgroup hierarchy.  While that
>> particular use case may not warrant a standalone drm cgroup controller, there
>> are other use cases where having one can be useful [3].  Monitoring GPU
>> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
>> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
>> sysadmins get a better understanding of the applications usage profile.  Further
>> usage regulations of the aforementioned resources can also help sysadmins
>> optimize workload deployment on limited GPU resources.
>>
>> With the increased importance of machine learning, data science and other
>> cloud-based applications, GPUs are already in production use in data centers
>> today [5,6,7].  Existing GPU resource management is very course grain, however,
>> as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
>> alternative is to use GPU virtualization (with or without SRIOV) but it
>> generally acts on the entire GPU instead of the specific resources in a GPU.
>> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
>> resource management (in addition to what may be available via GPU
>> virtualization.)
>>
>> In addition to production use, the DRM cgroup can also help with testing
>> graphics application robustness by providing a mean to artificially limit DRM
>> resources availble to the applications.
>>
>>
>> Challenges
>> ========
>> While there are common infrastructure in DRM that is shared across many vendors
>> (the scheduler [4] for example), there are also aspects of DRM that are vendor
>> specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
>> handle different kinds of cgroup controller.
>>
>> Resources for DRM are also often device (GPU) specific instead of system
>> specific and a system may contain more than one GPU.  For this, we borrowed some
>> of the ideas from RDMA cgroup controller.
>>
>> Approach
>> =======
>> To experiment with the idea of a DRM cgroup, we would like to start with basic
>> accounting and statistics, then continue to iterate and add regulating
>> mechanisms into the driver.
>>
>> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
>> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>> [3] https://www.spinics.net/lists/cgroups/msg20720.html
>> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
>> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
>> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
>> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin
>> [8] https://github.com/kubernetes/kubernetes/issues/52757
>>
>> Kenny Ho (16):
>>    drm: Add drm_minor_for_each
>>    cgroup: Introduce cgroup for drm subsystem
>>    drm, cgroup: Initialize drmcg properties
>>    drm, cgroup: Add total GEM buffer allocation stats
>>    drm, cgroup: Add peak GEM buffer allocation stats
>>    drm, cgroup: Add GEM buffer allocation count stats
>>    drm, cgroup: Add total GEM buffer allocation limit
>>    drm, cgroup: Add peak GEM buffer allocation limit
>>    drm, cgroup: Add TTM buffer allocation stats
>>    drm, cgroup: Add TTM buffer peak usage stats
>>    drm, cgroup: Add per cgroup bw measure and control
>>    drm, cgroup: Add soft VRAM limit
>>    drm, cgroup: Allow more aggressive memory reclaim
>>    drm, cgroup: Introduce lgpu as DRM cgroup resource
>>    drm, cgroup: add update trigger after limit change
>>    drm/amdgpu: Integrate with DRM cgroup
>>
>>   Documentation/admin-guide/cgroup-v2.rst       |  163 +-
>>   Documentation/cgroup-v1/drm.rst               |    1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   29 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |    6 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |    3 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |    6 +
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |    3 +
>>   .../amd/amdkfd/kfd_process_queue_manager.c    |  140 ++
>>   drivers/gpu/drm/drm_drv.c                     |   26 +
>>   drivers/gpu/drm/drm_gem.c                     |   16 +-
>>   drivers/gpu/drm/drm_internal.h                |    4 -
>>   drivers/gpu/drm/ttm/ttm_bo.c                  |   93 ++
>>   drivers/gpu/drm/ttm/ttm_bo_util.c             |    4 +
>>   include/drm/drm_cgroup.h                      |  122 ++
>>   include/drm/drm_device.h                      |    7 +
>>   include/drm/drm_drv.h                         |   23 +
>>   include/drm/drm_gem.h                         |   13 +-
>>   include/drm/ttm/ttm_bo_api.h                  |    2 +
>>   include/drm/ttm/ttm_bo_driver.h               |   10 +
>>   include/linux/cgroup_drm.h                    |  151 ++
>>   include/linux/cgroup_subsys.h                 |    4 +
>>   init/Kconfig                                  |    5 +
>>   kernel/cgroup/Makefile                        |    1 +
>>   kernel/cgroup/drm.c                           | 1367 +++++++++++++++++
>>   25 files changed, 2193 insertions(+), 10 deletions(-)
>>   create mode 100644 Documentation/cgroup-v1/drm.rst
>>   create mode 100644 include/drm/drm_cgroup.h
>>   create mode 100644 include/linux/cgroup_drm.h
>>   create mode 100644 kernel/cgroup/drm.c
>>
>> -- 
>> 2.22.0
>>

Daniel Vetter Sept. 3, 2019, 9:19 a.m. UTC | #5

On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> >> This is a follow up to the RFC I made previously to introduce a cgroup
> >> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
> >> provide resource management to GPU resources using things like container.
> >>
> >> With this RFC v4, I am hoping to have some consensus on a merge plan.  I believe
> >> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
> >> uncontroversial and ready to move out of RFC and into a more formal review.  I
> >> will continue to work on the memory backend resources (drm.memory.*).
> >>
> >> The cover letter from v1 is copied below for reference.
> >>
> >> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> > So looking at all this doesn't seem to have changed much, and the old
> > discussion didn't really conclude anywhere (aside from some details).
> >
> > One more open though that crossed my mind, having read a ton of ttm again
> > recently: How does this all interact with ttm global limits? I'd say the
> > ttm global limits is the ur-cgroups we have in drm, and not looking at
> > that seems kinda bad.
>
> At least my hope was to completely replace ttm globals with those
> limitations here when it is ready.

You need more, at least some kind of shrinker to cut down bo placed in
system memory when we're under memory pressure. Which drags in a
pretty epic amount of locking lols (see i915's shrinker fun, where we
attempt that). Probably another good idea to share at least some
concepts, maybe even code.
-Daniel

>
> Christian.
>
> > -Daniel
> >
> >> v4:
> >> Unchanged (no review needed)
> >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
> >> and shrinker)
> >> Base on feedbacks on v3:
> >> * update nominclature to drmcg
> >> * embed per device drmcg properties into drm_device
> >> * split GEM buffer related commits into stats and limit
> >> * rename function name to align with convention
> >> * combined buffer accounting and check into a try_charge function
> >> * support buffer stats without limit enforcement
> >> * removed GEM buffer sharing limitation
> >> * updated documentations
> >> New features:
> >> * introducing logical GPU concept
> >> * example implementation with AMD KFD
> >>
> >> v3:
> >> Base on feedbacks on v2:
> >> * removed .help type file from v2
> >> * conform to cgroup convention for default and max handling
> >> * conform to cgroup convention for addressing device specific limits (with major:minor)
> >> New function:
> >> * adopted memparse for memory size related attributes
> >> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> >> * added ttm buffer usage limit (per cgroup, for vram.)
> >> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
> >>
> >> v2:
> >> * Removed the vendoring concepts
> >> * Add limit to total buffer allocation
> >> * Add limit to the maximum size of a buffer allocation
> >>
> >> v1: cover letter
> >>
> >> The purpose of this patch series is to start a discussion for a generic cgroup
> >> controller for the drm subsystem.  The design proposed here is a very early one.
> >> We are hoping to engage the community as we develop the idea.
> >>
> >>
> >> Backgrounds
> >> ==========
> >> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
> >> tasks, and all their future children, into hierarchical groups with specialized
> >> behaviour, such as accounting/limiting the resources which processes in a cgroup
> >> can access[1].  Weights, limits, protections, allocations are the main resource
> >> distribution models.  Existing cgroup controllers includes cpu, memory, io,
> >> rdma, and more.  cgroup is one of the foundational technologies that enables the
> >> popular container application deployment and management method.
> >>
> >> Direct Rendering Manager/drm contains code intended to support the needs of
> >> complex graphics devices. Graphics drivers in the kernel may make use of DRM
> >> functions to make tasks like memory management, interrupt handling and DMA
> >> easier, and provide a uniform interface to applications.  The DRM has also
> >> developed beyond traditional graphics applications to support compute/GPGPU
> >> applications.
> >>
> >>
> >> Motivations
> >> =========
> >> As GPU grow beyond the realm of desktop/workstation graphics into areas like
> >> data center clusters and IoT, there are increasing needs to monitor and regulate
> >> GPU as a resource like cpu, memory and io.
> >>
> >> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
> >> purpose of managing GPU priority using the cgroup hierarchy.  While that
> >> particular use case may not warrant a standalone drm cgroup controller, there
> >> are other use cases where having one can be useful [3].  Monitoring GPU
> >> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
> >> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
> >> sysadmins get a better understanding of the applications usage profile.  Further
> >> usage regulations of the aforementioned resources can also help sysadmins
> >> optimize workload deployment on limited GPU resources.
> >>
> >> With the increased importance of machine learning, data science and other
> >> cloud-based applications, GPUs are already in production use in data centers
> >> today [5,6,7].  Existing GPU resource management is very course grain, however,
> >> as sysadmins are only able to distribute workload on a per-GPU basis [8].  An
> >> alternative is to use GPU virtualization (with or without SRIOV) but it
> >> generally acts on the entire GPU instead of the specific resources in a GPU.
> >> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU
> >> resource management (in addition to what may be available via GPU
> >> virtualization.)
> >>
> >> In addition to production use, the DRM cgroup can also help with testing
> >> graphics application robustness by providing a mean to artificially limit DRM
> >> resources availble to the applications.
> >>
> >>
> >> Challenges
> >> ========
> >> While there are common infrastructure in DRM that is shared across many vendors
> >> (the scheduler [4] for example), there are also aspects of DRM that are vendor
> >> specific.  To accommodate this, we borrowed the mechanism used by the cgroup to
> >> handle different kinds of cgroup controller.
> >>
> >> Resources for DRM are also often device (GPU) specific instead of system
> >> specific and a system may contain more than one GPU.  For this, we borrowed some
> >> of the ideas from RDMA cgroup controller.
> >>
> >> Approach
> >> =======
> >> To experiment with the idea of a DRM cgroup, we would like to start with basic
> >> accounting and statistics, then continue to iterate and add regulating
> >> mechanisms into the driver.
> >>
> >> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
> >> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >> [3] https://www.spinics.net/lists/cgroups/msg20720.html
> >> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler
> >> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
> >> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/
> >> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin
> >> [8] https://github.com/kubernetes/kubernetes/issues/52757
> >>
> >> Kenny Ho (16):
> >>    drm: Add drm_minor_for_each
> >>    cgroup: Introduce cgroup for drm subsystem
> >>    drm, cgroup: Initialize drmcg properties
> >>    drm, cgroup: Add total GEM buffer allocation stats
> >>    drm, cgroup: Add peak GEM buffer allocation stats
> >>    drm, cgroup: Add GEM buffer allocation count stats
> >>    drm, cgroup: Add total GEM buffer allocation limit
> >>    drm, cgroup: Add peak GEM buffer allocation limit
> >>    drm, cgroup: Add TTM buffer allocation stats
> >>    drm, cgroup: Add TTM buffer peak usage stats
> >>    drm, cgroup: Add per cgroup bw measure and control
> >>    drm, cgroup: Add soft VRAM limit
> >>    drm, cgroup: Allow more aggressive memory reclaim
> >>    drm, cgroup: Introduce lgpu as DRM cgroup resource
> >>    drm, cgroup: add update trigger after limit change
> >>    drm/amdgpu: Integrate with DRM cgroup
> >>
> >>   Documentation/admin-guide/cgroup-v2.rst       |  163 +-
> >>   Documentation/cgroup-v1/drm.rst               |    1 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   29 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |    6 +-
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |    3 +-
> >>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |    6 +
> >>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |    3 +
> >>   .../amd/amdkfd/kfd_process_queue_manager.c    |  140 ++
> >>   drivers/gpu/drm/drm_drv.c                     |   26 +
> >>   drivers/gpu/drm/drm_gem.c                     |   16 +-
> >>   drivers/gpu/drm/drm_internal.h                |    4 -
> >>   drivers/gpu/drm/ttm/ttm_bo.c                  |   93 ++
> >>   drivers/gpu/drm/ttm/ttm_bo_util.c             |    4 +
> >>   include/drm/drm_cgroup.h                      |  122 ++
> >>   include/drm/drm_device.h                      |    7 +
> >>   include/drm/drm_drv.h                         |   23 +
> >>   include/drm/drm_gem.h                         |   13 +-
> >>   include/drm/ttm/ttm_bo_api.h                  |    2 +
> >>   include/drm/ttm/ttm_bo_driver.h               |   10 +
> >>   include/linux/cgroup_drm.h                    |  151 ++
> >>   include/linux/cgroup_subsys.h                 |    4 +
> >>   init/Kconfig                                  |    5 +
> >>   kernel/cgroup/Makefile                        |    1 +
> >>   kernel/cgroup/drm.c                           | 1367 +++++++++++++++++
> >>   25 files changed, 2193 insertions(+), 10 deletions(-)
> >>   create mode 100644 Documentation/cgroup-v1/drm.rst
> >>   create mode 100644 include/drm/drm_cgroup.h
> >>   create mode 100644 include/linux/cgroup_drm.h
> >>   create mode 100644 kernel/cgroup/drm.c
> >>
> >> --
> >> 2.22.0
> >>
>

Tejun Heo Sept. 3, 2019, 6:50 p.m. UTC | #6

Hello, Daniel.

On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > * While breaking up and applying control to different types of
> >   internal objects may seem attractive to folks who work day in and
> >   day out with the subsystem, they aren't all that useful to users and
> >   the siloed controls are likely to make the whole mechanism a lot
> >   less useful.  We had the same problem with cgroup1 memcg - putting
> >   control of different uses of memory under separate knobs.  It made
> >   the whole thing pretty useless.  e.g. if you constrain all knobs
> >   tight enough to control the overall usage, overall utilization
> >   suffers, but if you don't, you really don't have control over actual
> >   usage.  For memcg, what has to be allocated and controlled is
> >   physical memory, no matter how they're used.  It's not like you can
> >   go buy more "socket" memory.  At least from the looks of it, I'm
> >   afraid gpu controller is repeating the same mistakes.
> 
> We do have quite a pile of different memories and ranges, so I don't
> thinkt we're doing the same mistake here. But it is maybe a bit too

I see.  One thing which caught my eyes was the system memory control.
Shouldn't that be controlled by memcg?  Is there something special
about system memory used by gpus?

> complicated, and exposes stuff that most users really don't care about.

Could be from me not knowing much about gpus but definitely looks too
complex to me.  I don't see how users would be able to alloate, vram,
system memory and GART with reasonable accuracy.  memcg on cgroup2
deals with just single number and that's already plenty challenging.

Thanks.

Kenny Ho Sept. 3, 2019, 7:23 p.m. UTC | #7

Hi Tejun,

Thanks for looking into this.  I can definitely help where I can and I
am sure other experts will jump in if I start misrepresenting the
reality :) (as Daniel already have done.)

Regarding your points, my understanding is that there isn't really a
TTM vs GEM situation anymore (there is an lwn.net article about that,
but it is more than a decade old.)  I believe GEM is the common
interface at this point and more and more features are being
refactored into it.  For example, AMD's driver uses TTM internally but
things are exposed via the GEM interface.

This GEM resource is actually the single number resource you just
referred to.  A GEM buffer (the drm.buffer.* resources) can be backed
by VRAM, or system memory or other type of memory.  The more fine
grain control is the drm.memory.* resources which still need more
discussion.  (As some of the functionalities in TTM are being
refactored into the GEM level.  I have seen some patches that make TTM
a subclass of GEM.)

This RFC can be grouped into 3 areas and they are fairly independent
so they can be reviewed separately: high level device memory control
(buffer.*), fine grain memory control and bandwidth (memory.*) and
compute resources (lgpu.*)  I think the memory.* resources are the
most controversial part but I think it's still needed.

Perhaps an analogy may help.  For a system, we have CPUs and memory.
And within memory, it can be backed by RAM or swap.  For GPU, each
device can have LGPUs and buffers.  And within the buffers, it can be
backed by VRAM, or system RAM or even swap.

As for setting the right amount, I think that's where the profiling
aspect of the *.stats comes in.  And while one can't necessary buy
more VRAM, it is still a useful knob to adjust if the intention is to
pack more work into a GPU device with predictable performance.  This
research on various GPU workload may be of interest:

A Taxonomy of GPGPU Performance Scaling
http://www.computermachines.org/joe/posters/iiswc2015_taxonomy.pdf
http://www.computermachines.org/joe/publications/pdfs/iiswc2015_taxonomy.pdf

(summary: GPU workload can be memory bound or compute bound.  So it's
possible to pack different workload together to improve utilization.)

Regards,
Kenny

On Tue, Sep 3, 2019 at 2:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > > * While breaking up and applying control to different types of
> > >   internal objects may seem attractive to folks who work day in and
> > >   day out with the subsystem, they aren't all that useful to users and
> > >   the siloed controls are likely to make the whole mechanism a lot
> > >   less useful.  We had the same problem with cgroup1 memcg - putting
> > >   control of different uses of memory under separate knobs.  It made
> > >   the whole thing pretty useless.  e.g. if you constrain all knobs
> > >   tight enough to control the overall usage, overall utilization
> > >   suffers, but if you don't, you really don't have control over actual
> > >   usage.  For memcg, what has to be allocated and controlled is
> > >   physical memory, no matter how they're used.  It's not like you can
> > >   go buy more "socket" memory.  At least from the looks of it, I'm
> > >   afraid gpu controller is repeating the same mistakes.
> >
> > We do have quite a pile of different memories and ranges, so I don't
> > thinkt we're doing the same mistake here. But it is maybe a bit too
>
> I see.  One thing which caught my eyes was the system memory control.
> Shouldn't that be controlled by memcg?  Is there something special
> about system memory used by gpus?
>
> > complicated, and exposes stuff that most users really don't care about.
>
> Could be from me not knowing much about gpus but definitely looks too
> complex to me.  I don't see how users would be able to alloate, vram,
> system memory and GART with reasonable accuracy.  memcg on cgroup2
> deals with just single number and that's already plenty challenging.
>
> Thanks.
>
> --
> tejun

Kenny Ho Sept. 3, 2019, 7:30 p.m. UTC | #8

On Tue, Sep 3, 2019 at 5:20 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> > > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> > >> With this RFC v4, I am hoping to have some consensus on a merge plan.  I believe
> > >> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> > >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
> > >> uncontroversial and ready to move out of RFC and into a more formal review.  I
> > >> will continue to work on the memory backend resources (drm.memory.*).
> > >>
> > >> The cover letter from v1 is copied below for reference.
> > >>
> > >> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> > >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> > > So looking at all this doesn't seem to have changed much, and the old
> > > discussion didn't really conclude anywhere (aside from some details).
> > >
> > > One more open though that crossed my mind, having read a ton of ttm again
> > > recently: How does this all interact with ttm global limits? I'd say the
> > > ttm global limits is the ur-cgroups we have in drm, and not looking at
> > > that seems kinda bad.
> >
> > At least my hope was to completely replace ttm globals with those
> > limitations here when it is ready.
>
> You need more, at least some kind of shrinker to cut down bo placed in
> system memory when we're under memory pressure. Which drags in a
> pretty epic amount of locking lols (see i915's shrinker fun, where we
> attempt that). Probably another good idea to share at least some
> concepts, maybe even code.

I am still looking into your shrinker suggestion so the memory.*
resources are untouch from RFC v3.  The main change for the buffer.*
resources is the removal of buffer sharing restriction as you
suggested and additional documentation of that behaviour.  (I may have
neglected mentioning it in the cover.)  The other key part of RFC v4
is the "logical GPU/lgpu" concept.  I am hoping to get it out there
early for feedback while I continue to work on the memory.* parts.

Kenny

> -Daniel
>
> >
> > Christian.
> >
> > > -Daniel
> > >
> > >> v4:
> > >> Unchanged (no review needed)
> > >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
> > >> and shrinker)
> > >> Base on feedbacks on v3:
> > >> * update nominclature to drmcg
> > >> * embed per device drmcg properties into drm_device
> > >> * split GEM buffer related commits into stats and limit
> > >> * rename function name to align with convention
> > >> * combined buffer accounting and check into a try_charge function
> > >> * support buffer stats without limit enforcement
> > >> * removed GEM buffer sharing limitation
> > >> * updated documentations
> > >> New features:
> > >> * introducing logical GPU concept
> > >> * example implementation with AMD KFD
> > >>
> > >> v3:
> > >> Base on feedbacks on v2:
> > >> * removed .help type file from v2
> > >> * conform to cgroup convention for default and max handling
> > >> * conform to cgroup convention for addressing device specific limits (with major:minor)
> > >> New function:
> > >> * adopted memparse for memory size related attributes
> > >> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> > >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> > >> * added ttm buffer usage limit (per cgroup, for vram.)
> > >> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
> > >>
> > >> v2:
> > >> * Removed the vendoring concepts
> > >> * Add limit to total buffer allocation
> > >> * Add limit to the maximum size of a buffer allocation
> > >>
> > >> v1: cover letter

Daniel Vetter Sept. 3, 2019, 7:48 p.m. UTC | #9

On Tue, Sep 3, 2019 at 8:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > > * While breaking up and applying control to different types of
> > >   internal objects may seem attractive to folks who work day in and
> > >   day out with the subsystem, they aren't all that useful to users and
> > >   the siloed controls are likely to make the whole mechanism a lot
> > >   less useful.  We had the same problem with cgroup1 memcg - putting
> > >   control of different uses of memory under separate knobs.  It made
> > >   the whole thing pretty useless.  e.g. if you constrain all knobs
> > >   tight enough to control the overall usage, overall utilization
> > >   suffers, but if you don't, you really don't have control over actual
> > >   usage.  For memcg, what has to be allocated and controlled is
> > >   physical memory, no matter how they're used.  It's not like you can
> > >   go buy more "socket" memory.  At least from the looks of it, I'm
> > >   afraid gpu controller is repeating the same mistakes.
> >
> > We do have quite a pile of different memories and ranges, so I don't
> > thinkt we're doing the same mistake here. But it is maybe a bit too
>
> I see.  One thing which caught my eyes was the system memory control.
> Shouldn't that be controlled by memcg?  Is there something special
> about system memory used by gpus?

I think system memory separate from vram makes sense. For one, vram is
like 10x+ faster than system memory, so we definitely want to have
good control on that. But maybe we only want one vram bucket overall
for the entire system?

The trouble with system memory is that gpu tasks pin that memory to
prep execution. There's two solutions:
- i915 has a shrinker. Lots (and I really mean lots) of pain with
direct reclaim recursion, which often means we can't free memory, and
we're angering the oom killer a lot. Plus it introduces real bad
latency spikes everywhere (gpu workloads are occasionally really slow,
think "worse than pageout to spinning rust" to get memory freed).
- ttm just has a global limit, set to 50% of system memory.

I do think a global system memory limit to tame the shrinker, without
the ttm approach of possible just wasting half your memory, could be
useful.

> > complicated, and exposes stuff that most users really don't care about.
>
> Could be from me not knowing much about gpus but definitely looks too
> complex to me.  I don't see how users would be able to alloate, vram,
> system memory and GART with reasonable accuracy.  memcg on cgroup2
> deals with just single number and that's already plenty challenging.

Yeah, especially wrt GART and some of the other more specialized
things I don't think there's any modern gpu were you can actually run
out of that stuff. At least not before you run out of every other kind
of memory (GART is just a remapping table to make system memory
visible to the gpu).

I'm also not sure of the bw limits, given all the fun we have on the
block io cgroups side. Aside from that the current bw limit only
controls the bw the kernel uses, userspace can submit unlimited
amounts of copying commands that use the same pcie links directly to
the gpu, bypassing this cg knob. Also, controlling execution time for
gpus is very tricky, since they work a lot more like a block io device
or maybe a network controller with packet scheduling, than a cpu.
-Daniel

Tejun Heo Sept. 6, 2019, 3:23 p.m. UTC | #10

Hello, Daniel.

On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> I think system memory separate from vram makes sense. For one, vram is
> like 10x+ faster than system memory, so we definitely want to have
> good control on that. But maybe we only want one vram bucket overall
> for the entire system?
> 
> The trouble with system memory is that gpu tasks pin that memory to
> prep execution. There's two solutions:
> - i915 has a shrinker. Lots (and I really mean lots) of pain with
> direct reclaim recursion, which often means we can't free memory, and
> we're angering the oom killer a lot. Plus it introduces real bad
> latency spikes everywhere (gpu workloads are occasionally really slow,
> think "worse than pageout to spinning rust" to get memory freed).
> - ttm just has a global limit, set to 50% of system memory.
> 
> I do think a global system memory limit to tame the shrinker, without
> the ttm approach of possible just wasting half your memory, could be
> useful.

Hmm... what'd be the fundamental difference from slab or socket memory
which are handled through memcg?  Is system memory used by GPUs have
further global restrictions in addition to the amount of physical
memory used?

> I'm also not sure of the bw limits, given all the fun we have on the
> block io cgroups side. Aside from that the current bw limit only
> controls the bw the kernel uses, userspace can submit unlimited
> amounts of copying commands that use the same pcie links directly to
> the gpu, bypassing this cg knob. Also, controlling execution time for
> gpus is very tricky, since they work a lot more like a block io device
> or maybe a network controller with packet scheduling, than a cpu.

At the system level, it just gets folded into cpu time, which isn't
perfect but is usually a good enough approximation of compute related
dynamic resources.  Can gpu do someting similar or at least start with
that?

Thanks.

Daniel Vetter Sept. 6, 2019, 3:34 p.m. UTC | #11

On Fri, Sep 6, 2019 at 5:23 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> > I think system memory separate from vram makes sense. For one, vram is
> > like 10x+ faster than system memory, so we definitely want to have
> > good control on that. But maybe we only want one vram bucket overall
> > for the entire system?
> >
> > The trouble with system memory is that gpu tasks pin that memory to
> > prep execution. There's two solutions:
> > - i915 has a shrinker. Lots (and I really mean lots) of pain with
> > direct reclaim recursion, which often means we can't free memory, and
> > we're angering the oom killer a lot. Plus it introduces real bad
> > latency spikes everywhere (gpu workloads are occasionally really slow,
> > think "worse than pageout to spinning rust" to get memory freed).
> > - ttm just has a global limit, set to 50% of system memory.
> >
> > I do think a global system memory limit to tame the shrinker, without
> > the ttm approach of possible just wasting half your memory, could be
> > useful.
>
> Hmm... what'd be the fundamental difference from slab or socket memory
> which are handled through memcg?  Is system memory used by GPUs have
> further global restrictions in addition to the amount of physical
> memory used?

Sometimes, but that would be specific resources (kinda like vram),
e.g. CMA regions used by a gpu. But probably not something you'll run
in a datacenter and want cgroups for ...

I guess we could try to integrate with the memcg group controller. One
trouble is that aside from i915 most gpu drivers do not really have a
full shrinker, so not sure how that would all integrate.

The overall gpu memory controller would still be outside of memcg I
think, since that would include swapped-out gpu objects, and stuff in
special memory regions like vram.

> > I'm also not sure of the bw limits, given all the fun we have on the
> > block io cgroups side. Aside from that the current bw limit only
> > controls the bw the kernel uses, userspace can submit unlimited
> > amounts of copying commands that use the same pcie links directly to
> > the gpu, bypassing this cg knob. Also, controlling execution time for
> > gpus is very tricky, since they work a lot more like a block io device
> > or maybe a network controller with packet scheduling, than a cpu.
>
> At the system level, it just gets folded into cpu time, which isn't
> perfect but is usually a good enough approximation of compute related
> dynamic resources.  Can gpu do someting similar or at least start with
> that?

So generally there's a pile of engines, often of different type (e.g.
amd hw has an entire pile of copy engines), with some ill-defined
sharing charateristics for some (often compute/render engines use the
same shader cores underneath), kinda like hyperthreading. So at that
detail it's all extremely hw specific, and probably too hard to
control in a useful way for users. And I'm not sure we can really do a
reasonable knob for overall gpu usage, e.g. if we include all the copy
engines, but the workloads are only running on compute engines, then
you might only get 10% overall utilization by engine-time. While the
shaders (which is most of the chip area/power consumption) are
actually at 100%. On top, with many userspace apis those engines are
an internal implementation detail of a more abstract gpu device (e.g.
opengl), but with others, this is all fully exposed (like vulkan).

Plus the kernel needs to use at least copy engines for vram management
itself, and you really can't take that away. Although Kenny here has
some proposal for a separate cgroup resource just for that.

I just think it's all a bit too ill-defined, and we might be better
off nailing the memory side first and get some real world experience
on this stuff. For context, there's not even a cross-driver standard
for how priorities are handled, that's all driver-specific interfaces.
-Daniel

Tejun Heo Sept. 6, 2019, 3:45 p.m. UTC | #12

Hello, Daniel.

On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > Hmm... what'd be the fundamental difference from slab or socket memory
> > which are handled through memcg?  Is system memory used by GPUs have
> > further global restrictions in addition to the amount of physical
> > memory used?
> 
> Sometimes, but that would be specific resources (kinda like vram),
> e.g. CMA regions used by a gpu. But probably not something you'll run
> in a datacenter and want cgroups for ...
> 
> I guess we could try to integrate with the memcg group controller. One
> trouble is that aside from i915 most gpu drivers do not really have a
> full shrinker, so not sure how that would all integrate.

So, while it'd great to have shrinkers in the longer term, it's not a
strict requirement to be accounted in memcg.  It already accounts a
lot of memory which isn't reclaimable (a lot of slabs and socket
buffer).

> The overall gpu memory controller would still be outside of memcg I
> think, since that would include swapped-out gpu objects, and stuff in
> special memory regions like vram.

Yeah, for resources which are on the GPU itself or hard limitations
arising from it.  In general, we wanna make cgroup controllers control
something real and concrete as in physical resources.

> > At the system level, it just gets folded into cpu time, which isn't
> > perfect but is usually a good enough approximation of compute related
> > dynamic resources.  Can gpu do someting similar or at least start with
> > that?
> 
> So generally there's a pile of engines, often of different type (e.g.
> amd hw has an entire pile of copy engines), with some ill-defined
> sharing charateristics for some (often compute/render engines use the
> same shader cores underneath), kinda like hyperthreading. So at that
> detail it's all extremely hw specific, and probably too hard to
> control in a useful way for users. And I'm not sure we can really do a
> reasonable knob for overall gpu usage, e.g. if we include all the copy
> engines, but the workloads are only running on compute engines, then
> you might only get 10% overall utilization by engine-time. While the
> shaders (which is most of the chip area/power consumption) are
> actually at 100%. On top, with many userspace apis those engines are
> an internal implementation detail of a more abstract gpu device (e.g.
> opengl), but with others, this is all fully exposed (like vulkan).
> 
> Plus the kernel needs to use at least copy engines for vram management
> itself, and you really can't take that away. Although Kenny here has
> some proposal for a separate cgroup resource just for that.
> 
> I just think it's all a bit too ill-defined, and we might be better
> off nailing the memory side first and get some real world experience
> on this stuff. For context, there's not even a cross-driver standard
> for how priorities are handled, that's all driver-specific interfaces.

I see.  Yeah, figuring it out as this develops makes sense to me.  One
thing I wanna raise is that in general we don't want to expose device
or implementation details in cgroup interface.  What we want expressed
there is the intentions of the user.  The more internal details we
expose the more we end up getting tied down to the specific
implementation which we should avoid especially given the early stage
of development.

Thanks.

Michal Hocko Sept. 10, 2019, 11:54 a.m. UTC | #13

On Fri 06-09-19 08:45:39, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > > Hmm... what'd be the fundamental difference from slab or socket memory
> > > which are handled through memcg?  Is system memory used by GPUs have
> > > further global restrictions in addition to the amount of physical
> > > memory used?
> > 
> > Sometimes, but that would be specific resources (kinda like vram),
> > e.g. CMA regions used by a gpu. But probably not something you'll run
> > in a datacenter and want cgroups for ...
> > 
> > I guess we could try to integrate with the memcg group controller. One
> > trouble is that aside from i915 most gpu drivers do not really have a
> > full shrinker, so not sure how that would all integrate.
> 
> So, while it'd great to have shrinkers in the longer term, it's not a
> strict requirement to be accounted in memcg.  It already accounts a
> lot of memory which isn't reclaimable (a lot of slabs and socket
> buffer).

Yeah, having a shrinker is preferred but the memory should better be
reclaimable in some form. If not by any other means then at least bound
to a user process context so that it goes away with a task being killed
by the OOM killer. If that is not the case then we cannot really charge
it because then the memcg controller is of no use. We can tolerate it to
some degree if the amount of memory charged like that is negligible to
the overall size. But from the discussion it seems that these buffers
are really large.

Tejun Heo Sept. 10, 2019, 4:03 p.m. UTC | #14

Hello, Michal.

On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > So, while it'd great to have shrinkers in the longer term, it's not a
> > strict requirement to be accounted in memcg.  It already accounts a
> > lot of memory which isn't reclaimable (a lot of slabs and socket
> > buffer).
> 
> Yeah, having a shrinker is preferred but the memory should better be
> reclaimable in some form. If not by any other means then at least bound
> to a user process context so that it goes away with a task being killed
> by the OOM killer. If that is not the case then we cannot really charge
> it because then the memcg controller is of no use. We can tolerate it to
> some degree if the amount of memory charged like that is negligible to
> the overall size. But from the discussion it seems that these buffers
> are really large.

Yeah, oom kills should be able to reduce the usage; however, please
note that tmpfs, among other things, can already escape this
restriction and we can have cgroups which are over max and empty.
It's obviously not ideal but the system doesn't melt down from it
either.

Thanks.

Michal Hocko Sept. 10, 2019, 5:25 p.m. UTC | #15

On Tue 10-09-19 09:03:29, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > > So, while it'd great to have shrinkers in the longer term, it's not a
> > > strict requirement to be accounted in memcg.  It already accounts a
> > > lot of memory which isn't reclaimable (a lot of slabs and socket
> > > buffer).
> > 
> > Yeah, having a shrinker is preferred but the memory should better be
> > reclaimable in some form. If not by any other means then at least bound
> > to a user process context so that it goes away with a task being killed
> > by the OOM killer. If that is not the case then we cannot really charge
> > it because then the memcg controller is of no use. We can tolerate it to
> > some degree if the amount of memory charged like that is negligible to
> > the overall size. But from the discussion it seems that these buffers
> > are really large.
> 
> Yeah, oom kills should be able to reduce the usage; however, please
> note that tmpfs, among other things, can already escape this
> restriction and we can have cgroups which are over max and empty.
> It's obviously not ideal but the system doesn't melt down from it
> either.

Right, and that is a reason why an access to tmpfs should be restricted
when containing a workload by memcg. My understanding of this particular
feature is that memcg should be the primary containment method and
that's why I brought this up.

Daniel Vetter Sept. 17, 2019, 12:21 p.m. UTC | #16

On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> On Fri 06-09-19 08:45:39, Tejun Heo wrote:
> > Hello, Daniel.
> > 
> > On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > > > Hmm... what'd be the fundamental difference from slab or socket memory
> > > > which are handled through memcg?  Is system memory used by GPUs have
> > > > further global restrictions in addition to the amount of physical
> > > > memory used?
> > > 
> > > Sometimes, but that would be specific resources (kinda like vram),
> > > e.g. CMA regions used by a gpu. But probably not something you'll run
> > > in a datacenter and want cgroups for ...
> > > 
> > > I guess we could try to integrate with the memcg group controller. One
> > > trouble is that aside from i915 most gpu drivers do not really have a
> > > full shrinker, so not sure how that would all integrate.
> > 
> > So, while it'd great to have shrinkers in the longer term, it's not a
> > strict requirement to be accounted in memcg.  It already accounts a
> > lot of memory which isn't reclaimable (a lot of slabs and socket
> > buffer).
> 
> Yeah, having a shrinker is preferred but the memory should better be
> reclaimable in some form. If not by any other means then at least bound
> to a user process context so that it goes away with a task being killed
> by the OOM killer. If that is not the case then we cannot really charge
> it because then the memcg controller is of no use. We can tolerate it to
> some degree if the amount of memory charged like that is negligible to
> the overall size. But from the discussion it seems that these buffers
> are really large.

I think we can just make "must have a shrinker" as a requirement for
system memory cgroup thing for gpu buffers. There's mild locking inversion
fun to be had when typing one, but I think the problem is well-understood
enough that this isn't a huge hurdle to climb over. And should give admins
an easier to mange system, since it works more like what they know
already.
-Daniel

[RFC,v4,00/16] new cgroup controller for gpu/drm subsystem

Message

Comments