Message ID | 20190829060533.32315-1-Kenny.Ho@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | new cgroup controller for gpu/drm subsystem | expand |
Hello, I just glanced through the interface and don't have enough context to give any kind of detailed review yet. I'll try to read up and understand more and would greatly appreciate if you can give me some pointers to read up on the resources being controlled and how the actual use cases would look like. That said, I have some basic concerns. * TTM vs. GEM distinction seems to be internal implementation detail rather than anything relating to underlying physical resources. Provided that's the case, I'm afraid these internal constructs being used as primary resource control objects likely isn't the right approach. Whether a given driver uses one or the other internal abstraction layer shouldn't determine how resources are represented at the userland interface layer. * While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes. Thanks.
On Fri, Aug 30, 2019 at 09:28:57PM -0700, Tejun Heo wrote: > Hello, > > I just glanced through the interface and don't have enough context to > give any kind of detailed review yet. I'll try to read up and > understand more and would greatly appreciate if you can give me some > pointers to read up on the resources being controlled and how the > actual use cases would look like. That said, I have some basic > concerns. > > * TTM vs. GEM distinction seems to be internal implementation detail > rather than anything relating to underlying physical resources. > Provided that's the case, I'm afraid these internal constructs being > used as primary resource control objects likely isn't the right > approach. Whether a given driver uses one or the other internal > abstraction layer shouldn't determine how resources are represented > at the userland interface layer. Yeah there's another RFC series from Brian Welty to abstract this away as a memory region concept for gpus. > * While breaking up and applying control to different types of > internal objects may seem attractive to folks who work day in and > day out with the subsystem, they aren't all that useful to users and > the siloed controls are likely to make the whole mechanism a lot > less useful. We had the same problem with cgroup1 memcg - putting > control of different uses of memory under separate knobs. It made > the whole thing pretty useless. e.g. if you constrain all knobs > tight enough to control the overall usage, overall utilization > suffers, but if you don't, you really don't have control over actual > usage. For memcg, what has to be allocated and controlled is > physical memory, no matter how they're used. It's not like you can > go buy more "socket" memory. At least from the looks of it, I'm > afraid gpu controller is repeating the same mistakes. We do have quite a pile of different memories and ranges, so I don't thinkt we're doing the same mistake here. But it is maybe a bit too complicated, and exposes stuff that most users really don't care about. -Daniel
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote: > This is a follow up to the RFC I made previously to introduce a cgroup > controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to > provide resource management to GPU resources using things like container. > > With this RFC v4, I am hoping to have some consensus on a merge plan. I believe > the GEM related resources (drm.buffer.*) introduced in previous RFC and, > hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are > uncontroversial and ready to move out of RFC and into a more formal review. I > will continue to work on the memory backend resources (drm.memory.*). > > The cover letter from v1 is copied below for reference. > > [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html > [v2]: https://www.spinics.net/lists/cgroups/msg22074.html > [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html So looking at all this doesn't seem to have changed much, and the old discussion didn't really conclude anywhere (aside from some details). One more open though that crossed my mind, having read a ton of ttm again recently: How does this all interact with ttm global limits? I'd say the ttm global limits is the ur-cgroups we have in drm, and not looking at that seems kinda bad. -Daniel > > v4: > Unchanged (no review needed) > * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth > and shrinker) > Base on feedbacks on v3: > * update nominclature to drmcg > * embed per device drmcg properties into drm_device > * split GEM buffer related commits into stats and limit > * rename function name to align with convention > * combined buffer accounting and check into a try_charge function > * support buffer stats without limit enforcement > * removed GEM buffer sharing limitation > * updated documentations > New features: > * introducing logical GPU concept > * example implementation with AMD KFD > > v3: > Base on feedbacks on v2: > * removed .help type file from v2 > * conform to cgroup convention for default and max handling > * conform to cgroup convention for addressing device specific limits (with major:minor) > New function: > * adopted memparse for memory size related attributes > * added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.) > * added ttm buffer usage stats (per cgroup, for system, tt, vram.) > * added ttm buffer usage limit (per cgroup, for vram.) > * added per cgroup bandwidth stats and limiting (burst and average bandwidth) > > v2: > * Removed the vendoring concepts > * Add limit to total buffer allocation > * Add limit to the maximum size of a buffer allocation > > v1: cover letter > > The purpose of this patch series is to start a discussion for a generic cgroup > controller for the drm subsystem. The design proposed here is a very early one. > We are hoping to engage the community as we develop the idea. > > > Backgrounds > ========== > Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of > tasks, and all their future children, into hierarchical groups with specialized > behaviour, such as accounting/limiting the resources which processes in a cgroup > can access[1]. Weights, limits, protections, allocations are the main resource > distribution models. Existing cgroup controllers includes cpu, memory, io, > rdma, and more. cgroup is one of the foundational technologies that enables the > popular container application deployment and management method. > > Direct Rendering Manager/drm contains code intended to support the needs of > complex graphics devices. Graphics drivers in the kernel may make use of DRM > functions to make tasks like memory management, interrupt handling and DMA > easier, and provide a uniform interface to applications. The DRM has also > developed beyond traditional graphics applications to support compute/GPGPU > applications. > > > Motivations > ========= > As GPU grow beyond the realm of desktop/workstation graphics into areas like > data center clusters and IoT, there are increasing needs to monitor and regulate > GPU as a resource like cpu, memory and io. > > Matt Roper from Intel began working on similar idea in early 2018 [2] for the > purpose of managing GPU priority using the cgroup hierarchy. While that > particular use case may not warrant a standalone drm cgroup controller, there > are other use cases where having one can be useful [3]. Monitoring GPU > resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU > (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help > sysadmins get a better understanding of the applications usage profile. Further > usage regulations of the aforementioned resources can also help sysadmins > optimize workload deployment on limited GPU resources. > > With the increased importance of machine learning, data science and other > cloud-based applications, GPUs are already in production use in data centers > today [5,6,7]. Existing GPU resource management is very course grain, however, > as sysadmins are only able to distribute workload on a per-GPU basis [8]. An > alternative is to use GPU virtualization (with or without SRIOV) but it > generally acts on the entire GPU instead of the specific resources in a GPU. > With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU > resource management (in addition to what may be available via GPU > virtualization.) > > In addition to production use, the DRM cgroup can also help with testing > graphics application robustness by providing a mean to artificially limit DRM > resources availble to the applications. > > > Challenges > ======== > While there are common infrastructure in DRM that is shared across many vendors > (the scheduler [4] for example), there are also aspects of DRM that are vendor > specific. To accommodate this, we borrowed the mechanism used by the cgroup to > handle different kinds of cgroup controller. > > Resources for DRM are also often device (GPU) specific instead of system > specific and a system may contain more than one GPU. For this, we borrowed some > of the ideas from RDMA cgroup controller. > > Approach > ======= > To experiment with the idea of a DRM cgroup, we would like to start with basic > accounting and statistics, then continue to iterate and add regulating > mechanisms into the driver. > > [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt > [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html > [3] https://www.spinics.net/lists/cgroups/msg20720.html > [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler > [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ > [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/ > [7] https://github.com/RadeonOpenCompute/k8s-device-plugin > [8] https://github.com/kubernetes/kubernetes/issues/52757 > > Kenny Ho (16): > drm: Add drm_minor_for_each > cgroup: Introduce cgroup for drm subsystem > drm, cgroup: Initialize drmcg properties > drm, cgroup: Add total GEM buffer allocation stats > drm, cgroup: Add peak GEM buffer allocation stats > drm, cgroup: Add GEM buffer allocation count stats > drm, cgroup: Add total GEM buffer allocation limit > drm, cgroup: Add peak GEM buffer allocation limit > drm, cgroup: Add TTM buffer allocation stats > drm, cgroup: Add TTM buffer peak usage stats > drm, cgroup: Add per cgroup bw measure and control > drm, cgroup: Add soft VRAM limit > drm, cgroup: Allow more aggressive memory reclaim > drm, cgroup: Introduce lgpu as DRM cgroup resource > drm, cgroup: add update trigger after limit change > drm/amdgpu: Integrate with DRM cgroup > > Documentation/admin-guide/cgroup-v2.rst | 163 +- > Documentation/cgroup-v1/drm.rst | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + > .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ > drivers/gpu/drm/drm_drv.c | 26 + > drivers/gpu/drm/drm_gem.c | 16 +- > drivers/gpu/drm/drm_internal.h | 4 - > drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ > drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + > include/drm/drm_cgroup.h | 122 ++ > include/drm/drm_device.h | 7 + > include/drm/drm_drv.h | 23 + > include/drm/drm_gem.h | 13 +- > include/drm/ttm/ttm_bo_api.h | 2 + > include/drm/ttm/ttm_bo_driver.h | 10 + > include/linux/cgroup_drm.h | 151 ++ > include/linux/cgroup_subsys.h | 4 + > init/Kconfig | 5 + > kernel/cgroup/Makefile | 1 + > kernel/cgroup/drm.c | 1367 +++++++++++++++++ > 25 files changed, 2193 insertions(+), 10 deletions(-) > create mode 100644 Documentation/cgroup-v1/drm.rst > create mode 100644 include/drm/drm_cgroup.h > create mode 100644 include/linux/cgroup_drm.h > create mode 100644 kernel/cgroup/drm.c > > -- > 2.22.0 >
Am 03.09.19 um 10:02 schrieb Daniel Vetter: > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote: >> This is a follow up to the RFC I made previously to introduce a cgroup >> controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to >> provide resource management to GPU resources using things like container. >> >> With this RFC v4, I am hoping to have some consensus on a merge plan. I believe >> the GEM related resources (drm.buffer.*) introduced in previous RFC and, >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are >> uncontroversial and ready to move out of RFC and into a more formal review. I >> will continue to work on the memory backend resources (drm.memory.*). >> >> The cover letter from v1 is copied below for reference. >> >> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html > So looking at all this doesn't seem to have changed much, and the old > discussion didn't really conclude anywhere (aside from some details). > > One more open though that crossed my mind, having read a ton of ttm again > recently: How does this all interact with ttm global limits? I'd say the > ttm global limits is the ur-cgroups we have in drm, and not looking at > that seems kinda bad. At least my hope was to completely replace ttm globals with those limitations here when it is ready. Christian. > -Daniel > >> v4: >> Unchanged (no review needed) >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth >> and shrinker) >> Base on feedbacks on v3: >> * update nominclature to drmcg >> * embed per device drmcg properties into drm_device >> * split GEM buffer related commits into stats and limit >> * rename function name to align with convention >> * combined buffer accounting and check into a try_charge function >> * support buffer stats without limit enforcement >> * removed GEM buffer sharing limitation >> * updated documentations >> New features: >> * introducing logical GPU concept >> * example implementation with AMD KFD >> >> v3: >> Base on feedbacks on v2: >> * removed .help type file from v2 >> * conform to cgroup convention for default and max handling >> * conform to cgroup convention for addressing device specific limits (with major:minor) >> New function: >> * adopted memparse for memory size related attributes >> * added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.) >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.) >> * added ttm buffer usage limit (per cgroup, for vram.) >> * added per cgroup bandwidth stats and limiting (burst and average bandwidth) >> >> v2: >> * Removed the vendoring concepts >> * Add limit to total buffer allocation >> * Add limit to the maximum size of a buffer allocation >> >> v1: cover letter >> >> The purpose of this patch series is to start a discussion for a generic cgroup >> controller for the drm subsystem. The design proposed here is a very early one. >> We are hoping to engage the community as we develop the idea. >> >> >> Backgrounds >> ========== >> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of >> tasks, and all their future children, into hierarchical groups with specialized >> behaviour, such as accounting/limiting the resources which processes in a cgroup >> can access[1]. Weights, limits, protections, allocations are the main resource >> distribution models. Existing cgroup controllers includes cpu, memory, io, >> rdma, and more. cgroup is one of the foundational technologies that enables the >> popular container application deployment and management method. >> >> Direct Rendering Manager/drm contains code intended to support the needs of >> complex graphics devices. Graphics drivers in the kernel may make use of DRM >> functions to make tasks like memory management, interrupt handling and DMA >> easier, and provide a uniform interface to applications. The DRM has also >> developed beyond traditional graphics applications to support compute/GPGPU >> applications. >> >> >> Motivations >> ========= >> As GPU grow beyond the realm of desktop/workstation graphics into areas like >> data center clusters and IoT, there are increasing needs to monitor and regulate >> GPU as a resource like cpu, memory and io. >> >> Matt Roper from Intel began working on similar idea in early 2018 [2] for the >> purpose of managing GPU priority using the cgroup hierarchy. While that >> particular use case may not warrant a standalone drm cgroup controller, there >> are other use cases where having one can be useful [3]. Monitoring GPU >> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU >> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help >> sysadmins get a better understanding of the applications usage profile. Further >> usage regulations of the aforementioned resources can also help sysadmins >> optimize workload deployment on limited GPU resources. >> >> With the increased importance of machine learning, data science and other >> cloud-based applications, GPUs are already in production use in data centers >> today [5,6,7]. Existing GPU resource management is very course grain, however, >> as sysadmins are only able to distribute workload on a per-GPU basis [8]. An >> alternative is to use GPU virtualization (with or without SRIOV) but it >> generally acts on the entire GPU instead of the specific resources in a GPU. >> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU >> resource management (in addition to what may be available via GPU >> virtualization.) >> >> In addition to production use, the DRM cgroup can also help with testing >> graphics application robustness by providing a mean to artificially limit DRM >> resources availble to the applications. >> >> >> Challenges >> ======== >> While there are common infrastructure in DRM that is shared across many vendors >> (the scheduler [4] for example), there are also aspects of DRM that are vendor >> specific. To accommodate this, we borrowed the mechanism used by the cgroup to >> handle different kinds of cgroup controller. >> >> Resources for DRM are also often device (GPU) specific instead of system >> specific and a system may contain more than one GPU. For this, we borrowed some >> of the ideas from RDMA cgroup controller. >> >> Approach >> ======= >> To experiment with the idea of a DRM cgroup, we would like to start with basic >> accounting and statistics, then continue to iterate and add regulating >> mechanisms into the driver. >> >> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt >> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html >> [3] https://www.spinics.net/lists/cgroups/msg20720.html >> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler >> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ >> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/ >> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin >> [8] https://github.com/kubernetes/kubernetes/issues/52757 >> >> Kenny Ho (16): >> drm: Add drm_minor_for_each >> cgroup: Introduce cgroup for drm subsystem >> drm, cgroup: Initialize drmcg properties >> drm, cgroup: Add total GEM buffer allocation stats >> drm, cgroup: Add peak GEM buffer allocation stats >> drm, cgroup: Add GEM buffer allocation count stats >> drm, cgroup: Add total GEM buffer allocation limit >> drm, cgroup: Add peak GEM buffer allocation limit >> drm, cgroup: Add TTM buffer allocation stats >> drm, cgroup: Add TTM buffer peak usage stats >> drm, cgroup: Add per cgroup bw measure and control >> drm, cgroup: Add soft VRAM limit >> drm, cgroup: Allow more aggressive memory reclaim >> drm, cgroup: Introduce lgpu as DRM cgroup resource >> drm, cgroup: add update trigger after limit change >> drm/amdgpu: Integrate with DRM cgroup >> >> Documentation/admin-guide/cgroup-v2.rst | 163 +- >> Documentation/cgroup-v1/drm.rst | 1 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- >> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + >> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + >> .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ >> drivers/gpu/drm/drm_drv.c | 26 + >> drivers/gpu/drm/drm_gem.c | 16 +- >> drivers/gpu/drm/drm_internal.h | 4 - >> drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ >> drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + >> include/drm/drm_cgroup.h | 122 ++ >> include/drm/drm_device.h | 7 + >> include/drm/drm_drv.h | 23 + >> include/drm/drm_gem.h | 13 +- >> include/drm/ttm/ttm_bo_api.h | 2 + >> include/drm/ttm/ttm_bo_driver.h | 10 + >> include/linux/cgroup_drm.h | 151 ++ >> include/linux/cgroup_subsys.h | 4 + >> init/Kconfig | 5 + >> kernel/cgroup/Makefile | 1 + >> kernel/cgroup/drm.c | 1367 +++++++++++++++++ >> 25 files changed, 2193 insertions(+), 10 deletions(-) >> create mode 100644 Documentation/cgroup-v1/drm.rst >> create mode 100644 include/drm/drm_cgroup.h >> create mode 100644 include/linux/cgroup_drm.h >> create mode 100644 kernel/cgroup/drm.c >> >> -- >> 2.22.0 >>
On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 03.09.19 um 10:02 schrieb Daniel Vetter: > > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote: > >> This is a follow up to the RFC I made previously to introduce a cgroup > >> controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to > >> provide resource management to GPU resources using things like container. > >> > >> With this RFC v4, I am hoping to have some consensus on a merge plan. I believe > >> the GEM related resources (drm.buffer.*) introduced in previous RFC and, > >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are > >> uncontroversial and ready to move out of RFC and into a more formal review. I > >> will continue to work on the memory backend resources (drm.memory.*). > >> > >> The cover letter from v1 is copied below for reference. > >> > >> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html > >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html > >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html > > So looking at all this doesn't seem to have changed much, and the old > > discussion didn't really conclude anywhere (aside from some details). > > > > One more open though that crossed my mind, having read a ton of ttm again > > recently: How does this all interact with ttm global limits? I'd say the > > ttm global limits is the ur-cgroups we have in drm, and not looking at > > that seems kinda bad. > > At least my hope was to completely replace ttm globals with those > limitations here when it is ready. You need more, at least some kind of shrinker to cut down bo placed in system memory when we're under memory pressure. Which drags in a pretty epic amount of locking lols (see i915's shrinker fun, where we attempt that). Probably another good idea to share at least some concepts, maybe even code. -Daniel > > Christian. > > > -Daniel > > > >> v4: > >> Unchanged (no review needed) > >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth > >> and shrinker) > >> Base on feedbacks on v3: > >> * update nominclature to drmcg > >> * embed per device drmcg properties into drm_device > >> * split GEM buffer related commits into stats and limit > >> * rename function name to align with convention > >> * combined buffer accounting and check into a try_charge function > >> * support buffer stats without limit enforcement > >> * removed GEM buffer sharing limitation > >> * updated documentations > >> New features: > >> * introducing logical GPU concept > >> * example implementation with AMD KFD > >> > >> v3: > >> Base on feedbacks on v2: > >> * removed .help type file from v2 > >> * conform to cgroup convention for default and max handling > >> * conform to cgroup convention for addressing device specific limits (with major:minor) > >> New function: > >> * adopted memparse for memory size related attributes > >> * added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.) > >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.) > >> * added ttm buffer usage limit (per cgroup, for vram.) > >> * added per cgroup bandwidth stats and limiting (burst and average bandwidth) > >> > >> v2: > >> * Removed the vendoring concepts > >> * Add limit to total buffer allocation > >> * Add limit to the maximum size of a buffer allocation > >> > >> v1: cover letter > >> > >> The purpose of this patch series is to start a discussion for a generic cgroup > >> controller for the drm subsystem. The design proposed here is a very early one. > >> We are hoping to engage the community as we develop the idea. > >> > >> > >> Backgrounds > >> ========== > >> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of > >> tasks, and all their future children, into hierarchical groups with specialized > >> behaviour, such as accounting/limiting the resources which processes in a cgroup > >> can access[1]. Weights, limits, protections, allocations are the main resource > >> distribution models. Existing cgroup controllers includes cpu, memory, io, > >> rdma, and more. cgroup is one of the foundational technologies that enables the > >> popular container application deployment and management method. > >> > >> Direct Rendering Manager/drm contains code intended to support the needs of > >> complex graphics devices. Graphics drivers in the kernel may make use of DRM > >> functions to make tasks like memory management, interrupt handling and DMA > >> easier, and provide a uniform interface to applications. The DRM has also > >> developed beyond traditional graphics applications to support compute/GPGPU > >> applications. > >> > >> > >> Motivations > >> ========= > >> As GPU grow beyond the realm of desktop/workstation graphics into areas like > >> data center clusters and IoT, there are increasing needs to monitor and regulate > >> GPU as a resource like cpu, memory and io. > >> > >> Matt Roper from Intel began working on similar idea in early 2018 [2] for the > >> purpose of managing GPU priority using the cgroup hierarchy. While that > >> particular use case may not warrant a standalone drm cgroup controller, there > >> are other use cases where having one can be useful [3]. Monitoring GPU > >> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU > >> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help > >> sysadmins get a better understanding of the applications usage profile. Further > >> usage regulations of the aforementioned resources can also help sysadmins > >> optimize workload deployment on limited GPU resources. > >> > >> With the increased importance of machine learning, data science and other > >> cloud-based applications, GPUs are already in production use in data centers > >> today [5,6,7]. Existing GPU resource management is very course grain, however, > >> as sysadmins are only able to distribute workload on a per-GPU basis [8]. An > >> alternative is to use GPU virtualization (with or without SRIOV) but it > >> generally acts on the entire GPU instead of the specific resources in a GPU. > >> With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU > >> resource management (in addition to what may be available via GPU > >> virtualization.) > >> > >> In addition to production use, the DRM cgroup can also help with testing > >> graphics application robustness by providing a mean to artificially limit DRM > >> resources availble to the applications. > >> > >> > >> Challenges > >> ======== > >> While there are common infrastructure in DRM that is shared across many vendors > >> (the scheduler [4] for example), there are also aspects of DRM that are vendor > >> specific. To accommodate this, we borrowed the mechanism used by the cgroup to > >> handle different kinds of cgroup controller. > >> > >> Resources for DRM are also often device (GPU) specific instead of system > >> specific and a system may contain more than one GPU. For this, we borrowed some > >> of the ideas from RDMA cgroup controller. > >> > >> Approach > >> ======= > >> To experiment with the idea of a DRM cgroup, we would like to start with basic > >> accounting and statistics, then continue to iterate and add regulating > >> mechanisms into the driver. > >> > >> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt > >> [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html > >> [3] https://www.spinics.net/lists/cgroups/msg20720.html > >> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler > >> [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ > >> [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/ > >> [7] https://github.com/RadeonOpenCompute/k8s-device-plugin > >> [8] https://github.com/kubernetes/kubernetes/issues/52757 > >> > >> Kenny Ho (16): > >> drm: Add drm_minor_for_each > >> cgroup: Introduce cgroup for drm subsystem > >> drm, cgroup: Initialize drmcg properties > >> drm, cgroup: Add total GEM buffer allocation stats > >> drm, cgroup: Add peak GEM buffer allocation stats > >> drm, cgroup: Add GEM buffer allocation count stats > >> drm, cgroup: Add total GEM buffer allocation limit > >> drm, cgroup: Add peak GEM buffer allocation limit > >> drm, cgroup: Add TTM buffer allocation stats > >> drm, cgroup: Add TTM buffer peak usage stats > >> drm, cgroup: Add per cgroup bw measure and control > >> drm, cgroup: Add soft VRAM limit > >> drm, cgroup: Allow more aggressive memory reclaim > >> drm, cgroup: Introduce lgpu as DRM cgroup resource > >> drm, cgroup: add update trigger after limit change > >> drm/amdgpu: Integrate with DRM cgroup > >> > >> Documentation/admin-guide/cgroup-v2.rst | 163 +- > >> Documentation/cgroup-v1/drm.rst | 1 + > >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + > >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + > >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- > >> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + > >> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + > >> .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ > >> drivers/gpu/drm/drm_drv.c | 26 + > >> drivers/gpu/drm/drm_gem.c | 16 +- > >> drivers/gpu/drm/drm_internal.h | 4 - > >> drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ > >> drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + > >> include/drm/drm_cgroup.h | 122 ++ > >> include/drm/drm_device.h | 7 + > >> include/drm/drm_drv.h | 23 + > >> include/drm/drm_gem.h | 13 +- > >> include/drm/ttm/ttm_bo_api.h | 2 + > >> include/drm/ttm/ttm_bo_driver.h | 10 + > >> include/linux/cgroup_drm.h | 151 ++ > >> include/linux/cgroup_subsys.h | 4 + > >> init/Kconfig | 5 + > >> kernel/cgroup/Makefile | 1 + > >> kernel/cgroup/drm.c | 1367 +++++++++++++++++ > >> 25 files changed, 2193 insertions(+), 10 deletions(-) > >> create mode 100644 Documentation/cgroup-v1/drm.rst > >> create mode 100644 include/drm/drm_cgroup.h > >> create mode 100644 include/linux/cgroup_drm.h > >> create mode 100644 kernel/cgroup/drm.c > >> > >> -- > >> 2.22.0 > >> >
Hello, Daniel. On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote: > > * While breaking up and applying control to different types of > > internal objects may seem attractive to folks who work day in and > > day out with the subsystem, they aren't all that useful to users and > > the siloed controls are likely to make the whole mechanism a lot > > less useful. We had the same problem with cgroup1 memcg - putting > > control of different uses of memory under separate knobs. It made > > the whole thing pretty useless. e.g. if you constrain all knobs > > tight enough to control the overall usage, overall utilization > > suffers, but if you don't, you really don't have control over actual > > usage. For memcg, what has to be allocated and controlled is > > physical memory, no matter how they're used. It's not like you can > > go buy more "socket" memory. At least from the looks of it, I'm > > afraid gpu controller is repeating the same mistakes. > > We do have quite a pile of different memories and ranges, so I don't > thinkt we're doing the same mistake here. But it is maybe a bit too I see. One thing which caught my eyes was the system memory control. Shouldn't that be controlled by memcg? Is there something special about system memory used by gpus? > complicated, and exposes stuff that most users really don't care about. Could be from me not knowing much about gpus but definitely looks too complex to me. I don't see how users would be able to alloate, vram, system memory and GART with reasonable accuracy. memcg on cgroup2 deals with just single number and that's already plenty challenging. Thanks.
Hi Tejun, Thanks for looking into this. I can definitely help where I can and I am sure other experts will jump in if I start misrepresenting the reality :) (as Daniel already have done.) Regarding your points, my understanding is that there isn't really a TTM vs GEM situation anymore (there is an lwn.net article about that, but it is more than a decade old.) I believe GEM is the common interface at this point and more and more features are being refactored into it. For example, AMD's driver uses TTM internally but things are exposed via the GEM interface. This GEM resource is actually the single number resource you just referred to. A GEM buffer (the drm.buffer.* resources) can be backed by VRAM, or system memory or other type of memory. The more fine grain control is the drm.memory.* resources which still need more discussion. (As some of the functionalities in TTM are being refactored into the GEM level. I have seen some patches that make TTM a subclass of GEM.) This RFC can be grouped into 3 areas and they are fairly independent so they can be reviewed separately: high level device memory control (buffer.*), fine grain memory control and bandwidth (memory.*) and compute resources (lgpu.*) I think the memory.* resources are the most controversial part but I think it's still needed. Perhaps an analogy may help. For a system, we have CPUs and memory. And within memory, it can be backed by RAM or swap. For GPU, each device can have LGPUs and buffers. And within the buffers, it can be backed by VRAM, or system RAM or even swap. As for setting the right amount, I think that's where the profiling aspect of the *.stats comes in. And while one can't necessary buy more VRAM, it is still a useful knob to adjust if the intention is to pack more work into a GPU device with predictable performance. This research on various GPU workload may be of interest: A Taxonomy of GPGPU Performance Scaling http://www.computermachines.org/joe/posters/iiswc2015_taxonomy.pdf http://www.computermachines.org/joe/publications/pdfs/iiswc2015_taxonomy.pdf (summary: GPU workload can be memory bound or compute bound. So it's possible to pack different workload together to improve utilization.) Regards, Kenny On Tue, Sep 3, 2019 at 2:50 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Daniel. > > On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote: > > > * While breaking up and applying control to different types of > > > internal objects may seem attractive to folks who work day in and > > > day out with the subsystem, they aren't all that useful to users and > > > the siloed controls are likely to make the whole mechanism a lot > > > less useful. We had the same problem with cgroup1 memcg - putting > > > control of different uses of memory under separate knobs. It made > > > the whole thing pretty useless. e.g. if you constrain all knobs > > > tight enough to control the overall usage, overall utilization > > > suffers, but if you don't, you really don't have control over actual > > > usage. For memcg, what has to be allocated and controlled is > > > physical memory, no matter how they're used. It's not like you can > > > go buy more "socket" memory. At least from the looks of it, I'm > > > afraid gpu controller is repeating the same mistakes. > > > > We do have quite a pile of different memories and ranges, so I don't > > thinkt we're doing the same mistake here. But it is maybe a bit too > > I see. One thing which caught my eyes was the system memory control. > Shouldn't that be controlled by memcg? Is there something special > about system memory used by gpus? > > > complicated, and exposes stuff that most users really don't care about. > > Could be from me not knowing much about gpus but definitely looks too > complex to me. I don't see how users would be able to alloate, vram, > system memory and GART with reasonable accuracy. memcg on cgroup2 > deals with just single number and that's already plenty challenging. > > Thanks. > > -- > tejun
On Tue, Sep 3, 2019 at 5:20 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian > <Christian.Koenig@amd.com> wrote: > > > > Am 03.09.19 um 10:02 schrieb Daniel Vetter: > > > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote: > > >> With this RFC v4, I am hoping to have some consensus on a merge plan. I believe > > >> the GEM related resources (drm.buffer.*) introduced in previous RFC and, > > >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are > > >> uncontroversial and ready to move out of RFC and into a more formal review. I > > >> will continue to work on the memory backend resources (drm.memory.*). > > >> > > >> The cover letter from v1 is copied below for reference. > > >> > > >> [v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html > > >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html > > >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html > > > So looking at all this doesn't seem to have changed much, and the old > > > discussion didn't really conclude anywhere (aside from some details). > > > > > > One more open though that crossed my mind, having read a ton of ttm again > > > recently: How does this all interact with ttm global limits? I'd say the > > > ttm global limits is the ur-cgroups we have in drm, and not looking at > > > that seems kinda bad. > > > > At least my hope was to completely replace ttm globals with those > > limitations here when it is ready. > > You need more, at least some kind of shrinker to cut down bo placed in > system memory when we're under memory pressure. Which drags in a > pretty epic amount of locking lols (see i915's shrinker fun, where we > attempt that). Probably another good idea to share at least some > concepts, maybe even code. I am still looking into your shrinker suggestion so the memory.* resources are untouch from RFC v3. The main change for the buffer.* resources is the removal of buffer sharing restriction as you suggested and additional documentation of that behaviour. (I may have neglected mentioning it in the cover.) The other key part of RFC v4 is the "logical GPU/lgpu" concept. I am hoping to get it out there early for feedback while I continue to work on the memory.* parts. Kenny > -Daniel > > > > > Christian. > > > > > -Daniel > > > > > >> v4: > > >> Unchanged (no review needed) > > >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth > > >> and shrinker) > > >> Base on feedbacks on v3: > > >> * update nominclature to drmcg > > >> * embed per device drmcg properties into drm_device > > >> * split GEM buffer related commits into stats and limit > > >> * rename function name to align with convention > > >> * combined buffer accounting and check into a try_charge function > > >> * support buffer stats without limit enforcement > > >> * removed GEM buffer sharing limitation > > >> * updated documentations > > >> New features: > > >> * introducing logical GPU concept > > >> * example implementation with AMD KFD > > >> > > >> v3: > > >> Base on feedbacks on v2: > > >> * removed .help type file from v2 > > >> * conform to cgroup convention for default and max handling > > >> * conform to cgroup convention for addressing device specific limits (with major:minor) > > >> New function: > > >> * adopted memparse for memory size related attributes > > >> * added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.) > > >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.) > > >> * added ttm buffer usage limit (per cgroup, for vram.) > > >> * added per cgroup bandwidth stats and limiting (burst and average bandwidth) > > >> > > >> v2: > > >> * Removed the vendoring concepts > > >> * Add limit to total buffer allocation > > >> * Add limit to the maximum size of a buffer allocation > > >> > > >> v1: cover letter
On Tue, Sep 3, 2019 at 8:50 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Daniel. > > On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote: > > > * While breaking up and applying control to different types of > > > internal objects may seem attractive to folks who work day in and > > > day out with the subsystem, they aren't all that useful to users and > > > the siloed controls are likely to make the whole mechanism a lot > > > less useful. We had the same problem with cgroup1 memcg - putting > > > control of different uses of memory under separate knobs. It made > > > the whole thing pretty useless. e.g. if you constrain all knobs > > > tight enough to control the overall usage, overall utilization > > > suffers, but if you don't, you really don't have control over actual > > > usage. For memcg, what has to be allocated and controlled is > > > physical memory, no matter how they're used. It's not like you can > > > go buy more "socket" memory. At least from the looks of it, I'm > > > afraid gpu controller is repeating the same mistakes. > > > > We do have quite a pile of different memories and ranges, so I don't > > thinkt we're doing the same mistake here. But it is maybe a bit too > > I see. One thing which caught my eyes was the system memory control. > Shouldn't that be controlled by memcg? Is there something special > about system memory used by gpus? I think system memory separate from vram makes sense. For one, vram is like 10x+ faster than system memory, so we definitely want to have good control on that. But maybe we only want one vram bucket overall for the entire system? The trouble with system memory is that gpu tasks pin that memory to prep execution. There's two solutions: - i915 has a shrinker. Lots (and I really mean lots) of pain with direct reclaim recursion, which often means we can't free memory, and we're angering the oom killer a lot. Plus it introduces real bad latency spikes everywhere (gpu workloads are occasionally really slow, think "worse than pageout to spinning rust" to get memory freed). - ttm just has a global limit, set to 50% of system memory. I do think a global system memory limit to tame the shrinker, without the ttm approach of possible just wasting half your memory, could be useful. > > complicated, and exposes stuff that most users really don't care about. > > Could be from me not knowing much about gpus but definitely looks too > complex to me. I don't see how users would be able to alloate, vram, > system memory and GART with reasonable accuracy. memcg on cgroup2 > deals with just single number and that's already plenty challenging. Yeah, especially wrt GART and some of the other more specialized things I don't think there's any modern gpu were you can actually run out of that stuff. At least not before you run out of every other kind of memory (GART is just a remapping table to make system memory visible to the gpu). I'm also not sure of the bw limits, given all the fun we have on the block io cgroups side. Aside from that the current bw limit only controls the bw the kernel uses, userspace can submit unlimited amounts of copying commands that use the same pcie links directly to the gpu, bypassing this cg knob. Also, controlling execution time for gpus is very tricky, since they work a lot more like a block io device or maybe a network controller with packet scheduling, than a cpu. -Daniel
Hello, Daniel. On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote: > I think system memory separate from vram makes sense. For one, vram is > like 10x+ faster than system memory, so we definitely want to have > good control on that. But maybe we only want one vram bucket overall > for the entire system? > > The trouble with system memory is that gpu tasks pin that memory to > prep execution. There's two solutions: > - i915 has a shrinker. Lots (and I really mean lots) of pain with > direct reclaim recursion, which often means we can't free memory, and > we're angering the oom killer a lot. Plus it introduces real bad > latency spikes everywhere (gpu workloads are occasionally really slow, > think "worse than pageout to spinning rust" to get memory freed). > - ttm just has a global limit, set to 50% of system memory. > > I do think a global system memory limit to tame the shrinker, without > the ttm approach of possible just wasting half your memory, could be > useful. Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used? > I'm also not sure of the bw limits, given all the fun we have on the > block io cgroups side. Aside from that the current bw limit only > controls the bw the kernel uses, userspace can submit unlimited > amounts of copying commands that use the same pcie links directly to > the gpu, bypassing this cg knob. Also, controlling execution time for > gpus is very tricky, since they work a lot more like a block io device > or maybe a network controller with packet scheduling, than a cpu. At the system level, it just gets folded into cpu time, which isn't perfect but is usually a good enough approximation of compute related dynamic resources. Can gpu do someting similar or at least start with that? Thanks.
On Fri, Sep 6, 2019 at 5:23 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Daniel. > > On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote: > > I think system memory separate from vram makes sense. For one, vram is > > like 10x+ faster than system memory, so we definitely want to have > > good control on that. But maybe we only want one vram bucket overall > > for the entire system? > > > > The trouble with system memory is that gpu tasks pin that memory to > > prep execution. There's two solutions: > > - i915 has a shrinker. Lots (and I really mean lots) of pain with > > direct reclaim recursion, which often means we can't free memory, and > > we're angering the oom killer a lot. Plus it introduces real bad > > latency spikes everywhere (gpu workloads are occasionally really slow, > > think "worse than pageout to spinning rust" to get memory freed). > > - ttm just has a global limit, set to 50% of system memory. > > > > I do think a global system memory limit to tame the shrinker, without > > the ttm approach of possible just wasting half your memory, could be > > useful. > > Hmm... what'd be the fundamental difference from slab or socket memory > which are handled through memcg? Is system memory used by GPUs have > further global restrictions in addition to the amount of physical > memory used? Sometimes, but that would be specific resources (kinda like vram), e.g. CMA regions used by a gpu. But probably not something you'll run in a datacenter and want cgroups for ... I guess we could try to integrate with the memcg group controller. One trouble is that aside from i915 most gpu drivers do not really have a full shrinker, so not sure how that would all integrate. The overall gpu memory controller would still be outside of memcg I think, since that would include swapped-out gpu objects, and stuff in special memory regions like vram. > > I'm also not sure of the bw limits, given all the fun we have on the > > block io cgroups side. Aside from that the current bw limit only > > controls the bw the kernel uses, userspace can submit unlimited > > amounts of copying commands that use the same pcie links directly to > > the gpu, bypassing this cg knob. Also, controlling execution time for > > gpus is very tricky, since they work a lot more like a block io device > > or maybe a network controller with packet scheduling, than a cpu. > > At the system level, it just gets folded into cpu time, which isn't > perfect but is usually a good enough approximation of compute related > dynamic resources. Can gpu do someting similar or at least start with > that? So generally there's a pile of engines, often of different type (e.g. amd hw has an entire pile of copy engines), with some ill-defined sharing charateristics for some (often compute/render engines use the same shader cores underneath), kinda like hyperthreading. So at that detail it's all extremely hw specific, and probably too hard to control in a useful way for users. And I'm not sure we can really do a reasonable knob for overall gpu usage, e.g. if we include all the copy engines, but the workloads are only running on compute engines, then you might only get 10% overall utilization by engine-time. While the shaders (which is most of the chip area/power consumption) are actually at 100%. On top, with many userspace apis those engines are an internal implementation detail of a more abstract gpu device (e.g. opengl), but with others, this is all fully exposed (like vulkan). Plus the kernel needs to use at least copy engines for vram management itself, and you really can't take that away. Although Kenny here has some proposal for a separate cgroup resource just for that. I just think it's all a bit too ill-defined, and we might be better off nailing the memory side first and get some real world experience on this stuff. For context, there's not even a cross-driver standard for how priorities are handled, that's all driver-specific interfaces. -Daniel
Hello, Daniel. On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote: > > Hmm... what'd be the fundamental difference from slab or socket memory > > which are handled through memcg? Is system memory used by GPUs have > > further global restrictions in addition to the amount of physical > > memory used? > > Sometimes, but that would be specific resources (kinda like vram), > e.g. CMA regions used by a gpu. But probably not something you'll run > in a datacenter and want cgroups for ... > > I guess we could try to integrate with the memcg group controller. One > trouble is that aside from i915 most gpu drivers do not really have a > full shrinker, so not sure how that would all integrate. So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer). > The overall gpu memory controller would still be outside of memcg I > think, since that would include swapped-out gpu objects, and stuff in > special memory regions like vram. Yeah, for resources which are on the GPU itself or hard limitations arising from it. In general, we wanna make cgroup controllers control something real and concrete as in physical resources. > > At the system level, it just gets folded into cpu time, which isn't > > perfect but is usually a good enough approximation of compute related > > dynamic resources. Can gpu do someting similar or at least start with > > that? > > So generally there's a pile of engines, often of different type (e.g. > amd hw has an entire pile of copy engines), with some ill-defined > sharing charateristics for some (often compute/render engines use the > same shader cores underneath), kinda like hyperthreading. So at that > detail it's all extremely hw specific, and probably too hard to > control in a useful way for users. And I'm not sure we can really do a > reasonable knob for overall gpu usage, e.g. if we include all the copy > engines, but the workloads are only running on compute engines, then > you might only get 10% overall utilization by engine-time. While the > shaders (which is most of the chip area/power consumption) are > actually at 100%. On top, with many userspace apis those engines are > an internal implementation detail of a more abstract gpu device (e.g. > opengl), but with others, this is all fully exposed (like vulkan). > > Plus the kernel needs to use at least copy engines for vram management > itself, and you really can't take that away. Although Kenny here has > some proposal for a separate cgroup resource just for that. > > I just think it's all a bit too ill-defined, and we might be better > off nailing the memory side first and get some real world experience > on this stuff. For context, there's not even a cross-driver standard > for how priorities are handled, that's all driver-specific interfaces. I see. Yeah, figuring it out as this develops makes sense to me. One thing I wanna raise is that in general we don't want to expose device or implementation details in cgroup interface. What we want expressed there is the intentions of the user. The more internal details we expose the more we end up getting tied down to the specific implementation which we should avoid especially given the early stage of development. Thanks.
On Fri 06-09-19 08:45:39, Tejun Heo wrote: > Hello, Daniel. > > On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote: > > > Hmm... what'd be the fundamental difference from slab or socket memory > > > which are handled through memcg? Is system memory used by GPUs have > > > further global restrictions in addition to the amount of physical > > > memory used? > > > > Sometimes, but that would be specific resources (kinda like vram), > > e.g. CMA regions used by a gpu. But probably not something you'll run > > in a datacenter and want cgroups for ... > > > > I guess we could try to integrate with the memcg group controller. One > > trouble is that aside from i915 most gpu drivers do not really have a > > full shrinker, so not sure how that would all integrate. > > So, while it'd great to have shrinkers in the longer term, it's not a > strict requirement to be accounted in memcg. It already accounts a > lot of memory which isn't reclaimable (a lot of slabs and socket > buffer). Yeah, having a shrinker is preferred but the memory should better be reclaimable in some form. If not by any other means then at least bound to a user process context so that it goes away with a task being killed by the OOM killer. If that is not the case then we cannot really charge it because then the memcg controller is of no use. We can tolerate it to some degree if the amount of memory charged like that is negligible to the overall size. But from the discussion it seems that these buffers are really large.
Hello, Michal. On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote: > > So, while it'd great to have shrinkers in the longer term, it's not a > > strict requirement to be accounted in memcg. It already accounts a > > lot of memory which isn't reclaimable (a lot of slabs and socket > > buffer). > > Yeah, having a shrinker is preferred but the memory should better be > reclaimable in some form. If not by any other means then at least bound > to a user process context so that it goes away with a task being killed > by the OOM killer. If that is not the case then we cannot really charge > it because then the memcg controller is of no use. We can tolerate it to > some degree if the amount of memory charged like that is negligible to > the overall size. But from the discussion it seems that these buffers > are really large. Yeah, oom kills should be able to reduce the usage; however, please note that tmpfs, among other things, can already escape this restriction and we can have cgroups which are over max and empty. It's obviously not ideal but the system doesn't melt down from it either. Thanks.
On Tue 10-09-19 09:03:29, Tejun Heo wrote: > Hello, Michal. > > On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote: > > > So, while it'd great to have shrinkers in the longer term, it's not a > > > strict requirement to be accounted in memcg. It already accounts a > > > lot of memory which isn't reclaimable (a lot of slabs and socket > > > buffer). > > > > Yeah, having a shrinker is preferred but the memory should better be > > reclaimable in some form. If not by any other means then at least bound > > to a user process context so that it goes away with a task being killed > > by the OOM killer. If that is not the case then we cannot really charge > > it because then the memcg controller is of no use. We can tolerate it to > > some degree if the amount of memory charged like that is negligible to > > the overall size. But from the discussion it seems that these buffers > > are really large. > > Yeah, oom kills should be able to reduce the usage; however, please > note that tmpfs, among other things, can already escape this > restriction and we can have cgroups which are over max and empty. > It's obviously not ideal but the system doesn't melt down from it > either. Right, and that is a reason why an access to tmpfs should be restricted when containing a workload by memcg. My understanding of this particular feature is that memcg should be the primary containment method and that's why I brought this up.
On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote: > On Fri 06-09-19 08:45:39, Tejun Heo wrote: > > Hello, Daniel. > > > > On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote: > > > > Hmm... what'd be the fundamental difference from slab or socket memory > > > > which are handled through memcg? Is system memory used by GPUs have > > > > further global restrictions in addition to the amount of physical > > > > memory used? > > > > > > Sometimes, but that would be specific resources (kinda like vram), > > > e.g. CMA regions used by a gpu. But probably not something you'll run > > > in a datacenter and want cgroups for ... > > > > > > I guess we could try to integrate with the memcg group controller. One > > > trouble is that aside from i915 most gpu drivers do not really have a > > > full shrinker, so not sure how that would all integrate. > > > > So, while it'd great to have shrinkers in the longer term, it's not a > > strict requirement to be accounted in memcg. It already accounts a > > lot of memory which isn't reclaimable (a lot of slabs and socket > > buffer). > > Yeah, having a shrinker is preferred but the memory should better be > reclaimable in some form. If not by any other means then at least bound > to a user process context so that it goes away with a task being killed > by the OOM killer. If that is not the case then we cannot really charge > it because then the memcg controller is of no use. We can tolerate it to > some degree if the amount of memory charged like that is negligible to > the overall size. But from the discussion it seems that these buffers > are really large. I think we can just make "must have a shrinker" as a requirement for system memory cgroup thing for gpu buffers. There's mild locking inversion fun to be had when typing one, but I think the problem is well-understood enough that this isn't a huge hurdle to climb over. And should give admins an easier to mange system, since it works more like what they know already. -Daniel