mbox series

[RFC,0/4] Add support for DRM cgroup memory accounting.

Message ID 20230503083500.645848-1-maarten.lankhorst@linux.intel.com (mailing list archive)
Headers show
Series Add support for DRM cgroup memory accounting. | expand

Message

Maarten Lankhorst May 3, 2023, 8:34 a.m. UTC
RFC as I'm looking for comments.

For long running compute, it can be beneficial to partition the GPU memory
between cgroups, so each cgroup can use its maximum amount of memory without
interfering with other scheduled jobs. Done properly, this can alleviate the
need for eviction, which might result in a job being terminated if the GPU
doesn't support mid-thread preemption or recoverable page faults.

This is done by adding a bunch of knobs to cgroup:
drm.capacity: Shows maximum capacity of each resource region.
drm.max: Display or limit max amount of memory.
drm.current: Current amount of memory in use.

TTM has not been made cgroup aware yet, so instead of evicting from
the current cgroup to stay within the cgroup limits, it simply returns
the error -ENOSPC to userspace.

I've used Tvrtko's cgroup controller series as a base, but it implemented
scheduling weight, not memory accounting, so I only ended up keeping the
base patch.

Xe is not upstream yet, so the driver specific patch will only apply on
https://gitlab.freedesktop.org/drm/xe/kernel

Maarten Lankhorst (3):
  drm/cgroup: Add memory accounting to DRM cgroup
  drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  drm/xe: Add support for the drm cgroup

Tvrtko Ursulin (1):
  cgroup: Add the DRM cgroup controller

 Documentation/admin-guide/cgroup-v2.rst    |  46 ++
 Documentation/gpu/drm-compute.rst          |  54 ++
 drivers/gpu/drm/ttm/ttm_bo.c               |   4 +-
 drivers/gpu/drm/xe/xe_device.c             |   4 +
 drivers/gpu/drm/xe/xe_device_types.h       |   4 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       |  21 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |   5 +
 include/linux/cgroup_drm.h                 |  90 ++++
 include/linux/cgroup_subsys.h              |   4 +
 init/Kconfig                               |   7 +
 kernel/cgroup/Makefile                     |   1 +
 kernel/cgroup/drm.c                        | 557 +++++++++++++++++++++
 12 files changed, 794 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/drm-compute.rst
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

Comments

Tejun Heo May 5, 2023, 7:50 p.m. UTC | #1
Hello,

On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
> RFC as I'm looking for comments.
> 
> For long running compute, it can be beneficial to partition the GPU memory
> between cgroups, so each cgroup can use its maximum amount of memory without
> interfering with other scheduled jobs. Done properly, this can alleviate the
> need for eviction, which might result in a job being terminated if the GPU
> doesn't support mid-thread preemption or recoverable page faults.
> 
> This is done by adding a bunch of knobs to cgroup:
> drm.capacity: Shows maximum capacity of each resource region.
> drm.max: Display or limit max amount of memory.
> drm.current: Current amount of memory in use.
> 
> TTM has not been made cgroup aware yet, so instead of evicting from
> the current cgroup to stay within the cgroup limits, it simply returns
> the error -ENOSPC to userspace.
> 
> I've used Tvrtko's cgroup controller series as a base, but it implemented
> scheduling weight, not memory accounting, so I only ended up keeping the
> base patch.
> 
> Xe is not upstream yet, so the driver specific patch will only apply on
> https://gitlab.freedesktop.org/drm/xe/kernel

Some high-level feedbacks.

* There have been multiple attempts at this but the track record is kinda
  poor. People don't seem to agree what should constitute DRM memory and how
  they should be accounted / controlled.

* I like Tvrtko's scheduling patchset because it exposes a generic interface
  which makes sense regardless of hardware details and then each driver can
  implement the configured control in whatever way they can. However, even
  for that, there doesn't seem much buy-in from other drivers.

* This proposal seems narrowly scoped trying to solve a specific problem
  which may not translate to different hardware configurations. Please let
  me know if I got that wrong, but if that's the case, I think a better and
  easier approach might be just being a part of the misc controller. That
  doesn't require much extra code and should be able to provide everything
  necessary for statically limiting specific resources.

Thanks.
Maarten Lankhorst May 10, 2023, 2:59 p.m. UTC | #2
Hey,

On 2023-05-05 21:50, Tejun Heo wrote:
> Hello,
>
> On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
>> RFC as I'm looking for comments.
>>
>> For long running compute, it can be beneficial to partition the GPU memory
>> between cgroups, so each cgroup can use its maximum amount of memory without
>> interfering with other scheduled jobs. Done properly, this can alleviate the
>> need for eviction, which might result in a job being terminated if the GPU
>> doesn't support mid-thread preemption or recoverable page faults.
>>
>> This is done by adding a bunch of knobs to cgroup:
>> drm.capacity: Shows maximum capacity of each resource region.
>> drm.max: Display or limit max amount of memory.
>> drm.current: Current amount of memory in use.
>>
>> TTM has not been made cgroup aware yet, so instead of evicting from
>> the current cgroup to stay within the cgroup limits, it simply returns
>> the error -ENOSPC to userspace.
>>
>> I've used Tvrtko's cgroup controller series as a base, but it implemented
>> scheduling weight, not memory accounting, so I only ended up keeping the
>> base patch.
>>
>> Xe is not upstream yet, so the driver specific patch will only apply on
>> https://gitlab.freedesktop.org/drm/xe/kernel
> Some high-level feedbacks.
>
> * There have been multiple attempts at this but the track record is kinda
>    poor. People don't seem to agree what should constitute DRM memory and how
>    they should be accounted / controlled.

Thanks for the feedback.

I think for a lot of drivers, what is VRAM might have different meaning, but the intention
is it being accounted in the same way. Most drivers use TTM, which has a standard way
of allocating memory, and a standard way of evicting VRAM.

This makes it very useful for the usecase which I'm looking at, long running compute.
When you have long running jobs, you don't want them to be interrupted because a completely
unrelated process needs some VRAM, and one of the compute jobs buffers are being evicted.

Some hardware does not support mid-thread preemption or page fault recovery, this means that
when memory is evicted, the compute job is terminated.

The full problem statement is in drm-compute.rst in the memory accounting patch.

> * I like Tvrtko's scheduling patchset because it exposes a generic interface
>    which makes sense regardless of hardware details and then each driver can
>    implement the configured control in whatever way they can. However, even
>    for that, there doesn't seem much buy-in from other drivers.

Yeah, that is correct. But it tries to solve a different part of the problem.

> * This proposal seems narrowly scoped trying to solve a specific problem
>    which may not translate to different hardware configurations. Please let
>    me know if I got that wrong, but if that's the case, I think a better and
>    easier approach might be just being a part of the misc controller. That
>    doesn't require much extra code and should be able to provide everything
>    necessary for statically limiting specific resources.

The misc controller is not granular enough. A single computer may have any number of
graphics cards, some of them with multiple regions of vram inside a single card.

For compute and shared hosting you might want to limit the usage of a single memory
region on a single card, and then limit the same limits for the rest too, to prevent
triggering eviction.

The current version doesn't handle eviction correctly, because I was still working
on it and I wanted to post a RFC. As a result, the case where resource limit is hit
will evict the device's entire memory or get stuck in a loop. With some changes, the
next version will not have this bug. This results in a few changes to the core code. [1]

In the next version, I will move all the code for handling the resource limit to
TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.

The effect of moving the code to TTM, is that it will make the code even more generic
for drivers that have vram and use TTM. When using TTM, you only have to describe your
VRAM, update some fields in the TTM manager and (un)register your device with the
cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
nouveau. [2]

If you want to add a knob for scheduling weight for a process, it makes sense to
also add resource usage as a knob, otherwise the effect of that knob is very
limited. So even for Tvrtko's original proposed usecase, it would make sense.

Cheers,
~Maarten

--------
[1] Compared to this version:
  static inline int drmcg_try_charge(struct drmcgroup_state **drmcs,
+                                  struct drmcgroup_state **limitcs,
                                    struct drmcgroup_device *cgdev,
                                    u32 index, u64 size)

This now returns which cgroup's limit is hit on -EAGAIN.

+bool drmcs_grouped(struct drmcgroup_state *limitcs,
+                  struct drmcgroup_state *testcs);
Tells if testcs is the same as limitcs, or a subgroup of it. This allows us to
skip evicting when it's unneeded. If we want to add a min, it will make sense
to pass the size too, to skip some subcgroups below min.

+void drmcs_put(struct drmcgroup_state *drmcs);
Drops the limitcs ref.
-------------------
[2] With the next version, I can very easily implement the cgroup handling on amdgpu too:
- embed a struct drmcgroup_device inside amdgpu_device.
- In amdgpu_vram_mgr_init, populate the struct drmcgroup_device.regions[0] for vram,
   and set ttm_resource_manager->cg to &adev->drmcgroup_device
- Call drmcg_register_device after, and drmcg_unregister_device after cleaning up vram.

So if anyone wants to limit VRAM on amdgpu or qxl or nouveau (left as exercise for reader)
afterwards, it will work as intended, while the driver doesn't have to be cgroups aware.
Tejun Heo May 10, 2023, 6:46 p.m. UTC | #3
Hello,

On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
> The misc controller is not granular enough. A single computer may have any number of
> graphics cards, some of them with multiple regions of vram inside a single card.

Extending the misc controller to support dynamic keys shouldn't be that
difficult.

...
> In the next version, I will move all the code for handling the resource limit to
> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
> 
> The effect of moving the code to TTM, is that it will make the code even more generic
> for drivers that have vram and use TTM. When using TTM, you only have to describe your
> VRAM, update some fields in the TTM manager and (un)register your device with the
> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
> nouveau. [2]
> 
> If you want to add a knob for scheduling weight for a process, it makes sense to
> also add resource usage as a knob, otherwise the effect of that knob is very
> limited. So even for Tvrtko's original proposed usecase, it would make sense.

It does make sense but unlike Tvrtko's scheduling weights what's being
proposed doesn't seem to encapsulate GPU memory resource in a generic enough
manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
specific knoweldge of how a specific GPU operates to say "this guy should
get 2x processing power over that guy". This more or less holds for other
major resources including CPU, memory and IO. What you're proposing seems a
lot more tied to hardware details and users would have to know a lot more
about how memory is configured on that particular GPU.

Now, if this is inherent to how all, or at least most, GPUs operate, sure,
but otherwise let's start small in terms of interface and not take up space
which should be for something universal. If this turns out to be the way,
expanding to take up the generic interface space isn't difficult.

I don't know GPU space so please educate me where I'm wrong.

Thanks.
Maarten Lankhorst May 11, 2023, 10:03 a.m. UTC | #4
Hey,

On 2023-05-10 20:46, Tejun Heo wrote:
> Hello,
>
> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>> The misc controller is not granular enough. A single computer may have any number of
>> graphics cards, some of them with multiple regions of vram inside a single card.
> Extending the misc controller to support dynamic keys shouldn't be that
> difficult.
>
> ...
>> In the next version, I will move all the code for handling the resource limit to
>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>
>> The effect of moving the code to TTM, is that it will make the code even more generic
>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>> VRAM, update some fields in the TTM manager and (un)register your device with the
>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>> nouveau. [2]
>>
>> If you want to add a knob for scheduling weight for a process, it makes sense to
>> also add resource usage as a knob, otherwise the effect of that knob is very
>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
> It does make sense but unlike Tvrtko's scheduling weights what's being
> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
> specific knoweldge of how a specific GPU operates to say "this guy should
> get 2x processing power over that guy". This more or less holds for other
> major resources including CPU, memory and IO. What you're proposing seems a
> lot more tied to hardware details and users would have to know a lot more
> about how memory is configured on that particular GPU.

There's not much need of knowing the specifics of a card, but there might
be a need of knowing the workload to determine what allocation limits to set.

I've left region to be implementation specific, but it would make sense to
standardise it.
TTM, the layer used by drivers that support VRAM, have the following regions:
* sysmem - All system memory allocated; includes evicted VRAM.
* mapped - All physical system memory that is mapped to the GPU, when unbound
           moves to sysmem. When evicting VRAM to sysmem, it's temporarily
           mapped here.
* vramN - All VRAM regions of the device.
* driver specific regions - probably doesn't make sense to put in cgroup at all,
  this includes stolen from the PoC.

That leaves the question, what regions would make sense for a cgroup?
Since vramN can be moved to mapped and sysmem (VRAM eviction, suspend/resume,
driver_madvise), it becomes a subject of debate if we should include the other
regions, since things become complicated fast.

For the first iteration, I focus on a single category, vramN.

Even when not knowing anything about a GPU, it will be easy to partition its
memory like that.

If you can assign a weight for the scheduler, then you can also partition it's
vram by parsing /drm.capacity for total amount, and then splitting it across
cgroups.


> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
> but otherwise let's start small in terms of interface and not take up space
> which should be for something universal. If this turns out to be the way,
> expanding to take up the generic interface space isn't difficult.
>
> I don't know GPU space so please educate me where I'm wrong.

Most GPU's have dedicated vram that works roughly in the same way, some
integrated chips like i915 or arm use shared memory from the host system
only. I would say amd, nvidia and intel's chips with dedicated memory work
roughly in the same way for vram.

I hope this explains it a little bit more,

~Maarten
Tvrtko Ursulin May 11, 2023, 10:14 a.m. UTC | #5
On 10/05/2023 19:46, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>> The misc controller is not granular enough. A single computer may have any number of
>> graphics cards, some of them with multiple regions of vram inside a single card.
> 
> Extending the misc controller to support dynamic keys shouldn't be that
> difficult.
> 
> ...
>> In the next version, I will move all the code for handling the resource limit to
>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>
>> The effect of moving the code to TTM, is that it will make the code even more generic
>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>> VRAM, update some fields in the TTM manager and (un)register your device with the
>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>> nouveau. [2]
>>
>> If you want to add a knob for scheduling weight for a process, it makes sense to
>> also add resource usage as a knob, otherwise the effect of that knob is very
>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
> 
> It does make sense but unlike Tvrtko's scheduling weights what's being
> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
> specific knoweldge of how a specific GPU operates to say "this guy should
> get 2x processing power over that guy". This more or less holds for other
> major resources including CPU, memory and IO. What you're proposing seems a
> lot more tied to hardware details and users would have to know a lot more
> about how memory is configured on that particular GPU.
> 
> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
> but otherwise let's start small in terms of interface and not take up space
> which should be for something universal. If this turns out to be the way,
> expanding to take up the generic interface space isn't difficult.
> 
> I don't know GPU space so please educate me where I'm wrong.

I was hoping it might be passable in principle (in some form, pending 
discussion on semantics) given how Maarten's proposal starts with only 
very specific per-device-per-memory regions controls, which is 
applicable to many devices. And hard limit at that, which probably has 
less complicated semantics, or at least implementation.

My understanding of the proposal is that the allocation either fits, or 
it evicts from the parent's hierarchy (if possible) and then fits, or it 
fails. Which sounds simple enough.

I do however agree that it is a limited use case. So from the negative 
side of the debating camp I have to ask if this use case could be simply 
satisfied by providing a driver/device global over commit yes/no 
control? In other words, is it really important to partition the vram 
space ahead of time, and from the kernel side too? Wouldn't the relevant 
(highly specialised) applications work just fine with global over commit 
disabled? Even if the option to configure their maximum allowed working 
set from the userspace side was needed.

Or if we conclude cgroup controller is the way to go, would adding less 
specific limits make it more palatable? I am thinking here some generic 
"device memory resident". Not per device, not per memory region. So 
userspace/admins have some chance of configuring generic limits. That 
would require coming up with more generic use cases though so another 
thing to discuss. Like who would use that and for what.

Assuming also we can agree that "device memory resident" is a 
stable/solid concept across drm. Should be easier than for integrated 
GPUs, for which I have to admit I currently don't remember if 
allocations are already consistently covered by the memory controller. 
Even if they are ownership is probably wrong.

Group ownership is possibly a concern in this proposal too. Because I 
remember the previous attempt of adding some drm stats to memcg 
explained that for instance on Android all buffers are allocated by a 
central process and then handed over to other processes. So transferring 
ownership was explained as critical.

Regards,

Tvrtko

P.S.
On the matter of naming the uapi interface - in any case I wouldn't use 
the "unqualified" drm namespace such as drm.max/current/capacity. I 
think all those should include a specific prefix to signify it is about 
memory. In some way.
Maarten Lankhorst May 11, 2023, 7:58 p.m. UTC | #6
Hey,

On 2023-05-11 12:14, Tvrtko Ursulin wrote:
>
> On 10/05/2023 19:46, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
>>> The misc controller is not granular enough. A single computer may have any number of
>>> graphics cards, some of them with multiple regions of vram inside a single card.
>>
>> Extending the misc controller to support dynamic keys shouldn't be that
>> difficult.
>>
>> ...
>>> In the next version, I will move all the code for handling the resource limit to
>>> TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.
>>>
>>> The effect of moving the code to TTM, is that it will make the code even more generic
>>> for drivers that have vram and use TTM. When using TTM, you only have to describe your
>>> VRAM, update some fields in the TTM manager and (un)register your device with the
>>> cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
>>> nouveau. [2]
>>>
>>> If you want to add a knob for scheduling weight for a process, it makes sense to
>>> also add resource usage as a knob, otherwise the effect of that knob is very
>>> limited. So even for Tvrtko's original proposed usecase, it would make sense.
>>
>> It does make sense but unlike Tvrtko's scheduling weights what's being
>> proposed doesn't seem to encapsulate GPU memory resource in a generic enough
>> manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
>> specific knoweldge of how a specific GPU operates to say "this guy should
>> get 2x processing power over that guy". This more or less holds for other
>> major resources including CPU, memory and IO. What you're proposing seems a
>> lot more tied to hardware details and users would have to know a lot more
>> about how memory is configured on that particular GPU.
>>
>> Now, if this is inherent to how all, or at least most, GPUs operate, sure,
>> but otherwise let's start small in terms of interface and not take up space
>> which should be for something universal. If this turns out to be the way,
>> expanding to take up the generic interface space isn't difficult.
>>
>> I don't know GPU space so please educate me where I'm wrong.
>
> I was hoping it might be passable in principle (in some form, pending discussion on semantics) given how Maarten's proposal starts with only very specific per-device-per-memory regions controls, which is applicable to many devices. And hard limit at that, which probably has less complicated semantics, or at least implementation.
>
> My understanding of the proposal is that the allocation either fits, or it evicts from the parent's hierarchy (if possible) and then fits, or it fails. Which sounds simple enough.

Yeah, for vram itÅ› that simple. I think for mapped and sysmem regions we may require the
possiblity to ignore the limits as well, as it's possible to move to those regions from
eviction. We probably don't want eviction to fail because of too low limits.

> I do however agree that it is a limited use case. So from the negative side of the debating camp I have to ask if this use case could be simply satisfied by providing a driver/device global over commit yes/no control? In other words, is it really important to partition the vram space ahead of time, and from the kernel side too? Wouldn't the relevant (highly specialised) applications work just fine with global over commit disabled? Even if the option to configure their maximum allowed working set from the userspace side was needed.

Disabling overcommit? Do you mean pinning the memory workload? This causes a denial of service if
done without limits, and that's what we're trying to avoid. There is no need for immunity from
eviction. Overcommit is still useful inside the cgroup itself, we only want immunity from being
evicted by another process performing another workload.

> Or if we conclude cgroup controller is the way to go, would adding less specific limits make it more palatable? I am thinking here some generic "device memory resident". Not per device, not per memory region. So userspace/admins have some chance of configuring generic limits. That would require coming up with more generic use cases though so another thing to discuss. Like who would use that and for what.

You would run into ambiguity with that. I think it's fine to assume any number of vram regions. In most cases, the number is 1.

I don't see that number going much higher, 2 is already on the high side. A simple controller could just half each region separately.

> Assuming also we can agree that "device memory resident" is a stable/solid concept across drm. Should be easier than for integrated GPUs, for which I have to admit I currently don't remember if allocations are already consistently covered by the memory controller. Even if they are ownership is probably wrong.
Likely, especially when evicting another process' memory. That needs some thought. Likely we have to keep the original cgroup as an owner.
> Group ownership is possibly a concern in this proposal too. Because I remember the previous attempt of adding some drm stats to memcg explained that for instance on Android all buffers are allocated by a central process and then handed over to other processes. So transferring ownership was explained as critical.
Is this done using dma-buf? Ownership could be handed over on import then. If not, what is the mechanism they use?
> Regards,
>
> Tvrtko
>
> P.S.
> On the matter of naming the uapi interface - in any case I wouldn't use the "unqualified" drm namespace such as drm.max/current/capacity. I think all those should include a specific prefix to signify it is about memory. In some way.

I've deliberately added region to each key, so what happens is that drm.capacity/max/current contains: $pciid region.vram0=$value, that way, if we want to add non-memory resources, it becomes possible to do so by choosing a different prefix.

For example number of engines that can be created would also be possible to add to those files, but that might be more driver specific.

I tried to keep it generic like that. In this way I didn't immediately pollute the entire namespace.

Of course, if needed, we can make it drm.region_max etc instead and drop it from the cgroup key entries instead.

Cheers,

Maarten