[0/7] kernel/cgroups: Add "dev" memory accounting cgroup.

Message ID	20241023075302.27194-1-maarten.lankhorst@linux.intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> To: intel-xe@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org> Cc: Friedrich Vock <friedrich.vock@gmx.de>, cgroups@vger.kernel.org, linux-mm@kvack.org, Maxime Ripard <mripard@kernel.org>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Subject: [PATCH 0/7] kernel/cgroups: Add "dev" memory accounting cgroup. Date: Wed, 23 Oct 2024 09:52:53 +0200 Message-ID: <20241023075302.27194-1-maarten.lankhorst@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	kernel/cgroups: Add "dev" memory accounting cgroup. \| expand [0/7] kernel/cgroups: Add "dev" memory accounting cgroup. [1/7] kernel/cgroup: Add "dev" memory accounting cgroup [2/7] drm/drv: Add drmm cgroup registration for dev cgroups. [3/7] drm/ttm: Handle cgroup based eviction in TTM [4/7] drm/xe: Implement cgroup for vram [5/7] drm/amdgpu: Add cgroups implementation [6/7,HACK] drm/xe: Hack to test with mapped pages instead of vram. [7/7,DISCUSSION] drm/gem: Add cgroup memory accounting

Maarten Lankhorst Oct. 23, 2024, 7:52 a.m. UTC

New submission!
I've added documentation for each call, and integrated the renaming from
drm cgroup to dev cgroup, based on maxime ripard's work.

Maxime has been testing this with dma-buf heaps and v4l2 too, and it seems to work.
In the initial submission, I've decided to only add the smallest enablement possible,
to have less chance of breaking things.

The API has been changed slightly, from "$name region.$regionname=$limit" in a file called
dev.min/low/max to "$subsystem/$name $regionname=$limit" in a file called dev.region.min/low/max.

This hopefully allows us to perhaps extend the API later on with the possibility to
set scheduler weights on the device, like in

https://blogs.igalia.com/tursulin/drm-scheduling-cgroup-controller/

Maarten Lankhorst (5):
  kernel/cgroup: Add "dev" memory accounting cgroup
  drm/ttm: Handle cgroup based eviction in TTM
  drm/xe: Implement cgroup for vram
  drm/amdgpu: Add cgroups implementation
  [HACK] drm/xe: Hack to test with mapped pages instead of vram.

Maxime Ripard (2):
  drm/drv: Add drmm cgroup registration for dev cgroups.
  [DISCUSSION] drm/gem: Add cgroup memory accounting

 Documentation/admin-guide/cgroup-v2.rst       |  51 +
 Documentation/core-api/cgroup.rst             |   9 +
 Documentation/core-api/index.rst              |   1 +
 Documentation/gpu/drm-compute.rst             |  54 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |   6 +
 drivers/gpu/drm/drm_drv.c                     |  32 +-
 drivers/gpu/drm/drm_gem.c                     |   4 +
 drivers/gpu/drm/drm_gem_dma_helper.c          |   4 +
 drivers/gpu/drm/ttm/tests/ttm_bo_test.c       |  18 +-
 .../gpu/drm/ttm/tests/ttm_bo_validate_test.c  |   4 +-
 drivers/gpu/drm/ttm/tests/ttm_resource_test.c |   2 +-
 drivers/gpu/drm/ttm/ttm_bo.c                  |  57 +-
 drivers/gpu/drm/ttm/ttm_resource.c            |  24 +-
 drivers/gpu/drm/xe/xe_device.c                |   4 +
 drivers/gpu/drm/xe/xe_device_types.h          |   4 +
 drivers/gpu/drm/xe/xe_ttm_sys_mgr.c           |  14 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c          |  10 +
 include/drm/drm_device.h                      |   4 +
 include/drm/drm_drv.h                         |   4 +
 include/drm/drm_gem.h                         |   2 +
 include/drm/ttm/ttm_resource.h                |  16 +-
 include/linux/cgroup_dev.h                    |  91 ++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/page_counter.h                  |   2 +-
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/dev.c                           | 893 ++++++++++++++++++
 mm/page_counter.c                             |   4 +-
 30 files changed, 1307 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/core-api/cgroup.rst
 create mode 100644 Documentation/gpu/drm-compute.rst
 create mode 100644 include/linux/cgroup_dev.h
 create mode 100644 kernel/cgroup/dev.c

Tejun Heo Oct. 23, 2024, 7:40 p.m. UTC | #1

Hello,

On Wed, Oct 23, 2024 at 09:52:53AM +0200, Maarten Lankhorst wrote:
> New submission!
> I've added documentation for each call, and integrated the renaming from
> drm cgroup to dev cgroup, based on maxime ripard's work.
> 
> Maxime has been testing this with dma-buf heaps and v4l2 too, and it seems to work.
> In the initial submission, I've decided to only add the smallest enablement possible,
> to have less chance of breaking things.
> 
> The API has been changed slightly, from "$name region.$regionname=$limit" in a file called
> dev.min/low/max to "$subsystem/$name $regionname=$limit" in a file called dev.region.min/low/max.
> 
> This hopefully allows us to perhaps extend the API later on with the possibility to
> set scheduler weights on the device, like in
> 
> https://blogs.igalia.com/tursulin/drm-scheduling-cgroup-controller/
> 
> Maarten Lankhorst (5):
>   kernel/cgroup: Add "dev" memory accounting cgroup

Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
with the devices controller from cgroup1. While cgroup1 is mostly
deprecated, the same features are provided through BPF in systemd using the
same terminologies, so this is going to be really confusing.

What happened with Tvrtko's weighted implementation? I've seen many proposed
patchsets in this area but as far as I could see none could establish
consensus among GPU crowd and that's one of the reasons why nothing ever
landed. Is the aim of this patchset establishing such consensus?

If reaching consensus doesn't seem feasible in a predictable timeframe, my
suggesstion is just extending the misc controller. If the only way forward
here is fragmented vendor(s)-specific implementations, let's throw them into
the misc controller.

Thanks.

Maxime Ripard Oct. 24, 2024, 7:20 a.m. UTC | #2

Hi Tejun,

Thanks a lot for your review.

On Wed, Oct 23, 2024 at 09:40:28AM -1000, Tejun Heo wrote:
> On Wed, Oct 23, 2024 at 09:52:53AM +0200, Maarten Lankhorst wrote:
> > New submission!
> > I've added documentation for each call, and integrated the renaming from
> > drm cgroup to dev cgroup, based on maxime ripard's work.
> > 
> > Maxime has been testing this with dma-buf heaps and v4l2 too, and it seems to work.
> > In the initial submission, I've decided to only add the smallest enablement possible,
> > to have less chance of breaking things.
> > 
> > The API has been changed slightly, from "$name region.$regionname=$limit" in a file called
> > dev.min/low/max to "$subsystem/$name $regionname=$limit" in a file called dev.region.min/low/max.
> > 
> > This hopefully allows us to perhaps extend the API later on with the possibility to
> > set scheduler weights on the device, like in
> > 
> > https://blogs.igalia.com/tursulin/drm-scheduling-cgroup-controller/
> > 
> > Maarten Lankhorst (5):
> >   kernel/cgroup: Add "dev" memory accounting cgroup
> 
> Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
> with the devices controller from cgroup1. While cgroup1 is mostly
> deprecated, the same features are provided through BPF in systemd using the
> same terminologies, so this is going to be really confusing.

Yeah, I agree. We switched to dev because we want to support more than
just DRM, but all DMA-able memory. We have patches to support for v4l2
and dma-buf heaps, so using the name DRM didn't feel great either.

Do you have a better name in mind? "device memory"? "dma memory"?

> What happened with Tvrtko's weighted implementation? I've seen many proposed
> patchsets in this area but as far as I could see none could establish
> consensus among GPU crowd and that's one of the reasons why nothing ever
> landed. Is the aim of this patchset establishing such consensus?

Yeah, we have a consensus by now I think. Valve, Intel, Google, and Red
Hat have been involved in that series and we all agree on the implementation.

Tvrtko aims at a different feature set though: this one is about memory
allocation limits, Tvrtko's about scheduling.

Scheduling doesn't make much sense for things outside of DRM (and even
for a fraction of all DRM devices), and it's pretty much orthogonal. So
i guess you can expect another series from Tvrtko, but I don't think
they should be considered equivalent or dependent on each other.

> If reaching consensus doesn't seem feasible in a predictable timeframe, my
> suggesstion is just extending the misc controller. If the only way forward
> here is fragmented vendor(s)-specific implementations, let's throw them into
> the misc controller.

I don't think we have a fragmented implementation here, at all. The last
patch especially implements it for all devices implementing the GEM
interface in DRM, which would be around 100 drivers from various vendors.

It's marked as a discussion because we don't quite know how to plumb it
in for all drivers in the current DRM framework, but it's very much what
we want to achieve.

Maxime

Tejun Heo Oct. 24, 2024, 5:06 p.m. UTC | #3

Hello,

On Thu, Oct 24, 2024 at 09:20:43AM +0200, Maxime Ripard wrote:
...
> > Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
> > with the devices controller from cgroup1. While cgroup1 is mostly
> > deprecated, the same features are provided through BPF in systemd using the
> > same terminologies, so this is going to be really confusing.
> 
> Yeah, I agree. We switched to dev because we want to support more than
> just DRM, but all DMA-able memory. We have patches to support for v4l2
> and dma-buf heaps, so using the name DRM didn't feel great either.
> 
> Do you have a better name in mind? "device memory"? "dma memory"?

Maybe just dma (I think the term isn't used heavily anymore, so the word is
kinda open)? But, hopefully, others have better ideas.

> > What happened with Tvrtko's weighted implementation? I've seen many proposed
> > patchsets in this area but as far as I could see none could establish
> > consensus among GPU crowd and that's one of the reasons why nothing ever
> > landed. Is the aim of this patchset establishing such consensus?
> 
> Yeah, we have a consensus by now I think. Valve, Intel, Google, and Red
> Hat have been involved in that series and we all agree on the implementation.

That's great to hear.

> Tvrtko aims at a different feature set though: this one is about memory
> allocation limits, Tvrtko's about scheduling.
> 
> Scheduling doesn't make much sense for things outside of DRM (and even
> for a fraction of all DRM devices), and it's pretty much orthogonal. So
> i guess you can expect another series from Tvrtko, but I don't think
> they should be considered equivalent or dependent on each other.

Yeah, I get that this is about memory and that is about processing capacity,
so the plan is going for separate controllers for each? Or would it be
better to present both under the same controller interface? Even if they're
going to be separate controllers, we at least want to be aligned on how
devices and their configurations are presented in the two controllers.

Thanks.

Maxime Ripard Oct. 28, 2024, 10:05 a.m. UTC | #4

On Thu, Oct 24, 2024 at 07:06:36AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 24, 2024 at 09:20:43AM +0200, Maxime Ripard wrote:
> ...
> > > Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
> > > with the devices controller from cgroup1. While cgroup1 is mostly
> > > deprecated, the same features are provided through BPF in systemd using the
> > > same terminologies, so this is going to be really confusing.
> > 
> > Yeah, I agree. We switched to dev because we want to support more than
> > just DRM, but all DMA-able memory. We have patches to support for v4l2
> > and dma-buf heaps, so using the name DRM didn't feel great either.
> > 
> > Do you have a better name in mind? "device memory"? "dma memory"?
> 
> Maybe just dma (I think the term isn't used heavily anymore, so the word is
> kinda open)? But, hopefully, others have better ideas.
> 
> > > What happened with Tvrtko's weighted implementation? I've seen many proposed
> > > patchsets in this area but as far as I could see none could establish
> > > consensus among GPU crowd and that's one of the reasons why nothing ever
> > > landed. Is the aim of this patchset establishing such consensus?
> > 
> > Yeah, we have a consensus by now I think. Valve, Intel, Google, and Red
> > Hat have been involved in that series and we all agree on the implementation.
> 
> That's great to hear.
> 
> > Tvrtko aims at a different feature set though: this one is about memory
> > allocation limits, Tvrtko's about scheduling.
> > 
> > Scheduling doesn't make much sense for things outside of DRM (and even
> > for a fraction of all DRM devices), and it's pretty much orthogonal. So
> > i guess you can expect another series from Tvrtko, but I don't think
> > they should be considered equivalent or dependent on each other.
> 
> Yeah, I get that this is about memory and that is about processing capacity,
> so the plan is going for separate controllers for each? Or would it be
> better to present both under the same controller interface? Even if they're
> going to be separate controllers, we at least want to be aligned on how
> devices and their configurations are presented in the two controllers.

It's still up in the air, I think.

My personal opinion is that there's only DRM (and accel) devices that
really care about scheduling constraints anyway, so it wouldn't (have
to) be as generic as this one.

And if we would call it dma, then the naming becomes a bit weird since
DMA doesn't have much to do with scheduling.

But I guess it's just another instance of the "naming is hard" problem :)

Maxime

Johannes Weiner Oct. 29, 2024, 8:38 p.m. UTC | #5

On Mon, Oct 28, 2024 at 11:05:48AM +0100, Maxime Ripard wrote:
> On Thu, Oct 24, 2024 at 07:06:36AM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Thu, Oct 24, 2024 at 09:20:43AM +0200, Maxime Ripard wrote:
> > ...
> > > > Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
> > > > with the devices controller from cgroup1. While cgroup1 is mostly
> > > > deprecated, the same features are provided through BPF in systemd using the
> > > > same terminologies, so this is going to be really confusing.
> > > 
> > > Yeah, I agree. We switched to dev because we want to support more than
> > > just DRM, but all DMA-able memory. We have patches to support for v4l2
> > > and dma-buf heaps, so using the name DRM didn't feel great either.
> > > 
> > > Do you have a better name in mind? "device memory"? "dma memory"?
> > 
> > Maybe just dma (I think the term isn't used heavily anymore, so the word is
> > kinda open)? But, hopefully, others have better ideas.
> > 
> > > > What happened with Tvrtko's weighted implementation? I've seen many proposed
> > > > patchsets in this area but as far as I could see none could establish
> > > > consensus among GPU crowd and that's one of the reasons why nothing ever
> > > > landed. Is the aim of this patchset establishing such consensus?
> > > 
> > > Yeah, we have a consensus by now I think. Valve, Intel, Google, and Red
> > > Hat have been involved in that series and we all agree on the implementation.
> > 
> > That's great to hear.
> > 
> > > Tvrtko aims at a different feature set though: this one is about memory
> > > allocation limits, Tvrtko's about scheduling.
> > > 
> > > Scheduling doesn't make much sense for things outside of DRM (and even
> > > for a fraction of all DRM devices), and it's pretty much orthogonal. So
> > > i guess you can expect another series from Tvrtko, but I don't think
> > > they should be considered equivalent or dependent on each other.
> > 
> > Yeah, I get that this is about memory and that is about processing capacity,
> > so the plan is going for separate controllers for each? Or would it be
> > better to present both under the same controller interface? Even if they're
> > going to be separate controllers, we at least want to be aligned on how
> > devices and their configurations are presented in the two controllers.
> 
> It's still up in the air, I think.
> 
> My personal opinion is that there's only DRM (and accel) devices that
> really care about scheduling constraints anyway, so it wouldn't (have
> to) be as generic as this one.

If they represent different resources that aren't always controlled in
conjunction, it makes sense to me to have separate controllers as well.

Especially if a merged version would have separate control files for
each resource anyway (dev.region.*, dev.weight etc.)

> And if we would call it dma, then the naming becomes a bit weird since
> DMA doesn't have much to do with scheduling.
> 
> But I guess it's just another instance of the "naming is hard" problem :)

Yes it would be good to have something catchy, easy on the eyes, and
vaguely familiar. devcomp(ute), devproc, devcpu, devcycles all kind of
suck. drm and gpu seem too specific for a set that includes npus and
potentially other accelerators in the future.

I don't think we want to go full devspace & devtime, either, though.

How about dmem for this one, and dpu for the other one. For device
memory and device processing unit, respectively.

Maxime Ripard Nov. 6, 2024, 10:31 a.m. UTC | #6

On Tue, Oct 29, 2024 at 04:38:34PM -0400, Johannes Weiner wrote:
> On Mon, Oct 28, 2024 at 11:05:48AM +0100, Maxime Ripard wrote:
> > On Thu, Oct 24, 2024 at 07:06:36AM -1000, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, Oct 24, 2024 at 09:20:43AM +0200, Maxime Ripard wrote:
> > > ...
> > > > > Yeah, let's not use "dev" name for this. As Waiman pointed out, it conflicts
> > > > > with the devices controller from cgroup1. While cgroup1 is mostly
> > > > > deprecated, the same features are provided through BPF in systemd using the
> > > > > same terminologies, so this is going to be really confusing.
> > > > 
> > > > Yeah, I agree. We switched to dev because we want to support more than
> > > > just DRM, but all DMA-able memory. We have patches to support for v4l2
> > > > and dma-buf heaps, so using the name DRM didn't feel great either.
> > > > 
> > > > Do you have a better name in mind? "device memory"? "dma memory"?
> > > 
> > > Maybe just dma (I think the term isn't used heavily anymore, so the word is
> > > kinda open)? But, hopefully, others have better ideas.
> > > 
> > > > > What happened with Tvrtko's weighted implementation? I've seen many proposed
> > > > > patchsets in this area but as far as I could see none could establish
> > > > > consensus among GPU crowd and that's one of the reasons why nothing ever
> > > > > landed. Is the aim of this patchset establishing such consensus?
> > > > 
> > > > Yeah, we have a consensus by now I think. Valve, Intel, Google, and Red
> > > > Hat have been involved in that series and we all agree on the implementation.
> > > 
> > > That's great to hear.
> > > 
> > > > Tvrtko aims at a different feature set though: this one is about memory
> > > > allocation limits, Tvrtko's about scheduling.
> > > > 
> > > > Scheduling doesn't make much sense for things outside of DRM (and even
> > > > for a fraction of all DRM devices), and it's pretty much orthogonal. So
> > > > i guess you can expect another series from Tvrtko, but I don't think
> > > > they should be considered equivalent or dependent on each other.
> > > 
> > > Yeah, I get that this is about memory and that is about processing capacity,
> > > so the plan is going for separate controllers for each? Or would it be
> > > better to present both under the same controller interface? Even if they're
> > > going to be separate controllers, we at least want to be aligned on how
> > > devices and their configurations are presented in the two controllers.
> > 
> > It's still up in the air, I think.
> > 
> > My personal opinion is that there's only DRM (and accel) devices that
> > really care about scheduling constraints anyway, so it wouldn't (have
> > to) be as generic as this one.
> 
> If they represent different resources that aren't always controlled in
> conjunction, it makes sense to me to have separate controllers as well.
> 
> Especially if a merged version would have separate control files for
> each resource anyway (dev.region.*, dev.weight etc.)
> 
> > And if we would call it dma, then the naming becomes a bit weird since
> > DMA doesn't have much to do with scheduling.
> > 
> > But I guess it's just another instance of the "naming is hard" problem :)
> 
> Yes it would be good to have something catchy, easy on the eyes, and
> vaguely familiar. devcomp(ute), devproc, devcpu, devcycles all kind of
> suck. drm and gpu seem too specific for a set that includes npus and
> potentially other accelerators in the future.
> 
> I don't think we want to go full devspace & devtime, either, though.
> 
> How about dmem for this one, and dpu for the other one. For device
> memory and device processing unit, respectively.

dmem sounds great to me, does everyone agree?

Maxime

Tejun Heo Nov. 6, 2024, 6:20 p.m. UTC | #7

On Wed, Nov 06, 2024 at 11:31:49AM +0100, Maxime Ripard wrote:
...
> > How about dmem for this one, and dpu for the other one. For device
> > memory and device processing unit, respectively.
> 
> dmem sounds great to me, does everyone agree?

Sounds good to me.

Thanks.

Maarten Lankhorst Nov. 13, 2024, 2:58 p.m. UTC | #8

Hey,

Den 2024-11-06 kl. 19:20, skrev Tejun Heo:
> On Wed, Nov 06, 2024 at 11:31:49AM +0100, Maxime Ripard wrote:
> ...
>>> How about dmem for this one, and dpu for the other one. For device
>>> memory and device processing unit, respectively.
>>
>> dmem sounds great to me, does everyone agree?
> 
> Sounds good to me.
> 
> Thanks.
> 
Thanks for all feedback and discussion. I checked mostly on patchwork so 
I missed the discussion here. Fortunately it's only been about naming. :)

I'm thinking of adding a 'high' knob as well, that will work similarly 
to high in normal mem controller. (so not proportionally calculated like 
'max', but (usage + allocated) < max = ok.

Recursively of course.

Cheers,
~Maarten

Tejun Heo Nov. 13, 2024, 6:29 p.m. UTC | #9

Hello,

On Wed, Nov 13, 2024 at 03:58:25PM +0100, Maarten Lankhorst wrote:
...
> Thanks for all feedback and discussion. I checked mostly on patchwork so I
> missed the discussion here. Fortunately it's only been about naming. :)
> 
> I'm thinking of adding a 'high' knob as well, that will work similarly to
> high in normal mem controller. (so not proportionally calculated like 'max',
> but (usage + allocated) < max = ok.
> 
> Recursively of course.

I'd be cautious about adding knobs. These being published API, it's easy to
paint oneself into a corner. I suggest starting with what's essential.

Thanks.

[0/7] kernel/cgroups: Add "dev" memory accounting cgroup.

Message

Comments