[RFC,0/4] uapi, drm: Add and implement RLIMIT_GPUPRIO

Message ID	20230403194058.25958-1-joshua@froggi.es (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Joshua Ashton <joshua@froggi.es> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org Subject: [RFC PATCH 0/4] uapi, drm: Add and implement RLIMIT_GPUPRIO Date: Mon, 3 Apr 2023 20:40:54 +0100 Message-Id: <20230403194058.25958-1-joshua@froggi.es> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: Joshua Ashton <joshua@froggi.es> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	uapi, drm: Add and implement RLIMIT_GPUPRIO \| expand [RFC,0/4] uapi, drm: Add and implement RLIMIT_GPUPRIO [RFC,1/4] drm/scheduler: Add DRM_SCHED_PRIORITY_VERY_HIGH [RFC,2/4] drm/scheduler: Split out drm_sched_priority to own file [RFC,3/4] uapi: Add RLIMIT_GPUPRIO [RFC,4/4] drm/amd/amdgpu: Check RLIMIT_GPUPRIO in priority permissions

Joshua Ashton April 3, 2023, 7:40 p.m. UTC

Hello all!

I would like to propose a new API for allowing processes to control
the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.

The main reason for this is for compositors such as Gamescope and
SteamVR vrcompositor to be able to create realtime async compute
queues on AMD without the need of CAP_SYS_NICE.

The current situation is bad for a few reasons, one being that in order
to setcap the executable, typically one must run as root which involves
a pretty high privelage escalation in order to achieve one
small feat, a realtime async compute queue queue for VR or a compositor.
The executable cannot be setcap'ed inside a
container nor can the setcap'ed executable be run in a container with
NO_NEW_PRIVS.

I go into more detail in the description in
`uapi: Add RLIMIT_GPUPRIO`.

My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
which seems to make most initial sense to me to solve the problem.

I am definitely not set that this is the best formulation however
or if this should be linked to DRM (in terms of it's scheduler
priority enum/definitions) in any way and and would really like other
people's opinions across the stack on this.

Once initial concern is that potentially this RLIMIT could out-live
the lifespan of DRM. It sounds crazy saying it right now, something
that definitely popped into my mind when touching `resource.h`. :-)

Anyway, please let me know what you think!
Definitely open to any feedback and advice you may have. :D

Thanks!
 - Joshie

Joshua Ashton (4):
  drm/scheduler: Add DRM_SCHED_PRIORITY_VERY_HIGH
  drm/scheduler: Split out drm_sched_priority to own file
  uapi: Add RLIMIT_GPUPRIO
  drm/amd/amdgpu: Check RLIMIT_GPUPRIO in priority permissions

 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 13 ++++++--
 drivers/gpu/drm/msm/msm_gpu.h           |  2 +-
 fs/proc/base.c                          |  1 +
 include/asm-generic/resource.h          |  3 +-
 include/drm/drm_sched_priority.h        | 41 +++++++++++++++++++++++++
 include/drm/gpu_scheduler.h             | 14 +--------
 include/uapi/asm-generic/resource.h     |  3 +-
 7 files changed, 58 insertions(+), 19 deletions(-)
 create mode 100644 include/drm/drm_sched_priority.h

Christian König April 3, 2023, 7:54 p.m. UTC | #1

Am 03.04.23 um 21:40 schrieb Joshua Ashton:
> Hello all!
>
> I would like to propose a new API for allowing processes to control
> the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
>
> The main reason for this is for compositors such as Gamescope and
> SteamVR vrcompositor to be able to create realtime async compute
> queues on AMD without the need of CAP_SYS_NICE.
>
> The current situation is bad for a few reasons, one being that in order
> to setcap the executable, typically one must run as root which involves
> a pretty high privelage escalation in order to achieve one
> small feat, a realtime async compute queue queue for VR or a compositor.
> The executable cannot be setcap'ed inside a
> container nor can the setcap'ed executable be run in a container with
> NO_NEW_PRIVS.
>
> I go into more detail in the description in
> `uapi: Add RLIMIT_GPUPRIO`.
>
> My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
> which seems to make most initial sense to me to solve the problem.
>
> I am definitely not set that this is the best formulation however
> or if this should be linked to DRM (in terms of it's scheduler
> priority enum/definitions) in any way and and would really like other
> people's opinions across the stack on this.
>
> Once initial concern is that potentially this RLIMIT could out-live
> the lifespan of DRM. It sounds crazy saying it right now, something
> that definitely popped into my mind when touching `resource.h`. :-)
>
> Anyway, please let me know what you think!
> Definitely open to any feedback and advice you may have. :D

Well the basic problem is that higher priority queues can be used to 
starve low priority queues.

This starvation in turn is very very bad for memory management since the 
dma_fence's the GPU scheduler deals with have very strong restrictions.

Even exposing this under CAP_SYS_NICE is questionable, so we will most 
likely have to NAK this.

Regards,
Christian.

>
> Thanks!
>   - Joshie
>
> Joshua Ashton (4):
>    drm/scheduler: Add DRM_SCHED_PRIORITY_VERY_HIGH
>    drm/scheduler: Split out drm_sched_priority to own file
>    uapi: Add RLIMIT_GPUPRIO
>    drm/amd/amdgpu: Check RLIMIT_GPUPRIO in priority permissions
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 13 ++++++--
>   drivers/gpu/drm/msm/msm_gpu.h           |  2 +-
>   fs/proc/base.c                          |  1 +
>   include/asm-generic/resource.h          |  3 +-
>   include/drm/drm_sched_priority.h        | 41 +++++++++++++++++++++++++
>   include/drm/gpu_scheduler.h             | 14 +--------
>   include/uapi/asm-generic/resource.h     |  3 +-
>   7 files changed, 58 insertions(+), 19 deletions(-)
>   create mode 100644 include/drm/drm_sched_priority.h
>

Joshua Ashton April 3, 2023, 8:15 p.m. UTC | #2

On 4/3/23 20:54, Christian König wrote:
> Am 03.04.23 um 21:40 schrieb Joshua Ashton:
>> Hello all!
>>
>> I would like to propose a new API for allowing processes to control
>> the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
>>
>> The main reason for this is for compositors such as Gamescope and
>> SteamVR vrcompositor to be able to create realtime async compute
>> queues on AMD without the need of CAP_SYS_NICE.
>>
>> The current situation is bad for a few reasons, one being that in order
>> to setcap the executable, typically one must run as root which involves
>> a pretty high privelage escalation in order to achieve one
>> small feat, a realtime async compute queue queue for VR or a compositor.
>> The executable cannot be setcap'ed inside a
>> container nor can the setcap'ed executable be run in a container with
>> NO_NEW_PRIVS.
>>
>> I go into more detail in the description in
>> `uapi: Add RLIMIT_GPUPRIO`.
>>
>> My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
>> which seems to make most initial sense to me to solve the problem.
>>
>> I am definitely not set that this is the best formulation however
>> or if this should be linked to DRM (in terms of it's scheduler
>> priority enum/definitions) in any way and and would really like other
>> people's opinions across the stack on this.
>>
>> Once initial concern is that potentially this RLIMIT could out-live
>> the lifespan of DRM. It sounds crazy saying it right now, something
>> that definitely popped into my mind when touching `resource.h`. :-)
>>
>> Anyway, please let me know what you think!
>> Definitely open to any feedback and advice you may have. :D
> 
> Well the basic problem is that higher priority queues can be used to 
> starve low priority queues.
> 
> This starvation in turn is very very bad for memory management since the 
> dma_fence's the GPU scheduler deals with have very strong restrictions.
> 
> Even exposing this under CAP_SYS_NICE is questionable, so we will most 
> likely have to NAK this.

This is already exposed with CAP_SYS_NICE and is relied on by SteamVR 
for async reprojection and Gamescope's composite path on Steam Deck.

Having a high priority async compute queue is really really important 
and advantageous for these tasks.

The majority of usecases for something like this is going to be a 
compositor which does some really tiny amount of work per-frame but is 
incredibly latency dependent (as it depends on latching onto buffers 
just before vblank to do it's work)

Starving and surpassing work on other queues is kind of the entire 
point. Gamescope and SteamVR do it on ACE as well so GFX work can run 
alongside it.

- Joshie

Christian König April 4, 2023, 8:50 a.m. UTC | #3

Adding a bunch of people who have been involved in this before.

Am 03.04.23 um 22:15 schrieb Joshua Ashton:
> On 4/3/23 20:54, Christian König wrote:
>> Am 03.04.23 um 21:40 schrieb Joshua Ashton:
>>> [SNIP]
>>> Anyway, please let me know what you think!
>>> Definitely open to any feedback and advice you may have. :D
>>
>> Well the basic problem is that higher priority queues can be used to 
>> starve low priority queues.
>>
>> This starvation in turn is very very bad for memory management since 
>> the dma_fence's the GPU scheduler deals with have very strong 
>> restrictions.
>>
>> Even exposing this under CAP_SYS_NICE is questionable, so we will 
>> most likely have to NAK this.
>
> This is already exposed with CAP_SYS_NICE and is relied on by SteamVR 
> for async reprojection and Gamescope's composite path on Steam Deck.

Yeah, I know I was the one who designed that :)

>
> Having a high priority async compute queue is really really important 
> and advantageous for these tasks.
>
> The majority of usecases for something like this is going to be a 
> compositor which does some really tiny amount of work per-frame but is 
> incredibly latency dependent (as it depends on latching onto buffers 
> just before vblank to do it's work)
>
> Starving and surpassing work on other queues is kind of the entire 
> point. Gamescope and SteamVR do it on ACE as well so GFX work can run 
> alongside it.

Yes, unfortunately exactly that.

The problem is that our memory management is designed around the idea 
that submissions to the hardware are guaranteed to finish at some point 
in the future.

When we now have a functionality which allows to extend the amount of 
time some work needs to finish on the hardware infinitely, then we have 
a major problem at hand.

What we could do is to make the GPU scheduler more clever and make sure 
that while higher priority submissions get precedence and can even 
preempt low priority submissions we still guarantee some forward 
progress for everybody.

Luben has been looking into a similar problem AMD internally as well, 
maybe he has some idea here but I doubt that the solution will be simple.

Regards,
Christian.

>
> - Joshie

Tvrtko Ursulin April 4, 2023, 10:45 a.m. UTC | #4

Hi,

On 03/04/2023 20:40, Joshua Ashton wrote:
> Hello all!
> 
> I would like to propose a new API for allowing processes to control
> the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
> 
> The main reason for this is for compositors such as Gamescope and
> SteamVR vrcompositor to be able to create realtime async compute
> queues on AMD without the need of CAP_SYS_NICE.
> 
> The current situation is bad for a few reasons, one being that in order
> to setcap the executable, typically one must run as root which involves
> a pretty high privelage escalation in order to achieve one
> small feat, a realtime async compute queue queue for VR or a compositor.
> The executable cannot be setcap'ed inside a
> container nor can the setcap'ed executable be run in a container with
> NO_NEW_PRIVS.
> 
> I go into more detail in the description in
> `uapi: Add RLIMIT_GPUPRIO`.
> 
> My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
> which seems to make most initial sense to me to solve the problem.
> 
> I am definitely not set that this is the best formulation however
> or if this should be linked to DRM (in terms of it's scheduler
> priority enum/definitions) in any way and and would really like other
> people's opinions across the stack on this.
> 
> Once initial concern is that potentially this RLIMIT could out-live
> the lifespan of DRM. It sounds crazy saying it right now, something
> that definitely popped into my mind when touching `resource.h`. :-)
> 
> Anyway, please let me know what you think!
> Definitely open to any feedback and advice you may have. :D

Interesting! I tried to solved the similar problem two times in the past already.

First time I was proposing to tie nice to DRM scheduling priority [1] - if the latter has been left at default - drawing the analogy with the nice+ionice handling. That was rejected and I was nudged towards the cgroups route.

So with that second attempt I implemented a hierarchical opaque drm.priority cgroup controller [2]. I think it would allow you to solve your use case too by placing your compositor in a cgroup with an elevated priority level.

Implementation wise in my proposal it was left to individual drivers to "meld" the opaque cgroup drm.priority with the driver specific priority concept.

That too wasn't too popular with the feedback (AFAIR) that the priority is a too subsystem specific concept.

Finally I was left with a weight based drm cgroup controller, exactly following the controls of the CPU and IO ones, but with much looser runtime guarantees. [3]

I don't think this last one works for your use case, at least not at the current state for drm scheduling capability, where the implementation is a "bit" too reactive for realtime.

Depending on how the discussion around your rlimit proposal goes, perhaps one alternative could be to go the cgroup route and add an attribute like drm.realtime. That perhaps sounds abstract and generic enough to be passable. Built as a simplification of [2] it wouldn't be too complicated.

On the actual proposal of RLIMIT_GPUPRIO...

The name would be problematic since we have generic hw accelerators (not just GPUs) under the DRM subsystem. Perhaps RLIMIT_DRMPRIO would be better but I think you will need to copy some more mailing lists and people on that one. Because I can imagine one or two more fundamental questions this opens up, as you have eluded in your cover letter as well.

Regards,

Tvrtko

[1] https://lore.kernel.org/dri-devel/20220407152806.3387898-1-tvrtko.ursulin@linux.intel.com/T/
[2] https://lore.kernel.org/lkml/20221019173254.3361334-4-tvrtko.ursulin@linux.intel.com/T/#u
[3] https://lore.kernel.org/lkml/20230314141904.1210824-1-tvrtko.ursulin@linux.intel.com/

Luben Tuikov April 5, 2023, 12:13 a.m. UTC | #5

Hi!

On 2023-04-04 04:50, Christian König wrote:
> Adding a bunch of people who have been involved in this before.
> 
> Am 03.04.23 um 22:15 schrieb Joshua Ashton:
>> On 4/3/23 20:54, Christian König wrote:
>>> Am 03.04.23 um 21:40 schrieb Joshua Ashton:
>>>> [SNIP]
>>>> Anyway, please let me know what you think!
>>>> Definitely open to any feedback and advice you may have. :D
>>>
>>> Well the basic problem is that higher priority queues can be used to 
>>> starve low priority queues.
>>>
>>> This starvation in turn is very very bad for memory management since 
>>> the dma_fence's the GPU scheduler deals with have very strong 
>>> restrictions.
>>>
>>> Even exposing this under CAP_SYS_NICE is questionable, so we will 
>>> most likely have to NAK this.
>>
>> This is already exposed with CAP_SYS_NICE and is relied on by SteamVR 
>> for async reprojection and Gamescope's composite path on Steam Deck.
> 
> Yeah, I know I was the one who designed that :)
> 
>>
>> Having a high priority async compute queue is really really important 
>> and advantageous for these tasks.
>>
>> The majority of usecases for something like this is going to be a 
>> compositor which does some really tiny amount of work per-frame but is 
>> incredibly latency dependent (as it depends on latching onto buffers 
>> just before vblank to do it's work)

There seems to be a dependency here. Is it possible to express this
dependency so that this work is done on vblank, then whoever needs
this, can latch onto vblank and get scheduled and completed before the vblank?

The problem generally is "We need to do some work B in order to satisfy
some condition in work A. Let's raise the ``priority'' of work B so that
if A needs it, when it needs it, it is ready." Or something to that effect.

The system would be much more responsive and run optimally, if such
dependencies are expressed directly, as opposed to trying to game
the scheduler and add more and more priorities, one on top of the other,
every so often.

It's okay to have priorities when tasks are independent and unrelated. But
when they do depend on each other directly, or indirectly (as in when memory
allocation or freeing is concerned), thus creating priority inversion,
then the best scheduler is the fair, oldest-ready-first scheduling, which
is the default GPU scheduler in DRM at the moment (for the last few months).

>> Starving and surpassing work on other queues is kind of the entire 
>> point. Gamescope and SteamVR do it on ACE as well so GFX work can run 
>> alongside it.

Are there no dependencies between them?

I mean if they're independent, we already have run queues with
different priorities. But if they're dependent, perhaps
we can express this explicitly so that we don't starve
other tasks/queues...

Regards,
Luben

> 
> Yes, unfortunately exactly that.
> 
> The problem is that our memory management is designed around the idea 
> that submissions to the hardware are guaranteed to finish at some point 
> in the future.
> 
> When we now have a functionality which allows to extend the amount of 
> time some work needs to finish on the hardware infinitely, then we have 
> a major problem at hand.
> 
> What we could do is to make the GPU scheduler more clever and make sure 
> that while higher priority submissions get precedence and can even 
> preempt low priority submissions we still guarantee some forward 
> progress for everybody.
> 
> Luben has been looking into a similar problem AMD internally as well, 
> maybe he has some idea here but I doubt that the solution will be simple.
> 
> Regards,
> Christian.
> 
>>
>> - Joshie

Daniel Vetter April 5, 2023, 8:28 a.m. UTC | #6

On Tue, 4 Apr 2023 at 12:45, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> Hi,
>
> On 03/04/2023 20:40, Joshua Ashton wrote:
> > Hello all!
> >
> > I would like to propose a new API for allowing processes to control
> > the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
> >
> > The main reason for this is for compositors such as Gamescope and
> > SteamVR vrcompositor to be able to create realtime async compute
> > queues on AMD without the need of CAP_SYS_NICE.
> >
> > The current situation is bad for a few reasons, one being that in order
> > to setcap the executable, typically one must run as root which involves
> > a pretty high privelage escalation in order to achieve one
> > small feat, a realtime async compute queue queue for VR or a compositor.
> > The executable cannot be setcap'ed inside a
> > container nor can the setcap'ed executable be run in a container with
> > NO_NEW_PRIVS.
> >
> > I go into more detail in the description in
> > `uapi: Add RLIMIT_GPUPRIO`.
> >
> > My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
> > which seems to make most initial sense to me to solve the problem.
> >
> > I am definitely not set that this is the best formulation however
> > or if this should be linked to DRM (in terms of it's scheduler
> > priority enum/definitions) in any way and and would really like other
> > people's opinions across the stack on this.
> >
> > Once initial concern is that potentially this RLIMIT could out-live
> > the lifespan of DRM. It sounds crazy saying it right now, something
> > that definitely popped into my mind when touching `resource.h`. :-)
> >
> > Anyway, please let me know what you think!
> > Definitely open to any feedback and advice you may have. :D
>
> Interesting! I tried to solved the similar problem two times in the past already.
>
> First time I was proposing to tie nice to DRM scheduling priority [1] - if the latter has been left at default - drawing the analogy with the nice+ionice handling. That was rejected and I was nudged towards the cgroups route.
>
> So with that second attempt I implemented a hierarchical opaque drm.priority cgroup controller [2]. I think it would allow you to solve your use case too by placing your compositor in a cgroup with an elevated priority level.
>
> Implementation wise in my proposal it was left to individual drivers to "meld" the opaque cgroup drm.priority with the driver specific priority concept.
>
> That too wasn't too popular with the feedback (AFAIR) that the priority is a too subsystem specific concept.
>
> Finally I was left with a weight based drm cgroup controller, exactly following the controls of the CPU and IO ones, but with much looser runtime guarantees. [3]
>
> I don't think this last one works for your use case, at least not at the current state for drm scheduling capability, where the implementation is a "bit" too reactive for realtime.
>
> Depending on how the discussion around your rlimit proposal goes, perhaps one alternative could be to go the cgroup route and add an attribute like drm.realtime. That perhaps sounds abstract and generic enough to be passable. Built as a simplification of [2] it wouldn't be too complicated.
>
> On the actual proposal of RLIMIT_GPUPRIO...
>
> The name would be problematic since we have generic hw accelerators (not just GPUs) under the DRM subsystem. Perhaps RLIMIT_DRMPRIO would be better but I think you will need to copy some more mailing lists and people on that one. Because I can imagine one or two more fundamental questions this opens up, as you have eluded in your cover letter as well.

So I don't want to get into the bikeshed, I think Tvrtko summarized
pretty well that this is a hard problem with lots of attempts (I think
some more from amd too). I think what we need are two pieces here
really:
- A solid summary of all the previous attempts from everyone in this
space of trying to manage gpu compute resources (all the various
cgroup attempts, sched priority), listening the pros/cons. There's
also the fdinfo stuff just for reporting gpu usage which blew up kinda
badly and didn't have much discussion among all the stakeholders.
- Everyone on cc who's doing new drivers using drm/sched (which I
think is everyone really, or using that currently. So that's like
etnaviv, lima, amd, intel with the new xe, probably new nouveau driver
too, amd ofc, panfrost, asahi. Please cc everyone.

Unless we do have some actual rough consens in this space across all
stakeholders I think all we'll achieve is just yet another rfc that
goes nowhere. Or maybe something like the minimal fdinfo stuff
(minimal I guess to avoid wider discussion) which then blew up because
it wasn't thought out well enough.

Adding at least some of the people who probably should be cc'ed on
this. Please add more.

Cheers, Daniel


>
> Regards,
>
> Tvrtko
>
> [1] https://lore.kernel.org/dri-devel/20220407152806.3387898-1-tvrtko.ursulin@linux.intel.com/T/
> [2] https://lore.kernel.org/lkml/20221019173254.3361334-4-tvrtko.ursulin@linux.intel.com/T/#u
> [3] https://lore.kernel.org/lkml/20230314141904.1210824-1-tvrtko.ursulin@linux.intel.com/



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Tvrtko Ursulin April 5, 2023, 9:10 a.m. UTC | #7

On 05/04/2023 09:28, Daniel Vetter wrote:
> On Tue, 4 Apr 2023 at 12:45, Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> Hi,
>>
>> On 03/04/2023 20:40, Joshua Ashton wrote:
>>> Hello all!
>>>
>>> I would like to propose a new API for allowing processes to control
>>> the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
>>>
>>> The main reason for this is for compositors such as Gamescope and
>>> SteamVR vrcompositor to be able to create realtime async compute
>>> queues on AMD without the need of CAP_SYS_NICE.
>>>
>>> The current situation is bad for a few reasons, one being that in order
>>> to setcap the executable, typically one must run as root which involves
>>> a pretty high privelage escalation in order to achieve one
>>> small feat, a realtime async compute queue queue for VR or a compositor.
>>> The executable cannot be setcap'ed inside a
>>> container nor can the setcap'ed executable be run in a container with
>>> NO_NEW_PRIVS.
>>>
>>> I go into more detail in the description in
>>> `uapi: Add RLIMIT_GPUPRIO`.
>>>
>>> My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
>>> which seems to make most initial sense to me to solve the problem.
>>>
>>> I am definitely not set that this is the best formulation however
>>> or if this should be linked to DRM (in terms of it's scheduler
>>> priority enum/definitions) in any way and and would really like other
>>> people's opinions across the stack on this.
>>>
>>> Once initial concern is that potentially this RLIMIT could out-live
>>> the lifespan of DRM. It sounds crazy saying it right now, something
>>> that definitely popped into my mind when touching `resource.h`. :-)
>>>
>>> Anyway, please let me know what you think!
>>> Definitely open to any feedback and advice you may have. :D
>>
>> Interesting! I tried to solved the similar problem two times in the past already.
>>
>> First time I was proposing to tie nice to DRM scheduling priority [1] - if the latter has been left at default - drawing the analogy with the nice+ionice handling. That was rejected and I was nudged towards the cgroups route.
>>
>> So with that second attempt I implemented a hierarchical opaque drm.priority cgroup controller [2]. I think it would allow you to solve your use case too by placing your compositor in a cgroup with an elevated priority level.
>>
>> Implementation wise in my proposal it was left to individual drivers to "meld" the opaque cgroup drm.priority with the driver specific priority concept.
>>
>> That too wasn't too popular with the feedback (AFAIR) that the priority is a too subsystem specific concept.
>>
>> Finally I was left with a weight based drm cgroup controller, exactly following the controls of the CPU and IO ones, but with much looser runtime guarantees. [3]
>>
>> I don't think this last one works for your use case, at least not at the current state for drm scheduling capability, where the implementation is a "bit" too reactive for realtime.
>>
>> Depending on how the discussion around your rlimit proposal goes, perhaps one alternative could be to go the cgroup route and add an attribute like drm.realtime. That perhaps sounds abstract and generic enough to be passable. Built as a simplification of [2] it wouldn't be too complicated.
>>
>> On the actual proposal of RLIMIT_GPUPRIO...
>>
>> The name would be problematic since we have generic hw accelerators (not just GPUs) under the DRM subsystem. Perhaps RLIMIT_DRMPRIO would be better but I think you will need to copy some more mailing lists and people on that one. Because I can imagine one or two more fundamental questions this opens up, as you have eluded in your cover letter as well.
> 
> So I don't want to get into the bikeshed, I think Tvrtko summarized
> pretty well that this is a hard problem with lots of attempts (I think
> some more from amd too). I think what we need are two pieces here
> really:
> - A solid summary of all the previous attempts from everyone in this
> space of trying to manage gpu compute resources (all the various
> cgroup attempts, sched priority), listening the pros/cons. There's
> also the fdinfo stuff just for reporting gpu usage which blew up kinda
> badly and didn't have much discussion among all the stakeholders.
> - Everyone on cc who's doing new drivers using drm/sched (which I
> think is everyone really, or using that currently. So that's like
> etnaviv, lima, amd, intel with the new xe, probably new nouveau driver
> too, amd ofc, panfrost, asahi. Please cc everyone.
> 
> Unless we do have some actual rough consens in this space across all
> stakeholders I think all we'll achieve is just yet another rfc that
> goes nowhere. Or maybe something like the minimal fdinfo stuff
> (minimal I guess to avoid wider discussion) which then blew up because
> it wasn't thought out well enough.

On the particular point how fdinfo allegedly blew up - are you referring 
to client usage stats? If so this would be the first time I hear about 
any problems in that space. Which would be "a bit" surprising given it's 
the thing I drove standardisation of. All I heard were positive 
comments. Both "works for us" from driver implementors and positives 
from the users.

Regards,

Tvrtko

Daniel Vetter April 5, 2023, 9:13 a.m. UTC | #8

On Wed, 5 Apr 2023 at 11:11, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 05/04/2023 09:28, Daniel Vetter wrote:
> > On Tue, 4 Apr 2023 at 12:45, Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> On 03/04/2023 20:40, Joshua Ashton wrote:
> >>> Hello all!
> >>>
> >>> I would like to propose a new API for allowing processes to control
> >>> the priority of GPU queues similar to RLIMIT_NICE/RLIMIT_RTPRIO.
> >>>
> >>> The main reason for this is for compositors such as Gamescope and
> >>> SteamVR vrcompositor to be able to create realtime async compute
> >>> queues on AMD without the need of CAP_SYS_NICE.
> >>>
> >>> The current situation is bad for a few reasons, one being that in order
> >>> to setcap the executable, typically one must run as root which involves
> >>> a pretty high privelage escalation in order to achieve one
> >>> small feat, a realtime async compute queue queue for VR or a compositor.
> >>> The executable cannot be setcap'ed inside a
> >>> container nor can the setcap'ed executable be run in a container with
> >>> NO_NEW_PRIVS.
> >>>
> >>> I go into more detail in the description in
> >>> `uapi: Add RLIMIT_GPUPRIO`.
> >>>
> >>> My initial proposal here is to add a new RLIMIT, `RLIMIT_GPUPRIO`,
> >>> which seems to make most initial sense to me to solve the problem.
> >>>
> >>> I am definitely not set that this is the best formulation however
> >>> or if this should be linked to DRM (in terms of it's scheduler
> >>> priority enum/definitions) in any way and and would really like other
> >>> people's opinions across the stack on this.
> >>>
> >>> Once initial concern is that potentially this RLIMIT could out-live
> >>> the lifespan of DRM. It sounds crazy saying it right now, something
> >>> that definitely popped into my mind when touching `resource.h`. :-)
> >>>
> >>> Anyway, please let me know what you think!
> >>> Definitely open to any feedback and advice you may have. :D
> >>
> >> Interesting! I tried to solved the similar problem two times in the past already.
> >>
> >> First time I was proposing to tie nice to DRM scheduling priority [1] - if the latter has been left at default - drawing the analogy with the nice+ionice handling. That was rejected and I was nudged towards the cgroups route.
> >>
> >> So with that second attempt I implemented a hierarchical opaque drm.priority cgroup controller [2]. I think it would allow you to solve your use case too by placing your compositor in a cgroup with an elevated priority level.
> >>
> >> Implementation wise in my proposal it was left to individual drivers to "meld" the opaque cgroup drm.priority with the driver specific priority concept.
> >>
> >> That too wasn't too popular with the feedback (AFAIR) that the priority is a too subsystem specific concept.
> >>
> >> Finally I was left with a weight based drm cgroup controller, exactly following the controls of the CPU and IO ones, but with much looser runtime guarantees. [3]
> >>
> >> I don't think this last one works for your use case, at least not at the current state for drm scheduling capability, where the implementation is a "bit" too reactive for realtime.
> >>
> >> Depending on how the discussion around your rlimit proposal goes, perhaps one alternative could be to go the cgroup route and add an attribute like drm.realtime. That perhaps sounds abstract and generic enough to be passable. Built as a simplification of [2] it wouldn't be too complicated.
> >>
> >> On the actual proposal of RLIMIT_GPUPRIO...
> >>
> >> The name would be problematic since we have generic hw accelerators (not just GPUs) under the DRM subsystem. Perhaps RLIMIT_DRMPRIO would be better but I think you will need to copy some more mailing lists and people on that one. Because I can imagine one or two more fundamental questions this opens up, as you have eluded in your cover letter as well.
> >
> > So I don't want to get into the bikeshed, I think Tvrtko summarized
> > pretty well that this is a hard problem with lots of attempts (I think
> > some more from amd too). I think what we need are two pieces here
> > really:
> > - A solid summary of all the previous attempts from everyone in this
> > space of trying to manage gpu compute resources (all the various
> > cgroup attempts, sched priority), listening the pros/cons. There's
> > also the fdinfo stuff just for reporting gpu usage which blew up kinda
> > badly and didn't have much discussion among all the stakeholders.
> > - Everyone on cc who's doing new drivers using drm/sched (which I
> > think is everyone really, or using that currently. So that's like
> > etnaviv, lima, amd, intel with the new xe, probably new nouveau driver
> > too, amd ofc, panfrost, asahi. Please cc everyone.
> >
> > Unless we do have some actual rough consens in this space across all
> > stakeholders I think all we'll achieve is just yet another rfc that
> > goes nowhere. Or maybe something like the minimal fdinfo stuff
> > (minimal I guess to avoid wider discussion) which then blew up because
> > it wasn't thought out well enough.
>
> On the particular point how fdinfo allegedly blew up - are you referring
> to client usage stats? If so this would be the first time I hear about
> any problems in that space. Which would be "a bit" surprising given it's
> the thing I drove standardisation of. All I heard were positive
> comments. Both "works for us" from driver implementors and positives
> from the users.

The drm/sched implementation blew up. Not the overall spec or the i915
implementation. See the reverts in -rc5 and drm-misc-next.

I think a tad more coordination and maybe more shared code for
drm/sched using drivers probably what we want for this. Or at least a
bit more cross-driver collaboration than here where one side reverts
while the other pushes more patches.
-Daniel

[RFC,0/4] uapi, drm: Add and implement RLIMIT_GPUPRIO

Message

Comments