mbox series

[RFC,v3,00/12] DRM scheduling cgroup controller

Message ID 20230112165609.1083270-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
Headers show
Series DRM scheduling cgroup controller | expand

Message

Tvrtko Ursulin Jan. 12, 2023, 4:55 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

This series contains a proposal for a DRM scheduling cgroup controller which
implements a weight based hierarchical GPU usage budget based controller
similar in concept to some of the existing controllers.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
   VMs based on the window focused status. It can be demonstrated how GPU
   scheduling control can easily be integrated into that setup.

*) Another real world example later in the cover letter.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

Structure of the series is as follows:

  1-2) Improve client ownership tracking in DRM core.
    3) Adds a skeleton DRM cgroup controller with no functionality.
  4-9) Laying down some infrastructure to enable the controller.
   10) The controller itself.
11-12) i915 support for the controller.

The proposals defines a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work.

DRM scheduling soft limits
~~~~~~~~~~~~~~~~~~~~~~~~~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  drm.weight
	Standard cgroup weight based control [1, 10000] used to configure the
	relative distributing of GPU time between the sibling groups.

  drm.period_us (Most probably only a debugging aid during RFC phase.)
	An integer representing the period with which the controller should look
	at the GPU usage by the group and potentially send the over/under budget
	signal.
	Value of zero (defaul) disables the soft limit checking.

This builds upon the per client GPU utilisation work which landed recently for a
few drivers. My thinking is that in principle, an intersect of drivers which
support both that and some sort of scheduling control, like  priorities, could
also in theory support this controller.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

 * DRM cgroup controller implements the control files and the scanning loop.
 * DRM core is required to track all DRM clients belonging to processes so it
   can answer when asked how much GPU time is a process using.
 * DRM core also provides a call back which the controller will call when a
   certain process is over budget.
 * Individual drivers need to implement two similar hooks, but which work for
   a single DRM client. Over budget callback and GPU utilisation query.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but definitely goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

Example usage from within a Linux desktop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Standard Linux distributions like Ubuntu already uses cgroups heavily for
session management and that could easily be extended with the DRM controller.

After logging into the system graphically we can enable the DRM controller
throughout the cgroups hierarchy:

echo +drm > /sys/fs/cgroup/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.subtree_control

Next we will open two SSH sessions, just so separate cgroups are handily created
by systemd for this experiment.

Roughly simultaneously we run the following two benchmarks in each session
respectively:

1)
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

(The only reason for vsync off here is because I struggled to find an easily
runnable and demanding enough benchmark, or to run on a screen large enough to
make even a simpler ones demanding.)

With this test we get 252fps from GpuTest and 96fps from GfxBenchmark.

Premise here is that one of these GPU intensive benchmarks is intended to be ran
by the user with lower priority. Imagine kicking off some background compute
processing and continuing to use the UI for other tasks. Hence the user will now
re-run the test by first lowering the weight control of the first session (DRM
cgroup):

1)
echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

In this case we will see that GpuTest has recorded 208fps (~18% down) and
GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach of
wiring up i915 to the DRM cgroup controller can enable external GPU scheduling
control.

* Note here that default weight is 100, so setting 50 for the background session
  is asking the controller to give it half as much GPU bandwidth.

* Also note that in the RFC stage the DRM controller itself boots in a disabled
  state and needs to be explicitly enabled by setting a scanning period such as:
  echo 1000000 | sudo tee /sys/fs/cgroup/drm.period_us.

v2:
 * Prefaced the series with some core DRM work as suggested by Christian.
 * Dropped the priority based controller for now.
 * Dropped the introspection cgroup controller file.
 * Implemented unused budget sharing/propagation.
 * Some small fixes/tweak as per review feedback and in general.

 v3:
 * Dropped one upstreamed patch.
 * Logging cleanup (use DRM macros where available).

Tvrtko Ursulin (12):
  drm: Track clients by tgid and not tid
  drm: Update file owner during use
  cgroup: Add the DRM cgroup controller
  drm/cgroup: Track clients per owning process
  drm/cgroup: Allow safe external access to file_priv
  drm/cgroup: Add ability to query drm cgroup GPU time
  drm/cgroup: Add over budget signalling callback
  drm/cgroup: Only track clients which are providing drm_cgroup_ops
  cgroup/drm: Client exit hook
  cgroup/drm: Introduce weight based drm cgroup control
  drm/i915: Wire up with drm controller GPU time query
  drm/i915: Implement cgroup controller over budget throttling

 Documentation/admin-guide/cgroup-v2.rst       |  37 ++
 drivers/gpu/drm/Kconfig                       |   1 +
 drivers/gpu/drm/Makefile                      |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       |   6 +-
 drivers/gpu/drm/drm_auth.c                    |   3 +-
 drivers/gpu/drm/drm_cgroup.c                  | 203 +++++++
 drivers/gpu/drm/drm_debugfs.c                 |  12 +-
 drivers/gpu/drm/drm_file.c                    |  60 +-
 drivers/gpu/drm/drm_ioctl.c                   |   3 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  38 +-
 drivers/gpu/drm/i915/i915_driver.c            |  11 +
 drivers/gpu/drm/i915/i915_drm_client.c        | 209 ++++++-
 drivers/gpu/drm/i915/i915_drm_client.h        |  13 +
 drivers/gpu/drm/nouveau/nouveau_drm.c         |   5 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_gem.c           |   6 +-
 include/drm/drm_clients.h                     |  47 ++
 include/drm/drm_drv.h                         |  36 ++
 include/drm/drm_file.h                        |  17 +-
 include/linux/cgroup_drm.h                    |  13 +
 include/linux/cgroup_subsys.h                 |   4 +
 init/Kconfig                                  |   8 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/drm.c                           | 570 ++++++++++++++++++
 23 files changed, 1273 insertions(+), 31 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_cgroup.c
 create mode 100644 include/drm/drm_clients.h
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

Comments

Michal Koutný Jan. 23, 2023, 3:42 p.m. UTC | #1
Hello Tvrtko.

Interesting work.

On Thu, Jan 12, 2023 at 04:55:57PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> Because of the heterogenous hardware and driver DRM capabilities, soft limits
> are implemented as a loose co-operative (bi-directional) interface between the
> controller and DRM core.

IIUC, this periodic scanning, calculating and applying could be partly
implemented with userspace utilities. (As you write, these limits are
best effort only, so it sounds to me such a total implementation is
unnecessary.)

I think a better approach would be to avoid the async querying and
instead require implementing explicit foo_charge_time(client, dur) API
(similar to how other controllers achieve this).
Your argument is the heterogenity of devices -- does it mean there are
devices/drivers that can't implement such a synchronous charging? 

> DRM core provides an API to query per process GPU utilization and 2nd API to
> receive notification from the cgroup controller when the group enters or exits
> the over budget condition.

The return value of foo_charge_time() would substitute such a
notification synchronously. (By extension all clients in an affected
cgroup could be notified to achieve some broader actions.)

> Individual DRM drivers which implement the interface are expected to act on this
> in the best-effort manner only. There are no guarantees that the soft limits
> will be respected.

Back to original concern -- must all code reside in the kernel when it's
essentially advisory resource control?

>  * DRM core is required to track all DRM clients belonging to processes so it
>    can answer when asked how much GPU time is a process using.
>  [...]
>  * Individual drivers need to implement two similar hooks, but which work for
>    a single DRM client. Over budget callback and GPU utilisation query.

This information is eventually aggregated for each process in a cgroup.
(And the action is carried on a single client, not a process.)
The per-process tracking seems like an additional indirection.
Could be the clients associated directly with DRM cgroup? [1]


Regards,
Michal

[1] I understand the sending a fd of a client is a regular operation, so
    I'm not sure how cross-cg migrations would have to be handled in any
    case.
Tvrtko Ursulin Jan. 25, 2023, 6:11 p.m. UTC | #2
Hi,

On 23/01/2023 15:42, Michal Koutný wrote:
> Hello Tvrtko.
> 
> Interesting work.

Thanks!

> On Thu, Jan 12, 2023 at 04:55:57PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
>> Because of the heterogenous hardware and driver DRM capabilities, soft limits
>> are implemented as a loose co-operative (bi-directional) interface between the
>> controller and DRM core.
> 
> IIUC, this periodic scanning, calculating and applying could be partly
> implemented with userspace utilities. (As you write, these limits are
> best effort only, so it sounds to me such a total implementation is
> unnecessary.)

I don't immediately see how you envisage the half-userspace 
implementation would look like in terms of what functionality/new APIs 
would be provided by the kernel?

> I think a better approach would be to avoid the async querying and
> instead require implementing explicit foo_charge_time(client, dur) API
> (similar to how other controllers achieve this).
> Your argument is the heterogenity of devices -- does it mean there are
> devices/drivers that can't implement such a synchronous charging?

Problem there is to find a suitable point to charge at. If for a moment 
we limit the discussion to i915, out of the box we could having charging 
happening at several thousand times per second to effectively never. 
This is to illustrate the GPU context execution dynamics which range 
from many small packets of work to multi-minute, or longer. For the 
latter to be accounted for we'd still need some periodic scanning, which 
would then perhaps go per driver. For the former we'd have thousands of 
needless updates per second.

Hence my thinking was to pay both the cost of accounting and collecting 
the usage data once per actionable event, where the latter is controlled 
by some reasonable scanning period/frequency.

In addition to that, a few DRM drivers already support GPU usage 
querying via fdinfo, so that being externally triggered, it is next to 
trivial to wire all those DRM drivers into such common DRM cgroup 
controller framework. All that every driver needs to implement on top is 
the "over budget" callback.

>> DRM core provides an API to query per process GPU utilization and 2nd API to
>> receive notification from the cgroup controller when the group enters or exits
>> the over budget condition.
> 
> The return value of foo_charge_time() would substitute such a
> notification synchronously. (By extension all clients in an affected
> cgroup could be notified to achieve some broader actions.)

Right, it is doable in theory, but as mention above some rate limit 
would have to be added. And the notification would still need to have 
unused budget propagation through the tree, so it wouldn't work to 
localize the action to the single cgroup (the one getting the charge).

>> Individual DRM drivers which implement the interface are expected to act on this
>> in the best-effort manner only. There are no guarantees that the soft limits
>> will be respected.
> 
> Back to original concern -- must all code reside in the kernel when it's
> essentially advisory resource control?
> 
>>   * DRM core is required to track all DRM clients belonging to processes so it
>>     can answer when asked how much GPU time is a process using.
>>   [...]
>>   * Individual drivers need to implement two similar hooks, but which work for
>>     a single DRM client. Over budget callback and GPU utilisation query.
> 
> This information is eventually aggregated for each process in a cgroup.
> (And the action is carried on a single client, not a process.)
> The per-process tracking seems like an additional indirection.
> Could be the clients associated directly with DRM cgroup? [1]

I think you could be right here - with some deeper integration with the 
cgroup subsystem this could probably be done. It would require moving 
the list of drm clients into the cgroup css state itself. Let me try and 
sketch that out in the following weeks because it would be a nice 
simplification if it indeed worked out.

Regards,

Tvrtko

> 
> 
> Regards,
> Michal
> 
> [1] I understand the sending a fd of a client is a regular operation, so
>      I'm not sure how cross-cg migrations would have to be handled in any
>      case.
Michal Koutný Jan. 26, 2023, 1 p.m. UTC | #3
On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> I don't immediately see how you envisage the half-userspace implementation
> would look like in terms of what functionality/new APIs would be provided by
> the kernel?

Output:
	drm.stat (with consumed time(s))

Input:
	drm.throttle (alternatives)
	- a) writing 0,1 (in rough analogy to your proposed
	     notifications)
	- b) writing duration (in loose analogy to memory.reclaim)
	     - for how long GPU work should be backed off

An userspace agent sitting between these two and it'd do the measurement
and calculation depending on given policies (weighting, throttling) and
apply respective controls.

(In resemblance of e.g. https://denji.github.io/cpulimit/)

> Problem there is to find a suitable point to charge at. If for a moment we
> limit the discussion to i915, out of the box we could having charging
> happening at several thousand times per second to effectively never. This is
> to illustrate the GPU context execution dynamics which range from many small
> packets of work to multi-minute, or longer. For the latter to be accounted
> for we'd still need some periodic scanning, which would then perhaps go per
> driver. For the former we'd have thousands of needless updates per second.
> 
> Hence my thinking was to pay both the cost of accounting and collecting the
> usage data once per actionable event, where the latter is controlled by some
> reasonable scanning period/frequency.
> 
> In addition to that, a few DRM drivers already support GPU usage querying
> via fdinfo, so that being externally triggered, it is next to trivial to
> wire all those DRM drivers into such common DRM cgroup controller framework.
> All that every driver needs to implement on top is the "over budget"
> callback.

I'd also like show comparison with CPU accounting and controller.
There is tick-based (~sampling) measurement of various components of CPU
time (task_group_account_field()). But the actual schedulling (weights)
or throttling is based on precise accounting (update_curr()).

So, if the goal is to have precise and guaranteed limits, it shouldn't
(cannot) be based on sampling. OTOH, if it must be sampling based due to
variability of the device landscape, it could be advisory mechanism with
the userspace component.

My 0.02€,
Michal
Tejun Heo Jan. 26, 2023, 5:04 p.m. UTC | #4
Hello,

On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutný wrote:
> On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> > I don't immediately see how you envisage the half-userspace implementation
> > would look like in terms of what functionality/new APIs would be provided by
> > the kernel?
> 
> Output:
> 	drm.stat (with consumed time(s))
> 
> Input:
> 	drm.throttle (alternatives)
> 	- a) writing 0,1 (in rough analogy to your proposed
> 	     notifications)
> 	- b) writing duration (in loose analogy to memory.reclaim)
> 	     - for how long GPU work should be backed off
> 
> An userspace agent sitting between these two and it'd do the measurement
> and calculation depending on given policies (weighting, throttling) and
> apply respective controls.
> 
> (In resemblance of e.g. https://denji.github.io/cpulimit/)

Yeah, things like this can be done from userspace but if we're gonna build
the infrastructure to allow that in gpu drivers and so on, I don't see why
we wouldn't add a generic in-kernel control layer if we can implement a
proper weight based control. We can of course also expose .max style
interface to allow userspace to do whatever they wanna do with it.

> > Problem there is to find a suitable point to charge at. If for a moment we
> > limit the discussion to i915, out of the box we could having charging
> > happening at several thousand times per second to effectively never. This is
> > to illustrate the GPU context execution dynamics which range from many small
> > packets of work to multi-minute, or longer. For the latter to be accounted
> > for we'd still need some periodic scanning, which would then perhaps go per
> > driver. For the former we'd have thousands of needless updates per second.
> > 
> > Hence my thinking was to pay both the cost of accounting and collecting the
> > usage data once per actionable event, where the latter is controlled by some
> > reasonable scanning period/frequency.
> > 
> > In addition to that, a few DRM drivers already support GPU usage querying
> > via fdinfo, so that being externally triggered, it is next to trivial to
> > wire all those DRM drivers into such common DRM cgroup controller framework.
> > All that every driver needs to implement on top is the "over budget"
> > callback.
> 
> I'd also like show comparison with CPU accounting and controller.
> There is tick-based (~sampling) measurement of various components of CPU
> time (task_group_account_field()). But the actual schedulling (weights)
> or throttling is based on precise accounting (update_curr()).
> 
> So, if the goal is to have precise and guaranteed limits, it shouldn't
> (cannot) be based on sampling. OTOH, if it must be sampling based due to
> variability of the device landscape, it could be advisory mechanism with
> the userspace component.

As for the specific control mechanism, yeah, charge based interface would be
more conventional and my suspicion is that transposing the current
implementation that way likely isn't too difficult. It just pushes "am I
over the limit?" decisions to the specific drivers with the core layer
telling them how much under/over budget they are. I'm curious what other gpu
driver folks think about the current RFC tho. Is at least AMD on board with
the approach?

Thanks.
Tvrtko Ursulin Jan. 26, 2023, 5:57 p.m. UTC | #5
Hi,

(Two replies in one, hope you will manage to navigate it.)

On 26/01/2023 17:04, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutný wrote:
>> On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
>>> I don't immediately see how you envisage the half-userspace implementation
>>> would look like in terms of what functionality/new APIs would be provided by
>>> the kernel?
>>
>> Output:
>> 	drm.stat (with consumed time(s))
>>
>> Input:
>> 	drm.throttle (alternatives)
>> 	- a) writing 0,1 (in rough analogy to your proposed
>> 	     notifications)
>> 	- b) writing duration (in loose analogy to memory.reclaim)
>> 	     - for how long GPU work should be backed off
>>
>> An userspace agent sitting between these two and it'd do the measurement
>> and calculation depending on given policies (weighting, throttling) and
>> apply respective controls.

Right, I wouldn't recommend drm.throttle as ABI since my idea is to 
enable drivers to do as good job as they individually can. Eg. some may 
be able to be much smarter than simple throttling, or some may start of 
simpler and later gain a better implementation. Some may even have 
differing capability or granularity depending on the GPU model they are 
driving, like in the case of i915.

So even if the RFC shows just a simple i915 implementation, the 
controller itself shouldn't prevent a smarter approach (via exposed 
ABI). And neither this simple i915 implementation works equally well for 
all supported GPU generations! This will be a theme common for many DRM 
drivers.

Secondly, doing this in userspace would require the ability to get some 
sort of an atomic snapshot of the whole tree hierarchy to account for 
changes in layout of the tree and task migrations. Or some retry logic 
with some added ABI fields to enable it.

Even then I think the only thing we would be able to move to userspace 
is the tree-walking logic and that sounds like not that much kernel code 
saved to trade for increased inefficiency.

>> (In resemblance of e.g. https://denji.github.io/cpulimit/)
> 
> Yeah, things like this can be done from userspace but if we're gonna build
> the infrastructure to allow that in gpu drivers and so on, I don't see why
> we wouldn't add a generic in-kernel control layer if we can implement a
> proper weight based control. We can of course also expose .max style
> interface to allow userspace to do whatever they wanna do with it.

Yes agreed, and to re-stress out, the ABI as proposed does not preclude 
changing from scanning to charging or whatever. The idea was for it to 
be compatible in concept with the CPU controller and also avoid baking 
in the controlling method to individual drivers.

>>> Problem there is to find a suitable point to charge at. If for a moment we
>>> limit the discussion to i915, out of the box we could having charging
>>> happening at several thousand times per second to effectively never. This is
>>> to illustrate the GPU context execution dynamics which range from many small
>>> packets of work to multi-minute, or longer. For the latter to be accounted
>>> for we'd still need some periodic scanning, which would then perhaps go per
>>> driver. For the former we'd have thousands of needless updates per second.
>>>
>>> Hence my thinking was to pay both the cost of accounting and collecting the
>>> usage data once per actionable event, where the latter is controlled by some
>>> reasonable scanning period/frequency.
>>>
>>> In addition to that, a few DRM drivers already support GPU usage querying
>>> via fdinfo, so that being externally triggered, it is next to trivial to
>>> wire all those DRM drivers into such common DRM cgroup controller framework.
>>> All that every driver needs to implement on top is the "over budget"
>>> callback.
>>
>> I'd also like show comparison with CPU accounting and controller.
>> There is tick-based (~sampling) measurement of various components of CPU
>> time (task_group_account_field()). But the actual schedulling (weights)
>> or throttling is based on precise accounting (update_curr()).
>>
>> So, if the goal is to have precise and guaranteed limits, it shouldn't
>> (cannot) be based on sampling. OTOH, if it must be sampling based due to
>> variability of the device landscape, it could be advisory mechanism with
>> the userspace component.

I don't think precise and guaranteed limits are feasible given the 
heterogeneous nature of DRM driver capabilities, but I also don't think 
sticking an userspace component in the middle is the way to go.

> As for the specific control mechanism, yeah, charge based interface would be
> more conventional and my suspicion is that transposing the current
> implementation that way likely isn't too difficult. It just pushes "am I
> over the limit?" decisions to the specific drivers with the core layer
> telling them how much under/over budget they are. I'm curious what other 

As I have tried to explain in my previous reply, I don't think real time 
charging is feasible. Because frequency of charging events can both be 
too high and too low. Too high that it doesn't bring value apart from 
increased processing times, where it is not useful to send out 
notification at the same rate, and too low in the sense that some sort 
of periodic query would then be needed in every driver implementation to 
enable all classes of GPU clients to be properly handled.

I just don't see any positives to the charging approach in the DRM 
landscape, but for sure see some negatives. (If we ignore the positive 
of it being a more typical approach, but then I think that is not enough 
to outweigh the negatives.)

gpu
> driver folks think about the current RFC tho. Is at least AMD on board with
> the approach?

Yes I am keenly awaiting comments from the DRM colleagues as well.

Regards,

Tvrtko

P.S. Note that Michal's idea to simplify client tracking is on my TODO 
list. If that works out some patches, the series itself actually, would 
become even simpler.
Tvrtko Ursulin Jan. 26, 2023, 6:14 p.m. UTC | #6
On 26/01/2023 17:57, Tvrtko Ursulin wrote:
> On 26/01/2023 17:04, Tejun Heo wrote:

>> driver folks think about the current RFC tho. Is at least AMD on board 
>> with
>> the approach?
> 
> Yes I am keenly awaiting comments from the DRM colleagues as well.

Forgot to mention one thing on this point which may interest AMD.

Some time ago I tested the super primitive "throttling via lowering the 
scheduling priority" on a GuC based i915 GPU, so only three supported 
priority levels, and FWIW it can be somewhat effective.

It certainly was effective for my main use case which is "run this GPU 
workload in the background while I use the GPU for something else".

The actual test was along the lines of running a GPU hog in parallel to 
an interactive client which can measure dropped frames.

With equal drm.weights the interactive client was seeing ~10 (i915 
pre-GuC) or ~27 (i915 GuC) dropped frames per second (60 fps target). 
With the GPU hog drm.weight lowered to 1:10 that dropped to ~3 dropped 
frames per second (all 3 before the over budget condition was noticed by 
the controller).

Main take here is that improved user experience is possible even with 
this primitive throttling method and even on GPUs which support only 
three scheduling priority levels.

Although main thing still is that individual drivers are completely free 
to improve their method of handling to the over budget signal. Nothing 
in the controller itself should be precluding that.

Regards,

Tvrtko
Michal Koutný Jan. 27, 2023, 10:04 a.m. UTC | #7
On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> So even if the RFC shows just a simple i915 implementation, the controller
> itself shouldn't prevent a smarter approach (via exposed ABI).

scan/query + over budget notification is IMO limited in guarantees.

> [...]
> Yes agreed, and to re-stress out, the ABI as proposed does not preclude
> changing from scanning to charging or whatever. The idea was for it to be
> compatible in concept with the CPU controller and also avoid baking in the
> controlling method to individual drivers.
> [...]

But I submit to your point of rather not exposing this via cgroup API
for possible future refinements.

> Secondly, doing this in userspace would require the ability to get some sort
> of an atomic snapshot of the whole tree hierarchy to account for changes in
> layout of the tree and task migrations. Or some retry logic with some added
> ABI fields to enable it.

Note, that the proposed implementation is succeptible to miscount due to
concurrent tree modifications and task migrations too (scanning may not
converge under frequent cgroup layout modifications, and migrating tasks
may be summed 0 or >1 times). While in-kernel implementation may assure
the snapshot view, it'd come at cost. (Read: since the mechanism isn't
precise anyway, I don't suggest a fully synchronized scanning.)

Regards,
Michal
Tvrtko Ursulin Jan. 27, 2023, 11:40 a.m. UTC | #8
On 27/01/2023 10:04, Michal Koutný wrote:
> On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
>> So even if the RFC shows just a simple i915 implementation, the controller
>> itself shouldn't prevent a smarter approach (via exposed ABI).
> 
> scan/query + over budget notification is IMO limited in guarantees.

It is yes, I tried to stress out that it is not a hard guarantee in any 
shape and form and that the "quality" of adhering to the allocated 
budget will depend on individual hw and sw capabilities.

But it is what I believe is the best approach given a) how different in 
scheduling capability GPU drivers are and b) the very fact there isn't a 
central scheduling entity as opposed to the CPU side of things.

It is just no possible to do a hard guarantee system. GPUs do not 
preempt as nicely and easily as the CPUs do. And the frequency of 
context switches varies widely from too fast to "never", so again, 
charging would have several problems to overcome which would make the 
whole setup IMHO pointless.

And not only that some GPUs do not preempt nicely, but some even can't 
do any of this, period. Even if we stay within the lineage of hardware 
supported by only i915, we have three distinct categories: 1) can't do 
any of this, 2a) can do fine grained priority based scheduling with 
reasonable preemption capability, 2b) ditto but without reasonable 
preemption capability, and 3) like 2a) and 2b) but with the scheduler in 
the firmware and currently supporting coarse priority based scheduling.

Shall I also mention that a single cgroup can contain multiple GPU 
clients, all using different GPUs with a different mix of the above 
listed challenges?

The main point is, should someone prove me wrong and come up a smarter 
way at some point in the future, then "drm.weight" as an ABI remains 
compatible and the improvement can happen completely under the hood. In 
the mean time users get external control, and _some_ ability to improve 
the user experience with the scenarios such as I described yesterday.

>> [...]
>> Yes agreed, and to re-stress out, the ABI as proposed does not preclude
>> changing from scanning to charging or whatever. The idea was for it to be
>> compatible in concept with the CPU controller and also avoid baking in the
>> controlling method to individual drivers.
>> [...]
> 
> But I submit to your point of rather not exposing this via cgroup API
> for possible future refinements.

Ack.

>> Secondly, doing this in userspace would require the ability to get some sort
>> of an atomic snapshot of the whole tree hierarchy to account for changes in
>> layout of the tree and task migrations. Or some retry logic with some added
>> ABI fields to enable it.
> 
> Note, that the proposed implementation is succeptible to miscount due to
> concurrent tree modifications and task migrations too (scanning may not
> converge under frequent cgroup layout modifications, and migrating tasks
> may be summed 0 or >1 times). While in-kernel implementation may assure
> the snapshot view, it'd come at cost. (Read: since the mechanism isn't
> precise anyway, I don't suggest a fully synchronized scanning.)

The part that scanning may not converge in my _current implementation_ 
is true. For instance if clients would be constantly coming and going, 
for that I took a shortcut of not bothering to accumulate usage on 
process/client exit, and I just wait for a stable two periods to look at 
the current state. I reckon this is possibly okay for the real world.

Cgroup tree hierarchy modifications being the reason for not converging 
can also happen, but I thought I can hand wave that as not a realistic 
scenario. Perhaps I am not imaginative enough?

Under or over-accounting for migrating tasks I don't think can happen 
since I am explicitly handling that.

Regards,

Tvrtko
Michal Koutný Jan. 27, 2023, 1 p.m. UTC | #9
On Fri, Jan 27, 2023 at 11:40:58AM +0000, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> The main point is, should someone prove me wrong and come up a smarter way
> at some point in the future, then "drm.weight" as an ABI remains compatible
> and the improvement can happen completely under the hood. In the mean time
> users get external control, and _some_ ability to improve the user
> experience with the scenarios such as I described yesterday.

I'm on board now.

(I've done a mental switch of likening this rather to existing IO
control (higher variance of quanta size, worse preemption, limited
effect) than CPU control.)


> Cgroup tree hierarchy modifications being the reason for not converging can
> also happen, but I thought I can hand wave that as not a realistic scenario.
> Perhaps I am not imaginative enough?

My suggestion: simply skip offlined drmcgs instead of restarting whole
iteration. (A respawning cgroup-wrapped job or intentionally adverse
respawner could in my imagination cause that.)

> Under or over-accounting for migrating tasks I don't think can happen since
> I am explicitly handling that.

I'll reply to the patch for better context...

Regards,
Michal