[RFC,v2,0/5] Waitboost drm syncobj waits

Message ID	20230210130647.580135-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Date: Fri, 10 Feb 2023 13:06:42 +0000 Message-Id: <20230210130647.580135-1-tvrtko.ursulin@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [RFC v2 0/5] Waitboost drm syncobj waits Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Waitboost drm syncobj waits \| expand [RFC,v2,0/5] Waitboost drm syncobj waits [RFC,1/5] dma-fence: Track explicit waiters [RFC,2/5] drm/syncobj: Mark syncobj waits as external waiters [RFC,3/5] drm/i915: Waitboost external waits [RFC,4/5] drm/i915: Mark waits as explicit [RFC,5/5] drm/i915: Wait boost requests waited upon by others

Tvrtko Ursulin Feb. 10, 2023, 1:06 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

In i915 we have this concept of "wait boosting" where we give a priority boost
for instance to fences which are actively waited upon from userspace. This has
it's pros and cons and can certainly be discussed at lenght. However fact is
some workloads really like it.

Problem is that with the arrival of drm syncobj and a new userspace waiting
entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
this mini series really (really) quickly to see if some discussion can be had.

It adds a concept of "wait count" to dma fence, which is incremented for every
explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
dma_fence_add_callback but from explicit/userspace wait paths).

Individual drivers can then inspect this via dma_fence_wait_count() and decide
to wait boost the waits on such fences.

Again, quickly put together and smoke tested only - no guarantees whatsoever and
I will rely on interested parties to test and report if it even works or how
well.

v2:
 * Small fixups based on CI feedback:
    * Handle decrement correctly for already signalled case while adding callback.
    * Remove i915 assert which was making sure struct i915_request does not grow.
 * Split out the i915 patch into three separate functional changes.

Tvrtko Ursulin (5):
  dma-fence: Track explicit waiters
  drm/syncobj: Mark syncobj waits as external waiters
  drm/i915: Waitboost external waits
  drm/i915: Mark waits as explicit
  drm/i915: Wait boost requests waited upon by others

 drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
 drivers/gpu/drm/drm_syncobj.c             |   6 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
 drivers/gpu/drm/i915/i915_request.c       |  13 ++-
 include/linux/dma-fence.h                 |  14 +++
 5 files changed, 101 insertions(+), 35 deletions(-)

Rob Clark Feb. 14, 2023, 7:14 p.m. UTC | #1

On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> In i915 we have this concept of "wait boosting" where we give a priority boost
> for instance to fences which are actively waited upon from userspace. This has
> it's pros and cons and can certainly be discussed at lenght. However fact is
> some workloads really like it.
>
> Problem is that with the arrival of drm syncobj and a new userspace waiting
> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> this mini series really (really) quickly to see if some discussion can be had.
>
> It adds a concept of "wait count" to dma fence, which is incremented for every
> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> dma_fence_add_callback but from explicit/userspace wait paths).

I was thinking about a similar thing, but in the context of dma_fence
(or rather sync_file) fd poll()ing.  How does the kernel differentiate
between "housekeeping" poll()ers that don't want to trigger boost but
simply know when to do cleanup, and waiters who are waiting with some
urgency.  I think we could use EPOLLPRI for this purpose.

Not sure how that translates to waits via the syncobj.  But I think we
want to let userspace give some hint about urgent vs housekeeping
waits.

Also, on a related topic: https://lwn.net/Articles/868468/

BR,
-R

> Individual drivers can then inspect this via dma_fence_wait_count() and decide
> to wait boost the waits on such fences.
>
> Again, quickly put together and smoke tested only - no guarantees whatsoever and
> I will rely on interested parties to test and report if it even works or how
> well.
>
> v2:
>  * Small fixups based on CI feedback:
>     * Handle decrement correctly for already signalled case while adding callback.
>     * Remove i915 assert which was making sure struct i915_request does not grow.
>  * Split out the i915 patch into three separate functional changes.
>
> Tvrtko Ursulin (5):
>   dma-fence: Track explicit waiters
>   drm/syncobj: Mark syncobj waits as external waiters
>   drm/i915: Waitboost external waits
>   drm/i915: Mark waits as explicit
>   drm/i915: Wait boost requests waited upon by others
>
>  drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
>  drivers/gpu/drm/drm_syncobj.c             |   6 +-
>  drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
>  drivers/gpu/drm/i915/i915_request.c       |  13 ++-
>  include/linux/dma-fence.h                 |  14 +++
>  5 files changed, 101 insertions(+), 35 deletions(-)
>
> --
> 2.34.1
>

Rob Clark Feb. 14, 2023, 7:26 p.m. UTC | #2

On Tue, Feb 14, 2023 at 11:14 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
> >
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > In i915 we have this concept of "wait boosting" where we give a priority boost
> > for instance to fences which are actively waited upon from userspace. This has
> > it's pros and cons and can certainly be discussed at lenght. However fact is
> > some workloads really like it.
> >
> > Problem is that with the arrival of drm syncobj and a new userspace waiting
> > entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > this mini series really (really) quickly to see if some discussion can be had.
> >
> > It adds a concept of "wait count" to dma fence, which is incremented for every
> > explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > dma_fence_add_callback but from explicit/userspace wait paths).
>
> I was thinking about a similar thing, but in the context of dma_fence
> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> between "housekeeping" poll()ers that don't want to trigger boost but
> simply know when to do cleanup, and waiters who are waiting with some
> urgency.  I think we could use EPOLLPRI for this purpose.
>
> Not sure how that translates to waits via the syncobj.  But I think we
> want to let userspace give some hint about urgent vs housekeeping
> waits.

So probably the syncobj equiv of this would be to add something along
the lines of DRM_SYNCOBJ_WAIT_FLAGS_WAIT_PRI

BR,
-R

> Also, on a related topic: https://lwn.net/Articles/868468/
>
> BR,
> -R
>
> > Individual drivers can then inspect this via dma_fence_wait_count() and decide
> > to wait boost the waits on such fences.
> >
> > Again, quickly put together and smoke tested only - no guarantees whatsoever and
> > I will rely on interested parties to test and report if it even works or how
> > well.
> >
> > v2:
> >  * Small fixups based on CI feedback:
> >     * Handle decrement correctly for already signalled case while adding callback.
> >     * Remove i915 assert which was making sure struct i915_request does not grow.
> >  * Split out the i915 patch into three separate functional changes.
> >
> > Tvrtko Ursulin (5):
> >   dma-fence: Track explicit waiters
> >   drm/syncobj: Mark syncobj waits as external waiters
> >   drm/i915: Waitboost external waits
> >   drm/i915: Mark waits as explicit
> >   drm/i915: Wait boost requests waited upon by others
> >
> >  drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
> >  drivers/gpu/drm/drm_syncobj.c             |   6 +-
> >  drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
> >  drivers/gpu/drm/i915/i915_request.c       |  13 ++-
> >  include/linux/dma-fence.h                 |  14 +++
> >  5 files changed, 101 insertions(+), 35 deletions(-)
> >
> > --
> > 2.34.1
> >

Tvrtko Ursulin Feb. 16, 2023, 11:19 a.m. UTC | #3

On 14/02/2023 19:14, Rob Clark wrote:
> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> In i915 we have this concept of "wait boosting" where we give a priority boost
>> for instance to fences which are actively waited upon from userspace. This has
>> it's pros and cons and can certainly be discussed at lenght. However fact is
>> some workloads really like it.
>>
>> Problem is that with the arrival of drm syncobj and a new userspace waiting
>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
>> this mini series really (really) quickly to see if some discussion can be had.
>>
>> It adds a concept of "wait count" to dma fence, which is incremented for every
>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
>> dma_fence_add_callback but from explicit/userspace wait paths).
> 
> I was thinking about a similar thing, but in the context of dma_fence
> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> between "housekeeping" poll()ers that don't want to trigger boost but
> simply know when to do cleanup, and waiters who are waiting with some
> urgency.  I think we could use EPOLLPRI for this purpose.

Sounds plausible to allow distinguishing the two.

I wasn't aware one can set POLLPRI in pollfd.events but it appears it could be allowed:

/* Event types that can be polled for.  These bits may be set in `events'
    to indicate the interesting event types; they will appear in `revents'
    to indicate the status of the file descriptor.  */
#define POLLIN          0x001           /* There is data to read.  */
#define POLLPRI         0x002           /* There is urgent data to read.  */
#define POLLOUT         0x004           /* Writing now will not block.  */

> Not sure how that translates to waits via the syncobj.  But I think we
> want to let userspace give some hint about urgent vs housekeeping
> waits.

Probably DRM_SYNCOBJ_WAIT_FLAGS_<something>.

Both look easy additions on top of my series. It would be just a matter of dma_fence_add_callback vs dma_fence_add_wait_callback based on flags, as that's how I called the "explicit userspace wait" one.

It would require userspace changes to make use of it but that is probably okay, or even preferable, since it makes the thing less of a heuristic. What I don't know however is how feasible is to wire it up with say OpenCL, OpenGL or Vulkan, to allow application writers distinguish between house keeping vs performance sensitive waits.

> Also, on a related topic: https://lwn.net/Articles/868468/

Right, I missed that one.

One thing to mention is that my motivation here wasn't strictly waits relating to frame presentation but clvk workloads which constantly move between the CPU and GPU. Even outside the compute domain, I think this is a workload characteristic where waitboost in general helps. The concept of deadline could still be used I guess, just setting it for some artificially early value, when the actual time does not exist. But scanning that discussion seems the proposal got bogged down in interactions between mode setting and stuff?

Regards,

Tvrtko

Rob Clark Feb. 16, 2023, 3:43 p.m. UTC | #4

On Thu, Feb 16, 2023 at 3:19 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 14/02/2023 19:14, Rob Clark wrote:
> > On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> >>
> >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >> In i915 we have this concept of "wait boosting" where we give a priority boost
> >> for instance to fences which are actively waited upon from userspace. This has
> >> it's pros and cons and can certainly be discussed at lenght. However fact is
> >> some workloads really like it.
> >>
> >> Problem is that with the arrival of drm syncobj and a new userspace waiting
> >> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> >> this mini series really (really) quickly to see if some discussion can be had.
> >>
> >> It adds a concept of "wait count" to dma fence, which is incremented for every
> >> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> >> dma_fence_add_callback but from explicit/userspace wait paths).
> >
> > I was thinking about a similar thing, but in the context of dma_fence
> > (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> > between "housekeeping" poll()ers that don't want to trigger boost but
> > simply know when to do cleanup, and waiters who are waiting with some
> > urgency.  I think we could use EPOLLPRI for this purpose.
>
> Sounds plausible to allow distinguishing the two.
>
> I wasn't aware one can set POLLPRI in pollfd.events but it appears it could be allowed:
>
> /* Event types that can be polled for.  These bits may be set in `events'
>     to indicate the interesting event types; they will appear in `revents'
>     to indicate the status of the file descriptor.  */
> #define POLLIN          0x001           /* There is data to read.  */
> #define POLLPRI         0x002           /* There is urgent data to read.  */
> #define POLLOUT         0x004           /* Writing now will not block.  */
>
> > Not sure how that translates to waits via the syncobj.  But I think we
> > want to let userspace give some hint about urgent vs housekeeping
> > waits.
>
> Probably DRM_SYNCOBJ_WAIT_FLAGS_<something>.
>
> Both look easy additions on top of my series. It would be just a matter of dma_fence_add_callback vs dma_fence_add_wait_callback based on flags, as that's how I called the "explicit userspace wait" one.
>
> It would require userspace changes to make use of it but that is probably okay, or even preferable, since it makes the thing less of a heuristic. What I don't know however is how feasible is to wire it up with say OpenCL, OpenGL or Vulkan, to allow application writers distinguish between house keeping vs performance sensitive waits.
>

I think to start with, we consider API level waits as
POLLPRI/DRM_SYNCOBJ_WAIT_PRI until someone types up an extension to
give the app control.  I guess most housekeeping waits will be within
the driver.

(I could see the argument for making "PRI" the default and having a
new flag for non-boosting waits.. but POLLPRI is also some sort of
precedent)

> > Also, on a related topic: https://lwn.net/Articles/868468/
>
> Right, I missed that one.
>
> One thing to mention is that my motivation here wasn't strictly waits relating to frame presentation but clvk workloads which constantly move between the CPU and GPU. Even outside the compute domain, I think this is a workload characteristic where waitboost in general helps. The concept of deadline could still be used I guess, just setting it for some artificially early value, when the actual time does not exist. But scanning that discussion seems the proposal got bogged down in interactions between mode setting and stuff?
>

Yeah, it isn't _exactly_ the same thing but it is the same class of
problem where GPU stalling on something else sends the freq in the
wrong direction.  Probably we could consider wait-boosting as simply
an immediate deadline to unify the two things.

BR,
-R


> Regards,
>
> Tvrtko

Rodrigo Vivi Feb. 16, 2023, 6:19 p.m. UTC | #5

On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
> >
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > In i915 we have this concept of "wait boosting" where we give a priority boost
> > for instance to fences which are actively waited upon from userspace. This has
> > it's pros and cons and can certainly be discussed at lenght. However fact is
> > some workloads really like it.
> >
> > Problem is that with the arrival of drm syncobj and a new userspace waiting
> > entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > this mini series really (really) quickly to see if some discussion can be had.
> >
> > It adds a concept of "wait count" to dma fence, which is incremented for every
> > explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > dma_fence_add_callback but from explicit/userspace wait paths).
> 
> I was thinking about a similar thing, but in the context of dma_fence
> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> between "housekeeping" poll()ers that don't want to trigger boost but
> simply know when to do cleanup, and waiters who are waiting with some
> urgency.  I think we could use EPOLLPRI for this purpose.
> 
> Not sure how that translates to waits via the syncobj.  But I think we
> want to let userspace give some hint about urgent vs housekeeping
> waits.

Should the hint be on the waits, or should the hints be on the executed
context?

In the end we need some way to quickly ramp-up the frequency to avoid
the execution bubbles.

waitboost is trying to guess that, but in some cases it guess wrong
and waste power.

btw, this is something that other drivers might need:

https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
Cc: Alex Deucher <alexander.deucher@amd.com>

> 
> Also, on a related topic: https://lwn.net/Articles/868468/
> 
> BR,
> -R
> 
> > Individual drivers can then inspect this via dma_fence_wait_count() and decide
> > to wait boost the waits on such fences.
> >
> > Again, quickly put together and smoke tested only - no guarantees whatsoever and
> > I will rely on interested parties to test and report if it even works or how
> > well.
> >
> > v2:
> >  * Small fixups based on CI feedback:
> >     * Handle decrement correctly for already signalled case while adding callback.
> >     * Remove i915 assert which was making sure struct i915_request does not grow.
> >  * Split out the i915 patch into three separate functional changes.
> >
> > Tvrtko Ursulin (5):
> >   dma-fence: Track explicit waiters
> >   drm/syncobj: Mark syncobj waits as external waiters
> >   drm/i915: Waitboost external waits
> >   drm/i915: Mark waits as explicit
> >   drm/i915: Wait boost requests waited upon by others
> >
> >  drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
> >  drivers/gpu/drm/drm_syncobj.c             |   6 +-
> >  drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
> >  drivers/gpu/drm/i915/i915_request.c       |  13 ++-
> >  include/linux/dma-fence.h                 |  14 +++
> >  5 files changed, 101 insertions(+), 35 deletions(-)
> >
> > --
> > 2.34.1
> >

Rob Clark Feb. 16, 2023, 7:59 p.m. UTC | #6

On Thu, Feb 16, 2023 at 10:20 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> > On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> > >
> > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > >
> > > In i915 we have this concept of "wait boosting" where we give a priority boost
> > > for instance to fences which are actively waited upon from userspace. This has
> > > it's pros and cons and can certainly be discussed at lenght. However fact is
> > > some workloads really like it.
> > >
> > > Problem is that with the arrival of drm syncobj and a new userspace waiting
> > > entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > > this mini series really (really) quickly to see if some discussion can be had.
> > >
> > > It adds a concept of "wait count" to dma fence, which is incremented for every
> > > explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > > dma_fence_add_callback but from explicit/userspace wait paths).
> >
> > I was thinking about a similar thing, but in the context of dma_fence
> > (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> > between "housekeeping" poll()ers that don't want to trigger boost but
> > simply know when to do cleanup, and waiters who are waiting with some
> > urgency.  I think we could use EPOLLPRI for this purpose.
> >
> > Not sure how that translates to waits via the syncobj.  But I think we
> > want to let userspace give some hint about urgent vs housekeeping
> > waits.
>
> Should the hint be on the waits, or should the hints be on the executed
> context?

I think it should be on the wait, because different waits may be for
different purposes.  Ideally this could be exposed at the app API
level, but I guess first step is to expose it to userspace.

BR,
-R

> In the end we need some way to quickly ramp-up the frequency to avoid
> the execution bubbles.
>
> waitboost is trying to guess that, but in some cases it guess wrong
> and waste power.
>
> btw, this is something that other drivers might need:
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> Cc: Alex Deucher <alexander.deucher@amd.com>
>
> >
> > Also, on a related topic: https://lwn.net/Articles/868468/
> >
> > BR,
> > -R
> >
> > > Individual drivers can then inspect this via dma_fence_wait_count() and decide
> > > to wait boost the waits on such fences.
> > >
> > > Again, quickly put together and smoke tested only - no guarantees whatsoever and
> > > I will rely on interested parties to test and report if it even works or how
> > > well.
> > >
> > > v2:
> > >  * Small fixups based on CI feedback:
> > >     * Handle decrement correctly for already signalled case while adding callback.
> > >     * Remove i915 assert which was making sure struct i915_request does not grow.
> > >  * Split out the i915 patch into three separate functional changes.
> > >
> > > Tvrtko Ursulin (5):
> > >   dma-fence: Track explicit waiters
> > >   drm/syncobj: Mark syncobj waits as external waiters
> > >   drm/i915: Waitboost external waits
> > >   drm/i915: Mark waits as explicit
> > >   drm/i915: Wait boost requests waited upon by others
> > >
> > >  drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
> > >  drivers/gpu/drm/drm_syncobj.c             |   6 +-
> > >  drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
> > >  drivers/gpu/drm/i915/i915_request.c       |  13 ++-
> > >  include/linux/dma-fence.h                 |  14 +++
> > >  5 files changed, 101 insertions(+), 35 deletions(-)
> > >
> > > --
> > > 2.34.1
> > >

Tvrtko Ursulin Feb. 17, 2023, 12:56 p.m. UTC | #7

On 16/02/2023 18:19, Rodrigo Vivi wrote:
> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>
>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> In i915 we have this concept of "wait boosting" where we give a priority boost
>>> for instance to fences which are actively waited upon from userspace. This has
>>> it's pros and cons and can certainly be discussed at lenght. However fact is
>>> some workloads really like it.
>>>
>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
>>> this mini series really (really) quickly to see if some discussion can be had.
>>>
>>> It adds a concept of "wait count" to dma fence, which is incremented for every
>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
>>> dma_fence_add_callback but from explicit/userspace wait paths).
>>
>> I was thinking about a similar thing, but in the context of dma_fence
>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
>> between "housekeeping" poll()ers that don't want to trigger boost but
>> simply know when to do cleanup, and waiters who are waiting with some
>> urgency.  I think we could use EPOLLPRI for this purpose.
>>
>> Not sure how that translates to waits via the syncobj.  But I think we
>> want to let userspace give some hint about urgent vs housekeeping
>> waits.
> 
> Should the hint be on the waits, or should the hints be on the executed
> context?
> 
> In the end we need some way to quickly ramp-up the frequency to avoid
> the execution bubbles.
> 
> waitboost is trying to guess that, but in some cases it guess wrong
> and waste power.

Do we have a list of workloads which shows who benefits and who loses 
from the current implementation of waitboost?
> btw, this is something that other drivers might need:
> 
> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> Cc: Alex Deucher <alexander.deucher@amd.com>

I have several issues with the context hint if it would directly 
influence frequency selection in the "more power" direction.

First of all, assume a context hint would replace the waitboost. Which 
applications would need to set it to restore the lost performance and 
how would they set it?

Then I don't even think userspace necessarily knows. Think of a layer 
like OpenCL. It doesn't really know in advance the profile of 
submissions vs waits. It depends on the CPU vs GPU speed, so hardware 
generation, and the actual size of the workload which can be influenced 
by the application (or user) and not the library.

The approach also lends itself well for the "arms race" where every 
application can say "Me me me, I am the most important workload there is!".

The last concern is for me shared with the proposal to expose deadlines 
or high priority waits as explicit uapi knobs. Both come under the "what 
application told us it will do" category vs what it actually does. So I 
think it is slightly weaker than basing decisions of waits.

The current waitboost is a bit detached from that problem because when 
we waitboost for flips we _know_ it is an actual framebuffer in the flip 
chain. When we waitboost for waits we also know someone is waiting. We 
are not trusting userspace telling us this will be a buffer in the flip 
chain or that this is a context which will have a certain duty-cycle.

But yes, even if the input is truthful, latter is still only a 
heuristics because nothing says all waits are important. AFAIU it just 
happened to work well in the past.

I do understand I am effectively arguing for more heuristics, which may 
sound a bit against the common wisdom. This is because in general I 
think the logic to do the right thing, be it in the driver or in the 
firmware, can work best if it has a holistic view. Simply put it needs 
to have more inputs to the decisions it is making.

That is what my series is proposing - adding a common signal of "someone 
in userspace is waiting". What happens with that signal needs not be 
defined (promised) in the uapi contract.

Say you route it to SLPC logic. It doesn't need to do with it what 
legacy i915 is doing today. It just needs to do something which works 
best for majority of workloads. It can even ignore it if that works for it.

Finally, back to the immediate problem is when people replace the OpenCL 
NEO driver with clvk that performance tanks. Because former does waits 
using i915 specific ioctls and so triggers waitboost, latter waits on 
syncobj so no waitboost and performance is bad. What short term solution 
can we come up with? Or we say to not use clvk? I also wonder if other 
Vulkan based stuff is perhaps missing those easy performance gains..

Perhaps strictly speaking Rob's and mine proposal are not mutually 
exclusive. Yes I could piggy back on his with an "immediate deadline for 
waits" idea, but they could also be separate concepts if we concluded 
"someone is waiting" signal is useful to have. Or it takes to long to 
upstream the full deadline idea.

Regards,

Tvrtko

>>
>> Also, on a related topic: https://lwn.net/Articles/868468/
>>
>> BR,
>> -R
>>
>>> Individual drivers can then inspect this via dma_fence_wait_count() and decide
>>> to wait boost the waits on such fences.
>>>
>>> Again, quickly put together and smoke tested only - no guarantees whatsoever and
>>> I will rely on interested parties to test and report if it even works or how
>>> well.
>>>
>>> v2:
>>>   * Small fixups based on CI feedback:
>>>      * Handle decrement correctly for already signalled case while adding callback.
>>>      * Remove i915 assert which was making sure struct i915_request does not grow.
>>>   * Split out the i915 patch into three separate functional changes.
>>>
>>> Tvrtko Ursulin (5):
>>>    dma-fence: Track explicit waiters
>>>    drm/syncobj: Mark syncobj waits as external waiters
>>>    drm/i915: Waitboost external waits
>>>    drm/i915: Mark waits as explicit
>>>    drm/i915: Wait boost requests waited upon by others
>>>
>>>   drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
>>>   drivers/gpu/drm/drm_syncobj.c             |   6 +-
>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
>>>   drivers/gpu/drm/i915/i915_request.c       |  13 ++-
>>>   include/linux/dma-fence.h                 |  14 +++
>>>   5 files changed, 101 insertions(+), 35 deletions(-)
>>>
>>> --
>>> 2.34.1
>>>

Rob Clark Feb. 17, 2023, 2:55 p.m. UTC | #8

On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> > On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> >> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> >> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>
> >>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>
> >>> In i915 we have this concept of "wait boosting" where we give a priority boost
> >>> for instance to fences which are actively waited upon from userspace. This has
> >>> it's pros and cons and can certainly be discussed at lenght. However fact is
> >>> some workloads really like it.
> >>>
> >>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> >>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> >>> this mini series really (really) quickly to see if some discussion can be had.
> >>>
> >>> It adds a concept of "wait count" to dma fence, which is incremented for every
> >>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> >>> dma_fence_add_callback but from explicit/userspace wait paths).
> >>
> >> I was thinking about a similar thing, but in the context of dma_fence
> >> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> >> between "housekeeping" poll()ers that don't want to trigger boost but
> >> simply know when to do cleanup, and waiters who are waiting with some
> >> urgency.  I think we could use EPOLLPRI for this purpose.
> >>
> >> Not sure how that translates to waits via the syncobj.  But I think we
> >> want to let userspace give some hint about urgent vs housekeeping
> >> waits.
> >
> > Should the hint be on the waits, or should the hints be on the executed
> > context?
> >
> > In the end we need some way to quickly ramp-up the frequency to avoid
> > the execution bubbles.
> >
> > waitboost is trying to guess that, but in some cases it guess wrong
> > and waste power.
>
> Do we have a list of workloads which shows who benefits and who loses
> from the current implementation of waitboost?
> > btw, this is something that other drivers might need:
> >
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> > Cc: Alex Deucher <alexander.deucher@amd.com>
>
> I have several issues with the context hint if it would directly
> influence frequency selection in the "more power" direction.
>
> First of all, assume a context hint would replace the waitboost. Which
> applications would need to set it to restore the lost performance and
> how would they set it?
>
> Then I don't even think userspace necessarily knows. Think of a layer
> like OpenCL. It doesn't really know in advance the profile of
> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> generation, and the actual size of the workload which can be influenced
> by the application (or user) and not the library.
>
> The approach also lends itself well for the "arms race" where every
> application can say "Me me me, I am the most important workload there is!".

since there is discussion happening in two places:

https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433

What I think you might want is a ctx boost_mask which lets an app or
driver disable certain boost signals/classes.  Where fence waits is
one class of boost, but hypothetical other signals like touchscreen
(or other) input events could be another class of boost.  A compute
workload might be interested in fence wait boosts but could care less
about input events.

> The last concern is for me shared with the proposal to expose deadlines
> or high priority waits as explicit uapi knobs. Both come under the "what
> application told us it will do" category vs what it actually does. So I
> think it is slightly weaker than basing decisions of waits.
>
> The current waitboost is a bit detached from that problem because when
> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> chain. When we waitboost for waits we also know someone is waiting. We
> are not trusting userspace telling us this will be a buffer in the flip
> chain or that this is a context which will have a certain duty-cycle.
>
> But yes, even if the input is truthful, latter is still only a
> heuristics because nothing says all waits are important. AFAIU it just
> happened to work well in the past.
>
> I do understand I am effectively arguing for more heuristics, which may
> sound a bit against the common wisdom. This is because in general I
> think the logic to do the right thing, be it in the driver or in the
> firmware, can work best if it has a holistic view. Simply put it needs
> to have more inputs to the decisions it is making.
>
> That is what my series is proposing - adding a common signal of "someone
> in userspace is waiting". What happens with that signal needs not be
> defined (promised) in the uapi contract.
>
> Say you route it to SLPC logic. It doesn't need to do with it what
> legacy i915 is doing today. It just needs to do something which works
> best for majority of workloads. It can even ignore it if that works for it.
>
> Finally, back to the immediate problem is when people replace the OpenCL
> NEO driver with clvk that performance tanks. Because former does waits
> using i915 specific ioctls and so triggers waitboost, latter waits on
> syncobj so no waitboost and performance is bad. What short term solution
> can we come up with? Or we say to not use clvk? I also wonder if other
> Vulkan based stuff is perhaps missing those easy performance gains..
>
> Perhaps strictly speaking Rob's and mine proposal are not mutually
> exclusive. Yes I could piggy back on his with an "immediate deadline for
> waits" idea, but they could also be separate concepts if we concluded
> "someone is waiting" signal is useful to have. Or it takes to long to
> upstream the full deadline idea.

Let me re-spin my series and add the syncobj wait flag and i915 bits
adapted from your patches..  I think the basic idea of deadlines
(which includes "I want it NOW" ;-)) isn't controversial, but the
original idea got caught up in some bikeshed (what about compositors
that wait on fences in userspace to decide which surfaces to update in
the next frame), plus me getting busy and generally not having a good
plan for how to leverage this from VM guests (which is becoming
increasingly important for CrOS).  I think I can build on some ongoing
virtgpu fencing improvement work to solve the latter.  But now that we
have a 2nd use-case for this, it makes sense to respin.

BR,
-R

> Regards,
>
> Tvrtko
>
> >>
> >> Also, on a related topic: https://lwn.net/Articles/868468/
> >>
> >> BR,
> >> -R
> >>
> >>> Individual drivers can then inspect this via dma_fence_wait_count() and decide
> >>> to wait boost the waits on such fences.
> >>>
> >>> Again, quickly put together and smoke tested only - no guarantees whatsoever and
> >>> I will rely on interested parties to test and report if it even works or how
> >>> well.
> >>>
> >>> v2:
> >>>   * Small fixups based on CI feedback:
> >>>      * Handle decrement correctly for already signalled case while adding callback.
> >>>      * Remove i915 assert which was making sure struct i915_request does not grow.
> >>>   * Split out the i915 patch into three separate functional changes.
> >>>
> >>> Tvrtko Ursulin (5):
> >>>    dma-fence: Track explicit waiters
> >>>    drm/syncobj: Mark syncobj waits as external waiters
> >>>    drm/i915: Waitboost external waits
> >>>    drm/i915: Mark waits as explicit
> >>>    drm/i915: Wait boost requests waited upon by others
> >>>
> >>>   drivers/dma-buf/dma-fence.c               | 102 ++++++++++++++++------
> >>>   drivers/gpu/drm/drm_syncobj.c             |   6 +-
> >>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c |   1 -
> >>>   drivers/gpu/drm/i915/i915_request.c       |  13 ++-
> >>>   include/linux/dma-fence.h                 |  14 +++
> >>>   5 files changed, 101 insertions(+), 35 deletions(-)
> >>>
> >>> --
> >>> 2.34.1
> >>>

Tvrtko Ursulin Feb. 17, 2023, 4:03 p.m. UTC | #9

On 17/02/2023 14:55, Rob Clark wrote:
> On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 16/02/2023 18:19, Rodrigo Vivi wrote:
>>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
>>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
>>>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>>>
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>
>>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
>>>>> for instance to fences which are actively waited upon from userspace. This has
>>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
>>>>> some workloads really like it.
>>>>>
>>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
>>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
>>>>> this mini series really (really) quickly to see if some discussion can be had.
>>>>>
>>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
>>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
>>>>> dma_fence_add_callback but from explicit/userspace wait paths).
>>>>
>>>> I was thinking about a similar thing, but in the context of dma_fence
>>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
>>>> between "housekeeping" poll()ers that don't want to trigger boost but
>>>> simply know when to do cleanup, and waiters who are waiting with some
>>>> urgency.  I think we could use EPOLLPRI for this purpose.
>>>>
>>>> Not sure how that translates to waits via the syncobj.  But I think we
>>>> want to let userspace give some hint about urgent vs housekeeping
>>>> waits.
>>>
>>> Should the hint be on the waits, or should the hints be on the executed
>>> context?
>>>
>>> In the end we need some way to quickly ramp-up the frequency to avoid
>>> the execution bubbles.
>>>
>>> waitboost is trying to guess that, but in some cases it guess wrong
>>> and waste power.
>>
>> Do we have a list of workloads which shows who benefits and who loses
>> from the current implementation of waitboost?
>>> btw, this is something that other drivers might need:
>>>
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>
>> I have several issues with the context hint if it would directly
>> influence frequency selection in the "more power" direction.
>>
>> First of all, assume a context hint would replace the waitboost. Which
>> applications would need to set it to restore the lost performance and
>> how would they set it?
>>
>> Then I don't even think userspace necessarily knows. Think of a layer
>> like OpenCL. It doesn't really know in advance the profile of
>> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
>> generation, and the actual size of the workload which can be influenced
>> by the application (or user) and not the library.
>>
>> The approach also lends itself well for the "arms race" where every
>> application can say "Me me me, I am the most important workload there is!".
> 
> since there is discussion happening in two places:
> 
> https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> 
> What I think you might want is a ctx boost_mask which lets an app or
> driver disable certain boost signals/classes.  Where fence waits is
> one class of boost, but hypothetical other signals like touchscreen
> (or other) input events could be another class of boost.  A compute
> workload might be interested in fence wait boosts but could care less
> about input events.

I think it can only be apps which could have any chance knowing whether 
their use of a library is latency sensitive or not. Which means new 
library extensions and their adoption. So I have some strong reservation 
that route is feasible.

Or we tie with priority which many drivers do. Normal and above gets the 
boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).

Related note is that we lack any external control of our scheduling 
decisions so we really do suck compared to other scheduling domains like 
CPU and IO etc.

>> The last concern is for me shared with the proposal to expose deadlines
>> or high priority waits as explicit uapi knobs. Both come under the "what
>> application told us it will do" category vs what it actually does. So I
>> think it is slightly weaker than basing decisions of waits.
>>
>> The current waitboost is a bit detached from that problem because when
>> we waitboost for flips we _know_ it is an actual framebuffer in the flip
>> chain. When we waitboost for waits we also know someone is waiting. We
>> are not trusting userspace telling us this will be a buffer in the flip
>> chain or that this is a context which will have a certain duty-cycle.
>>
>> But yes, even if the input is truthful, latter is still only a
>> heuristics because nothing says all waits are important. AFAIU it just
>> happened to work well in the past.
>>
>> I do understand I am effectively arguing for more heuristics, which may
>> sound a bit against the common wisdom. This is because in general I
>> think the logic to do the right thing, be it in the driver or in the
>> firmware, can work best if it has a holistic view. Simply put it needs
>> to have more inputs to the decisions it is making.
>>
>> That is what my series is proposing - adding a common signal of "someone
>> in userspace is waiting". What happens with that signal needs not be
>> defined (promised) in the uapi contract.
>>
>> Say you route it to SLPC logic. It doesn't need to do with it what
>> legacy i915 is doing today. It just needs to do something which works
>> best for majority of workloads. It can even ignore it if that works for it.
>>
>> Finally, back to the immediate problem is when people replace the OpenCL
>> NEO driver with clvk that performance tanks. Because former does waits
>> using i915 specific ioctls and so triggers waitboost, latter waits on
>> syncobj so no waitboost and performance is bad. What short term solution
>> can we come up with? Or we say to not use clvk? I also wonder if other
>> Vulkan based stuff is perhaps missing those easy performance gains..
>>
>> Perhaps strictly speaking Rob's and mine proposal are not mutually
>> exclusive. Yes I could piggy back on his with an "immediate deadline for
>> waits" idea, but they could also be separate concepts if we concluded
>> "someone is waiting" signal is useful to have. Or it takes to long to
>> upstream the full deadline idea.
> 
> Let me re-spin my series and add the syncobj wait flag and i915 bits

I think wait flag is questionable unless it is inverted to imply waits 
which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH). 
Having a flag which "makes things faster" IMO should require elevated 
privilege (to avoid the "arms race") at which point I fear it quickly 
becomes uninteresting.

> adapted from your patches..  I think the basic idea of deadlines
> (which includes "I want it NOW" ;-)) isn't controversial, but the
> original idea got caught up in some bikeshed (what about compositors
> that wait on fences in userspace to decide which surfaces to update in
> the next frame), plus me getting busy and generally not having a good
> plan for how to leverage this from VM guests (which is becoming
> increasingly important for CrOS).  I think I can build on some ongoing
> virtgpu fencing improvement work to solve the latter.  But now that we
> have a 2nd use-case for this, it makes sense to respin.

Sure, I was looking at the old version already. It is interesting. But 
also IMO needs quite a bit more work to approach achieving what is 
implied from the name of the feature. It would need proper deadline 
based sched job picking, and even then drm sched is mostly just a 
frontend. So once past runnable status and jobs handed over to backend, 
without further driver work it probably wouldn't be very effective past 
very lightly loaded systems.

Then if we fast forward to a world where schedulers perhaps become fully 
deadline aware (we even had this for i915 few years back) then the 
question will be does equating waits with immediate deadlines still 
works. Maybe not too well because we wouldn't have the ability to 
distinguish between the "someone is waiting" signal from the otherwise 
propagated deadlines.

Regards,

Tvrtko

Rob Clark Feb. 17, 2023, 5 p.m. UTC | #10

On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 17/02/2023 14:55, Rob Clark wrote:
> > On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> >>
> >>
> >> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> >>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> >>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> >>>> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>>>
> >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>
> >>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
> >>>>> for instance to fences which are actively waited upon from userspace. This has
> >>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
> >>>>> some workloads really like it.
> >>>>>
> >>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> >>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> >>>>> this mini series really (really) quickly to see if some discussion can be had.
> >>>>>
> >>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
> >>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> >>>>> dma_fence_add_callback but from explicit/userspace wait paths).
> >>>>
> >>>> I was thinking about a similar thing, but in the context of dma_fence
> >>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> >>>> between "housekeeping" poll()ers that don't want to trigger boost but
> >>>> simply know when to do cleanup, and waiters who are waiting with some
> >>>> urgency.  I think we could use EPOLLPRI for this purpose.
> >>>>
> >>>> Not sure how that translates to waits via the syncobj.  But I think we
> >>>> want to let userspace give some hint about urgent vs housekeeping
> >>>> waits.
> >>>
> >>> Should the hint be on the waits, or should the hints be on the executed
> >>> context?
> >>>
> >>> In the end we need some way to quickly ramp-up the frequency to avoid
> >>> the execution bubbles.
> >>>
> >>> waitboost is trying to guess that, but in some cases it guess wrong
> >>> and waste power.
> >>
> >> Do we have a list of workloads which shows who benefits and who loses
> >> from the current implementation of waitboost?
> >>> btw, this is something that other drivers might need:
> >>>
> >>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> >>> Cc: Alex Deucher <alexander.deucher@amd.com>
> >>
> >> I have several issues with the context hint if it would directly
> >> influence frequency selection in the "more power" direction.
> >>
> >> First of all, assume a context hint would replace the waitboost. Which
> >> applications would need to set it to restore the lost performance and
> >> how would they set it?
> >>
> >> Then I don't even think userspace necessarily knows. Think of a layer
> >> like OpenCL. It doesn't really know in advance the profile of
> >> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> >> generation, and the actual size of the workload which can be influenced
> >> by the application (or user) and not the library.
> >>
> >> The approach also lends itself well for the "arms race" where every
> >> application can say "Me me me, I am the most important workload there is!".
> >
> > since there is discussion happening in two places:
> >
> > https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> >
> > What I think you might want is a ctx boost_mask which lets an app or
> > driver disable certain boost signals/classes.  Where fence waits is
> > one class of boost, but hypothetical other signals like touchscreen
> > (or other) input events could be another class of boost.  A compute
> > workload might be interested in fence wait boosts but could care less
> > about input events.
>
> I think it can only be apps which could have any chance knowing whether
> their use of a library is latency sensitive or not. Which means new
> library extensions and their adoption. So I have some strong reservation
> that route is feasible.
>
> Or we tie with priority which many drivers do. Normal and above gets the
> boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).

yeah, that sounds reasonable.

> Related note is that we lack any external control of our scheduling
> decisions so we really do suck compared to other scheduling domains like
> CPU and IO etc.
>
> >> The last concern is for me shared with the proposal to expose deadlines
> >> or high priority waits as explicit uapi knobs. Both come under the "what
> >> application told us it will do" category vs what it actually does. So I
> >> think it is slightly weaker than basing decisions of waits.
> >>
> >> The current waitboost is a bit detached from that problem because when
> >> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> >> chain. When we waitboost for waits we also know someone is waiting. We
> >> are not trusting userspace telling us this will be a buffer in the flip
> >> chain or that this is a context which will have a certain duty-cycle.
> >>
> >> But yes, even if the input is truthful, latter is still only a
> >> heuristics because nothing says all waits are important. AFAIU it just
> >> happened to work well in the past.
> >>
> >> I do understand I am effectively arguing for more heuristics, which may
> >> sound a bit against the common wisdom. This is because in general I
> >> think the logic to do the right thing, be it in the driver or in the
> >> firmware, can work best if it has a holistic view. Simply put it needs
> >> to have more inputs to the decisions it is making.
> >>
> >> That is what my series is proposing - adding a common signal of "someone
> >> in userspace is waiting". What happens with that signal needs not be
> >> defined (promised) in the uapi contract.
> >>
> >> Say you route it to SLPC logic. It doesn't need to do with it what
> >> legacy i915 is doing today. It just needs to do something which works
> >> best for majority of workloads. It can even ignore it if that works for it.
> >>
> >> Finally, back to the immediate problem is when people replace the OpenCL
> >> NEO driver with clvk that performance tanks. Because former does waits
> >> using i915 specific ioctls and so triggers waitboost, latter waits on
> >> syncobj so no waitboost and performance is bad. What short term solution
> >> can we come up with? Or we say to not use clvk? I also wonder if other
> >> Vulkan based stuff is perhaps missing those easy performance gains..
> >>
> >> Perhaps strictly speaking Rob's and mine proposal are not mutually
> >> exclusive. Yes I could piggy back on his with an "immediate deadline for
> >> waits" idea, but they could also be separate concepts if we concluded
> >> "someone is waiting" signal is useful to have. Or it takes to long to
> >> upstream the full deadline idea.
> >
> > Let me re-spin my series and add the syncobj wait flag and i915 bits
>
> I think wait flag is questionable unless it is inverted to imply waits
> which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
> Having a flag which "makes things faster" IMO should require elevated
> privilege (to avoid the "arms race") at which point I fear it quickly
> becomes uninteresting.

I guess you could make the argument in either direction.  Making the
default behavior ramp up clocks could be a power regression.

I also think the "arms race" scenario isn't really as much of a
problem as you think.  There aren't _that_ many things using the GPU
at the same time (compared to # of things using CPU).   And a lot of
mobile games throttle framerate to avoid draining your battery too
quickly (after all, if your battery is dead you can't keep buying loot
boxes or whatever).

> > adapted from your patches..  I think the basic idea of deadlines
> > (which includes "I want it NOW" ;-)) isn't controversial, but the
> > original idea got caught up in some bikeshed (what about compositors
> > that wait on fences in userspace to decide which surfaces to update in
> > the next frame), plus me getting busy and generally not having a good
> > plan for how to leverage this from VM guests (which is becoming
> > increasingly important for CrOS).  I think I can build on some ongoing
> > virtgpu fencing improvement work to solve the latter.  But now that we
> > have a 2nd use-case for this, it makes sense to respin.
>
> Sure, I was looking at the old version already. It is interesting. But
> also IMO needs quite a bit more work to approach achieving what is
> implied from the name of the feature. It would need proper deadline
> based sched job picking, and even then drm sched is mostly just a
> frontend. So once past runnable status and jobs handed over to backend,
> without further driver work it probably wouldn't be very effective past
> very lightly loaded systems.

Yes, but all of that is not part of dma_fence ;-)

A pretty common challenging usecase is still the single fullscreen
game, where scheduling isn't the problem, but landing at an
appropriate GPU freq absolutely is.  (UI workloads are perhaps more
interesting from a scheduler standpoint, but they generally aren't
challenging from a load/freq standpoint.)

Fwiw, the original motivation of the series was to implement something
akin to i915 pageflip boosting without having to abandon the atomic
helpers.  (And, I guess it would also let i915 preserve that feature
if it switched to atomic helpers.. I'm unsure if there are still other
things blocking i915's migration.)

> Then if we fast forward to a world where schedulers perhaps become fully
> deadline aware (we even had this for i915 few years back) then the
> question will be does equating waits with immediate deadlines still
> works. Maybe not too well because we wouldn't have the ability to
> distinguish between the "someone is waiting" signal from the otherwise
> propagated deadlines.

Is there any other way to handle a wait boost than expressing it as an
ASAP deadline?

BR,
-R

>
> Regards,
>
> Tvrtko

Rodrigo Vivi Feb. 17, 2023, 8:45 p.m. UTC | #11

On Fri, Feb 17, 2023 at 09:00:49AM -0800, Rob Clark wrote:
> On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
> >
> >
> > On 17/02/2023 14:55, Rob Clark wrote:
> > > On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> > > <tvrtko.ursulin@linux.intel.com> wrote:
> > >>
> > >>
> > >> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> > >>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> > >>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> > >>>> <tvrtko.ursulin@linux.intel.com> wrote:
> > >>>>>
> > >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > >>>>>
> > >>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
> > >>>>> for instance to fences which are actively waited upon from userspace. This has
> > >>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
> > >>>>> some workloads really like it.
> > >>>>>
> > >>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> > >>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > >>>>> this mini series really (really) quickly to see if some discussion can be had.
> > >>>>>
> > >>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
> > >>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > >>>>> dma_fence_add_callback but from explicit/userspace wait paths).
> > >>>>
> > >>>> I was thinking about a similar thing, but in the context of dma_fence
> > >>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> > >>>> between "housekeeping" poll()ers that don't want to trigger boost but
> > >>>> simply know when to do cleanup, and waiters who are waiting with some
> > >>>> urgency.  I think we could use EPOLLPRI for this purpose.
> > >>>>
> > >>>> Not sure how that translates to waits via the syncobj.  But I think we
> > >>>> want to let userspace give some hint about urgent vs housekeeping
> > >>>> waits.
> > >>>
> > >>> Should the hint be on the waits, or should the hints be on the executed
> > >>> context?
> > >>>
> > >>> In the end we need some way to quickly ramp-up the frequency to avoid
> > >>> the execution bubbles.
> > >>>
> > >>> waitboost is trying to guess that, but in some cases it guess wrong
> > >>> and waste power.
> > >>
> > >> Do we have a list of workloads which shows who benefits and who loses
> > >> from the current implementation of waitboost?
> > >>> btw, this is something that other drivers might need:
> > >>>
> > >>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> > >>> Cc: Alex Deucher <alexander.deucher@amd.com>
> > >>
> > >> I have several issues with the context hint if it would directly
> > >> influence frequency selection in the "more power" direction.
> > >>
> > >> First of all, assume a context hint would replace the waitboost. Which
> > >> applications would need to set it to restore the lost performance and
> > >> how would they set it?
> > >>
> > >> Then I don't even think userspace necessarily knows. Think of a layer
> > >> like OpenCL. It doesn't really know in advance the profile of
> > >> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> > >> generation, and the actual size of the workload which can be influenced
> > >> by the application (or user) and not the library.
> > >>
> > >> The approach also lends itself well for the "arms race" where every
> > >> application can say "Me me me, I am the most important workload there is!".
> > >
> > > since there is discussion happening in two places:
> > >
> > > https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> > >
> > > What I think you might want is a ctx boost_mask which lets an app or
> > > driver disable certain boost signals/classes.  Where fence waits is
> > > one class of boost, but hypothetical other signals like touchscreen
> > > (or other) input events could be another class of boost.  A compute
> > > workload might be interested in fence wait boosts but could care less
> > > about input events.
> >
> > I think it can only be apps which could have any chance knowing whether
> > their use of a library is latency sensitive or not. Which means new
> > library extensions and their adoption. So I have some strong reservation
> > that route is feasible.
> >
> > Or we tie with priority which many drivers do. Normal and above gets the
> > boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).
> 
> yeah, that sounds reasonable.
> 

on that gitlab-issue discussion Emma Anholt was against using the priority
to influence frequency since that should be more about latency.

or we are talking about something different priority here?

> > Related note is that we lack any external control of our scheduling
> > decisions so we really do suck compared to other scheduling domains like
> > CPU and IO etc.
> >
> > >> The last concern is for me shared with the proposal to expose deadlines
> > >> or high priority waits as explicit uapi knobs. Both come under the "what
> > >> application told us it will do" category vs what it actually does. So I
> > >> think it is slightly weaker than basing decisions of waits.
> > >>
> > >> The current waitboost is a bit detached from that problem because when
> > >> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> > >> chain. When we waitboost for waits we also know someone is waiting. We
> > >> are not trusting userspace telling us this will be a buffer in the flip
> > >> chain or that this is a context which will have a certain duty-cycle.
> > >>
> > >> But yes, even if the input is truthful, latter is still only a
> > >> heuristics because nothing says all waits are important. AFAIU it just
> > >> happened to work well in the past.
> > >>
> > >> I do understand I am effectively arguing for more heuristics, which may
> > >> sound a bit against the common wisdom. This is because in general I
> > >> think the logic to do the right thing, be it in the driver or in the
> > >> firmware, can work best if it has a holistic view. Simply put it needs
> > >> to have more inputs to the decisions it is making.
> > >>
> > >> That is what my series is proposing - adding a common signal of "someone
> > >> in userspace is waiting". What happens with that signal needs not be
> > >> defined (promised) in the uapi contract.
> > >>
> > >> Say you route it to SLPC logic. It doesn't need to do with it what
> > >> legacy i915 is doing today. It just needs to do something which works
> > >> best for majority of workloads. It can even ignore it if that works for it.
> > >>
> > >> Finally, back to the immediate problem is when people replace the OpenCL
> > >> NEO driver with clvk that performance tanks. Because former does waits
> > >> using i915 specific ioctls and so triggers waitboost, latter waits on
> > >> syncobj so no waitboost and performance is bad. What short term solution
> > >> can we come up with? Or we say to not use clvk? I also wonder if other
> > >> Vulkan based stuff is perhaps missing those easy performance gains..
> > >>
> > >> Perhaps strictly speaking Rob's and mine proposal are not mutually
> > >> exclusive. Yes I could piggy back on his with an "immediate deadline for
> > >> waits" idea, but they could also be separate concepts if we concluded
> > >> "someone is waiting" signal is useful to have. Or it takes to long to
> > >> upstream the full deadline idea.
> > >
> > > Let me re-spin my series and add the syncobj wait flag and i915 bits
> >
> > I think wait flag is questionable unless it is inverted to imply waits
> > which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
> > Having a flag which "makes things faster" IMO should require elevated
> > privilege (to avoid the "arms race") at which point I fear it quickly
> > becomes uninteresting.
> 
> I guess you could make the argument in either direction.  Making the
> default behavior ramp up clocks could be a power regression.

yeap, exactly the media / video conference case.

> 
> I also think the "arms race" scenario isn't really as much of a
> problem as you think.  There aren't _that_ many things using the GPU
> at the same time (compared to # of things using CPU).   And a lot of
> mobile games throttle framerate to avoid draining your battery too
> quickly (after all, if your battery is dead you can't keep buying loot
> boxes or whatever).

Very good point.

And in the GPU case they rely a lot on the profiles. Which btw, seems
to be the Radeon solution. They boost the freq if the high performance
profile is selected and don't care about the execution bubbles if low
or mid profiles are selected, or something like that.

> 
> > > adapted from your patches..  I think the basic idea of deadlines
> > > (which includes "I want it NOW" ;-)) isn't controversial, but the
> > > original idea got caught up in some bikeshed (what about compositors
> > > that wait on fences in userspace to decide which surfaces to update in
> > > the next frame), plus me getting busy and generally not having a good
> > > plan for how to leverage this from VM guests (which is becoming
> > > increasingly important for CrOS).  I think I can build on some ongoing
> > > virtgpu fencing improvement work to solve the latter.  But now that we
> > > have a 2nd use-case for this, it makes sense to respin.
> >
> > Sure, I was looking at the old version already. It is interesting. But
> > also IMO needs quite a bit more work to approach achieving what is
> > implied from the name of the feature. It would need proper deadline
> > based sched job picking, and even then drm sched is mostly just a
> > frontend. So once past runnable status and jobs handed over to backend,
> > without further driver work it probably wouldn't be very effective past
> > very lightly loaded systems.
> 
> Yes, but all of that is not part of dma_fence ;-)
> 
> A pretty common challenging usecase is still the single fullscreen
> game, where scheduling isn't the problem, but landing at an
> appropriate GPU freq absolutely is.  (UI workloads are perhaps more
> interesting from a scheduler standpoint, but they generally aren't
> challenging from a load/freq standpoint.)
> 
> Fwiw, the original motivation of the series was to implement something
> akin to i915 pageflip boosting without having to abandon the atomic
> helpers.  (And, I guess it would also let i915 preserve that feature
> if it switched to atomic helpers.. I'm unsure if there are still other
> things blocking i915's migration.)
> 
> > Then if we fast forward to a world where schedulers perhaps become fully
> > deadline aware (we even had this for i915 few years back) then the
> > question will be does equating waits with immediate deadlines still
> > works. Maybe not too well because we wouldn't have the ability to
> > distinguish between the "someone is waiting" signal from the otherwise
> > propagated deadlines.
> 
> Is there any other way to handle a wait boost than expressing it as an
> ASAP deadline?
> 
> BR,
> -R
> 
> >
> > Regards,
> >
> > Tvrtko

Rob Clark Feb. 17, 2023, 11:38 p.m. UTC | #12

On Fri, Feb 17, 2023 at 12:45 PM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> On Fri, Feb 17, 2023 at 09:00:49AM -0800, Rob Clark wrote:
> > On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> > >
> > >
> > > On 17/02/2023 14:55, Rob Clark wrote:
> > > > On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> > > > <tvrtko.ursulin@linux.intel.com> wrote:
> > > >>
> > > >>
> > > >> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> > > >>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> > > >>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> > > >>>> <tvrtko.ursulin@linux.intel.com> wrote:
> > > >>>>>
> > > >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > >>>>>
> > > >>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
> > > >>>>> for instance to fences which are actively waited upon from userspace. This has
> > > >>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
> > > >>>>> some workloads really like it.
> > > >>>>>
> > > >>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> > > >>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > > >>>>> this mini series really (really) quickly to see if some discussion can be had.
> > > >>>>>
> > > >>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
> > > >>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > > >>>>> dma_fence_add_callback but from explicit/userspace wait paths).
> > > >>>>
> > > >>>> I was thinking about a similar thing, but in the context of dma_fence
> > > >>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> > > >>>> between "housekeeping" poll()ers that don't want to trigger boost but
> > > >>>> simply know when to do cleanup, and waiters who are waiting with some
> > > >>>> urgency.  I think we could use EPOLLPRI for this purpose.
> > > >>>>
> > > >>>> Not sure how that translates to waits via the syncobj.  But I think we
> > > >>>> want to let userspace give some hint about urgent vs housekeeping
> > > >>>> waits.
> > > >>>
> > > >>> Should the hint be on the waits, or should the hints be on the executed
> > > >>> context?
> > > >>>
> > > >>> In the end we need some way to quickly ramp-up the frequency to avoid
> > > >>> the execution bubbles.
> > > >>>
> > > >>> waitboost is trying to guess that, but in some cases it guess wrong
> > > >>> and waste power.
> > > >>
> > > >> Do we have a list of workloads which shows who benefits and who loses
> > > >> from the current implementation of waitboost?
> > > >>> btw, this is something that other drivers might need:
> > > >>>
> > > >>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> > > >>> Cc: Alex Deucher <alexander.deucher@amd.com>
> > > >>
> > > >> I have several issues with the context hint if it would directly
> > > >> influence frequency selection in the "more power" direction.
> > > >>
> > > >> First of all, assume a context hint would replace the waitboost. Which
> > > >> applications would need to set it to restore the lost performance and
> > > >> how would they set it?
> > > >>
> > > >> Then I don't even think userspace necessarily knows. Think of a layer
> > > >> like OpenCL. It doesn't really know in advance the profile of
> > > >> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> > > >> generation, and the actual size of the workload which can be influenced
> > > >> by the application (or user) and not the library.
> > > >>
> > > >> The approach also lends itself well for the "arms race" where every
> > > >> application can say "Me me me, I am the most important workload there is!".
> > > >
> > > > since there is discussion happening in two places:
> > > >
> > > > https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> > > >
> > > > What I think you might want is a ctx boost_mask which lets an app or
> > > > driver disable certain boost signals/classes.  Where fence waits is
> > > > one class of boost, but hypothetical other signals like touchscreen
> > > > (or other) input events could be another class of boost.  A compute
> > > > workload might be interested in fence wait boosts but could care less
> > > > about input events.
> > >
> > > I think it can only be apps which could have any chance knowing whether
> > > their use of a library is latency sensitive or not. Which means new
> > > library extensions and their adoption. So I have some strong reservation
> > > that route is feasible.
> > >
> > > Or we tie with priority which many drivers do. Normal and above gets the
> > > boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).
> >
> > yeah, that sounds reasonable.
> >
>
> on that gitlab-issue discussion Emma Anholt was against using the priority
> to influence frequency since that should be more about latency.
>
> or we are talking about something different priority here?

I was thinking to only _not_ boost on the lowest priority, but boost
on norm/high priority.

But not something I feel too strongly about.  Ie. deciding on policy
doesn't affect or need to block getting the dma_fence and syncobj
plumbing in place.

BR,
-R

> > > Related note is that we lack any external control of our scheduling
> > > decisions so we really do suck compared to other scheduling domains like
> > > CPU and IO etc.
> > >
> > > >> The last concern is for me shared with the proposal to expose deadlines
> > > >> or high priority waits as explicit uapi knobs. Both come under the "what
> > > >> application told us it will do" category vs what it actually does. So I
> > > >> think it is slightly weaker than basing decisions of waits.
> > > >>
> > > >> The current waitboost is a bit detached from that problem because when
> > > >> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> > > >> chain. When we waitboost for waits we also know someone is waiting. We
> > > >> are not trusting userspace telling us this will be a buffer in the flip
> > > >> chain or that this is a context which will have a certain duty-cycle.
> > > >>
> > > >> But yes, even if the input is truthful, latter is still only a
> > > >> heuristics because nothing says all waits are important. AFAIU it just
> > > >> happened to work well in the past.
> > > >>
> > > >> I do understand I am effectively arguing for more heuristics, which may
> > > >> sound a bit against the common wisdom. This is because in general I
> > > >> think the logic to do the right thing, be it in the driver or in the
> > > >> firmware, can work best if it has a holistic view. Simply put it needs
> > > >> to have more inputs to the decisions it is making.
> > > >>
> > > >> That is what my series is proposing - adding a common signal of "someone
> > > >> in userspace is waiting". What happens with that signal needs not be
> > > >> defined (promised) in the uapi contract.
> > > >>
> > > >> Say you route it to SLPC logic. It doesn't need to do with it what
> > > >> legacy i915 is doing today. It just needs to do something which works
> > > >> best for majority of workloads. It can even ignore it if that works for it.
> > > >>
> > > >> Finally, back to the immediate problem is when people replace the OpenCL
> > > >> NEO driver with clvk that performance tanks. Because former does waits
> > > >> using i915 specific ioctls and so triggers waitboost, latter waits on
> > > >> syncobj so no waitboost and performance is bad. What short term solution
> > > >> can we come up with? Or we say to not use clvk? I also wonder if other
> > > >> Vulkan based stuff is perhaps missing those easy performance gains..
> > > >>
> > > >> Perhaps strictly speaking Rob's and mine proposal are not mutually
> > > >> exclusive. Yes I could piggy back on his with an "immediate deadline for
> > > >> waits" idea, but they could also be separate concepts if we concluded
> > > >> "someone is waiting" signal is useful to have. Or it takes to long to
> > > >> upstream the full deadline idea.
> > > >
> > > > Let me re-spin my series and add the syncobj wait flag and i915 bits
> > >
> > > I think wait flag is questionable unless it is inverted to imply waits
> > > which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
> > > Having a flag which "makes things faster" IMO should require elevated
> > > privilege (to avoid the "arms race") at which point I fear it quickly
> > > becomes uninteresting.
> >
> > I guess you could make the argument in either direction.  Making the
> > default behavior ramp up clocks could be a power regression.
>
> yeap, exactly the media / video conference case.
>
> >
> > I also think the "arms race" scenario isn't really as much of a
> > problem as you think.  There aren't _that_ many things using the GPU
> > at the same time (compared to # of things using CPU).   And a lot of
> > mobile games throttle framerate to avoid draining your battery too
> > quickly (after all, if your battery is dead you can't keep buying loot
> > boxes or whatever).
>
> Very good point.
>
> And in the GPU case they rely a lot on the profiles. Which btw, seems
> to be the Radeon solution. They boost the freq if the high performance
> profile is selected and don't care about the execution bubbles if low
> or mid profiles are selected, or something like that.
>
> >
> > > > adapted from your patches..  I think the basic idea of deadlines
> > > > (which includes "I want it NOW" ;-)) isn't controversial, but the
> > > > original idea got caught up in some bikeshed (what about compositors
> > > > that wait on fences in userspace to decide which surfaces to update in
> > > > the next frame), plus me getting busy and generally not having a good
> > > > plan for how to leverage this from VM guests (which is becoming
> > > > increasingly important for CrOS).  I think I can build on some ongoing
> > > > virtgpu fencing improvement work to solve the latter.  But now that we
> > > > have a 2nd use-case for this, it makes sense to respin.
> > >
> > > Sure, I was looking at the old version already. It is interesting. But
> > > also IMO needs quite a bit more work to approach achieving what is
> > > implied from the name of the feature. It would need proper deadline
> > > based sched job picking, and even then drm sched is mostly just a
> > > frontend. So once past runnable status and jobs handed over to backend,
> > > without further driver work it probably wouldn't be very effective past
> > > very lightly loaded systems.
> >
> > Yes, but all of that is not part of dma_fence ;-)
> >
> > A pretty common challenging usecase is still the single fullscreen
> > game, where scheduling isn't the problem, but landing at an
> > appropriate GPU freq absolutely is.  (UI workloads are perhaps more
> > interesting from a scheduler standpoint, but they generally aren't
> > challenging from a load/freq standpoint.)
> >
> > Fwiw, the original motivation of the series was to implement something
> > akin to i915 pageflip boosting without having to abandon the atomic
> > helpers.  (And, I guess it would also let i915 preserve that feature
> > if it switched to atomic helpers.. I'm unsure if there are still other
> > things blocking i915's migration.)
> >
> > > Then if we fast forward to a world where schedulers perhaps become fully
> > > deadline aware (we even had this for i915 few years back) then the
> > > question will be does equating waits with immediate deadlines still
> > > works. Maybe not too well because we wouldn't have the ability to
> > > distinguish between the "someone is waiting" signal from the otherwise
> > > propagated deadlines.
> >
> > Is there any other way to handle a wait boost than expressing it as an
> > ASAP deadline?
> >
> > BR,
> > -R
> >
> > >
> > > Regards,
> > >
> > > Tvrtko

Tvrtko Ursulin Feb. 20, 2023, 11:33 a.m. UTC | #13

On 17/02/2023 20:45, Rodrigo Vivi wrote:
> On Fri, Feb 17, 2023 at 09:00:49AM -0800, Rob Clark wrote:
>> On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>
>>>
>>> On 17/02/2023 14:55, Rob Clark wrote:
>>>> On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
>>>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>>>
>>>>>
>>>>> On 16/02/2023 18:19, Rodrigo Vivi wrote:
>>>>>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
>>>>>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
>>>>>>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>>>>>>
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>
>>>>>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
>>>>>>>> for instance to fences which are actively waited upon from userspace. This has
>>>>>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
>>>>>>>> some workloads really like it.
>>>>>>>>
>>>>>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
>>>>>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
>>>>>>>> this mini series really (really) quickly to see if some discussion can be had.
>>>>>>>>
>>>>>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
>>>>>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
>>>>>>>> dma_fence_add_callback but from explicit/userspace wait paths).
>>>>>>>
>>>>>>> I was thinking about a similar thing, but in the context of dma_fence
>>>>>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
>>>>>>> between "housekeeping" poll()ers that don't want to trigger boost but
>>>>>>> simply know when to do cleanup, and waiters who are waiting with some
>>>>>>> urgency.  I think we could use EPOLLPRI for this purpose.
>>>>>>>
>>>>>>> Not sure how that translates to waits via the syncobj.  But I think we
>>>>>>> want to let userspace give some hint about urgent vs housekeeping
>>>>>>> waits.
>>>>>>
>>>>>> Should the hint be on the waits, or should the hints be on the executed
>>>>>> context?
>>>>>>
>>>>>> In the end we need some way to quickly ramp-up the frequency to avoid
>>>>>> the execution bubbles.
>>>>>>
>>>>>> waitboost is trying to guess that, but in some cases it guess wrong
>>>>>> and waste power.
>>>>>
>>>>> Do we have a list of workloads which shows who benefits and who loses
>>>>> from the current implementation of waitboost?
>>>>>> btw, this is something that other drivers might need:
>>>>>>
>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>
>>>>> I have several issues with the context hint if it would directly
>>>>> influence frequency selection in the "more power" direction.
>>>>>
>>>>> First of all, assume a context hint would replace the waitboost. Which
>>>>> applications would need to set it to restore the lost performance and
>>>>> how would they set it?
>>>>>
>>>>> Then I don't even think userspace necessarily knows. Think of a layer
>>>>> like OpenCL. It doesn't really know in advance the profile of
>>>>> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
>>>>> generation, and the actual size of the workload which can be influenced
>>>>> by the application (or user) and not the library.
>>>>>
>>>>> The approach also lends itself well for the "arms race" where every
>>>>> application can say "Me me me, I am the most important workload there is!".
>>>>
>>>> since there is discussion happening in two places:
>>>>
>>>> https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
>>>>
>>>> What I think you might want is a ctx boost_mask which lets an app or
>>>> driver disable certain boost signals/classes.  Where fence waits is
>>>> one class of boost, but hypothetical other signals like touchscreen
>>>> (or other) input events could be another class of boost.  A compute
>>>> workload might be interested in fence wait boosts but could care less
>>>> about input events.
>>>
>>> I think it can only be apps which could have any chance knowing whether
>>> their use of a library is latency sensitive or not. Which means new
>>> library extensions and their adoption. So I have some strong reservation
>>> that route is feasible.
>>>
>>> Or we tie with priority which many drivers do. Normal and above gets the
>>> boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).
>>
>> yeah, that sounds reasonable.
>>
> 
> on that gitlab-issue discussion Emma Anholt was against using the priority
> to influence frequency since that should be more about latency.
> 
> or we are talking about something different priority here?

As Rob already explained - I was suggesting skipping waitboost for 
contexts which explicitly made themselves low priority. I don't see a 
controversial angle there.

>>> Related note is that we lack any external control of our scheduling
>>> decisions so we really do suck compared to other scheduling domains like
>>> CPU and IO etc.
>>>
>>>>> The last concern is for me shared with the proposal to expose deadlines
>>>>> or high priority waits as explicit uapi knobs. Both come under the "what
>>>>> application told us it will do" category vs what it actually does. So I
>>>>> think it is slightly weaker than basing decisions of waits.
>>>>>
>>>>> The current waitboost is a bit detached from that problem because when
>>>>> we waitboost for flips we _know_ it is an actual framebuffer in the flip
>>>>> chain. When we waitboost for waits we also know someone is waiting. We
>>>>> are not trusting userspace telling us this will be a buffer in the flip
>>>>> chain or that this is a context which will have a certain duty-cycle.
>>>>>
>>>>> But yes, even if the input is truthful, latter is still only a
>>>>> heuristics because nothing says all waits are important. AFAIU it just
>>>>> happened to work well in the past.
>>>>>
>>>>> I do understand I am effectively arguing for more heuristics, which may
>>>>> sound a bit against the common wisdom. This is because in general I
>>>>> think the logic to do the right thing, be it in the driver or in the
>>>>> firmware, can work best if it has a holistic view. Simply put it needs
>>>>> to have more inputs to the decisions it is making.
>>>>>
>>>>> That is what my series is proposing - adding a common signal of "someone
>>>>> in userspace is waiting". What happens with that signal needs not be
>>>>> defined (promised) in the uapi contract.
>>>>>
>>>>> Say you route it to SLPC logic. It doesn't need to do with it what
>>>>> legacy i915 is doing today. It just needs to do something which works
>>>>> best for majority of workloads. It can even ignore it if that works for it.
>>>>>
>>>>> Finally, back to the immediate problem is when people replace the OpenCL
>>>>> NEO driver with clvk that performance tanks. Because former does waits
>>>>> using i915 specific ioctls and so triggers waitboost, latter waits on
>>>>> syncobj so no waitboost and performance is bad. What short term solution
>>>>> can we come up with? Or we say to not use clvk? I also wonder if other
>>>>> Vulkan based stuff is perhaps missing those easy performance gains..
>>>>>
>>>>> Perhaps strictly speaking Rob's and mine proposal are not mutually
>>>>> exclusive. Yes I could piggy back on his with an "immediate deadline for
>>>>> waits" idea, but they could also be separate concepts if we concluded
>>>>> "someone is waiting" signal is useful to have. Or it takes to long to
>>>>> upstream the full deadline idea.
>>>>
>>>> Let me re-spin my series and add the syncobj wait flag and i915 bits
>>>
>>> I think wait flag is questionable unless it is inverted to imply waits
>>> which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
>>> Having a flag which "makes things faster" IMO should require elevated
>>> privilege (to avoid the "arms race") at which point I fear it quickly
>>> becomes uninteresting.
>>
>> I guess you could make the argument in either direction.  Making the
>> default behavior ramp up clocks could be a power regression.
> 
> yeap, exactly the media / video conference case.

Yeah I agree. And as not all media use cases are the same, as are not 
all compute contexts someone somewhere will need to run a series of 
workloads for power and performance numbers. Ideally that someone would 
be the entity for which it makes sense to look at all use cases, from 
server room to client, 3d, media and compute for both. If we could get 
the capability to run this in some automated fashion, akin to CI, we 
would even have a chance to keep making good decisions in the future.

Or we do some one off testing for this instance, but we still need a 
range of workloads and parts to do it properly..

>> I also think the "arms race" scenario isn't really as much of a
>> problem as you think.  There aren't _that_ many things using the GPU
>> at the same time (compared to # of things using CPU).   And a lot of
>> mobile games throttle framerate to avoid draining your battery too
>> quickly (after all, if your battery is dead you can't keep buying loot
>> boxes or whatever).
> 
> Very good point.

On this one I still disagree from the point of view that it does not 
make it good uapi if we allow everyone to select themselves for priority 
handling (one flavour or the other).

> And in the GPU case they rely a lot on the profiles. Which btw, seems
> to be the Radeon solution. They boost the freq if the high performance
> profile is selected and don't care about the execution bubbles if low
> or mid profiles are selected, or something like that.

Profile as something which controls the waitboost globally? What would 
be the mechanism for communicating it to the driver?

Also, how would that reconcile the fact waitboost harms some workloads 
but helps others? If the latter not only improves the performance but 
also efficiency then assuming "battery" profile must mean "waitboost 
off" would be leaving battery life on the table. Conversely, if the "on 
a/c - max performance", would be global "waitboost on", then it could 
even be possible it wouldn't always be truly best performance if it 
causes thermal throttling.

Regards,

Tvrtko

Tvrtko Ursulin Feb. 20, 2023, 12:22 p.m. UTC | #14

On 17/02/2023 17:00, Rob Clark wrote:
> On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:

[snip]

>>> adapted from your patches..  I think the basic idea of deadlines
>>> (which includes "I want it NOW" ;-)) isn't controversial, but the
>>> original idea got caught up in some bikeshed (what about compositors
>>> that wait on fences in userspace to decide which surfaces to update in
>>> the next frame), plus me getting busy and generally not having a good
>>> plan for how to leverage this from VM guests (which is becoming
>>> increasingly important for CrOS).  I think I can build on some ongoing
>>> virtgpu fencing improvement work to solve the latter.  But now that we
>>> have a 2nd use-case for this, it makes sense to respin.
>>
>> Sure, I was looking at the old version already. It is interesting. But
>> also IMO needs quite a bit more work to approach achieving what is
>> implied from the name of the feature. It would need proper deadline
>> based sched job picking, and even then drm sched is mostly just a
>> frontend. So once past runnable status and jobs handed over to backend,
>> without further driver work it probably wouldn't be very effective past
>> very lightly loaded systems.
> 
> Yes, but all of that is not part of dma_fence ;-)

:) Okay.

Having said that, do we need a step back to think about whether adding 
deadline to dma-fences is not making them something too much different 
to what they were? Going from purely synchronisation primitive more 
towards scheduling paradigms. Just to brainstorm if there will not be 
any unintended consequences. I should mention this in your RFC thread 
actually.

> A pretty common challenging usecase is still the single fullscreen
> game, where scheduling isn't the problem, but landing at an
> appropriate GPU freq absolutely is.  (UI workloads are perhaps more
> interesting from a scheduler standpoint, but they generally aren't
> challenging from a load/freq standpoint.)

Challenging as in picking the right operating point? Might be latency 
impacted (and so user perceived UI smoothness) due missing waitboost for 
anything syncobj related. I don't know if anything to measure that 
exists currently though. Assuming it is measurable then the question 
would be is it perceivable.
> Fwiw, the original motivation of the series was to implement something
> akin to i915 pageflip boosting without having to abandon the atomic
> helpers.  (And, I guess it would also let i915 preserve that feature
> if it switched to atomic helpers.. I'm unsure if there are still other
> things blocking i915's migration.)

Question for display folks I guess.

>> Then if we fast forward to a world where schedulers perhaps become fully
>> deadline aware (we even had this for i915 few years back) then the
>> question will be does equating waits with immediate deadlines still
>> works. Maybe not too well because we wouldn't have the ability to
>> distinguish between the "someone is waiting" signal from the otherwise
>> propagated deadlines.
> 
> Is there any other way to handle a wait boost than expressing it as an
> ASAP deadline?

A leading question or just a question? Nothing springs to my mind at the 
moment.

Regards,

Tvrtko

Rob Clark Feb. 20, 2023, 3:45 p.m. UTC | #15

On Mon, Feb 20, 2023 at 4:22 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 17/02/2023 17:00, Rob Clark wrote:
> > On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
>
> [snip]
>
> >>> adapted from your patches..  I think the basic idea of deadlines
> >>> (which includes "I want it NOW" ;-)) isn't controversial, but the
> >>> original idea got caught up in some bikeshed (what about compositors
> >>> that wait on fences in userspace to decide which surfaces to update in
> >>> the next frame), plus me getting busy and generally not having a good
> >>> plan for how to leverage this from VM guests (which is becoming
> >>> increasingly important for CrOS).  I think I can build on some ongoing
> >>> virtgpu fencing improvement work to solve the latter.  But now that we
> >>> have a 2nd use-case for this, it makes sense to respin.
> >>
> >> Sure, I was looking at the old version already. It is interesting. But
> >> also IMO needs quite a bit more work to approach achieving what is
> >> implied from the name of the feature. It would need proper deadline
> >> based sched job picking, and even then drm sched is mostly just a
> >> frontend. So once past runnable status and jobs handed over to backend,
> >> without further driver work it probably wouldn't be very effective past
> >> very lightly loaded systems.
> >
> > Yes, but all of that is not part of dma_fence ;-)
>
> :) Okay.
>
> Having said that, do we need a step back to think about whether adding
> deadline to dma-fences is not making them something too much different
> to what they were? Going from purely synchronisation primitive more
> towards scheduling paradigms. Just to brainstorm if there will not be
> any unintended consequences. I should mention this in your RFC thread
> actually.

Perhaps "deadline" isn't quite the right name, but I haven't thought
of anything better.  It is really a hint to the fence signaller about
how soon it is interested in a result so the driver can factor that
into freq scaling decisions.  Maybe "goal" or some other term would be
better?

I guess that can factor into scheduling decisions as well.. but we
already have priority for that.  My main interest is freq mgmt.

(Thankfully we don't have performance and efficiency cores to worry
about, like CPUs ;-))

> > A pretty common challenging usecase is still the single fullscreen
> > game, where scheduling isn't the problem, but landing at an
> > appropriate GPU freq absolutely is.  (UI workloads are perhaps more
> > interesting from a scheduler standpoint, but they generally aren't
> > challenging from a load/freq standpoint.)
>
> Challenging as in picking the right operating point? Might be latency
> impacted (and so user perceived UI smoothness) due missing waitboost for
> anything syncobj related. I don't know if anything to measure that
> exists currently though. Assuming it is measurable then the question
> would be is it perceivable.
> > Fwiw, the original motivation of the series was to implement something
> > akin to i915 pageflip boosting without having to abandon the atomic
> > helpers.  (And, I guess it would also let i915 preserve that feature
> > if it switched to atomic helpers.. I'm unsure if there are still other
> > things blocking i915's migration.)
>
> Question for display folks I guess.
>
> >> Then if we fast forward to a world where schedulers perhaps become fully
> >> deadline aware (we even had this for i915 few years back) then the
> >> question will be does equating waits with immediate deadlines still
> >> works. Maybe not too well because we wouldn't have the ability to
> >> distinguish between the "someone is waiting" signal from the otherwise
> >> propagated deadlines.
> >
> > Is there any other way to handle a wait boost than expressing it as an
> > ASAP deadline?
>
> A leading question or just a question? Nothing springs to my mind at the
> moment.

Just a question.  The immediate deadline is the only thing that makes
sense to me, but that could be because I'm looking at it from the
perspective of also trying to handle the case where missing vblank
reduces utilization and provides the wrong signal to gpufreq.. i915
already has a way to handle this internally, but it involves bypassing
the atomic helpers, which isn't a thing I want to encourage other
drivers to do.  And completely doesn't work for situations where the
gpu and display are separate devices.

BR,
-R

> Regards,
>
> Tvrtko

Rob Clark Feb. 20, 2023, 3:52 p.m. UTC | #16

On Mon, Feb 20, 2023 at 3:33 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 17/02/2023 20:45, Rodrigo Vivi wrote:
> > On Fri, Feb 17, 2023 at 09:00:49AM -0800, Rob Clark wrote:
> >> On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> >> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>
> >>>
> >>> On 17/02/2023 14:55, Rob Clark wrote:
> >>>> On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> >>>> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>>>
> >>>>>
> >>>>> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> >>>>>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> >>>>>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> >>>>>>> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>>>>>>
> >>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>>>>
> >>>>>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
> >>>>>>>> for instance to fences which are actively waited upon from userspace. This has
> >>>>>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
> >>>>>>>> some workloads really like it.
> >>>>>>>>
> >>>>>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> >>>>>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> >>>>>>>> this mini series really (really) quickly to see if some discussion can be had.
> >>>>>>>>
> >>>>>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
> >>>>>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> >>>>>>>> dma_fence_add_callback but from explicit/userspace wait paths).
> >>>>>>>
> >>>>>>> I was thinking about a similar thing, but in the context of dma_fence
> >>>>>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> >>>>>>> between "housekeeping" poll()ers that don't want to trigger boost but
> >>>>>>> simply know when to do cleanup, and waiters who are waiting with some
> >>>>>>> urgency.  I think we could use EPOLLPRI for this purpose.
> >>>>>>>
> >>>>>>> Not sure how that translates to waits via the syncobj.  But I think we
> >>>>>>> want to let userspace give some hint about urgent vs housekeeping
> >>>>>>> waits.
> >>>>>>
> >>>>>> Should the hint be on the waits, or should the hints be on the executed
> >>>>>> context?
> >>>>>>
> >>>>>> In the end we need some way to quickly ramp-up the frequency to avoid
> >>>>>> the execution bubbles.
> >>>>>>
> >>>>>> waitboost is trying to guess that, but in some cases it guess wrong
> >>>>>> and waste power.
> >>>>>
> >>>>> Do we have a list of workloads which shows who benefits and who loses
> >>>>> from the current implementation of waitboost?
> >>>>>> btw, this is something that other drivers might need:
> >>>>>>
> >>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> >>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
> >>>>>
> >>>>> I have several issues with the context hint if it would directly
> >>>>> influence frequency selection in the "more power" direction.
> >>>>>
> >>>>> First of all, assume a context hint would replace the waitboost. Which
> >>>>> applications would need to set it to restore the lost performance and
> >>>>> how would they set it?
> >>>>>
> >>>>> Then I don't even think userspace necessarily knows. Think of a layer
> >>>>> like OpenCL. It doesn't really know in advance the profile of
> >>>>> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> >>>>> generation, and the actual size of the workload which can be influenced
> >>>>> by the application (or user) and not the library.
> >>>>>
> >>>>> The approach also lends itself well for the "arms race" where every
> >>>>> application can say "Me me me, I am the most important workload there is!".
> >>>>
> >>>> since there is discussion happening in two places:
> >>>>
> >>>> https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> >>>>
> >>>> What I think you might want is a ctx boost_mask which lets an app or
> >>>> driver disable certain boost signals/classes.  Where fence waits is
> >>>> one class of boost, but hypothetical other signals like touchscreen
> >>>> (or other) input events could be another class of boost.  A compute
> >>>> workload might be interested in fence wait boosts but could care less
> >>>> about input events.
> >>>
> >>> I think it can only be apps which could have any chance knowing whether
> >>> their use of a library is latency sensitive or not. Which means new
> >>> library extensions and their adoption. So I have some strong reservation
> >>> that route is feasible.
> >>>
> >>> Or we tie with priority which many drivers do. Normal and above gets the
> >>> boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).
> >>
> >> yeah, that sounds reasonable.
> >>
> >
> > on that gitlab-issue discussion Emma Anholt was against using the priority
> > to influence frequency since that should be more about latency.
> >
> > or we are talking about something different priority here?
>
> As Rob already explained - I was suggesting skipping waitboost for
> contexts which explicitly made themselves low priority. I don't see a
> controversial angle there.
>
> >>> Related note is that we lack any external control of our scheduling
> >>> decisions so we really do suck compared to other scheduling domains like
> >>> CPU and IO etc.
> >>>
> >>>>> The last concern is for me shared with the proposal to expose deadlines
> >>>>> or high priority waits as explicit uapi knobs. Both come under the "what
> >>>>> application told us it will do" category vs what it actually does. So I
> >>>>> think it is slightly weaker than basing decisions of waits.
> >>>>>
> >>>>> The current waitboost is a bit detached from that problem because when
> >>>>> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> >>>>> chain. When we waitboost for waits we also know someone is waiting. We
> >>>>> are not trusting userspace telling us this will be a buffer in the flip
> >>>>> chain or that this is a context which will have a certain duty-cycle.
> >>>>>
> >>>>> But yes, even if the input is truthful, latter is still only a
> >>>>> heuristics because nothing says all waits are important. AFAIU it just
> >>>>> happened to work well in the past.
> >>>>>
> >>>>> I do understand I am effectively arguing for more heuristics, which may
> >>>>> sound a bit against the common wisdom. This is because in general I
> >>>>> think the logic to do the right thing, be it in the driver or in the
> >>>>> firmware, can work best if it has a holistic view. Simply put it needs
> >>>>> to have more inputs to the decisions it is making.
> >>>>>
> >>>>> That is what my series is proposing - adding a common signal of "someone
> >>>>> in userspace is waiting". What happens with that signal needs not be
> >>>>> defined (promised) in the uapi contract.
> >>>>>
> >>>>> Say you route it to SLPC logic. It doesn't need to do with it what
> >>>>> legacy i915 is doing today. It just needs to do something which works
> >>>>> best for majority of workloads. It can even ignore it if that works for it.
> >>>>>
> >>>>> Finally, back to the immediate problem is when people replace the OpenCL
> >>>>> NEO driver with clvk that performance tanks. Because former does waits
> >>>>> using i915 specific ioctls and so triggers waitboost, latter waits on
> >>>>> syncobj so no waitboost and performance is bad. What short term solution
> >>>>> can we come up with? Or we say to not use clvk? I also wonder if other
> >>>>> Vulkan based stuff is perhaps missing those easy performance gains..
> >>>>>
> >>>>> Perhaps strictly speaking Rob's and mine proposal are not mutually
> >>>>> exclusive. Yes I could piggy back on his with an "immediate deadline for
> >>>>> waits" idea, but they could also be separate concepts if we concluded
> >>>>> "someone is waiting" signal is useful to have. Or it takes to long to
> >>>>> upstream the full deadline idea.
> >>>>
> >>>> Let me re-spin my series and add the syncobj wait flag and i915 bits
> >>>
> >>> I think wait flag is questionable unless it is inverted to imply waits
> >>> which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
> >>> Having a flag which "makes things faster" IMO should require elevated
> >>> privilege (to avoid the "arms race") at which point I fear it quickly
> >>> becomes uninteresting.
> >>
> >> I guess you could make the argument in either direction.  Making the
> >> default behavior ramp up clocks could be a power regression.
> >
> > yeap, exactly the media / video conference case.
>
> Yeah I agree. And as not all media use cases are the same, as are not
> all compute contexts someone somewhere will need to run a series of
> workloads for power and performance numbers. Ideally that someone would
> be the entity for which it makes sense to look at all use cases, from
> server room to client, 3d, media and compute for both. If we could get
> the capability to run this in some automated fashion, akin to CI, we
> would even have a chance to keep making good decisions in the future.
>
> Or we do some one off testing for this instance, but we still need a
> range of workloads and parts to do it properly..
>
> >> I also think the "arms race" scenario isn't really as much of a
> >> problem as you think.  There aren't _that_ many things using the GPU
> >> at the same time (compared to # of things using CPU).   And a lot of
> >> mobile games throttle framerate to avoid draining your battery too
> >> quickly (after all, if your battery is dead you can't keep buying loot
> >> boxes or whatever).
> >
> > Very good point.
>
> On this one I still disagree from the point of view that it does not
> make it good uapi if we allow everyone to select themselves for priority
> handling (one flavour or the other).

There is plenty of precedent for userspace giving hints to the kernel
about scheduling and freq mgmt.  Like schedutil uclamp stuff.
Although I think that is all based on cgroups.

In the fence/syncobj case, I think we need per-wait hints.. because
for a single process the driver will be doing both housekeeping waits
and potentially urgent waits.  There may also be some room for some
cgroup or similar knobs to control things like what max priority an
app can ask for, and whether or how aggressively the kernel responds
to the "deadline" hints.  So as far as "arms race", I don't think I'd
change anything about my "fence deadline" proposal.. but that it might
just be one piece of the overall puzzle.

BR,
-R

> > And in the GPU case they rely a lot on the profiles. Which btw, seems
> > to be the Radeon solution. They boost the freq if the high performance
> > profile is selected and don't care about the execution bubbles if low
> > or mid profiles are selected, or something like that.
>
> Profile as something which controls the waitboost globally? What would
> be the mechanism for communicating it to the driver?
>
> Also, how would that reconcile the fact waitboost harms some workloads
> but helps others? If the latter not only improves the performance but
> also efficiency then assuming "battery" profile must mean "waitboost
> off" would be leaving battery life on the table. Conversely, if the "on
> a/c - max performance", would be global "waitboost on", then it could
> even be possible it wouldn't always be truly best performance if it
> causes thermal throttling.
>
> Regards,
>
> Tvrtko

Tvrtko Ursulin Feb. 20, 2023, 3:56 p.m. UTC | #17

On 20/02/2023 15:45, Rob Clark wrote:
> On Mon, Feb 20, 2023 at 4:22 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 17/02/2023 17:00, Rob Clark wrote:
>>> On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
>>> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>> [snip]
>>
>>>>> adapted from your patches..  I think the basic idea of deadlines
>>>>> (which includes "I want it NOW" ;-)) isn't controversial, but the
>>>>> original idea got caught up in some bikeshed (what about compositors
>>>>> that wait on fences in userspace to decide which surfaces to update in
>>>>> the next frame), plus me getting busy and generally not having a good
>>>>> plan for how to leverage this from VM guests (which is becoming
>>>>> increasingly important for CrOS).  I think I can build on some ongoing
>>>>> virtgpu fencing improvement work to solve the latter.  But now that we
>>>>> have a 2nd use-case for this, it makes sense to respin.
>>>>
>>>> Sure, I was looking at the old version already. It is interesting. But
>>>> also IMO needs quite a bit more work to approach achieving what is
>>>> implied from the name of the feature. It would need proper deadline
>>>> based sched job picking, and even then drm sched is mostly just a
>>>> frontend. So once past runnable status and jobs handed over to backend,
>>>> without further driver work it probably wouldn't be very effective past
>>>> very lightly loaded systems.
>>>
>>> Yes, but all of that is not part of dma_fence ;-)
>>
>> :) Okay.
>>
>> Having said that, do we need a step back to think about whether adding
>> deadline to dma-fences is not making them something too much different
>> to what they were? Going from purely synchronisation primitive more
>> towards scheduling paradigms. Just to brainstorm if there will not be
>> any unintended consequences. I should mention this in your RFC thread
>> actually.
> 
> Perhaps "deadline" isn't quite the right name, but I haven't thought
> of anything better.  It is really a hint to the fence signaller about
> how soon it is interested in a result so the driver can factor that
> into freq scaling decisions.  Maybe "goal" or some other term would be
> better?

Don't know, no strong opinion on the name at the moment. For me it was 
more about the change of what type of side channel data is getting 
attached to dma-fence and whether it changes what the primitive is for.

> I guess that can factor into scheduling decisions as well.. but we
> already have priority for that.  My main interest is freq mgmt.
> 
> (Thankfully we don't have performance and efficiency cores to worry
> about, like CPUs ;-))
> 
>>> A pretty common challenging usecase is still the single fullscreen
>>> game, where scheduling isn't the problem, but landing at an
>>> appropriate GPU freq absolutely is.  (UI workloads are perhaps more
>>> interesting from a scheduler standpoint, but they generally aren't
>>> challenging from a load/freq standpoint.)
>>
>> Challenging as in picking the right operating point? Might be latency
>> impacted (and so user perceived UI smoothness) due missing waitboost for
>> anything syncobj related. I don't know if anything to measure that
>> exists currently though. Assuming it is measurable then the question
>> would be is it perceivable.
>>> Fwiw, the original motivation of the series was to implement something
>>> akin to i915 pageflip boosting without having to abandon the atomic
>>> helpers.  (And, I guess it would also let i915 preserve that feature
>>> if it switched to atomic helpers.. I'm unsure if there are still other
>>> things blocking i915's migration.)
>>
>> Question for display folks I guess.
>>
>>>> Then if we fast forward to a world where schedulers perhaps become fully
>>>> deadline aware (we even had this for i915 few years back) then the
>>>> question will be does equating waits with immediate deadlines still
>>>> works. Maybe not too well because we wouldn't have the ability to
>>>> distinguish between the "someone is waiting" signal from the otherwise
>>>> propagated deadlines.
>>>
>>> Is there any other way to handle a wait boost than expressing it as an
>>> ASAP deadline?
>>
>> A leading question or just a question? Nothing springs to my mind at the
>> moment.
> 
> Just a question.  The immediate deadline is the only thing that makes
> sense to me, but that could be because I'm looking at it from the
> perspective of also trying to handle the case where missing vblank
> reduces utilization and provides the wrong signal to gpufreq.. i915
> already has a way to handle this internally, but it involves bypassing
> the atomic helpers, which isn't a thing I want to encourage other
> drivers to do.  And completely doesn't work for situations where the
> gpu and display are separate devices.

Right, there is yet another angle to discuss with Daniel here who AFAIR 
was a bit against i915 priority inheritance going past a single device 
instance. In which case DRI_PRIME=1 would lose the ability to boost 
frame buffer dependency chains. Opens up the question of deadline 
inheritance across different drivers too. Or perhaps Daniel would be 
okay with this working if implemented at the dma-fence layer.

Regards,

Tvrtko

Tvrtko Ursulin Feb. 20, 2023, 4:44 p.m. UTC | #18

On 20/02/2023 15:52, Rob Clark wrote:
> On Mon, Feb 20, 2023 at 3:33 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 17/02/2023 20:45, Rodrigo Vivi wrote:

[snip]

>> Yeah I agree. And as not all media use cases are the same, as are not
>> all compute contexts someone somewhere will need to run a series of
>> workloads for power and performance numbers. Ideally that someone would
>> be the entity for which it makes sense to look at all use cases, from
>> server room to client, 3d, media and compute for both. If we could get
>> the capability to run this in some automated fashion, akin to CI, we
>> would even have a chance to keep making good decisions in the future.
>>
>> Or we do some one off testing for this instance, but we still need a
>> range of workloads and parts to do it properly..
>>
>>>> I also think the "arms race" scenario isn't really as much of a
>>>> problem as you think.  There aren't _that_ many things using the GPU
>>>> at the same time (compared to # of things using CPU).   And a lot of
>>>> mobile games throttle framerate to avoid draining your battery too
>>>> quickly (after all, if your battery is dead you can't keep buying loot
>>>> boxes or whatever).
>>>
>>> Very good point.
>>
>> On this one I still disagree from the point of view that it does not
>> make it good uapi if we allow everyone to select themselves for priority
>> handling (one flavour or the other).
> 
> There is plenty of precedent for userspace giving hints to the kernel
> about scheduling and freq mgmt.  Like schedutil uclamp stuff.
> Although I think that is all based on cgroups.

I knew about SCHED_DEADLINE and that it requires CAP_SYS_NICE, but I did 
not know about uclamp. Quick experiment with uclampset suggests it 
indeed does not require elevated privilege. If that is indeed so, it is 
good enough for me as a precedent.

It appears to work using sched_setscheduler so maybe could define 
something similar in i915/xe, per context or per client, not sure.

Maybe it would start as a primitive implementation but the uapi would 
not preclude making it smart(er) afterwards. Or passing along to GuC to 
do it's thing with it.

> In the fence/syncobj case, I think we need per-wait hints.. because
> for a single process the driver will be doing both housekeeping waits
> and potentially urgent waits.  There may also be some room for some
> cgroup or similar knobs to control things like what max priority an
> app can ask for, and whether or how aggressively the kernel responds
> to the "deadline" hints.  So as far as "arms race", I don't think I'd

Per wait hints are okay I guess even with "I am important" in their name 
if sched_setscheduler allows raising uclamp.min just like that. In which 
case cgroup limits to mimick cpu uclamp also make sense.

> change anything about my "fence deadline" proposal.. but that it might
> just be one piece of the overall puzzle.

That SCHED_DEADLINE requires CAP_SYS_NICE does not worry you?

Regards,

Tvrtko

Tvrtko Ursulin Feb. 20, 2023, 4:51 p.m. UTC | #19

On 20/02/2023 16:44, Tvrtko Ursulin wrote:
> 
> On 20/02/2023 15:52, Rob Clark wrote:
>> On Mon, Feb 20, 2023 at 3:33 AM Tvrtko Ursulin
>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>
>>>
>>> On 17/02/2023 20:45, Rodrigo Vivi wrote:
> 
> [snip]
> 
>>> Yeah I agree. And as not all media use cases are the same, as are not
>>> all compute contexts someone somewhere will need to run a series of
>>> workloads for power and performance numbers. Ideally that someone would
>>> be the entity for which it makes sense to look at all use cases, from
>>> server room to client, 3d, media and compute for both. If we could get
>>> the capability to run this in some automated fashion, akin to CI, we
>>> would even have a chance to keep making good decisions in the future.
>>>
>>> Or we do some one off testing for this instance, but we still need a
>>> range of workloads and parts to do it properly..
>>>
>>>>> I also think the "arms race" scenario isn't really as much of a
>>>>> problem as you think.  There aren't _that_ many things using the GPU
>>>>> at the same time (compared to # of things using CPU).   And a lot of
>>>>> mobile games throttle framerate to avoid draining your battery too
>>>>> quickly (after all, if your battery is dead you can't keep buying loot
>>>>> boxes or whatever).
>>>>
>>>> Very good point.
>>>
>>> On this one I still disagree from the point of view that it does not
>>> make it good uapi if we allow everyone to select themselves for priority
>>> handling (one flavour or the other).
>>
>> There is plenty of precedent for userspace giving hints to the kernel
>> about scheduling and freq mgmt.  Like schedutil uclamp stuff.
>> Although I think that is all based on cgroups.
> 
> I knew about SCHED_DEADLINE and that it requires CAP_SYS_NICE, but I did 
> not know about uclamp. Quick experiment with uclampset suggests it 
> indeed does not require elevated privilege. If that is indeed so, it is 
> good enough for me as a precedent.
> 
> It appears to work using sched_setscheduler so maybe could define 
> something similar in i915/xe, per context or per client, not sure.
> 
> Maybe it would start as a primitive implementation but the uapi would 
> not preclude making it smart(er) afterwards. Or passing along to GuC to 
> do it's thing with it.

Hmmm having said that, how would we fix clvk performance using that? We 
would either need the library to do a new step when creating contexts, 
or allow external control so outside entity can do it. And then the 
question is based on what it decides to do it? Is it possible to know 
which, for instance, Chrome tab will be (or is) using clvk so that tab 
management code does it?

Regards,

Tvrtko

>> In the fence/syncobj case, I think we need per-wait hints.. because
>> for a single process the driver will be doing both housekeeping waits
>> and potentially urgent waits.  There may also be some room for some
>> cgroup or similar knobs to control things like what max priority an
>> app can ask for, and whether or how aggressively the kernel responds
>> to the "deadline" hints.  So as far as "arms race", I don't think I'd
> 
> Per wait hints are okay I guess even with "I am important" in their name 
> if sched_setscheduler allows raising uclamp.min just like that. In which 
> case cgroup limits to mimick cpu uclamp also make sense.
> 
>> change anything about my "fence deadline" proposal.. but that it might
>> just be one piece of the overall puzzle.
> 
> That SCHED_DEADLINE requires CAP_SYS_NICE does not worry you?
> 
> Regards,
> 
> Tvrtko

Rob Clark Feb. 20, 2023, 5:07 p.m. UTC | #20

On Mon, Feb 20, 2023 at 8:44 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 20/02/2023 15:52, Rob Clark wrote:
> > On Mon, Feb 20, 2023 at 3:33 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com> wrote:
> >>
> >>
> >> On 17/02/2023 20:45, Rodrigo Vivi wrote:
>
> [snip]
>
> >> Yeah I agree. And as not all media use cases are the same, as are not
> >> all compute contexts someone somewhere will need to run a series of
> >> workloads for power and performance numbers. Ideally that someone would
> >> be the entity for which it makes sense to look at all use cases, from
> >> server room to client, 3d, media and compute for both. If we could get
> >> the capability to run this in some automated fashion, akin to CI, we
> >> would even have a chance to keep making good decisions in the future.
> >>
> >> Or we do some one off testing for this instance, but we still need a
> >> range of workloads and parts to do it properly..
> >>
> >>>> I also think the "arms race" scenario isn't really as much of a
> >>>> problem as you think.  There aren't _that_ many things using the GPU
> >>>> at the same time (compared to # of things using CPU).   And a lot of
> >>>> mobile games throttle framerate to avoid draining your battery too
> >>>> quickly (after all, if your battery is dead you can't keep buying loot
> >>>> boxes or whatever).
> >>>
> >>> Very good point.
> >>
> >> On this one I still disagree from the point of view that it does not
> >> make it good uapi if we allow everyone to select themselves for priority
> >> handling (one flavour or the other).
> >
> > There is plenty of precedent for userspace giving hints to the kernel
> > about scheduling and freq mgmt.  Like schedutil uclamp stuff.
> > Although I think that is all based on cgroups.
>
> I knew about SCHED_DEADLINE and that it requires CAP_SYS_NICE, but I did
> not know about uclamp. Quick experiment with uclampset suggests it
> indeed does not require elevated privilege. If that is indeed so, it is
> good enough for me as a precedent.
>
> It appears to work using sched_setscheduler so maybe could define
> something similar in i915/xe, per context or per client, not sure.
>
> Maybe it would start as a primitive implementation but the uapi would
> not preclude making it smart(er) afterwards. Or passing along to GuC to
> do it's thing with it.
>
> > In the fence/syncobj case, I think we need per-wait hints.. because
> > for a single process the driver will be doing both housekeeping waits
> > and potentially urgent waits.  There may also be some room for some
> > cgroup or similar knobs to control things like what max priority an
> > app can ask for, and whether or how aggressively the kernel responds
> > to the "deadline" hints.  So as far as "arms race", I don't think I'd
>
> Per wait hints are okay I guess even with "I am important" in their name
> if sched_setscheduler allows raising uclamp.min just like that. In which
> case cgroup limits to mimick cpu uclamp also make sense.
>
> > change anything about my "fence deadline" proposal.. but that it might
> > just be one piece of the overall puzzle.
>
> That SCHED_DEADLINE requires CAP_SYS_NICE does not worry you?

This gets to why the name "fence deadline" is perhaps not the best..
it really isn't meant to be analogous to SCHED_DEADLINE, but rather
just a hint to the driver about what userspace is doing.  Maybe we
just document it more strongly as a hint?

BR,
-R

> Regards,
>
> Tvrtko

Rob Clark Feb. 20, 2023, 5:14 p.m. UTC | #21

On Mon, Feb 20, 2023 at 8:51 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 20/02/2023 16:44, Tvrtko Ursulin wrote:
> >
> > On 20/02/2023 15:52, Rob Clark wrote:
> >> On Mon, Feb 20, 2023 at 3:33 AM Tvrtko Ursulin
> >> <tvrtko.ursulin@linux.intel.com> wrote:
> >>>
> >>>
> >>> On 17/02/2023 20:45, Rodrigo Vivi wrote:
> >
> > [snip]
> >
> >>> Yeah I agree. And as not all media use cases are the same, as are not
> >>> all compute contexts someone somewhere will need to run a series of
> >>> workloads for power and performance numbers. Ideally that someone would
> >>> be the entity for which it makes sense to look at all use cases, from
> >>> server room to client, 3d, media and compute for both. If we could get
> >>> the capability to run this in some automated fashion, akin to CI, we
> >>> would even have a chance to keep making good decisions in the future.
> >>>
> >>> Or we do some one off testing for this instance, but we still need a
> >>> range of workloads and parts to do it properly..
> >>>
> >>>>> I also think the "arms race" scenario isn't really as much of a
> >>>>> problem as you think.  There aren't _that_ many things using the GPU
> >>>>> at the same time (compared to # of things using CPU).   And a lot of
> >>>>> mobile games throttle framerate to avoid draining your battery too
> >>>>> quickly (after all, if your battery is dead you can't keep buying loot
> >>>>> boxes or whatever).
> >>>>
> >>>> Very good point.
> >>>
> >>> On this one I still disagree from the point of view that it does not
> >>> make it good uapi if we allow everyone to select themselves for priority
> >>> handling (one flavour or the other).
> >>
> >> There is plenty of precedent for userspace giving hints to the kernel
> >> about scheduling and freq mgmt.  Like schedutil uclamp stuff.
> >> Although I think that is all based on cgroups.
> >
> > I knew about SCHED_DEADLINE and that it requires CAP_SYS_NICE, but I did
> > not know about uclamp. Quick experiment with uclampset suggests it
> > indeed does not require elevated privilege. If that is indeed so, it is
> > good enough for me as a precedent.
> >
> > It appears to work using sched_setscheduler so maybe could define
> > something similar in i915/xe, per context or per client, not sure.
> >
> > Maybe it would start as a primitive implementation but the uapi would
> > not preclude making it smart(er) afterwards. Or passing along to GuC to
> > do it's thing with it.
>
> Hmmm having said that, how would we fix clvk performance using that? We
> would either need the library to do a new step when creating contexts,
> or allow external control so outside entity can do it. And then the
> question is based on what it decides to do it? Is it possible to know
> which, for instance, Chrome tab will be (or is) using clvk so that tab
> management code does it?

I am not sure.. the clvk usage is, I think, not actually in chrome
itself, but something camera related?

Presumably we could build some cgroup knobs to control how the driver
reacts to the "deadline" hints (ie. ignore them completely, or impose
some upper limit on how much freq boost will be applied, etc).  I
think this sort of control of how the driver responds to hints
probably fits best with cgroups, as that is how we are already
implementing similar tuning for cpufreq/sched.  (Ie. foreground app or
tab gets moved to a different cgroup.)  But admittedly I haven't
looked too closely at how cgroups work on the kernel side.

BR,
-R

> Regards,
>
> Tvrtko
>
> >> In the fence/syncobj case, I think we need per-wait hints.. because
> >> for a single process the driver will be doing both housekeeping waits
> >> and potentially urgent waits.  There may also be some room for some
> >> cgroup or similar knobs to control things like what max priority an
> >> app can ask for, and whether or how aggressively the kernel responds
> >> to the "deadline" hints.  So as far as "arms race", I don't think I'd
> >
> > Per wait hints are okay I guess even with "I am important" in their name
> > if sched_setscheduler allows raising uclamp.min just like that. In which
> > case cgroup limits to mimick cpu uclamp also make sense.
> >
> >> change anything about my "fence deadline" proposal.. but that it might
> >> just be one piece of the overall puzzle.
> >
> > That SCHED_DEADLINE requires CAP_SYS_NICE does not worry you?
> >
> > Regards,
> >
> > Tvrtko

[RFC,v2,0/5] Waitboost drm syncobj waits

Message

Comments