Message ID | 20230308155322.344664-2-robdclark@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dma-fence: Deadline awareness | expand |
On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Add a way to hint to the fence signaler of an upcoming deadline, such as > vblank, which the fence waiter would prefer not to miss. This is to aid > the fence signaler in making power management decisions, like boosting > frequency as the deadline approaches and awareness of missing deadlines > so that can be factored in to the frequency scaling. > > v2: Drop dma_fence::deadline and related logic to filter duplicate > deadlines, to avoid increasing dma_fence size. The fence-context > implementation will need similar logic to track deadlines of all > the fences on the same timeline. [ckoenig] > v3: Clarify locking wrt. set_deadline callback > v4: Clarify in docs comment that this is a hint > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > v6: More docs > v7: Fix typo, clarify past deadlines > > Signed-off-by: Rob Clark <robdclark@chromium.org> > Reviewed-by: Christian König <christian.koenig@amd.com> > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > --- Hi Rob! > Documentation/driver-api/dma-buf.rst | 6 +++ > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > include/linux/dma-fence.h | 22 +++++++++++ > 3 files changed, 87 insertions(+) > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > index 622b8156d212..183e480d8cea 100644 > --- a/Documentation/driver-api/dma-buf.rst > +++ b/Documentation/driver-api/dma-buf.rst > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > .. kernel-doc:: drivers/dma-buf/dma-fence.c > :doc: fence signalling annotation > > +DMA Fence Deadline Hints > +~~~~~~~~~~~~~~~~~~~~~~~~ > + > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > + :doc: deadline hints > + > DMA Fences Functions Reference > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > index 0de0482cd36e..f177c56269bb 100644 > --- a/drivers/dma-buf/dma-fence.c > +++ b/drivers/dma-buf/dma-fence.c > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > } > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > +/** > + * DOC: deadline hints > + * > + * In an ideal world, it would be possible to pipeline a workload sufficiently > + * that a utilization based device frequency governor could arrive at a minimum > + * frequency that meets the requirements of the use-case, in order to minimize > + * power consumption. But in the real world there are many workloads which > + * defy this ideal. For example, but not limited to: > + * > + * * Workloads that ping-pong between device and CPU, with alternating periods > + * of CPU waiting for device, and device waiting on CPU. This can result in > + * devfreq and cpufreq seeing idle time in their respective domains and in > + * result reduce frequency. > + * > + * * Workloads that interact with a periodic time based deadline, such as double > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > + * missing a vblank deadline results in an *increase* in idle time on the GPU > + * (since it has to wait an additional vblank period), sending a signal to > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > + * needed. This is the use case I'd like to get some better understanding about how this series intends to work, as the problematic scheduling behavior triggered by missed deadlines has plagued compositing display servers for a long time. I apologize, I'm not a GPU driver developer, nor an OpenGL driver developer, so I will need some hand holding when it comes to understanding exactly what piece of software is responsible for communicating what piece of information. > + * > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > + * The deadline hint provides a way for the waiting driver, or userspace, to > + * convey an appropriate sense of urgency to the signaling driver. > + * > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > + * facing APIs). The time could either be some point in the future (such as > + * the vblank based deadline for page-flipping, or the start of a compositor's > + * composition cycle), or the current time to indicate an immediate deadline > + * hint (Ie. forward progress cannot be made until this fence is signaled). Is it guaranteed that a GPU driver will use the actual start of the vblank as the effective deadline? I have some memories of seing something about vblank evasion browsing driver code, which I might have misunderstood, but I have yet to find whether this is something userspace can actually expect to be something it can rely on. Can userspace set a deadline that targets the next vblank deadline before GPU work has been flushed e.g. at the start of a paint cycle, and still be sure that the kernel has the information it needs to know it should make its clocks increase their speed in time for when the actual work has been actually flushed? Or is it needed that the this deadline is set at the end? What I'm more or less trying to ask is, will a mode setting compositor be able to tell the kernel to boost its clocks at the time it knows is best, and how will it in practice achieve this? For example relying on the atomic mode setting commit setting the deadline is fundamentally flawed, since user space will at times want to purposefully delay committing until as late as possible, without doing so causing an increased risk of missing the deadline due to the kernel not speeding up clocks at the right time for GPU work that has already been flushed long ago. Relying on commits also has no effect on GPU work queued by a compositor drawing only to dma-bufs that are never intended to be presented using mode setting. How can we make sure a compositor can provide hints that the kernel will know to respect despite the compositor not being drm master? Jonas > + * > + * Multiple deadlines may be set on a given fence, even in parallel. See the > + * documentation for &dma_fence_ops.set_deadline. > + * > + * The deadline hint is just that, a hint. The driver that created the fence > + * may react by increasing frequency, making different scheduling choices, etc. > + * Or doing nothing at all. > + */ > + > +/** > + * dma_fence_set_deadline - set desired fence-wait deadline hint > + * @fence: the fence that is to be waited on > + * @deadline: the time by which the waiter hopes for the fence to be > + * signaled > + * > + * Give the fence signaler a hint about an upcoming deadline, such as > + * vblank, by which point the waiter would prefer the fence to be > + * signaled by. This is intended to give feedback to the fence signaler > + * to aid in power management decisions, such as boosting GPU frequency > + * if a periodic vblank deadline is approaching but the fence is not > + * yet signaled.. > + */ > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) > +{ > + if (fence->ops->set_deadline && !dma_fence_is_signaled(fence)) > + fence->ops->set_deadline(fence, deadline); > +} > +EXPORT_SYMBOL(dma_fence_set_deadline); > + > /** > * dma_fence_describe - Dump fence describtion into seq_file > * @fence: the 6fence to describe > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h > index 775cdc0b4f24..d54b595a0fe0 100644 > --- a/include/linux/dma-fence.h > +++ b/include/linux/dma-fence.h > @@ -257,6 +257,26 @@ struct dma_fence_ops { > */ > void (*timeline_value_str)(struct dma_fence *fence, > char *str, int size); > + > + /** > + * @set_deadline: > + * > + * Callback to allow a fence waiter to inform the fence signaler of > + * an upcoming deadline, such as vblank, by which point the waiter > + * would prefer the fence to be signaled by. This is intended to > + * give feedback to the fence signaler to aid in power management > + * decisions, such as boosting GPU frequency. > + * > + * This is called without &dma_fence.lock held, it can be called > + * multiple times and from any context. Locking is up to the callee > + * if it has some state to manage. If multiple deadlines are set, > + * the expectation is to track the soonest one. If the deadline is > + * before the current time, it should be interpreted as an immediate > + * deadline. > + * > + * This callback is optional. > + */ > + void (*set_deadline)(struct dma_fence *fence, ktime_t deadline); > }; > > void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > @@ -583,6 +603,8 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr) > return ret < 0 ? ret : 0; > } > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline); > + > struct dma_fence *dma_fence_get_stub(void); > struct dma_fence *dma_fence_allocate_private_stub(void); > u64 dma_fence_context_alloc(unsigned num); > -- > 2.39.2 >
On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > vblank, which the fence waiter would prefer not to miss. This is to aid > > the fence signaler in making power management decisions, like boosting > > frequency as the deadline approaches and awareness of missing deadlines > > so that can be factored in to the frequency scaling. > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > deadlines, to avoid increasing dma_fence size. The fence-context > > implementation will need similar logic to track deadlines of all > > the fences on the same timeline. [ckoenig] > > v3: Clarify locking wrt. set_deadline callback > > v4: Clarify in docs comment that this is a hint > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > v6: More docs > > v7: Fix typo, clarify past deadlines > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > Reviewed-by: Christian König <christian.koenig@amd.com> > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > --- > > Hi Rob! > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > include/linux/dma-fence.h | 22 +++++++++++ > > 3 files changed, 87 insertions(+) > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > index 622b8156d212..183e480d8cea 100644 > > --- a/Documentation/driver-api/dma-buf.rst > > +++ b/Documentation/driver-api/dma-buf.rst > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > :doc: fence signalling annotation > > > > +DMA Fence Deadline Hints > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > + :doc: deadline hints > > + > > DMA Fences Functions Reference > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > index 0de0482cd36e..f177c56269bb 100644 > > --- a/drivers/dma-buf/dma-fence.c > > +++ b/drivers/dma-buf/dma-fence.c > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > } > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > +/** > > + * DOC: deadline hints > > + * > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > + * that a utilization based device frequency governor could arrive at a minimum > > + * frequency that meets the requirements of the use-case, in order to minimize > > + * power consumption. But in the real world there are many workloads which > > + * defy this ideal. For example, but not limited to: > > + * > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > + * result reduce frequency. > > + * > > + * * Workloads that interact with a periodic time based deadline, such as double > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > + * (since it has to wait an additional vblank period), sending a signal to > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > + * needed. > > This is the use case I'd like to get some better understanding about how > this series intends to work, as the problematic scheduling behavior > triggered by missed deadlines has plagued compositing display servers > for a long time. > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > developer, so I will need some hand holding when it comes to > understanding exactly what piece of software is responsible for > communicating what piece of information. > > > + * > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > + * convey an appropriate sense of urgency to the signaling driver. > > + * > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > + * facing APIs). The time could either be some point in the future (such as > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > + * composition cycle), or the current time to indicate an immediate deadline > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > Is it guaranteed that a GPU driver will use the actual start of the > vblank as the effective deadline? I have some memories of seing > something about vblank evasion browsing driver code, which I might have > misunderstood, but I have yet to find whether this is something > userspace can actually expect to be something it can rely on. I guess you mean s/GPU driver/display driver/ ? It makes things more clear if we talk about them separately even if they happen to be the same device. Assuming that is what you mean, nothing strongly defines what the deadline is. In practice there is probably some buffering in the display controller. For ex, block based (including bandwidth compressed) formats, you need to buffer up a row of blocks to efficiently linearize for scanout. So you probably need to latch some time before you start sending pixel data to the display. But details like this are heavily implementation dependent. I think the most reasonable thing to target is start of vblank. Also, keep in mind the deadline hint is just that. It won't magically make the GPU finish by that deadline, but it gives the GPU driver information about lateness so it can realize if it needs to clock up. > Can userspace set a deadline that targets the next vblank deadline > before GPU work has been flushed e.g. at the start of a paint cycle, and > still be sure that the kernel has the information it needs to know it should > make its clocks increase their speed in time for when the actual work > has been actually flushed? Or is it needed that the this deadline is set > at the end? You need a fence to set the deadline, and for that work needs to be flushed. But you can't associate a deadline with work that the kernel is unaware of anyways. > What I'm more or less trying to ask is, will a mode setting compositor > be able to tell the kernel to boost its clocks at the time it knows is > best, and how will it in practice achieve this? The anticipated usage for a compositor is that, when you receive a <buf, fence> pair from an app, you immediately set a deadline for upcoming start-of-vblank on the fence fd passed from the app. (Or for implicit sync you can use DMA_BUF_IOCTL_EXPORT_SYNC_FILE). For the composite step, no need to set a deadline as this is already done on the kernel side in drm_atomic_helper_wait_for_fences(). > For example relying on the atomic mode setting commit setting the > deadline is fundamentally flawed, since user space will at times want to > purposefully delay committing until as late as possible, without doing > so causing an increased risk of missing the deadline due to the kernel > not speeding up clocks at the right time for GPU work that has already > been flushed long ago. Right, this is the point for exposing the ioctl to userspace. > Relying on commits also has no effect on GPU work queued by > a compositor drawing only to dma-bufs that are never intended to be > presented using mode setting. How can we make sure a compositor can > provide hints that the kernel will know to respect despite the > compositor not being drm master? It doesn't matter if there are indirect dependencies. Even if the compositor completely ignores deadline hints and fancy tricks like delaying composite decisions, the indirect dependency (app rendering) will delay the direct dependency (compositor rendering) of the page flip. So the driver will still see whether it is late or early compared to the deadline, allowing it to adjust freq in the appropriate direction for the next frame. BR, -R > > Jonas > > > + * > > + * Multiple deadlines may be set on a given fence, even in parallel. See the > > + * documentation for &dma_fence_ops.set_deadline. > > + * > > + * The deadline hint is just that, a hint. The driver that created the fence > > + * may react by increasing frequency, making different scheduling choices, etc. > > + * Or doing nothing at all. > > + */ > > + > > +/** > > + * dma_fence_set_deadline - set desired fence-wait deadline hint > > + * @fence: the fence that is to be waited on > > + * @deadline: the time by which the waiter hopes for the fence to be > > + * signaled > > + * > > + * Give the fence signaler a hint about an upcoming deadline, such as > > + * vblank, by which point the waiter would prefer the fence to be > > + * signaled by. This is intended to give feedback to the fence signaler > > + * to aid in power management decisions, such as boosting GPU frequency > > + * if a periodic vblank deadline is approaching but the fence is not > > + * yet signaled.. > > + */ > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) > > +{ > > + if (fence->ops->set_deadline && !dma_fence_is_signaled(fence)) > > + fence->ops->set_deadline(fence, deadline); > > +} > > +EXPORT_SYMBOL(dma_fence_set_deadline); > > + > > /** > > * dma_fence_describe - Dump fence describtion into seq_file > > * @fence: the 6fence to describe > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h > > index 775cdc0b4f24..d54b595a0fe0 100644 > > --- a/include/linux/dma-fence.h > > +++ b/include/linux/dma-fence.h > > @@ -257,6 +257,26 @@ struct dma_fence_ops { > > */ > > void (*timeline_value_str)(struct dma_fence *fence, > > char *str, int size); > > + > > + /** > > + * @set_deadline: > > + * > > + * Callback to allow a fence waiter to inform the fence signaler of > > + * an upcoming deadline, such as vblank, by which point the waiter > > + * would prefer the fence to be signaled by. This is intended to > > + * give feedback to the fence signaler to aid in power management > > + * decisions, such as boosting GPU frequency. > > + * > > + * This is called without &dma_fence.lock held, it can be called > > + * multiple times and from any context. Locking is up to the callee > > + * if it has some state to manage. If multiple deadlines are set, > > + * the expectation is to track the soonest one. If the deadline is > > + * before the current time, it should be interpreted as an immediate > > + * deadline. > > + * > > + * This callback is optional. > > + */ > > + void (*set_deadline)(struct dma_fence *fence, ktime_t deadline); > > }; > > > > void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > @@ -583,6 +603,8 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr) > > return ret < 0 ? ret : 0; > > } > > > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline); > > + > > struct dma_fence *dma_fence_get_stub(void); > > struct dma_fence *dma_fence_allocate_private_stub(void); > > u64 dma_fence_context_alloc(unsigned num); > > -- > > 2.39.2 > >
On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > From: Rob Clark <robdclark@chromium.org> > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > the fence signaler in making power management decisions, like boosting > > > frequency as the deadline approaches and awareness of missing deadlines > > > so that can be factored in to the frequency scaling. > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > implementation will need similar logic to track deadlines of all > > > the fences on the same timeline. [ckoenig] > > > v3: Clarify locking wrt. set_deadline callback > > > v4: Clarify in docs comment that this is a hint > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > v6: More docs > > > v7: Fix typo, clarify past deadlines > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > --- > > > > Hi Rob! > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > include/linux/dma-fence.h | 22 +++++++++++ > > > 3 files changed, 87 insertions(+) > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > index 622b8156d212..183e480d8cea 100644 > > > --- a/Documentation/driver-api/dma-buf.rst > > > +++ b/Documentation/driver-api/dma-buf.rst > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > :doc: fence signalling annotation > > > > > > +DMA Fence Deadline Hints > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > + > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > + :doc: deadline hints > > > + > > > DMA Fences Functions Reference > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > index 0de0482cd36e..f177c56269bb 100644 > > > --- a/drivers/dma-buf/dma-fence.c > > > +++ b/drivers/dma-buf/dma-fence.c > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > } > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > +/** > > > + * DOC: deadline hints > > > + * > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > + * that a utilization based device frequency governor could arrive at a minimum > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > + * power consumption. But in the real world there are many workloads which > > > + * defy this ideal. For example, but not limited to: > > > + * > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > + * result reduce frequency. > > > + * > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > + * (since it has to wait an additional vblank period), sending a signal to > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > + * needed. > > > > This is the use case I'd like to get some better understanding about how > > this series intends to work, as the problematic scheduling behavior > > triggered by missed deadlines has plagued compositing display servers > > for a long time. > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > developer, so I will need some hand holding when it comes to > > understanding exactly what piece of software is responsible for > > communicating what piece of information. > > > > > + * > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > + * convey an appropriate sense of urgency to the signaling driver. > > > + * > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > + * facing APIs). The time could either be some point in the future (such as > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > + * composition cycle), or the current time to indicate an immediate deadline > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > Is it guaranteed that a GPU driver will use the actual start of the > > vblank as the effective deadline? I have some memories of seing > > something about vblank evasion browsing driver code, which I might have > > misunderstood, but I have yet to find whether this is something > > userspace can actually expect to be something it can rely on. > > I guess you mean s/GPU driver/display driver/ ? It makes things more > clear if we talk about them separately even if they happen to be the > same device. Sure, sorry about being unclear about that. > > Assuming that is what you mean, nothing strongly defines what the > deadline is. In practice there is probably some buffering in the > display controller. For ex, block based (including bandwidth > compressed) formats, you need to buffer up a row of blocks to > efficiently linearize for scanout. So you probably need to latch some > time before you start sending pixel data to the display. But details > like this are heavily implementation dependent. I think the most > reasonable thing to target is start of vblank. The driver exposing those details would be quite useful for userspace though, so that it can delay committing updates to late, but not too late. Setting a deadline to be the vblank seems easy enough, but it isn't enough for scheduling the actual commit. > > Also, keep in mind the deadline hint is just that. It won't magically > make the GPU finish by that deadline, but it gives the GPU driver > information about lateness so it can realize if it needs to clock up. Sure, even if the GPU ramped up clocks to the max, if the job queue is too large, it won't magically invent more cycles to squeeze in. > > > Can userspace set a deadline that targets the next vblank deadline > > before GPU work has been flushed e.g. at the start of a paint cycle, and > > still be sure that the kernel has the information it needs to know it should > > make its clocks increase their speed in time for when the actual work > > has been actually flushed? Or is it needed that the this deadline is set > > at the end? > > You need a fence to set the deadline, and for that work needs to be > flushed. But you can't associate a deadline with work that the kernel > is unaware of anyways. That makes sense, but it might also a bit inadequate to have it as the only way to tell the kernel it should speed things up. Even with the trick i915 does, with GNOME Shell, we still end up with the feedback loop this series aims to mitigate. Doing triple buffering, i.e. delaying or dropping the first frame is so far the best work around that works, except doing other tricks that makes the kernel to ramp up its clock. Having to rely on choosing between latency and frame drops should ideally not have to be made. > > > What I'm more or less trying to ask is, will a mode setting compositor > > be able to tell the kernel to boost its clocks at the time it knows is > > best, and how will it in practice achieve this? > > The anticipated usage for a compositor is that, when you receive a > <buf, fence> pair from an app, you immediately set a deadline for > upcoming start-of-vblank on the fence fd passed from the app. (Or for > implicit sync you can use DMA_BUF_IOCTL_EXPORT_SYNC_FILE). For the > composite step, no need to set a deadline as this is already done on > the kernel side in drm_atomic_helper_wait_for_fences(). So it sounds like the new uapi will help compositors that do not draw with the intention of page flipping anything, and compositors that deliberately delay the commit. I suppose with proper target presentation time integration EGL/Vulkan WSI can set deadlines them as well. All that sounds like a welcomed improvement, but I'm not convinced it's enough to solve the problems we currently face. > > > For example relying on the atomic mode setting commit setting the > > deadline is fundamentally flawed, since user space will at times want to > > purposefully delay committing until as late as possible, without doing > > so causing an increased risk of missing the deadline due to the kernel > > not speeding up clocks at the right time for GPU work that has already > > been flushed long ago. > > Right, this is the point for exposing the ioctl to userspace. > > > Relying on commits also has no effect on GPU work queued by > > a compositor drawing only to dma-bufs that are never intended to be > > presented using mode setting. How can we make sure a compositor can > > provide hints that the kernel will know to respect despite the > > compositor not being drm master? > > It doesn't matter if there are indirect dependencies. Even if the > compositor completely ignores deadline hints and fancy tricks like > delaying composite decisions, the indirect dependency (app rendering) > will delay the direct dependency (compositor rendering) of the page > flip. So the driver will still see whether it is late or early > compared to the deadline, allowing it to adjust freq in the > appropriate direction for the next frame. Is it expected that WSI's will set their own deadlines, or should that be the job of the compositor? For example by using compositors using DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a deadline matching the vsync it most ideally will be committed to? Jonas > > BR, > -R > > > > > Jonas > > > > > + * > > > + * Multiple deadlines may be set on a given fence, even in parallel. See the > > > + * documentation for &dma_fence_ops.set_deadline. > > > + * > > > + * The deadline hint is just that, a hint. The driver that created the fence > > > + * may react by increasing frequency, making different scheduling choices, etc. > > > + * Or doing nothing at all. > > > + */ > > > + > > > +/** > > > + * dma_fence_set_deadline - set desired fence-wait deadline hint > > > + * @fence: the fence that is to be waited on > > > + * @deadline: the time by which the waiter hopes for the fence to be > > > + * signaled > > > + * > > > + * Give the fence signaler a hint about an upcoming deadline, such as > > > + * vblank, by which point the waiter would prefer the fence to be > > > + * signaled by. This is intended to give feedback to the fence signaler > > > + * to aid in power management decisions, such as boosting GPU frequency > > > + * if a periodic vblank deadline is approaching but the fence is not > > > + * yet signaled.. > > > + */ > > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) > > > +{ > > > + if (fence->ops->set_deadline && !dma_fence_is_signaled(fence)) > > > + fence->ops->set_deadline(fence, deadline); > > > +} > > > +EXPORT_SYMBOL(dma_fence_set_deadline); > > > + > > > /** > > > * dma_fence_describe - Dump fence describtion into seq_file > > > * @fence: the 6fence to describe > > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h > > > index 775cdc0b4f24..d54b595a0fe0 100644 > > > --- a/include/linux/dma-fence.h > > > +++ b/include/linux/dma-fence.h > > > @@ -257,6 +257,26 @@ struct dma_fence_ops { > > > */ > > > void (*timeline_value_str)(struct dma_fence *fence, > > > char *str, int size); > > > + > > > + /** > > > + * @set_deadline: > > > + * > > > + * Callback to allow a fence waiter to inform the fence signaler of > > > + * an upcoming deadline, such as vblank, by which point the waiter > > > + * would prefer the fence to be signaled by. This is intended to > > > + * give feedback to the fence signaler to aid in power management > > > + * decisions, such as boosting GPU frequency. > > > + * > > > + * This is called without &dma_fence.lock held, it can be called > > > + * multiple times and from any context. Locking is up to the callee > > > + * if it has some state to manage. If multiple deadlines are set, > > > + * the expectation is to track the soonest one. If the deadline is > > > + * before the current time, it should be interpreted as an immediate > > > + * deadline. > > > + * > > > + * This callback is optional. > > > + */ > > > + void (*set_deadline)(struct dma_fence *fence, ktime_t deadline); > > > }; > > > > > > void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > @@ -583,6 +603,8 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr) > > > return ret < 0 ? ret : 0; > > > } > > > > > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline); > > > + > > > struct dma_fence *dma_fence_get_stub(void); > > > struct dma_fence *dma_fence_allocate_private_stub(void); > > > u64 dma_fence_context_alloc(unsigned num); > > > -- > > > 2.39.2 > > >
On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > the fence signaler in making power management decisions, like boosting > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > so that can be factored in to the frequency scaling. > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > implementation will need similar logic to track deadlines of all > > > > the fences on the same timeline. [ckoenig] > > > > v3: Clarify locking wrt. set_deadline callback > > > > v4: Clarify in docs comment that this is a hint > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > v6: More docs > > > > v7: Fix typo, clarify past deadlines > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > --- > > > > > > Hi Rob! > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > 3 files changed, 87 insertions(+) > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > index 622b8156d212..183e480d8cea 100644 > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > :doc: fence signalling annotation > > > > > > > > +DMA Fence Deadline Hints > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > + > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > + :doc: deadline hints > > > > + > > > > DMA Fences Functions Reference > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > --- a/drivers/dma-buf/dma-fence.c > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > } > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > +/** > > > > + * DOC: deadline hints > > > > + * > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > + * power consumption. But in the real world there are many workloads which > > > > + * defy this ideal. For example, but not limited to: > > > > + * > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > + * result reduce frequency. > > > > + * > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > + * needed. > > > > > > This is the use case I'd like to get some better understanding about how > > > this series intends to work, as the problematic scheduling behavior > > > triggered by missed deadlines has plagued compositing display servers > > > for a long time. > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > developer, so I will need some hand holding when it comes to > > > understanding exactly what piece of software is responsible for > > > communicating what piece of information. > > > > > > > + * > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > + * > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > + * facing APIs). The time could either be some point in the future (such as > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > vblank as the effective deadline? I have some memories of seing > > > something about vblank evasion browsing driver code, which I might have > > > misunderstood, but I have yet to find whether this is something > > > userspace can actually expect to be something it can rely on. > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > clear if we talk about them separately even if they happen to be the > > same device. > > Sure, sorry about being unclear about that. > > > > > Assuming that is what you mean, nothing strongly defines what the > > deadline is. In practice there is probably some buffering in the > > display controller. For ex, block based (including bandwidth > > compressed) formats, you need to buffer up a row of blocks to > > efficiently linearize for scanout. So you probably need to latch some > > time before you start sending pixel data to the display. But details > > like this are heavily implementation dependent. I think the most > > reasonable thing to target is start of vblank. > > The driver exposing those details would be quite useful for userspace > though, so that it can delay committing updates to late, but not too > late. Setting a deadline to be the vblank seems easy enough, but it > isn't enough for scheduling the actual commit. I'm not entirely sure how that would even work.. but OTOH I think you are talking about something on the order of 100us? But that is a bit of another topic. > > > > Also, keep in mind the deadline hint is just that. It won't magically > > make the GPU finish by that deadline, but it gives the GPU driver > > information about lateness so it can realize if it needs to clock up. > > Sure, even if the GPU ramped up clocks to the max, if the job queue is > too large, it won't magically invent more cycles to squeeze in. > > > > > > Can userspace set a deadline that targets the next vblank deadline > > > before GPU work has been flushed e.g. at the start of a paint cycle, and > > > still be sure that the kernel has the information it needs to know it should > > > make its clocks increase their speed in time for when the actual work > > > has been actually flushed? Or is it needed that the this deadline is set > > > at the end? > > > > You need a fence to set the deadline, and for that work needs to be > > flushed. But you can't associate a deadline with work that the kernel > > is unaware of anyways. > > That makes sense, but it might also a bit inadequate to have it as the > only way to tell the kernel it should speed things up. Even with the > trick i915 does, with GNOME Shell, we still end up with the feedback > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > or dropping the first frame is so far the best work around that works, > except doing other tricks that makes the kernel to ramp up its clock. > Having to rely on choosing between latency and frame drops should > ideally not have to be made. Before you have a fence, the thing you want to be speeding up is the CPU, not the GPU. There are existing mechanisms for that. TBF I'm of the belief that there is still a need for input based cpu boost (and early wake-up trigger for GPU).. we have something like this in CrOS kernel. That is a bit of a different topic, but my point is that fence deadlines are just one of several things we need to optimize power/perf and responsiveness, rather than the single thing that solves every problem under the sun ;-) > > > > > What I'm more or less trying to ask is, will a mode setting compositor > > > be able to tell the kernel to boost its clocks at the time it knows is > > > best, and how will it in practice achieve this? > > > > The anticipated usage for a compositor is that, when you receive a > > <buf, fence> pair from an app, you immediately set a deadline for > > upcoming start-of-vblank on the fence fd passed from the app. (Or for > > implicit sync you can use DMA_BUF_IOCTL_EXPORT_SYNC_FILE). For the > > composite step, no need to set a deadline as this is already done on > > the kernel side in drm_atomic_helper_wait_for_fences(). > > So it sounds like the new uapi will help compositors that do not draw > with the intention of page flipping anything, and compositors that > deliberately delay the commit. I suppose with proper target presentation > time integration EGL/Vulkan WSI can set deadlines them as well. All that > sounds like a welcomed improvement, but I'm not convinced it's enough to > solve the problems we currently face. Yeah, like I mentioned this addresses one issue, giving the GPU kernel driver better information for freq mgmt. But there probably isn't one single solution for everything. > > > > > For example relying on the atomic mode setting commit setting the > > > deadline is fundamentally flawed, since user space will at times want to > > > purposefully delay committing until as late as possible, without doing > > > so causing an increased risk of missing the deadline due to the kernel > > > not speeding up clocks at the right time for GPU work that has already > > > been flushed long ago. > > > > Right, this is the point for exposing the ioctl to userspace. > > > > > Relying on commits also has no effect on GPU work queued by > > > a compositor drawing only to dma-bufs that are never intended to be > > > presented using mode setting. How can we make sure a compositor can > > > provide hints that the kernel will know to respect despite the > > > compositor not being drm master? > > > > It doesn't matter if there are indirect dependencies. Even if the > > compositor completely ignores deadline hints and fancy tricks like > > delaying composite decisions, the indirect dependency (app rendering) > > will delay the direct dependency (compositor rendering) of the page > > flip. So the driver will still see whether it is late or early > > compared to the deadline, allowing it to adjust freq in the > > appropriate direction for the next frame. > > Is it expected that WSI's will set their own deadlines, or should that > be the job of the compositor? For example by using compositors using > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > deadline matching the vsync it most ideally will be committed to? > I'm kind of assuming compositors, but if the WSI somehow has more information about ideal presentation time, then I suppose it could be in the WSI? I'll defer to folks who spend more time on WSI and compositors to hash out the details ;-) BR, -R > > Jonas > > > > > BR, > > -R > > > > > > > > Jonas > > > > > > > + * > > > > + * Multiple deadlines may be set on a given fence, even in parallel. See the > > > > + * documentation for &dma_fence_ops.set_deadline. > > > > + * > > > > + * The deadline hint is just that, a hint. The driver that created the fence > > > > + * may react by increasing frequency, making different scheduling choices, etc. > > > > + * Or doing nothing at all. > > > > + */ > > > > + > > > > +/** > > > > + * dma_fence_set_deadline - set desired fence-wait deadline hint > > > > + * @fence: the fence that is to be waited on > > > > + * @deadline: the time by which the waiter hopes for the fence to be > > > > + * signaled > > > > + * > > > > + * Give the fence signaler a hint about an upcoming deadline, such as > > > > + * vblank, by which point the waiter would prefer the fence to be > > > > + * signaled by. This is intended to give feedback to the fence signaler > > > > + * to aid in power management decisions, such as boosting GPU frequency > > > > + * if a periodic vblank deadline is approaching but the fence is not > > > > + * yet signaled.. > > > > + */ > > > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) > > > > +{ > > > > + if (fence->ops->set_deadline && !dma_fence_is_signaled(fence)) > > > > + fence->ops->set_deadline(fence, deadline); > > > > +} > > > > +EXPORT_SYMBOL(dma_fence_set_deadline); > > > > + > > > > /** > > > > * dma_fence_describe - Dump fence describtion into seq_file > > > > * @fence: the 6fence to describe > > > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h > > > > index 775cdc0b4f24..d54b595a0fe0 100644 > > > > --- a/include/linux/dma-fence.h > > > > +++ b/include/linux/dma-fence.h > > > > @@ -257,6 +257,26 @@ struct dma_fence_ops { > > > > */ > > > > void (*timeline_value_str)(struct dma_fence *fence, > > > > char *str, int size); > > > > + > > > > + /** > > > > + * @set_deadline: > > > > + * > > > > + * Callback to allow a fence waiter to inform the fence signaler of > > > > + * an upcoming deadline, such as vblank, by which point the waiter > > > > + * would prefer the fence to be signaled by. This is intended to > > > > + * give feedback to the fence signaler to aid in power management > > > > + * decisions, such as boosting GPU frequency. > > > > + * > > > > + * This is called without &dma_fence.lock held, it can be called > > > > + * multiple times and from any context. Locking is up to the callee > > > > + * if it has some state to manage. If multiple deadlines are set, > > > > + * the expectation is to track the soonest one. If the deadline is > > > > + * before the current time, it should be interpreted as an immediate > > > > + * deadline. > > > > + * > > > > + * This callback is optional. > > > > + */ > > > > + void (*set_deadline)(struct dma_fence *fence, ktime_t deadline); > > > > }; > > > > > > > > void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, > > > > @@ -583,6 +603,8 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr) > > > > return ret < 0 ? ret : 0; > > > > } > > > > > > > > +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline); > > > > + > > > > struct dma_fence *dma_fence_get_stub(void); > > > > struct dma_fence *dma_fence_allocate_private_stub(void); > > > > u64 dma_fence_context_alloc(unsigned num); > > > > -- > > > > 2.39.2 > > > >
On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > the fence signaler in making power management decisions, like boosting > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > implementation will need similar logic to track deadlines of all > > > > > the fences on the same timeline. [ckoenig] > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > v4: Clarify in docs comment that this is a hint > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > v6: More docs > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > --- > > > > > > > > Hi Rob! > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > :doc: fence signalling annotation > > > > > > > > > > +DMA Fence Deadline Hints > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > + > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > + :doc: deadline hints > > > > > + > > > > > DMA Fences Functions Reference > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > } > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > +/** > > > > > + * DOC: deadline hints > > > > > + * > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > + * power consumption. But in the real world there are many workloads which > > > > > + * defy this ideal. For example, but not limited to: > > > > > + * > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > + * result reduce frequency. > > > > > + * > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > + * needed. > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > this series intends to work, as the problematic scheduling behavior > > > > triggered by missed deadlines has plagued compositing display servers > > > > for a long time. > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > developer, so I will need some hand holding when it comes to > > > > understanding exactly what piece of software is responsible for > > > > communicating what piece of information. > > > > > > > > > + * > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > + * > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > vblank as the effective deadline? I have some memories of seing > > > > something about vblank evasion browsing driver code, which I might have > > > > misunderstood, but I have yet to find whether this is something > > > > userspace can actually expect to be something it can rely on. > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > clear if we talk about them separately even if they happen to be the > > > same device. > > > > Sure, sorry about being unclear about that. > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > deadline is. In practice there is probably some buffering in the > > > display controller. For ex, block based (including bandwidth > > > compressed) formats, you need to buffer up a row of blocks to > > > efficiently linearize for scanout. So you probably need to latch some > > > time before you start sending pixel data to the display. But details > > > like this are heavily implementation dependent. I think the most > > > reasonable thing to target is start of vblank. > > > > The driver exposing those details would be quite useful for userspace > > though, so that it can delay committing updates to late, but not too > > late. Setting a deadline to be the vblank seems easy enough, but it > > isn't enough for scheduling the actual commit. > > I'm not entirely sure how that would even work.. but OTOH I think you > are talking about something on the order of 100us? But that is a bit > of another topic. Yes, something like that. But yea, it's not really related. Scheduling commits closer to the deadline has more complex behavior than that too, e.g. the need for real time scheduling, and knowing how long it usually takes to create and commit and for the kernel to process. > 8-< *snip* 8-< > > > > > > You need a fence to set the deadline, and for that work needs to be > > > flushed. But you can't associate a deadline with work that the kernel > > > is unaware of anyways. > > > > That makes sense, but it might also a bit inadequate to have it as the > > only way to tell the kernel it should speed things up. Even with the > > trick i915 does, with GNOME Shell, we still end up with the feedback > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > or dropping the first frame is so far the best work around that works, > > except doing other tricks that makes the kernel to ramp up its clock. > > Having to rely on choosing between latency and frame drops should > > ideally not have to be made. > > Before you have a fence, the thing you want to be speeding up is the > CPU, not the GPU. There are existing mechanisms for that. Is there no benefit to let the GPU know earlier that it should speed up, so that when the job queue arrives, it's already up to speed? > > TBF I'm of the belief that there is still a need for input based cpu > boost (and early wake-up trigger for GPU).. we have something like > this in CrOS kernel. That is a bit of a different topic, but my point > is that fence deadlines are just one of several things we need to > optimize power/perf and responsiveness, rather than the single thing > that solves every problem under the sun ;-) Perhaps; but I believe it's a bit of a back channel of intent; the piece of the puzzle that has the information to know whether there is need actually speed up is the compositor, not the kernel. For example, pressing 'p' while a terminal is focused does not need high frequency clocks, it just needs the terminal emulator to draw a 'p' and the compositor to composite that update. Pressing <Super> may however trigger a non-trivial animation moving a lot of stuff around on screen, maybe triggering Wayland clients to draw and what not, and should most arguably have the ability to "warn" the kernel about the upcoming flood of work before it is already knocking on its door step. > 8-< *snip* 8-< > > > > Is it expected that WSI's will set their own deadlines, or should that > > be the job of the compositor? For example by using compositors using > > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > > deadline matching the vsync it most ideally will be committed to? > > > > I'm kind of assuming compositors, but if the WSI somehow has more > information about ideal presentation time, then I suppose it could be > in the WSI? I'll defer to folks who spend more time on WSI and > compositors to hash out the details ;-) With my compositor developer hat on, it might be best to let it be up to the compositor, it's the one that knows if a client's content will actually end up anywhere visible. Jonas > > BR, > -R
On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > implementation will need similar logic to track deadlines of all > > > > > > the fences on the same timeline. [ckoenig] > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > v6: More docs > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > --- > > > > > > > > > > Hi Rob! > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > + > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > + :doc: deadline hints > > > > > > + > > > > > > DMA Fences Functions Reference > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > } > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > +/** > > > > > > + * DOC: deadline hints > > > > > > + * > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > + * > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > + * result reduce frequency. > > > > > > + * > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > + * needed. > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > this series intends to work, as the problematic scheduling behavior > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > for a long time. > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > developer, so I will need some hand holding when it comes to > > > > > understanding exactly what piece of software is responsible for > > > > > communicating what piece of information. > > > > > > > > > > > + * > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > + * > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > vblank as the effective deadline? I have some memories of seing > > > > > something about vblank evasion browsing driver code, which I might have > > > > > misunderstood, but I have yet to find whether this is something > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > clear if we talk about them separately even if they happen to be the > > > > same device. > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > deadline is. In practice there is probably some buffering in the > > > > display controller. For ex, block based (including bandwidth > > > > compressed) formats, you need to buffer up a row of blocks to > > > > efficiently linearize for scanout. So you probably need to latch some > > > > time before you start sending pixel data to the display. But details > > > > like this are heavily implementation dependent. I think the most > > > > reasonable thing to target is start of vblank. > > > > > > The driver exposing those details would be quite useful for userspace > > > though, so that it can delay committing updates to late, but not too > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > isn't enough for scheduling the actual commit. > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > are talking about something on the order of 100us? But that is a bit > > of another topic. > > Yes, something like that. But yea, it's not really related. Scheduling > commits closer to the deadline has more complex behavior than that too, > e.g. the need for real time scheduling, and knowing how long it usually > takes to create and commit and for the kernel to process. > > > > > 8-< *snip* 8-< > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > flushed. But you can't associate a deadline with work that the kernel > > > > is unaware of anyways. > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > only way to tell the kernel it should speed things up. Even with the > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > or dropping the first frame is so far the best work around that works, > > > except doing other tricks that makes the kernel to ramp up its clock. > > > Having to rely on choosing between latency and frame drops should > > > ideally not have to be made. > > > > Before you have a fence, the thing you want to be speeding up is the > > CPU, not the GPU. There are existing mechanisms for that. > > Is there no benefit to let the GPU know earlier that it should speed up, > so that when the job queue arrives, it's already up to speed? Downstream we have input notifier that resumes the GPU so we can pipeline the 1-2ms it takes to boot up the GPU with userspace. But we wait to boost freq until we have cmdstream to submit, since that doesn't take as long. What needs help initially after input is all the stuff that happens on the CPU before the GPU can start to do anything ;-) Btw, I guess I haven't made this clear, dma-fence deadline is trying to help the steady-state situation, rather than the input-latency situation. It might take a frame or two of missed deadlines for gpufreq to arrive at a good steady-state freq. > > > > TBF I'm of the belief that there is still a need for input based cpu > > boost (and early wake-up trigger for GPU).. we have something like > > this in CrOS kernel. That is a bit of a different topic, but my point > > is that fence deadlines are just one of several things we need to > > optimize power/perf and responsiveness, rather than the single thing > > that solves every problem under the sun ;-) > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > of the puzzle that has the information to know whether there is need > actually speed up is the compositor, not the kernel. > > For example, pressing 'p' while a terminal is focused does not need high > frequency clocks, it just needs the terminal emulator to draw a 'p' and > the compositor to composite that update. Pressing <Super> may however > trigger a non-trivial animation moving a lot of stuff around on screen, > maybe triggering Wayland clients to draw and what not, and should most > arguably have the ability to "warn" the kernel about the upcoming flood > of work before it is already knocking on its door step. The super key is problematic, but not for the reason you think. It is because it is a case where we should boost on key-up instead of key-down.. and the second key-up event comes after the cpu-boost is already in it's cool-down period. But even if suboptimal in cases like this, it is still useful for touch/stylus cases where the slightest of lag is much more perceptible. This is getting off topic but I kinda favor coming up with some sort of static definition that userspace could give the kernel to let the kernel know what input to boost on. Or maybe something could be done with BPF? > > > > 8-< *snip* 8-< > > > > > > > Is it expected that WSI's will set their own deadlines, or should that > > > be the job of the compositor? For example by using compositors using > > > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > > > deadline matching the vsync it most ideally will be committed to? > > > > > > > I'm kind of assuming compositors, but if the WSI somehow has more > > information about ideal presentation time, then I suppose it could be > > in the WSI? I'll defer to folks who spend more time on WSI and > > compositors to hash out the details ;-) > > With my compositor developer hat on, it might be best to let it be up to > the compositor, it's the one that knows if a client's content will > actually end up anywhere visible. > wfm BR, -R > > Jonas > > > > > BR, > > -R
On Thu, Mar 16, 2023 at 5:29 PM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > v6: More docs > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > --- > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > + > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > + :doc: deadline hints > > > > > > > + > > > > > > > DMA Fences Functions Reference > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > } > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > +/** > > > > > > > + * DOC: deadline hints > > > > > > > + * > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > + * > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > + * result reduce frequency. > > > > > > > + * > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > + * needed. > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > for a long time. > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > developer, so I will need some hand holding when it comes to > > > > > > understanding exactly what piece of software is responsible for > > > > > > communicating what piece of information. > > > > > > > > > > > > > + * > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > + * > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > clear if we talk about them separately even if they happen to be the > > > > > same device. > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > deadline is. In practice there is probably some buffering in the > > > > > display controller. For ex, block based (including bandwidth > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > time before you start sending pixel data to the display. But details > > > > > like this are heavily implementation dependent. I think the most > > > > > reasonable thing to target is start of vblank. > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > though, so that it can delay committing updates to late, but not too > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > isn't enough for scheduling the actual commit. > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > are talking about something on the order of 100us? But that is a bit > > > of another topic. > > > > Yes, something like that. But yea, it's not really related. Scheduling > > commits closer to the deadline has more complex behavior than that too, > > e.g. the need for real time scheduling, and knowing how long it usually > > takes to create and commit and for the kernel to process. Vblank can be really long, especially with VRR where the additional time you get to finish the frame comes from making vblank longer. Using the start of vblank as a deadline makes VRR useless. It really would be nice to have some feedback about the actual deadline from the kernel, maybe in `struct drm_event_vblank`. But yes, sorry, off topic... > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > is unaware of anyways. > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > only way to tell the kernel it should speed things up. Even with the > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > or dropping the first frame is so far the best work around that works, > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > Having to rely on choosing between latency and frame drops should > > > > ideally not have to be made. > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > CPU, not the GPU. There are existing mechanisms for that. > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > so that when the job queue arrives, it's already up to speed? > > Downstream we have input notifier that resumes the GPU so we can > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > wait to boost freq until we have cmdstream to submit, since that > doesn't take as long. What needs help initially after input is all > the stuff that happens on the CPU before the GPU can start to do > anything ;-) > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > to help the steady-state situation, rather than the input-latency > situation. It might take a frame or two of missed deadlines for > gpufreq to arrive at a good steady-state freq. The mutter issue also is about a suboptimal steady-state. Truth be told, I'm not sure if this fence deadline idea fixes the issue we're seeing or at least helps sometimes. It might, it might not. What annoys me is that the compositor *knows* before any work is submitted that some work will be submitted and when it has to finish. We could maximize the chances to get everything right but having to wait for a fence to materialize in the compositor to do anything about it is suboptimal. > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > boost (and early wake-up trigger for GPU).. we have something like > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > is that fence deadlines are just one of several things we need to > > > optimize power/perf and responsiveness, rather than the single thing > > > that solves every problem under the sun ;-) > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > of the puzzle that has the information to know whether there is need > > actually speed up is the compositor, not the kernel. > > > > For example, pressing 'p' while a terminal is focused does not need high > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > the compositor to composite that update. Pressing <Super> may however > > trigger a non-trivial animation moving a lot of stuff around on screen, > > maybe triggering Wayland clients to draw and what not, and should most > > arguably have the ability to "warn" the kernel about the upcoming flood > > of work before it is already knocking on its door step. > > The super key is problematic, but not for the reason you think. It is > because it is a case where we should boost on key-up instead of > key-down.. and the second key-up event comes after the cpu-boost is > already in it's cool-down period. But even if suboptimal in cases > like this, it is still useful for touch/stylus cases where the > slightest of lag is much more perceptible. > > This is getting off topic but I kinda favor coming up with some sort > of static definition that userspace could give the kernel to let the > kernel know what input to boost on. Or maybe something could be done > with BPF? Why? Do you think user space is so slow that it can't process the input events and then do a syscall? We need to have all input devices open anyway that can affect the system and know more about how they affect behavior than the kernel can ever know. > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > Is it expected that WSI's will set their own deadlines, or should that > > > > be the job of the compositor? For example by using compositors using > > > > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > > > > deadline matching the vsync it most ideally will be committed to? > > > > > > > > > > I'm kind of assuming compositors, but if the WSI somehow has more > > > information about ideal presentation time, then I suppose it could be > > > in the WSI? I'll defer to folks who spend more time on WSI and > > > compositors to hash out the details ;-) > > > > With my compositor developer hat on, it might be best to let it be up to > > the compositor, it's the one that knows if a client's content will > > actually end up anywhere visible. > > > > wfm > > BR, > -R > > > > > Jonas > > > > > > > > BR, > > > -R >
On Thu, Mar 16, 2023 at 3:22 PM Sebastian Wick <sebastian.wick@redhat.com> wrote: > > On Thu, Mar 16, 2023 at 5:29 PM Rob Clark <robdclark@gmail.com> wrote: > > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > v6: More docs > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > --- > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > + > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > + :doc: deadline hints > > > > > > > > + > > > > > > > > DMA Fences Functions Reference > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > +/** > > > > > > > > + * DOC: deadline hints > > > > > > > > + * > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > + * > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > + * result reduce frequency. > > > > > > > > + * > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > + * needed. > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > for a long time. > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > + * > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > + * > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > same device. > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > display controller. For ex, block based (including bandwidth > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > time before you start sending pixel data to the display. But details > > > > > > like this are heavily implementation dependent. I think the most > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > though, so that it can delay committing updates to late, but not too > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > isn't enough for scheduling the actual commit. > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > are talking about something on the order of 100us? But that is a bit > > > > of another topic. > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > commits closer to the deadline has more complex behavior than that too, > > > e.g. the need for real time scheduling, and knowing how long it usually > > > takes to create and commit and for the kernel to process. > > Vblank can be really long, especially with VRR where the additional > time you get to finish the frame comes from making vblank longer. > Using the start of vblank as a deadline makes VRR useless. It really > would be nice to have some feedback about the actual deadline from the > kernel, maybe in `struct drm_event_vblank`. note that here we are only talking about the difference between start/end of vblank and the deadline for the hw to latch a change for the next frame. (Which I _expect_ generally amounts to however long it takes to slurp in a row of tiles) > But yes, sorry, off topic... > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > > is unaware of anyways. > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > > only way to tell the kernel it should speed things up. Even with the > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > > or dropping the first frame is so far the best work around that works, > > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > > Having to rely on choosing between latency and frame drops should > > > > > ideally not have to be made. > > > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > > so that when the job queue arrives, it's already up to speed? > > > > Downstream we have input notifier that resumes the GPU so we can > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > wait to boost freq until we have cmdstream to submit, since that > > doesn't take as long. What needs help initially after input is all > > the stuff that happens on the CPU before the GPU can start to do > > anything ;-) > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > to help the steady-state situation, rather than the input-latency > > situation. It might take a frame or two of missed deadlines for > > gpufreq to arrive at a good steady-state freq. > > The mutter issue also is about a suboptimal steady-state. > > Truth be told, I'm not sure if this fence deadline idea fixes the > issue we're seeing or at least helps sometimes. It might, it might > not. What annoys me is that the compositor *knows* before any work is > submitted that some work will be submitted and when it has to finish. > We could maximize the chances to get everything right but having to > wait for a fence to materialize in the compositor to do anything about > it is suboptimal. Why would the app not immediately send the fence+buf to the compositor as soon as it is submitted to the kernel on client process side? At any rate, it really doesn't matter how early the kernel finds out about the deadline, since the point is to let the kernel driver know if it is missing the deadline so that it doesn't mis-interpret stall time waiting for the _next_ vblank after the one we wanted. > > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > > is that fence deadlines are just one of several things we need to > > > > optimize power/perf and responsiveness, rather than the single thing > > > > that solves every problem under the sun ;-) > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > > of the puzzle that has the information to know whether there is need > > > actually speed up is the compositor, not the kernel. > > > > > > For example, pressing 'p' while a terminal is focused does not need high > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > > the compositor to composite that update. Pressing <Super> may however > > > trigger a non-trivial animation moving a lot of stuff around on screen, > > > maybe triggering Wayland clients to draw and what not, and should most > > > arguably have the ability to "warn" the kernel about the upcoming flood > > > of work before it is already knocking on its door step. > > > > The super key is problematic, but not for the reason you think. It is > > because it is a case where we should boost on key-up instead of > > key-down.. and the second key-up event comes after the cpu-boost is > > already in it's cool-down period. But even if suboptimal in cases > > like this, it is still useful for touch/stylus cases where the > > slightest of lag is much more perceptible. > > > > This is getting off topic but I kinda favor coming up with some sort > > of static definition that userspace could give the kernel to let the > > kernel know what input to boost on. Or maybe something could be done > > with BPF? > > Why? Do you think user space is so slow that it can't process the > input events and then do a syscall? We need to have all input devices > open anyway that can affect the system and know more about how they > affect behavior than the kernel can ever know. Again this is getting off into a different topic. But my gut feel is that the shorter the path to input cpu freq boost, the better.. since however many extra cycles you add, they will be cycles with cpu (and probably ddr) at lowest freq BR, -R > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > Is it expected that WSI's will set their own deadlines, or should that > > > > > be the job of the compositor? For example by using compositors using > > > > > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > > > > > deadline matching the vsync it most ideally will be committed to? > > > > > > > > > > > > > I'm kind of assuming compositors, but if the WSI somehow has more > > > > information about ideal presentation time, then I suppose it could be > > > > in the WSI? I'll defer to folks who spend more time on WSI and > > > > compositors to hash out the details ;-) > > > > > > With my compositor developer hat on, it might be best to let it be up to > > > the compositor, it's the one that knows if a client's content will > > > actually end up anywhere visible. > > > > > > > wfm > > > > BR, > > -R > > > > > > > > Jonas > > > > > > > > > > > BR, > > > > -R > > >
On Thu, 16 Mar 2023 23:22:24 +0100 Sebastian Wick <sebastian.wick@redhat.com> wrote: > On Thu, Mar 16, 2023 at 5:29 PM Rob Clark <robdclark@gmail.com> wrote: > > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > v6: More docs > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > --- > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > + > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > + :doc: deadline hints > > > > > > > > + > > > > > > > > DMA Fences Functions Reference > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > +/** > > > > > > > > + * DOC: deadline hints > > > > > > > > + * > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > + * > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > + * result reduce frequency. > > > > > > > > + * > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > + * needed. > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > for a long time. > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > + * > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > + * > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > same device. > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > display controller. For ex, block based (including bandwidth > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > time before you start sending pixel data to the display. But details > > > > > > like this are heavily implementation dependent. I think the most > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > though, so that it can delay committing updates to late, but not too > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > isn't enough for scheduling the actual commit. > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > are talking about something on the order of 100us? But that is a bit > > > > of another topic. > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > commits closer to the deadline has more complex behavior than that too, > > > e.g. the need for real time scheduling, and knowing how long it usually > > > takes to create and commit and for the kernel to process. > > Vblank can be really long, especially with VRR where the additional > time you get to finish the frame comes from making vblank longer. > Using the start of vblank as a deadline makes VRR useless. It really > would be nice to have some feedback about the actual deadline from the > kernel, maybe in `struct drm_event_vblank`. Hi Sebastian, why would deadline at vblank beginning not be exactly what you want also with VRR? Let's say the frame misses that deadline on a VRR system, but the frame still completes and the flip makes it to the intended scanout cycle, thanks to VRR giving it more time. Why should that miss not be classified as a miss? If it is classified as a miss, then GPU freq and whatnot will be increased, which will increase the effective update rate of VRR. If it is not classified as a miss, then GPU might not speed up, and you end up with low update rate, even though you still don't skip scanout cycles thanks to VRR. I guess if you actually *want* VRR to run at a low rate in order to keep the GPU in a lower power demand, you could compute your "late deadline" from the minimum rate VRR timings, subtracting the vblank period from the estimated next flip completion timestamp? Thanks, pq
On 3/16/23 23:22, Sebastian Wick wrote: > On Thu, Mar 16, 2023 at 5:29 PM Rob Clark <robdclark@gmail.com> wrote: >> On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: >>> On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: >>>> On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: >>>>> On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: >>>>>> On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: >>>>>>> >>>>>>>> + * >>>>>>>> + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. >>>>>>>> + * The deadline hint provides a way for the waiting driver, or userspace, to >>>>>>>> + * convey an appropriate sense of urgency to the signaling driver. >>>>>>>> + * >>>>>>>> + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace >>>>>>>> + * facing APIs). The time could either be some point in the future (such as >>>>>>>> + * the vblank based deadline for page-flipping, or the start of a compositor's >>>>>>>> + * composition cycle), or the current time to indicate an immediate deadline >>>>>>>> + * hint (Ie. forward progress cannot be made until this fence is signaled). >>>>>>> >>>>>>> Is it guaranteed that a GPU driver will use the actual start of the >>>>>>> vblank as the effective deadline? I have some memories of seing >>>>>>> something about vblank evasion browsing driver code, which I might have >>>>>>> misunderstood, but I have yet to find whether this is something >>>>>>> userspace can actually expect to be something it can rely on. >>>>>> >>>>>> I guess you mean s/GPU driver/display driver/ ? It makes things more >>>>>> clear if we talk about them separately even if they happen to be the >>>>>> same device. >>>>> >>>>> Sure, sorry about being unclear about that. >>>>> >>>>>> >>>>>> Assuming that is what you mean, nothing strongly defines what the >>>>>> deadline is. In practice there is probably some buffering in the >>>>>> display controller. For ex, block based (including bandwidth >>>>>> compressed) formats, you need to buffer up a row of blocks to >>>>>> efficiently linearize for scanout. So you probably need to latch some >>>>>> time before you start sending pixel data to the display. But details >>>>>> like this are heavily implementation dependent. I think the most >>>>>> reasonable thing to target is start of vblank. >>>>> >>>>> The driver exposing those details would be quite useful for userspace >>>>> though, so that it can delay committing updates to late, but not too >>>>> late. Setting a deadline to be the vblank seems easy enough, but it >>>>> isn't enough for scheduling the actual commit. >>>> >>>> I'm not entirely sure how that would even work.. but OTOH I think you >>>> are talking about something on the order of 100us? But that is a bit >>>> of another topic. >>> >>> Yes, something like that. But yea, it's not really related. Scheduling >>> commits closer to the deadline has more complex behavior than that too, >>> e.g. the need for real time scheduling, and knowing how long it usually >>> takes to create and commit and for the kernel to process. > > Vblank can be really long, especially with VRR where the additional > time you get to finish the frame comes from making vblank longer. > Using the start of vblank as a deadline makes VRR useless. Not really. We normally still want to aim for start of vblank with VRR, which would result in the maximum refresh rate. Missing that target just incurs less of a penalty than with fixed refresh rate.
On Fri, 17 Mar 2023 11:09:21 +0200 Pekka Paalanen <ppaalanen@gmail.com> wrote: > On Thu, 16 Mar 2023 23:22:24 +0100 > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > Vblank can be really long, especially with VRR where the additional > > time you get to finish the frame comes from making vblank longer. Btw. VRR extends front porch, not vblank. Thanks, pq
On Fri, 17 Mar 2023 11:17:37 +0200 Pekka Paalanen <ppaalanen@gmail.com> wrote: > On Fri, 17 Mar 2023 11:09:21 +0200 > Pekka Paalanen <ppaalanen@gmail.com> wrote: > > > On Thu, 16 Mar 2023 23:22:24 +0100 > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > > Vblank can be really long, especially with VRR where the additional > > > time you get to finish the frame comes from making vblank longer. > > Btw. VRR extends front porch, not vblank. Need to correct myself too. vblank includes front porch, vsync does not. https://electronics.stackexchange.com/questions/166681/how-exactly-does-a-vga-cable-work Thanks, pq
On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote: > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > v6: More docs > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > --- > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > + > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > + :doc: deadline hints > > > > > > > + > > > > > > > DMA Fences Functions Reference > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > } > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > +/** > > > > > > > + * DOC: deadline hints > > > > > > > + * > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > + * > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > + * result reduce frequency. > > > > > > > + * > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > + * needed. > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > for a long time. > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > developer, so I will need some hand holding when it comes to > > > > > > understanding exactly what piece of software is responsible for > > > > > > communicating what piece of information. > > > > > > > > > > > > > + * > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > + * > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > clear if we talk about them separately even if they happen to be the > > > > > same device. > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > deadline is. In practice there is probably some buffering in the > > > > > display controller. For ex, block based (including bandwidth > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > time before you start sending pixel data to the display. But details > > > > > like this are heavily implementation dependent. I think the most > > > > > reasonable thing to target is start of vblank. > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > though, so that it can delay committing updates to late, but not too > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > isn't enough for scheduling the actual commit. > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > are talking about something on the order of 100us? But that is a bit > > > of another topic. > > > > Yes, something like that. But yea, it's not really related. Scheduling > > commits closer to the deadline has more complex behavior than that too, > > e.g. the need for real time scheduling, and knowing how long it usually > > takes to create and commit and for the kernel to process. > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > is unaware of anyways. > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > only way to tell the kernel it should speed things up. Even with the > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > or dropping the first frame is so far the best work around that works, > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > Having to rely on choosing between latency and frame drops should > > > > ideally not have to be made. > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > CPU, not the GPU. There are existing mechanisms for that. > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > so that when the job queue arrives, it's already up to speed? > > Downstream we have input notifier that resumes the GPU so we can > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > wait to boost freq until we have cmdstream to submit, since that > doesn't take as long. What needs help initially after input is all > the stuff that happens on the CPU before the GPU can start to do > anything ;-) How do you deal with boosting CPU speeds downstream? Does the input notifier do that too? > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > to help the steady-state situation, rather than the input-latency > situation. It might take a frame or two of missed deadlines for > gpufreq to arrive at a good steady-state freq. I'm just not sure it will help. Missed deadlines set at commit hasn't been enough in the past to let the kernel understand it should speed things up before the next frame (which will be a whole frame late without any triple buffering which should be a last resort), so I don't see how it will help by adding a userspace hook to do the same thing. I think input latency and steady state target frequency here is tightly linked; what we should aim for is to provide enough information at the right time so that it does *not* take a frame or two to of missed deadlines to arrive at the target frequency, as those missed deadlines either means either stuttering and/or lag. That it helps with the deliberately late commit I do understand, but we don't do that yet, but intend to when there is kernel uapi to lets us do so without negative consequences. > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > boost (and early wake-up trigger for GPU).. we have something like > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > is that fence deadlines are just one of several things we need to > > > optimize power/perf and responsiveness, rather than the single thing > > > that solves every problem under the sun ;-) > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > of the puzzle that has the information to know whether there is need > > actually speed up is the compositor, not the kernel. > > > > For example, pressing 'p' while a terminal is focused does not need high > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > the compositor to composite that update. Pressing <Super> may however > > trigger a non-trivial animation moving a lot of stuff around on screen, > > maybe triggering Wayland clients to draw and what not, and should most > > arguably have the ability to "warn" the kernel about the upcoming flood > > of work before it is already knocking on its door step. > > The super key is problematic, but not for the reason you think. It is > because it is a case where we should boost on key-up instead of > key-down.. and the second key-up event comes after the cpu-boost is > already in it's cool-down period. But even if suboptimal in cases > like this, it is still useful for touch/stylus cases where the > slightest of lag is much more perceptible. Other keys are even more problematic. Alt, for example, does nothing, Alt + Tab does some light rendering, but Alt + KeyAboveTab will, depending on the current active applications, suddenly trigger N Wayland surfaces to start rendering at the same time. > > This is getting off topic but I kinda favor coming up with some sort > of static definition that userspace could give the kernel to let the > kernel know what input to boost on. Or maybe something could be done > with BPF? I have hard time seeing any static information can be enough, it's depends too much on context what is expected to happen. And can a BPF program really help? Unless BPF programs that pulls some internal kernel strings to speed things up whenever userspace wants I don't see how it is that much better. I don't think userspace is necessarily too slow to actively particitpate in providing direct scheduling hints either. Input processing can, for example, be off loaded to a real time scheduled thread, and plumbing any hints about future expectations from rendering, windowing and layout subsystems will be significantly easier to plumb to a real time input thread than translated into static informations or BPF programs. Jonas
On Thu, Mar 16, 2023 at 11:59 PM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Mar 16, 2023 at 3:22 PM Sebastian Wick > <sebastian.wick@redhat.com> wrote: > > > > On Thu, Mar 16, 2023 at 5:29 PM Rob Clark <robdclark@gmail.com> wrote: > > > > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > > v6: More docs > > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > > --- > > > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > + > > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > + :doc: deadline hints > > > > > > > > > + > > > > > > > > > DMA Fences Functions Reference > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > > } > > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > + * DOC: deadline hints > > > > > > > > > + * > > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > > + * > > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > > + * result reduce frequency. > > > > > > > > > + * > > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > > + * needed. > > > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > > for a long time. > > > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > > > + * > > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > > + * > > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > > same device. > > > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > > display controller. For ex, block based (including bandwidth > > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > > time before you start sending pixel data to the display. But details > > > > > > > like this are heavily implementation dependent. I think the most > > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > > though, so that it can delay committing updates to late, but not too > > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > > isn't enough for scheduling the actual commit. > > > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > > are talking about something on the order of 100us? But that is a bit > > > > > of another topic. > > > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > > commits closer to the deadline has more complex behavior than that too, > > > > e.g. the need for real time scheduling, and knowing how long it usually > > > > takes to create and commit and for the kernel to process. > > > > Vblank can be really long, especially with VRR where the additional > > time you get to finish the frame comes from making vblank longer. > > Using the start of vblank as a deadline makes VRR useless. It really > > would be nice to have some feedback about the actual deadline from the > > kernel, maybe in `struct drm_event_vblank`. > > note that here we are only talking about the difference between > start/end of vblank and the deadline for the hw to latch a change for > the next frame. (Which I _expect_ generally amounts to however long > it takes to slurp in a row of tiles) > > > But yes, sorry, off topic... > > > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > > > is unaware of anyways. > > > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > > > only way to tell the kernel it should speed things up. Even with the > > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > > > or dropping the first frame is so far the best work around that works, > > > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > > > Having to rely on choosing between latency and frame drops should > > > > > > ideally not have to be made. > > > > > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > > > so that when the job queue arrives, it's already up to speed? > > > > > > Downstream we have input notifier that resumes the GPU so we can > > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > > wait to boost freq until we have cmdstream to submit, since that > > > doesn't take as long. What needs help initially after input is all > > > the stuff that happens on the CPU before the GPU can start to do > > > anything ;-) > > > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > > to help the steady-state situation, rather than the input-latency > > > situation. It might take a frame or two of missed deadlines for > > > gpufreq to arrive at a good steady-state freq. > > > > The mutter issue also is about a suboptimal steady-state. > > > > Truth be told, I'm not sure if this fence deadline idea fixes the > > issue we're seeing or at least helps sometimes. It might, it might > > not. What annoys me is that the compositor *knows* before any work is > > submitted that some work will be submitted and when it has to finish. > > We could maximize the chances to get everything right but having to > > wait for a fence to materialize in the compositor to do anything about > > it is suboptimal. > > Why would the app not immediately send the fence+buf to the compositor > as soon as it is submitted to the kernel on client process side? Some apps just are not good at this. Reading back work from the GPU, taking a lot of CPU time to create the GPU work, etc. The other obvious offender: frame callbacks. Committing a buffer only happens after receiving a frame callback in FIFO/vsync mode which we try to schedule as close to the deadline as possible. The idea that the clients are able to submit all GPU work some time early, then immediately commit to show up in the compositor well before the deadline is very idealized. We're trying to get there but we also only have control over the WSI so bad apps will still be bad apps. > At any rate, it really doesn't matter how early the kernel finds out > about the deadline, since the point is to let the kernel driver know > if it is missing the deadline so that it doesn't mis-interpret stall > time waiting for the _next_ vblank after the one we wanted. That's a good point! Let's see how well this works in practice and how we can improve on that in the future. > > > > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > > > is that fence deadlines are just one of several things we need to > > > > > optimize power/perf and responsiveness, rather than the single thing > > > > > that solves every problem under the sun ;-) > > > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > > > of the puzzle that has the information to know whether there is need > > > > actually speed up is the compositor, not the kernel. > > > > > > > > For example, pressing 'p' while a terminal is focused does not need high > > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > > > the compositor to composite that update. Pressing <Super> may however > > > > trigger a non-trivial animation moving a lot of stuff around on screen, > > > > maybe triggering Wayland clients to draw and what not, and should most > > > > arguably have the ability to "warn" the kernel about the upcoming flood > > > > of work before it is already knocking on its door step. > > > > > > The super key is problematic, but not for the reason you think. It is > > > because it is a case where we should boost on key-up instead of > > > key-down.. and the second key-up event comes after the cpu-boost is > > > already in it's cool-down period. But even if suboptimal in cases > > > like this, it is still useful for touch/stylus cases where the > > > slightest of lag is much more perceptible. > > > > > > This is getting off topic but I kinda favor coming up with some sort > > > of static definition that userspace could give the kernel to let the > > > kernel know what input to boost on. Or maybe something could be done > > > with BPF? > > > > Why? Do you think user space is so slow that it can't process the > > input events and then do a syscall? We need to have all input devices > > open anyway that can affect the system and know more about how they > > affect behavior than the kernel can ever know. > > Again this is getting off into a different topic. But my gut feel is > that the shorter the path to input cpu freq boost, the better.. since > however many extra cycles you add, they will be cycles with cpu (and > probably ddr) at lowest freq On the one hand, sure, that makes sense in theory. On the other hand, we won't know for sure until we try it and I suspect a RT thread in user space will be fast enough. > BR, > -R > > > > > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > > Is it expected that WSI's will set their own deadlines, or should that > > > > > > be the job of the compositor? For example by using compositors using > > > > > > DMA_BUF_IOCTL_EXPORT_SYNC_FILE that you mentioned, using it to set a > > > > > > deadline matching the vsync it most ideally will be committed to? > > > > > > > > > > > > > > > > I'm kind of assuming compositors, but if the WSI somehow has more > > > > > information about ideal presentation time, then I suppose it could be > > > > > in the WSI? I'll defer to folks who spend more time on WSI and > > > > > compositors to hash out the details ;-) > > > > > > > > With my compositor developer hat on, it might be best to let it be up to > > > > the compositor, it's the one that knows if a client's content will > > > > actually end up anywhere visible. > > > > > > > > > > wfm > > > > > > BR, > > > -R > > > > > > > > > > > Jonas > > > > > > > > > > > > > > BR, > > > > > -R > > > > > >
On Fri, Mar 17, 2023 at 3:23 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote: > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > v6: More docs > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > --- > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > + > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > + :doc: deadline hints > > > > > > > > + > > > > > > > > DMA Fences Functions Reference > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > +/** > > > > > > > > + * DOC: deadline hints > > > > > > > > + * > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > + * > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > + * result reduce frequency. > > > > > > > > + * > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > + * needed. > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > for a long time. > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > + * > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > + * > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > same device. > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > display controller. For ex, block based (including bandwidth > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > time before you start sending pixel data to the display. But details > > > > > > like this are heavily implementation dependent. I think the most > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > though, so that it can delay committing updates to late, but not too > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > isn't enough for scheduling the actual commit. > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > are talking about something on the order of 100us? But that is a bit > > > > of another topic. > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > commits closer to the deadline has more complex behavior than that too, > > > e.g. the need for real time scheduling, and knowing how long it usually > > > takes to create and commit and for the kernel to process. > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > > is unaware of anyways. > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > > only way to tell the kernel it should speed things up. Even with the > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > > or dropping the first frame is so far the best work around that works, > > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > > Having to rely on choosing between latency and frame drops should > > > > > ideally not have to be made. > > > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > > so that when the job queue arrives, it's already up to speed? > > > > Downstream we have input notifier that resumes the GPU so we can > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > wait to boost freq until we have cmdstream to submit, since that > > doesn't take as long. What needs help initially after input is all > > the stuff that happens on the CPU before the GPU can start to do > > anything ;-) > > How do you deal with boosting CPU speeds downstream? Does the input > notifier do that too? Yes.. actually currently downstream (depending on device) we have 1 to 3 input notifiers, one for CPU boost, one for early-PSR-exit, and one to get a head start on booting up the GPU. > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > to help the steady-state situation, rather than the input-latency > > situation. It might take a frame or two of missed deadlines for > > gpufreq to arrive at a good steady-state freq. > > I'm just not sure it will help. Missed deadlines set at commit hasn't > been enough in the past to let the kernel understand it should speed > things up before the next frame (which will be a whole frame late > without any triple buffering which should be a last resort), so I don't > see how it will help by adding a userspace hook to do the same thing. So deadline is just a superset of "right now" and "sometime in the future".. and this has been useful enough for i915 that they have both forms, when waiting on GPU via i915 specific ioctls and when pageflip (assuming userspace isn't deferring composition decision and instead just pushing it all down to the kernel). But this breaks down in a few cases: 1) non pageflip (for ex. ping-ponging between cpu and gpu) use cases when you wait via polling on fence fd or wait via drm_syncobj instead of DRM_IOCTL_I915_GEM_WAIT 2) when userspace decides late in frame to not pageflip because app fence isn't signaled yet And this is all done in a way that doesn't help for situations where you have separate kms and render devices. Or the kms driver doesn't bypass atomic helpers (ie. uses drm_atomic_helper_wait_for_fences()). So the technique has already proven to be useful. This series just extends it beyond driver specific primitives (ie. dma_fence/drm_syncojb) > I think input latency and steady state target frequency here is tightly > linked; what we should aim for is to provide enough information at the > right time so that it does *not* take a frame or two to of missed > deadlines to arrive at the target frequency, as those missed deadlines > either means either stuttering and/or lag. If you have some magic way for a gl/vk driver to accurately predict how many cycles it will take to execute a sequence of draws, I'm all ears. Realistically, the best solution on sudden input is to overshoot and let freqs settle back down. But there is a lot more to input latency than GPU freq. In UI workloads, even fullscreen animation, I don't really see the GPU going above the 2nd lowest OPP even on relatively small things like a618. UI input latency (touch scrolling, on-screen stylus / low-latency-ink, animations) are a separate issue from what this series addresses, and aren't too much to do with GPU freq. > That it helps with the deliberately late commit I do understand, but we > don't do that yet, but intend to when there is kernel uapi to lets us do > so without negative consequences. > > > > > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > > is that fence deadlines are just one of several things we need to > > > > optimize power/perf and responsiveness, rather than the single thing > > > > that solves every problem under the sun ;-) > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > > of the puzzle that has the information to know whether there is need > > > actually speed up is the compositor, not the kernel. > > > > > > For example, pressing 'p' while a terminal is focused does not need high > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > > the compositor to composite that update. Pressing <Super> may however > > > trigger a non-trivial animation moving a lot of stuff around on screen, > > > maybe triggering Wayland clients to draw and what not, and should most > > > arguably have the ability to "warn" the kernel about the upcoming flood > > > of work before it is already knocking on its door step. > > > > The super key is problematic, but not for the reason you think. It is > > because it is a case where we should boost on key-up instead of > > key-down.. and the second key-up event comes after the cpu-boost is > > already in it's cool-down period. But even if suboptimal in cases > > like this, it is still useful for touch/stylus cases where the > > slightest of lag is much more perceptible. > > Other keys are even more problematic. Alt, for example, does nothing, > Alt + Tab does some light rendering, but Alt + KeyAboveTab will, > depending on the current active applications, suddenly trigger N Wayland > surfaces to start rendering at the same time. > > > > > This is getting off topic but I kinda favor coming up with some sort > > of static definition that userspace could give the kernel to let the > > kernel know what input to boost on. Or maybe something could be done > > with BPF? > > I have hard time seeing any static information can be enough, it's > depends too much on context what is expected to happen. And can a BPF > program really help? Unless BPF programs that pulls some internal kernel > strings to speed things up whenever userspace wants I don't see how it > is that much better. > > I don't think userspace is necessarily too slow to actively particitpate > in providing direct scheduling hints either. Input processing can, for > example, be off loaded to a real time scheduled thread, and plumbing any > hints about future expectations from rendering, windowing and layout > subsystems will be significantly easier to plumb to a real time input > thread than translated into static informations or BPF programs. I mean, the kernel side input handler is called from irq context long before even the scheduler gets involved.. But I think you are over-thinking the Alt + SomeOtherKey case. The important thing isn't what the other key is, it is just to know that Alt is a modifier key (ie. handle it on key-up instead of key-down). No need to over-complicate things. It's probably enough to give the kernel a list of modifier+key combo's that do _something_.. And like I've said before, keyboard input is the least problematic in terms of latency. It is a _lot_ easier to notice lag with touch scrolling or stylus (on screen). (The latter case, I think wayland has some catching up to do compared to CrOS or android.. you really need a way to allow the app to do front buffer rendering to an overlay for the stylus case, because even just 16ms delay is _very_ noticeable.) BR, -R
On Fri, Mar 17, 2023 at 08:59:48AM -0700, Rob Clark wrote: > On Fri, Mar 17, 2023 at 3:23 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote: > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > > v6: More docs > > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > > --- > > > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > + > > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > + :doc: deadline hints > > > > > > > > > + > > > > > > > > > DMA Fences Functions Reference > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > > } > > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > + * DOC: deadline hints > > > > > > > > > + * > > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > > + * > > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > > + * result reduce frequency. > > > > > > > > > + * > > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > > + * needed. > > > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > > for a long time. > > > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > > > + * > > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > > + * > > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > > same device. > > > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > > display controller. For ex, block based (including bandwidth > > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > > time before you start sending pixel data to the display. But details > > > > > > > like this are heavily implementation dependent. I think the most > > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > > though, so that it can delay committing updates to late, but not too > > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > > isn't enough for scheduling the actual commit. > > > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > > are talking about something on the order of 100us? But that is a bit > > > > > of another topic. > > > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > > commits closer to the deadline has more complex behavior than that too, > > > > e.g. the need for real time scheduling, and knowing how long it usually > > > > takes to create and commit and for the kernel to process. > > > > > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > > > is unaware of anyways. > > > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > > > only way to tell the kernel it should speed things up. Even with the > > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > > > or dropping the first frame is so far the best work around that works, > > > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > > > Having to rely on choosing between latency and frame drops should > > > > > > ideally not have to be made. > > > > > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > > > so that when the job queue arrives, it's already up to speed? > > > > > > Downstream we have input notifier that resumes the GPU so we can > > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > > wait to boost freq until we have cmdstream to submit, since that > > > doesn't take as long. What needs help initially after input is all > > > the stuff that happens on the CPU before the GPU can start to do > > > anything ;-) > > > > How do you deal with boosting CPU speeds downstream? Does the input > > notifier do that too? > > Yes.. actually currently downstream (depending on device) we have 1 to > 3 input notifiers, one for CPU boost, one for early-PSR-exit, and one > to get a head start on booting up the GPU. Would be really nice to upstream these, one way or the other, be it actually input event based, or via some uapi to just poke the kernel. I realize it's not related to this thread, so this is just me wishing things into the void. > > > > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > > to help the steady-state situation, rather than the input-latency > > > situation. It might take a frame or two of missed deadlines for > > > gpufreq to arrive at a good steady-state freq. > > > > I'm just not sure it will help. Missed deadlines set at commit hasn't > > been enough in the past to let the kernel understand it should speed > > things up before the next frame (which will be a whole frame late > > without any triple buffering which should be a last resort), so I don't > > see how it will help by adding a userspace hook to do the same thing. > > So deadline is just a superset of "right now" and "sometime in the > future".. and this has been useful enough for i915 that they have both > forms, when waiting on GPU via i915 specific ioctls and when pageflip > (assuming userspace isn't deferring composition decision and instead > just pushing it all down to the kernel). But this breaks down in a > few cases: > > 1) non pageflip (for ex. ping-ponging between cpu and gpu) use cases > when you wait via polling on fence fd or wait via drm_syncobj instead > of DRM_IOCTL_I915_GEM_WAIT > 2) when userspace decides late in frame to not pageflip because app > fence isn't signaled yet It breaks down in practice today, because we do entering the low-freq feedback loop that triple buffering today effectively works around. That is even with non-delayed page flipping, and a single pipeline source (compositor only rendering) or only using already signaled ready client buffers when compositing. Anyway, I don't doubt its usefulness, just a bit pessimistic. > > And this is all done in a way that doesn't help for situations where > you have separate kms and render devices. Or the kms driver doesn't > bypass atomic helpers (ie. uses drm_atomic_helper_wait_for_fences()). > So the technique has already proven to be useful. This series just > extends it beyond driver specific primitives (ie. > dma_fence/drm_syncojb) > > > I think input latency and steady state target frequency here is tightly > > linked; what we should aim for is to provide enough information at the > > right time so that it does *not* take a frame or two to of missed > > deadlines to arrive at the target frequency, as those missed deadlines > > either means either stuttering and/or lag. > > If you have some magic way for a gl/vk driver to accurately predict > how many cycles it will take to execute a sequence of draws, I'm all > ears. > > Realistically, the best solution on sudden input is to overshoot and > let freqs settle back down. > > But there is a lot more to input latency than GPU freq. In UI > workloads, even fullscreen animation, I don't really see the GPU going > above the 2nd lowest OPP even on relatively small things like a618. > UI input latency (touch scrolling, on-screen stylus / low-latency-ink, > animations) are a separate issue from what this series addresses, and > aren't too much to do with GPU freq. > > > That it helps with the deliberately late commit I do understand, but we > > don't do that yet, but intend to when there is kernel uapi to lets us do > > so without negative consequences. > > > > > > > > > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > > > is that fence deadlines are just one of several things we need to > > > > > optimize power/perf and responsiveness, rather than the single thing > > > > > that solves every problem under the sun ;-) > > > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > > > of the puzzle that has the information to know whether there is need > > > > actually speed up is the compositor, not the kernel. > > > > > > > > For example, pressing 'p' while a terminal is focused does not need high > > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > > > the compositor to composite that update. Pressing <Super> may however > > > > trigger a non-trivial animation moving a lot of stuff around on screen, > > > > maybe triggering Wayland clients to draw and what not, and should most > > > > arguably have the ability to "warn" the kernel about the upcoming flood > > > > of work before it is already knocking on its door step. > > > > > > The super key is problematic, but not for the reason you think. It is > > > because it is a case where we should boost on key-up instead of > > > key-down.. and the second key-up event comes after the cpu-boost is > > > already in it's cool-down period. But even if suboptimal in cases > > > like this, it is still useful for touch/stylus cases where the > > > slightest of lag is much more perceptible. > > > > Other keys are even more problematic. Alt, for example, does nothing, > > Alt + Tab does some light rendering, but Alt + KeyAboveTab will, > > depending on the current active applications, suddenly trigger N Wayland > > surfaces to start rendering at the same time. > > > > > > > > This is getting off topic but I kinda favor coming up with some sort > > > of static definition that userspace could give the kernel to let the > > > kernel know what input to boost on. Or maybe something could be done > > > with BPF? > > > > I have hard time seeing any static information can be enough, it's > > depends too much on context what is expected to happen. And can a BPF > > program really help? Unless BPF programs that pulls some internal kernel > > strings to speed things up whenever userspace wants I don't see how it > > is that much better. > > > > I don't think userspace is necessarily too slow to actively particitpate > > in providing direct scheduling hints either. Input processing can, for > > example, be off loaded to a real time scheduled thread, and plumbing any > > hints about future expectations from rendering, windowing and layout > > subsystems will be significantly easier to plumb to a real time input > > thread than translated into static informations or BPF programs. > > I mean, the kernel side input handler is called from irq context long > before even the scheduler gets involved.. > > But I think you are over-thinking the Alt + SomeOtherKey case. The > important thing isn't what the other key is, it is just to know that > Alt is a modifier key (ie. handle it on key-up instead of key-down). > No need to over-complicate things. It's probably enough to give the > kernel a list of modifier+key combo's that do _something_.. Perhaps I'm over thinking it, it just seems all so unnecessary to complicate the kernel so that it's able to predict when GUI animations will happen instead of the GUI itself doing it when it is actually beneficial. All it'd take (naively) is uapi for the three kind of boosts downstream now does automatically from input events. > > And like I've said before, keyboard input is the least problematic in > terms of latency. It is a _lot_ easier to notice lag with touch > scrolling or stylus (on screen). (The latter case, I think wayland > has some catching up to do compared to CrOS or android.. you really > need a way to allow the app to do front buffer rendering to an overlay > for the stylus case, because even just 16ms delay is _very_ > noticeable.) Sure, but here too userpsace (rt thread in the compositor) is probably a good enough place to predict when to boost since it will be the one proxies e.g. the stylus input events to the application. Front buffering on the other hand is a very different topic ;) Jonas > > BR, > -R
On Tue, Mar 21, 2023 at 6:24 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > On Fri, Mar 17, 2023 at 08:59:48AM -0700, Rob Clark wrote: > > On Fri, Mar 17, 2023 at 3:23 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote: > > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as > > > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid > > > > > > > > > > the fence signaler in making power management decisions, like boosting > > > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines > > > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate > > > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context > > > > > > > > > > implementation will need similar logic to track deadlines of all > > > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > > > v6: More docs > > > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com> > > > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++ > > > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst > > > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > + > > > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > > > + :doc: deadline hints > > > > > > > > > > + > > > > > > > > > > DMA Fences Functions Reference > > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c > > > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, > > > > > > > > > > } > > > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > > + * DOC: deadline hints > > > > > > > > > > + * > > > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently > > > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum > > > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize > > > > > > > > > > + * power consumption. But in the real world there are many workloads which > > > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > > > + * > > > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods > > > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in > > > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in > > > > > > > > > > + * result reduce frequency. > > > > > > > > > > + * > > > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double > > > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, > > > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU > > > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to > > > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is > > > > > > > > > > + * needed. > > > > > > > > > > > > > > > > > > This is the use case I'd like to get some better understanding about how > > > > > > > > > this series intends to work, as the problematic scheduling behavior > > > > > > > > > triggered by missed deadlines has plagued compositing display servers > > > > > > > > > for a long time. > > > > > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver > > > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > > > understanding exactly what piece of software is responsible for > > > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > > > > > + * > > > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. > > > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to > > > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver. > > > > > > > > > > + * > > > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace > > > > > > > > > > + * facing APIs). The time could either be some point in the future (such as > > > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's > > > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline > > > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled). > > > > > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the > > > > > > > > > vblank as the effective deadline? I have some memories of seing > > > > > > > > > something about vblank evasion browsing driver code, which I might have > > > > > > > > > misunderstood, but I have yet to find whether this is something > > > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more > > > > > > > > clear if we talk about them separately even if they happen to be the > > > > > > > > same device. > > > > > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what the > > > > > > > > deadline is. In practice there is probably some buffering in the > > > > > > > > display controller. For ex, block based (including bandwidth > > > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > > > efficiently linearize for scanout. So you probably need to latch some > > > > > > > > time before you start sending pixel data to the display. But details > > > > > > > > like this are heavily implementation dependent. I think the most > > > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > > > > > The driver exposing those details would be quite useful for userspace > > > > > > > though, so that it can delay committing updates to late, but not too > > > > > > > late. Setting a deadline to be the vblank seems easy enough, but it > > > > > > > isn't enough for scheduling the actual commit. > > > > > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think you > > > > > > are talking about something on the order of 100us? But that is a bit > > > > > > of another topic. > > > > > > > > > > Yes, something like that. But yea, it's not really related. Scheduling > > > > > commits closer to the deadline has more complex behavior than that too, > > > > > e.g. the need for real time scheduling, and knowing how long it usually > > > > > takes to create and commit and for the kernel to process. > > > > > > > > > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs to be > > > > > > > > flushed. But you can't associate a deadline with work that the kernel > > > > > > > > is unaware of anyways. > > > > > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it as the > > > > > > > only way to tell the kernel it should speed things up. Even with the > > > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback > > > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying > > > > > > > or dropping the first frame is so far the best work around that works, > > > > > > > except doing other tricks that makes the kernel to ramp up its clock. > > > > > > > Having to rely on choosing between latency and frame drops should > > > > > > > ideally not have to be made. > > > > > > > > > > > > Before you have a fence, the thing you want to be speeding up is the > > > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > > > > > Is there no benefit to let the GPU know earlier that it should speed up, > > > > > so that when the job queue arrives, it's already up to speed? > > > > > > > > Downstream we have input notifier that resumes the GPU so we can > > > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > > > wait to boost freq until we have cmdstream to submit, since that > > > > doesn't take as long. What needs help initially after input is all > > > > the stuff that happens on the CPU before the GPU can start to do > > > > anything ;-) > > > > > > How do you deal with boosting CPU speeds downstream? Does the input > > > notifier do that too? > > > > Yes.. actually currently downstream (depending on device) we have 1 to > > 3 input notifiers, one for CPU boost, one for early-PSR-exit, and one > > to get a head start on booting up the GPU. > > Would be really nice to upstream these, one way or the other, be it > actually input event based, or via some uapi to just poke the kernel. I > realize it's not related to this thread, so this is just me wishing > things into the void. There was a drm/input_helper proposed maybe a year or so back, mainly for the early-PSR-exit but I was planning to build on that for early GPU wake-up. I guess we should revisit it. Might not be the right place for cpu boost, but it solves some problems so it's a start. As far as uapi, I think sysfs already gives you everything or at least most everything you need. For ex, /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq .. on the gpu side, for drivers using devfreq (ie. panfrost/msm/etc) there is similar sysfs. I'm not sure what sort of knobs are avail on intel/amd. BR, -R > > > > > > > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > > > to help the steady-state situation, rather than the input-latency > > > > situation. It might take a frame or two of missed deadlines for > > > > gpufreq to arrive at a good steady-state freq. > > > > > > I'm just not sure it will help. Missed deadlines set at commit hasn't > > > been enough in the past to let the kernel understand it should speed > > > things up before the next frame (which will be a whole frame late > > > without any triple buffering which should be a last resort), so I don't > > > see how it will help by adding a userspace hook to do the same thing. > > > > So deadline is just a superset of "right now" and "sometime in the > > future".. and this has been useful enough for i915 that they have both > > forms, when waiting on GPU via i915 specific ioctls and when pageflip > > (assuming userspace isn't deferring composition decision and instead > > just pushing it all down to the kernel). But this breaks down in a > > few cases: > > > > 1) non pageflip (for ex. ping-ponging between cpu and gpu) use cases > > when you wait via polling on fence fd or wait via drm_syncobj instead > > of DRM_IOCTL_I915_GEM_WAIT > > 2) when userspace decides late in frame to not pageflip because app > > fence isn't signaled yet > > It breaks down in practice today, because we do entering the low-freq > feedback loop that triple buffering today effectively works around. > That is even with non-delayed page flipping, and a single pipeline > source (compositor only rendering) or only using already signaled ready > client buffers when compositing. > > Anyway, I don't doubt its usefulness, just a bit pessimistic. > > > > > And this is all done in a way that doesn't help for situations where > > you have separate kms and render devices. Or the kms driver doesn't > > bypass atomic helpers (ie. uses drm_atomic_helper_wait_for_fences()). > > So the technique has already proven to be useful. This series just > > extends it beyond driver specific primitives (ie. > > dma_fence/drm_syncojb) > > > > > I think input latency and steady state target frequency here is tightly > > > linked; what we should aim for is to provide enough information at the > > > right time so that it does *not* take a frame or two to of missed > > > deadlines to arrive at the target frequency, as those missed deadlines > > > either means either stuttering and/or lag. > > > > If you have some magic way for a gl/vk driver to accurately predict > > how many cycles it will take to execute a sequence of draws, I'm all > > ears. > > > > Realistically, the best solution on sudden input is to overshoot and > > let freqs settle back down. > > > > But there is a lot more to input latency than GPU freq. In UI > > workloads, even fullscreen animation, I don't really see the GPU going > > above the 2nd lowest OPP even on relatively small things like a618. > > UI input latency (touch scrolling, on-screen stylus / low-latency-ink, > > animations) are a separate issue from what this series addresses, and > > aren't too much to do with GPU freq. > > > > > That it helps with the deliberately late commit I do understand, but we > > > don't do that yet, but intend to when there is kernel uapi to lets us do > > > so without negative consequences. > > > > > > > > > > > > > > > > > > > TBF I'm of the belief that there is still a need for input based cpu > > > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > > > this in CrOS kernel. That is a bit of a different topic, but my point > > > > > > is that fence deadlines are just one of several things we need to > > > > > > optimize power/perf and responsiveness, rather than the single thing > > > > > > that solves every problem under the sun ;-) > > > > > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece > > > > > of the puzzle that has the information to know whether there is need > > > > > actually speed up is the compositor, not the kernel. > > > > > > > > > > For example, pressing 'p' while a terminal is focused does not need high > > > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and > > > > > the compositor to composite that update. Pressing <Super> may however > > > > > trigger a non-trivial animation moving a lot of stuff around on screen, > > > > > maybe triggering Wayland clients to draw and what not, and should most > > > > > arguably have the ability to "warn" the kernel about the upcoming flood > > > > > of work before it is already knocking on its door step. > > > > > > > > The super key is problematic, but not for the reason you think. It is > > > > because it is a case where we should boost on key-up instead of > > > > key-down.. and the second key-up event comes after the cpu-boost is > > > > already in it's cool-down period. But even if suboptimal in cases > > > > like this, it is still useful for touch/stylus cases where the > > > > slightest of lag is much more perceptible. > > > > > > Other keys are even more problematic. Alt, for example, does nothing, > > > Alt + Tab does some light rendering, but Alt + KeyAboveTab will, > > > depending on the current active applications, suddenly trigger N Wayland > > > surfaces to start rendering at the same time. > > > > > > > > > > > This is getting off topic but I kinda favor coming up with some sort > > > > of static definition that userspace could give the kernel to let the > > > > kernel know what input to boost on. Or maybe something could be done > > > > with BPF? > > > > > > I have hard time seeing any static information can be enough, it's > > > depends too much on context what is expected to happen. And can a BPF > > > program really help? Unless BPF programs that pulls some internal kernel > > > strings to speed things up whenever userspace wants I don't see how it > > > is that much better. > > > > > > I don't think userspace is necessarily too slow to actively particitpate > > > in providing direct scheduling hints either. Input processing can, for > > > example, be off loaded to a real time scheduled thread, and plumbing any > > > hints about future expectations from rendering, windowing and layout > > > subsystems will be significantly easier to plumb to a real time input > > > thread than translated into static informations or BPF programs. > > > > I mean, the kernel side input handler is called from irq context long > > before even the scheduler gets involved.. > > > > But I think you are over-thinking the Alt + SomeOtherKey case. The > > important thing isn't what the other key is, it is just to know that > > Alt is a modifier key (ie. handle it on key-up instead of key-down). > > No need to over-complicate things. It's probably enough to give the > > kernel a list of modifier+key combo's that do _something_.. > > Perhaps I'm over thinking it, it just seems all so unnecessary to > complicate the kernel so that it's able to predict when GUI animations > will happen instead of the GUI itself doing it when it is actually > beneficial. All it'd take (naively) is uapi for the three kind of boosts > downstream now does automatically from input events. > > > > > And like I've said before, keyboard input is the least problematic in > > terms of latency. It is a _lot_ easier to notice lag with touch > > scrolling or stylus (on screen). (The latter case, I think wayland > > has some catching up to do compared to CrOS or android.. you really > > need a way to allow the app to do front buffer rendering to an overlay > > for the stylus case, because even just 16ms delay is _very_ > > noticeable.) > > Sure, but here too userpsace (rt thread in the compositor) is probably a > good enough place to predict when to boost since it will be the one > proxies e.g. the stylus input events to the application. > > Front buffering on the other hand is a very different topic ;) > > > Jonas > > > > > BR, > > -R
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 622b8156d212..183e480d8cea 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: fence signalling annotation +DMA Fence Deadline Hints +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: deadline hints + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 0de0482cd36e..f177c56269bb 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, } EXPORT_SYMBOL(dma_fence_wait_any_timeout); +/** + * DOC: deadline hints + * + * In an ideal world, it would be possible to pipeline a workload sufficiently + * that a utilization based device frequency governor could arrive at a minimum + * frequency that meets the requirements of the use-case, in order to minimize + * power consumption. But in the real world there are many workloads which + * defy this ideal. For example, but not limited to: + * + * * Workloads that ping-pong between device and CPU, with alternating periods + * of CPU waiting for device, and device waiting on CPU. This can result in + * devfreq and cpufreq seeing idle time in their respective domains and in + * result reduce frequency. + * + * * Workloads that interact with a periodic time based deadline, such as double + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario, + * missing a vblank deadline results in an *increase* in idle time on the GPU + * (since it has to wait an additional vblank period), sending a signal to + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is + * needed. + * + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline. + * The deadline hint provides a way for the waiting driver, or userspace, to + * convey an appropriate sense of urgency to the signaling driver. + * + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace + * facing APIs). The time could either be some point in the future (such as + * the vblank based deadline for page-flipping, or the start of a compositor's + * composition cycle), or the current time to indicate an immediate deadline + * hint (Ie. forward progress cannot be made until this fence is signaled). + * + * Multiple deadlines may be set on a given fence, even in parallel. See the + * documentation for &dma_fence_ops.set_deadline. + * + * The deadline hint is just that, a hint. The driver that created the fence + * may react by increasing frequency, making different scheduling choices, etc. + * Or doing nothing at all. + */ + +/** + * dma_fence_set_deadline - set desired fence-wait deadline hint + * @fence: the fence that is to be waited on + * @deadline: the time by which the waiter hopes for the fence to be + * signaled + * + * Give the fence signaler a hint about an upcoming deadline, such as + * vblank, by which point the waiter would prefer the fence to be + * signaled by. This is intended to give feedback to the fence signaler + * to aid in power management decisions, such as boosting GPU frequency + * if a periodic vblank deadline is approaching but the fence is not + * yet signaled.. + */ +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) +{ + if (fence->ops->set_deadline && !dma_fence_is_signaled(fence)) + fence->ops->set_deadline(fence, deadline); +} +EXPORT_SYMBOL(dma_fence_set_deadline); + /** * dma_fence_describe - Dump fence describtion into seq_file * @fence: the 6fence to describe diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 775cdc0b4f24..d54b595a0fe0 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -257,6 +257,26 @@ struct dma_fence_ops { */ void (*timeline_value_str)(struct dma_fence *fence, char *str, int size); + + /** + * @set_deadline: + * + * Callback to allow a fence waiter to inform the fence signaler of + * an upcoming deadline, such as vblank, by which point the waiter + * would prefer the fence to be signaled by. This is intended to + * give feedback to the fence signaler to aid in power management + * decisions, such as boosting GPU frequency. + * + * This is called without &dma_fence.lock held, it can be called + * multiple times and from any context. Locking is up to the callee + * if it has some state to manage. If multiple deadlines are set, + * the expectation is to track the soonest one. If the deadline is + * before the current time, it should be interpreted as an immediate + * deadline. + * + * This callback is optional. + */ + void (*set_deadline)(struct dma_fence *fence, ktime_t deadline); }; void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops, @@ -583,6 +603,8 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr) return ret < 0 ? ret : 0; } +void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline); + struct dma_fence *dma_fence_get_stub(void); struct dma_fence *dma_fence_allocate_private_stub(void); u64 dma_fence_context_alloc(unsigned num);