[1/2] dma-buf.rst: Document why indefinite fences are a bad idea
diff mbox series

Message ID 20200709123339.547390-1-daniel.vetter@ffwll.ch
State New
Headers show
Series
  • [1/2] dma-buf.rst: Document why indefinite fences are a bad idea
Related show

Commit Message

Daniel Vetter July 9, 2020, 12:33 p.m. UTC
Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

v2: Now with dot graph!

v3: Typo (Dave Airlie)

Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Cc: Jesse Natalie <jenatali@microsoft.com>
Cc: Steve Pronovost <spronovo@microsoft.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Comments

Maarten Lankhorst July 10, 2020, 12:30 p.m. UTC | #1
Op 09-07-2020 om 14:33 schreef Daniel Vetter:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.
>
> v2: Now with dot graph!
>
> v3: Typo (Dave Airlie)

For first 5 patches, and patch 16, 17:

Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

> Acked-by: Christian König <christian.koenig@amd.com>
> Acked-by: Daniel Stone <daniels@collabora.com>
> Cc: Jesse Natalie <jenatali@microsoft.com>
> Cc: Steve Pronovost <spronovo@microsoft.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index f8f6decde359..100bfd227265 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
>  
> +Indefinite DMA Fences
> +~~~~~~~~~~~~~~~~~~~~
> +
> +At various times &dma_fence with an indefinite time until dma_fence_wait()
> +finishes have been proposed. Examples include:
> +
> +* Future fences, used in HWC1 to signal when a buffer isn't used by the display
> +  any longer, and created with the screen update that makes the buffer visible.
> +  The time this fence completes is entirely under userspace's control.
> +
> +* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
> +  been set. Used to asynchronously delay command submission.
> +
> +* Userspace fences or gpu futexes, fine-grained locking within a command buffer
> +  that userspace uses for synchronization across engines or with the CPU, which
> +  are then imported as a DMA fence for integration into existing winsys
> +  protocols.
> +
> +* Long-running compute command buffers, while still using traditional end of
> +  batch DMA fences for memory management instead of context preemption DMA
> +  fences which get reattached when the compute job is rescheduled.
> +
> +Common to all these schemes is that userspace controls the dependencies of these
> +fences and controls when they fire. Mixing indefinite fences with normal
> +in-kernel DMA fences does not work, even when a fallback timeout is included to
> +protect against malicious userspace:
> +
> +* Only the kernel knows about all DMA fence dependencies, userspace is not aware
> +  of dependencies injected due to memory management or scheduler decisions.
> +
> +* Only userspace knows about all dependencies in indefinite fences and when
> +  exactly they will complete, the kernel has no visibility.
> +
> +Furthermore the kernel has to be able to hold up userspace command submission
> +for memory management needs, which means we must support indefinite fences being
> +dependent upon DMA fences. If the kernel also support indefinite fences in the
> +kernel like a DMA fence, like any of the above proposal would, there is the
> +potential for deadlocks.
> +
> +.. kernel-render:: DOT
> +   :alt: Indefinite Fencing Dependency Cycle
> +   :caption: Indefinite Fencing Dependency Cycle
> +
> +   digraph "Fencing Cycle" {
> +      node [shape=box bgcolor=grey style=filled]
> +      kernel [label="Kernel DMA Fences"]
> +      userspace [label="userspace controlled fences"]
> +      kernel -> userspace [label="memory management"]
> +      userspace -> kernel [label="Future fence, fence proxy, ..."]
> +
> +      { rank=same; kernel userspace }
> +   }
> +
> +This means that the kernel might accidentally create deadlocks
> +through memory management dependencies which userspace is unaware of, which
> +randomly hangs workloads until the timeout kicks in. Workloads, which from
> +userspace's perspective, do not contain a deadlock.  In such a mixed fencing
> +architecture there is no single entity with knowledge of all dependencies.
> +Thefore preventing such deadlocks from within the kernel is not possible.
> +
> +The only solution to avoid dependencies loops is by not allowing indefinite
> +fences in the kernel. This means:
> +
> +* No future fences, proxy fences or userspace fences imported as DMA fences,
> +  with or without a timeout.
> +
> +* No DMA fences that signal end of batchbuffer for command submission where
> +  userspace is allowed to use userspace fencing or long running compute
> +  workloads. This also means no implicit fencing for shared buffers in these
> +  cases.
Jason Ekstrand July 14, 2020, 5:46 p.m. UTC | #2
This matches my understanding for what it's worth.  In my little bit
of synchronization work in drm, I've gone out of my way to ensure we
can maintain this constraint.

Acked-by: Jason Ekstrand <jason@jlekstrand.net>

On Thu, Jul 9, 2020 at 7:33 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.
>
> v2: Now with dot graph!
>
> v3: Typo (Dave Airlie)
>
> Acked-by: Christian König <christian.koenig@amd.com>
> Acked-by: Daniel Stone <daniels@collabora.com>
> Cc: Jesse Natalie <jenatali@microsoft.com>
> Cc: Steve Pronovost <spronovo@microsoft.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index f8f6decde359..100bfd227265 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
>
> +Indefinite DMA Fences
> +~~~~~~~~~~~~~~~~~~~~
> +
> +At various times &dma_fence with an indefinite time until dma_fence_wait()
> +finishes have been proposed. Examples include:
> +
> +* Future fences, used in HWC1 to signal when a buffer isn't used by the display
> +  any longer, and created with the screen update that makes the buffer visible.
> +  The time this fence completes is entirely under userspace's control.
> +
> +* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
> +  been set. Used to asynchronously delay command submission.
> +
> +* Userspace fences or gpu futexes, fine-grained locking within a command buffer
> +  that userspace uses for synchronization across engines or with the CPU, which
> +  are then imported as a DMA fence for integration into existing winsys
> +  protocols.
> +
> +* Long-running compute command buffers, while still using traditional end of
> +  batch DMA fences for memory management instead of context preemption DMA
> +  fences which get reattached when the compute job is rescheduled.
> +
> +Common to all these schemes is that userspace controls the dependencies of these
> +fences and controls when they fire. Mixing indefinite fences with normal
> +in-kernel DMA fences does not work, even when a fallback timeout is included to
> +protect against malicious userspace:
> +
> +* Only the kernel knows about all DMA fence dependencies, userspace is not aware
> +  of dependencies injected due to memory management or scheduler decisions.
> +
> +* Only userspace knows about all dependencies in indefinite fences and when
> +  exactly they will complete, the kernel has no visibility.
> +
> +Furthermore the kernel has to be able to hold up userspace command submission
> +for memory management needs, which means we must support indefinite fences being
> +dependent upon DMA fences. If the kernel also support indefinite fences in the
> +kernel like a DMA fence, like any of the above proposal would, there is the
> +potential for deadlocks.
> +
> +.. kernel-render:: DOT
> +   :alt: Indefinite Fencing Dependency Cycle
> +   :caption: Indefinite Fencing Dependency Cycle
> +
> +   digraph "Fencing Cycle" {
> +      node [shape=box bgcolor=grey style=filled]
> +      kernel [label="Kernel DMA Fences"]
> +      userspace [label="userspace controlled fences"]
> +      kernel -> userspace [label="memory management"]
> +      userspace -> kernel [label="Future fence, fence proxy, ..."]
> +
> +      { rank=same; kernel userspace }
> +   }
> +
> +This means that the kernel might accidentally create deadlocks
> +through memory management dependencies which userspace is unaware of, which
> +randomly hangs workloads until the timeout kicks in. Workloads, which from
> +userspace's perspective, do not contain a deadlock.  In such a mixed fencing
> +architecture there is no single entity with knowledge of all dependencies.
> +Thefore preventing such deadlocks from within the kernel is not possible.
> +
> +The only solution to avoid dependencies loops is by not allowing indefinite
> +fences in the kernel. This means:
> +
> +* No future fences, proxy fences or userspace fences imported as DMA fences,
> +  with or without a timeout.
> +
> +* No DMA fences that signal end of batchbuffer for command submission where
> +  userspace is allowed to use userspace fencing or long running compute
> +  workloads. This also means no implicit fencing for shared buffers in these
> +  cases.
> --
> 2.27.0
>
Thomas Hellström (Intel) July 20, 2020, 11:15 a.m. UTC | #3
Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely 
untangle kernel-introduced fences for resource management and dma-fences 
used for completion- and dependency tracking and lift a lot of 
restrictions for the dma-fences, including prohibiting infinite ones, I 
think this makes sense describing the current state.

Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>
Daniel Vetter July 21, 2020, 7:41 a.m. UTC | #4
On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote:
> Hi,
> 
> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> > Comes up every few years, gets somewhat tedious to discuss, let's
> > write this down once and for all.
> > 
> > What I'm not sure about is whether the text should be more explicit in
> > flat out mandating the amdkfd eviction fences for long running compute
> > workloads or workloads where userspace fencing is allowed.
> 
> Although (in my humble opinion) it might be possible to completely untangle
> kernel-introduced fences for resource management and dma-fences used for
> completion- and dependency tracking and lift a lot of restrictions for the
> dma-fences, including prohibiting infinite ones, I think this makes sense
> describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.
>
> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>

Thanks for taking a look, first 3 patches here with annotations and docs
merged to drm-misc-next. I'll ask Maarten/Dave whether another pull is ok
for 5.9 so that everyone can use this asap.
-Daniel
Christian König July 21, 2020, 7:45 a.m. UTC | #5
Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote:
>> Hi,
>>
>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>> write this down once and for all.
>>>
>>> What I'm not sure about is whether the text should be more explicit in
>>> flat out mandating the amdkfd eviction fences for long running compute
>>> workloads or workloads where userspace fencing is allowed.
>> Although (in my humble opinion) it might be possible to completely untangle
>> kernel-introduced fences for resource management and dma-fences used for
>> completion- and dependency tracking and lift a lot of restrictions for the
>> dma-fences, including prohibiting infinite ones, I think this makes sense
>> describing the current state.
> Yeah I think a future patch needs to type up how we want to make that
> happen (for some cross driver consistency) and what needs to be
> considered. Some of the necessary parts are already there (with like the
> preemption fences amdkfd has as an example), but I think some clear docs
> on what's required from both hw, drivers and userspace would be really
> good.

I'm currently writing that up, but probably still need a few days for this.

Christian.

>> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Thanks for taking a look, first 3 patches here with annotations and docs
> merged to drm-misc-next. I'll ask Maarten/Dave whether another pull is ok
> for 5.9 so that everyone can use this asap.
> -Daniel
Thomas Hellström (Intel) July 21, 2020, 8:47 a.m. UTC | #6
On 7/21/20 9:45 AM, Christian König wrote:
> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>> wrote:
>>> Hi,
>>>
>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>> write this down once and for all.
>>>>
>>>> What I'm not sure about is whether the text should be more explicit in
>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>> workloads or workloads where userspace fencing is allowed.
>>> Although (in my humble opinion) it might be possible to completely 
>>> untangle
>>> kernel-introduced fences for resource management and dma-fences used 
>>> for
>>> completion- and dependency tracking and lift a lot of restrictions 
>>> for the
>>> dma-fences, including prohibiting infinite ones, I think this makes 
>>> sense
>>> describing the current state.
>> Yeah I think a future patch needs to type up how we want to make that
>> happen (for some cross driver consistency) and what needs to be
>> considered. Some of the necessary parts are already there (with like the
>> preemption fences amdkfd has as an example), but I think some clear docs
>> on what's required from both hw, drivers and userspace would be really
>> good.
>
> I'm currently writing that up, but probably still need a few days for 
> this.

Great! I put down some (very) initial thoughts a couple of weeks ago 
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

/Thomas
Christian König July 21, 2020, 8:55 a.m. UTC | #7
Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>
> On 7/21/20 9:45 AM, Christian König wrote:
>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>>> wrote:
>>>> Hi,
>>>>
>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>> write this down once and for all.
>>>>>
>>>>> What I'm not sure about is whether the text should be more 
>>>>> explicit in
>>>>> flat out mandating the amdkfd eviction fences for long running 
>>>>> compute
>>>>> workloads or workloads where userspace fencing is allowed.
>>>> Although (in my humble opinion) it might be possible to completely 
>>>> untangle
>>>> kernel-introduced fences for resource management and dma-fences 
>>>> used for
>>>> completion- and dependency tracking and lift a lot of restrictions 
>>>> for the
>>>> dma-fences, including prohibiting infinite ones, I think this makes 
>>>> sense
>>>> describing the current state.
>>> Yeah I think a future patch needs to type up how we want to make that
>>> happen (for some cross driver consistency) and what needs to be
>>> considered. Some of the necessary parts are already there (with like 
>>> the
>>> preemption fences amdkfd has as an example), but I think some clear 
>>> docs
>>> on what's required from both hw, drivers and userspace would be really
>>> good.
>>
>> I'm currently writing that up, but probably still need a few days for 
>> this.
>
> Great! I put down some (very) initial thoughts a couple of weeks ago 
> building on eviction fences for various hardware complexity levels here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0 
>

I don't think that this will ever be possible.

See that Daniel describes in his text is that indefinite fences are a 
bad idea for memory management, and I think that this is a fixed fact.

In other words the whole concept of submitting work to the kernel which 
depends on some user space interaction doesn't work and never will.

What can be done is that dma_fences work with hardware schedulers. E.g. 
what the KFD tries to do with its preemption fences.

But for this you need a better concept and description of what the 
hardware scheduler is supposed to do and how that interacts with 
dma_fence objects.

Christian.

>
> /Thomas
>
>
Daniel Vetter July 21, 2020, 9:16 a.m. UTC | #8
On Tue, Jul 21, 2020 at 10:55 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> >
> > On 7/21/20 9:45 AM, Christian König wrote:
> >> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>> write this down once and for all.
> >>>>>
> >>>>> What I'm not sure about is whether the text should be more
> >>>>> explicit in
> >>>>> flat out mandating the amdkfd eviction fences for long running
> >>>>> compute
> >>>>> workloads or workloads where userspace fencing is allowed.
> >>>> Although (in my humble opinion) it might be possible to completely
> >>>> untangle
> >>>> kernel-introduced fences for resource management and dma-fences
> >>>> used for
> >>>> completion- and dependency tracking and lift a lot of restrictions
> >>>> for the
> >>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>> sense
> >>>> describing the current state.
> >>> Yeah I think a future patch needs to type up how we want to make that
> >>> happen (for some cross driver consistency) and what needs to be
> >>> considered. Some of the necessary parts are already there (with like
> >>> the
> >>> preemption fences amdkfd has as an example), but I think some clear
> >>> docs
> >>> on what's required from both hw, drivers and userspace would be really
> >>> good.
> >>
> >> I'm currently writing that up, but probably still need a few days for
> >> this.
> >
> > Great! I put down some (very) initial thoughts a couple of weeks ago
> > building on eviction fences for various hardware complexity levels here:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> >
>
> I don't think that this will ever be possible.
>
> See that Daniel describes in his text is that indefinite fences are a
> bad idea for memory management, and I think that this is a fixed fact.
>
> In other words the whole concept of submitting work to the kernel which
> depends on some user space interaction doesn't work and never will.
>
> What can be done is that dma_fences work with hardware schedulers. E.g.
> what the KFD tries to do with its preemption fences.
>
> But for this you need a better concept and description of what the
> hardware scheduler is supposed to do and how that interacts with
> dma_fence objects.

Yeah I think trying to split dma_fence wont work, simply because of
inertia. Creating an entirely new thing for augmented userspace
controlled fencing, and then jotting down all the rules the
kernel/hw/userspace need to obey to not break dma_fence is what I had
in mind. And I guess that's also what Christian is working on. E.g.
just going through all the cases of how much your hw can preempt or
handle page faults on the gpu, and what that means in terms of
dma_fence_begin/end_signalling and other constraints would be really
good.
-Daniel

>
> Christian.
>
> >
> > /Thomas
> >
> >
>
Daniel Vetter July 21, 2020, 9:24 a.m. UTC | #9
On Tue, Jul 21, 2020 at 11:16 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 21, 2020 at 10:55 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> > >
> > > On 7/21/20 9:45 AM, Christian König wrote:
> > >> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> > >>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> > >>> wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> > >>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> > >>>>> write this down once and for all.
> > >>>>>
> > >>>>> What I'm not sure about is whether the text should be more
> > >>>>> explicit in
> > >>>>> flat out mandating the amdkfd eviction fences for long running
> > >>>>> compute
> > >>>>> workloads or workloads where userspace fencing is allowed.
> > >>>> Although (in my humble opinion) it might be possible to completely
> > >>>> untangle
> > >>>> kernel-introduced fences for resource management and dma-fences
> > >>>> used for
> > >>>> completion- and dependency tracking and lift a lot of restrictions
> > >>>> for the
> > >>>> dma-fences, including prohibiting infinite ones, I think this makes
> > >>>> sense
> > >>>> describing the current state.
> > >>> Yeah I think a future patch needs to type up how we want to make that
> > >>> happen (for some cross driver consistency) and what needs to be
> > >>> considered. Some of the necessary parts are already there (with like
> > >>> the
> > >>> preemption fences amdkfd has as an example), but I think some clear
> > >>> docs
> > >>> on what's required from both hw, drivers and userspace would be really
> > >>> good.
> > >>
> > >> I'm currently writing that up, but probably still need a few days for
> > >> this.
> > >
> > > Great! I put down some (very) initial thoughts a couple of weeks ago
> > > building on eviction fences for various hardware complexity levels here:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> > >
> >
> > I don't think that this will ever be possible.
> >
> > See that Daniel describes in his text is that indefinite fences are a
> > bad idea for memory management, and I think that this is a fixed fact.
> >
> > In other words the whole concept of submitting work to the kernel which
> > depends on some user space interaction doesn't work and never will.
> >
> > What can be done is that dma_fences work with hardware schedulers. E.g.
> > what the KFD tries to do with its preemption fences.
> >
> > But for this you need a better concept and description of what the
> > hardware scheduler is supposed to do and how that interacts with
> > dma_fence objects.
>
> Yeah I think trying to split dma_fence wont work, simply because of
> inertia. Creating an entirely new thing for augmented userspace
> controlled fencing, and then jotting down all the rules the
> kernel/hw/userspace need to obey to not break dma_fence is what I had
> in mind. And I guess that's also what Christian is working on. E.g.
> just going through all the cases of how much your hw can preempt or
> handle page faults on the gpu, and what that means in terms of
> dma_fence_begin/end_signalling and other constraints would be really
> good.

Or rephrased in terms of Thomas' doc: dma-fence will stay the memory
fence, and also the sync fence for current userspace and winsys.

Then we create a new thing and complete protocol and driver reving of
the entire world. The really hard part is that running old stuff on a
new stack is possible (we'd be totally screwed otherwise, since it
would become a system wide flag day). But running new stuff on an old
stack (even if it's just something in userspace like the compositor)
doesn't work, because then you tie the new synchronization fences back
into the dma-fence memory fences, and game over.

So yeah around 5 years or so for anything that wants to use a winsys,
or at least that's what it usually takes us to do something like this
:-/ Entirely stand-alone compute workloads (irrespective whether it's
cuda, cl, vk or whatever) doesn't have that problem ofc.
-Daniel

> -Daniel
>
> >
> > Christian.
> >
> > >
> > > /Thomas
> > >
> > >
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Thomas Hellström (Intel) July 21, 2020, 9:37 a.m. UTC | #10
On 7/21/20 10:55 AM, Christian König wrote:
> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>
>> On 7/21/20 9:45 AM, Christian König wrote:
>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>> write this down once and for all.
>>>>>>
>>>>>> What I'm not sure about is whether the text should be more 
>>>>>> explicit in
>>>>>> flat out mandating the amdkfd eviction fences for long running 
>>>>>> compute
>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>> Although (in my humble opinion) it might be possible to completely 
>>>>> untangle
>>>>> kernel-introduced fences for resource management and dma-fences 
>>>>> used for
>>>>> completion- and dependency tracking and lift a lot of restrictions 
>>>>> for the
>>>>> dma-fences, including prohibiting infinite ones, I think this 
>>>>> makes sense
>>>>> describing the current state.
>>>> Yeah I think a future patch needs to type up how we want to make that
>>>> happen (for some cross driver consistency) and what needs to be
>>>> considered. Some of the necessary parts are already there (with 
>>>> like the
>>>> preemption fences amdkfd has as an example), but I think some clear 
>>>> docs
>>>> on what's required from both hw, drivers and userspace would be really
>>>> good.
>>>
>>> I'm currently writing that up, but probably still need a few days 
>>> for this.
>>
>> Great! I put down some (very) initial thoughts a couple of weeks ago 
>> building on eviction fences for various hardware complexity levels here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0 
>>
>
> I don't think that this will ever be possible.
>
> See that Daniel describes in his text is that indefinite fences are a 
> bad idea for memory management, and I think that this is a fixed fact.
>
> In other words the whole concept of submitting work to the kernel 
> which depends on some user space interaction doesn't work and never will.

Well the idea here is that memory management will *never* depend on 
indefinite fences: As soon as someone waits on a memory manager fence 
(be it eviction, shrinker or mmu notifier) it breaks out of any 
dma-fence dependencies and /or user-space interaction. The text tries to 
describe what's required to be able to do that (save for non-preemptible 
gpus where someone submits a forever-running shader).

So while I think this is possible (until someone comes up with a case 
where it wouldn't work of course), I guess Daniel has a point in that it 
won't happen because of inertia and there might be better options.

/Thomas
Daniel Vetter July 21, 2020, 9:50 a.m. UTC | #11
On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 7/21/20 10:55 AM, Christian König wrote:
> > Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> >>
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>> write this down once and for all.
> >>>>>>
> >>>>>> What I'm not sure about is whether the text should be more
> >>>>>> explicit in
> >>>>>> flat out mandating the amdkfd eviction fences for long running
> >>>>>> compute
> >>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>> Although (in my humble opinion) it might be possible to completely
> >>>>> untangle
> >>>>> kernel-introduced fences for resource management and dma-fences
> >>>>> used for
> >>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>> for the
> >>>>> dma-fences, including prohibiting infinite ones, I think this
> >>>>> makes sense
> >>>>> describing the current state.
> >>>> Yeah I think a future patch needs to type up how we want to make that
> >>>> happen (for some cross driver consistency) and what needs to be
> >>>> considered. Some of the necessary parts are already there (with
> >>>> like the
> >>>> preemption fences amdkfd has as an example), but I think some clear
> >>>> docs
> >>>> on what's required from both hw, drivers and userspace would be really
> >>>> good.
> >>>
> >>> I'm currently writing that up, but probably still need a few days
> >>> for this.
> >>
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels here:
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> >>
> >
> > I don't think that this will ever be possible.
> >
> > See that Daniel describes in his text is that indefinite fences are a
> > bad idea for memory management, and I think that this is a fixed fact.
> >
> > In other words the whole concept of submitting work to the kernel
> > which depends on some user space interaction doesn't work and never will.
>
> Well the idea here is that memory management will *never* depend on
> indefinite fences: As soon as someone waits on a memory manager fence
> (be it eviction, shrinker or mmu notifier) it breaks out of any
> dma-fence dependencies and /or user-space interaction. The text tries to
> describe what's required to be able to do that (save for non-preemptible
> gpus where someone submits a forever-running shader).

Yeah I think that part of your text is good to describe how to
untangle memory fences from synchronization fences given how much the
hw can do.

> So while I think this is possible (until someone comes up with a case
> where it wouldn't work of course), I guess Daniel has a point in that it
> won't happen because of inertia and there might be better options.

Yeah it's just I don't see much chance for splitting dma-fence itself.
That's also why I'm not positive on the "no hw preemption, only
scheduler" case: You still have a dma_fence for the batch itself,
which means still no userspace controlled synchronization or other
form of indefinite batches allowed. So not getting us any closer to
enabling the compute use cases people want. So minimally I think hw
needs to be able to preempt, and preempt fairly quickly (i.e. within
shaders if you have long running shaders as your use-case), or support
gpu page faults. And depending how it all works different parts of the
driver code end up in dma fence critical sections, with different
restrictions.
-Daniel
Thomas Hellström (Intel) July 21, 2020, 10:47 a.m. UTC | #12
On 7/21/20 11:50 AM, Daniel Vetter wrote:
> On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 7/21/20 10:55 AM, Christian König wrote:
>>> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>> write this down once and for all.
>>>>>>>>
>>>>>>>> What I'm not sure about is whether the text should be more
>>>>>>>> explicit in
>>>>>>>> flat out mandating the amdkfd eviction fences for long running
>>>>>>>> compute
>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>> untangle
>>>>>>> kernel-introduced fences for resource management and dma-fences
>>>>>>> used for
>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>> for the
>>>>>>> dma-fences, including prohibiting infinite ones, I think this
>>>>>>> makes sense
>>>>>>> describing the current state.
>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>> considered. Some of the necessary parts are already there (with
>>>>>> like the
>>>>>> preemption fences amdkfd has as an example), but I think some clear
>>>>>> docs
>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>> good.
>>>>> I'm currently writing that up, but probably still need a few days
>>>>> for this.
>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>> building on eviction fences for various hardware complexity levels here:
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
>>>>
>>> I don't think that this will ever be possible.
>>>
>>> See that Daniel describes in his text is that indefinite fences are a
>>> bad idea for memory management, and I think that this is a fixed fact.
>>>
>>> In other words the whole concept of submitting work to the kernel
>>> which depends on some user space interaction doesn't work and never will.
>> Well the idea here is that memory management will *never* depend on
>> indefinite fences: As soon as someone waits on a memory manager fence
>> (be it eviction, shrinker or mmu notifier) it breaks out of any
>> dma-fence dependencies and /or user-space interaction. The text tries to
>> describe what's required to be able to do that (save for non-preemptible
>> gpus where someone submits a forever-running shader).
> Yeah I think that part of your text is good to describe how to
> untangle memory fences from synchronization fences given how much the
> hw can do.
>
>> So while I think this is possible (until someone comes up with a case
>> where it wouldn't work of course), I guess Daniel has a point in that it
>> won't happen because of inertia and there might be better options.
> Yeah it's just I don't see much chance for splitting dma-fence itself.
> That's also why I'm not positive on the "no hw preemption, only
> scheduler" case: You still have a dma_fence for the batch itself,
> which means still no userspace controlled synchronization or other
> form of indefinite batches allowed. So not getting us any closer to
> enabling the compute use cases people want.

Yes, we can't do magic. As soon as an indefinite batch makes it to such 
hardware we've lost. But since we can break out while the batch is stuck 
in the scheduler waiting, what I believe we *can* do with this approach 
is to avoid deadlocks due to locally unknown dependencies, which has 
some bearing on this documentation patch, and also to allow memory 
allocation in dma-fence (not memory-fence) critical sections, like gpu 
fault- and error handlers without resorting to using memory pools.

But again. I'm not saying we should actually implement this. Better to 
consider it and reject it than not consider it at all.

/Thomas
Christian König July 21, 2020, 1:59 p.m. UTC | #13
Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
>
> On 7/21/20 11:50 AM, Daniel Vetter wrote:
>> On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
>> <thomas_os@shipmail.org> wrote:
>>>
>>> On 7/21/20 10:55 AM, Christian König wrote:
>>>> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>> write this down once and for all.
>>>>>>>>>
>>>>>>>>> What I'm not sure about is whether the text should be more
>>>>>>>>> explicit in
>>>>>>>>> flat out mandating the amdkfd eviction fences for long running
>>>>>>>>> compute
>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>> untangle
>>>>>>>> kernel-introduced fences for resource management and dma-fences
>>>>>>>> used for
>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>> for the
>>>>>>>> dma-fences, including prohibiting infinite ones, I think this
>>>>>>>> makes sense
>>>>>>>> describing the current state.
>>>>>>> Yeah I think a future patch needs to type up how we want to make 
>>>>>>> that
>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>> considered. Some of the necessary parts are already there (with
>>>>>>> like the
>>>>>>> preemption fences amdkfd has as an example), but I think some clear
>>>>>>> docs
>>>>>>> on what's required from both hw, drivers and userspace would be 
>>>>>>> really
>>>>>>> good.
>>>>>> I'm currently writing that up, but probably still need a few days
>>>>>> for this.
>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>> building on eviction fences for various hardware complexity levels 
>>>>> here:
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C0af39422c4e744a9303b08d82d637d62%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309252665326201&amp;sdata=Zk3LVX7bbMpfAMsq%2Fs2jyA0puRQNcjzliJS%2BC7uDLMo%3D&amp;reserved=0 
>>>>>
>>>>>
>>>> I don't think that this will ever be possible.
>>>>
>>>> See that Daniel describes in his text is that indefinite fences are a
>>>> bad idea for memory management, and I think that this is a fixed fact.
>>>>
>>>> In other words the whole concept of submitting work to the kernel
>>>> which depends on some user space interaction doesn't work and never 
>>>> will.
>>> Well the idea here is that memory management will *never* depend on
>>> indefinite fences: As soon as someone waits on a memory manager fence
>>> (be it eviction, shrinker or mmu notifier) it breaks out of any
>>> dma-fence dependencies and /or user-space interaction. The text 
>>> tries to
>>> describe what's required to be able to do that (save for 
>>> non-preemptible
>>> gpus where someone submits a forever-running shader).
>> Yeah I think that part of your text is good to describe how to
>> untangle memory fences from synchronization fences given how much the
>> hw can do.
>>
>>> So while I think this is possible (until someone comes up with a case
>>> where it wouldn't work of course), I guess Daniel has a point in 
>>> that it
>>> won't happen because of inertia and there might be better options.
>> Yeah it's just I don't see much chance for splitting dma-fence itself.

Well that's the whole idea with the timeline semaphores and waiting for 
a signal number to appear.

E.g. instead of doing the wait with the dma_fence we are separating that 
out into the timeline semaphore object.

This not only avoids the indefinite fence problem for the wait before 
signal case in Vulkan, but also prevents userspace to submit stuff which 
can't be processed immediately.

>> That's also why I'm not positive on the "no hw preemption, only
>> scheduler" case: You still have a dma_fence for the batch itself,
>> which means still no userspace controlled synchronization or other
>> form of indefinite batches allowed. So not getting us any closer to
>> enabling the compute use cases people want.

What compute use case are you talking about? I'm only aware about the 
wait before signal case from Vulkan, the page fault case and the KFD 
preemption fence case.

>
> Yes, we can't do magic. As soon as an indefinite batch makes it to 
> such hardware we've lost. But since we can break out while the batch 
> is stuck in the scheduler waiting, what I believe we *can* do with 
> this approach is to avoid deadlocks due to locally unknown 
> dependencies, which has some bearing on this documentation patch, and 
> also to allow memory allocation in dma-fence (not memory-fence) 
> critical sections, like gpu fault- and error handlers without 
> resorting to using memory pools.

Avoiding deadlocks is only the tip of the iceberg here.

When you allow the kernel to depend on user space to proceed with some 
operation there are a lot more things which need consideration.

E.g. what happens when an userspace process which has submitted stuff to 
the kernel is killed? Are the prepared commands send to the hardware or 
aborted as well? What do we do with other processes waiting for that stuff?

How to we do resource accounting? When processes need to block when 
submitting to the hardware stuff which is not ready we have a process we 
can punish for blocking resources. But how is kernel memory used for a 
submission accounted? How do we avoid deny of service attacks here were 
somebody eats up all memory by doing submissions which can't finish?

> But again. I'm not saying we should actually implement this. Better to 
> consider it and reject it than not consider it at all.

Agreed.

Same thing as it turned out with the Wait before Signal for Vulkan, 
initially it looked simpler to do it in the kernel. But as far as I know 
the solution in userspace now works so well that we don't really want 
the pain for a kernel implementation any more.

Christian.

>
> /Thomas
>
>
Thomas Hellström (Intel) July 21, 2020, 5:46 p.m. UTC | #14
On 2020-07-21 15:59, Christian König wrote:
> Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
...
>> Yes, we can't do magic. As soon as an indefinite batch makes it to 
>> such hardware we've lost. But since we can break out while the batch 
>> is stuck in the scheduler waiting, what I believe we *can* do with 
>> this approach is to avoid deadlocks due to locally unknown 
>> dependencies, which has some bearing on this documentation patch, and 
>> also to allow memory allocation in dma-fence (not memory-fence) 
>> critical sections, like gpu fault- and error handlers without 
>> resorting to using memory pools.
>
> Avoiding deadlocks is only the tip of the iceberg here.
>
> When you allow the kernel to depend on user space to proceed with some 
> operation there are a lot more things which need consideration.
>
> E.g. what happens when an userspace process which has submitted stuff 
> to the kernel is killed? Are the prepared commands send to the 
> hardware or aborted as well? What do we do with other processes 
> waiting for that stuff?
>
> How to we do resource accounting? When processes need to block when 
> submitting to the hardware stuff which is not ready we have a process 
> we can punish for blocking resources. But how is kernel memory used 
> for a submission accounted? How do we avoid deny of service attacks 
> here were somebody eats up all memory by doing submissions which can't 
> finish?
>
Hmm. Are these problems really unique to user-space controlled 
dependencies? Couldn't you hit the same or similar problems with 
mis-behaving shaders blocking timeline progress?

/Thomas
Daniel Vetter July 21, 2020, 6:18 p.m. UTC | #15
On Tue, Jul 21, 2020 at 7:46 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-21 15:59, Christian König wrote:
> > Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
> ...
> >> Yes, we can't do magic. As soon as an indefinite batch makes it to
> >> such hardware we've lost. But since we can break out while the batch
> >> is stuck in the scheduler waiting, what I believe we *can* do with
> >> this approach is to avoid deadlocks due to locally unknown
> >> dependencies, which has some bearing on this documentation patch, and
> >> also to allow memory allocation in dma-fence (not memory-fence)
> >> critical sections, like gpu fault- and error handlers without
> >> resorting to using memory pools.
> >
> > Avoiding deadlocks is only the tip of the iceberg here.
> >
> > When you allow the kernel to depend on user space to proceed with some
> > operation there are a lot more things which need consideration.
> >
> > E.g. what happens when an userspace process which has submitted stuff
> > to the kernel is killed? Are the prepared commands send to the
> > hardware or aborted as well? What do we do with other processes
> > waiting for that stuff?
> >
> > How to we do resource accounting? When processes need to block when
> > submitting to the hardware stuff which is not ready we have a process
> > we can punish for blocking resources. But how is kernel memory used
> > for a submission accounted? How do we avoid deny of service attacks
> > here were somebody eats up all memory by doing submissions which can't
> > finish?
> >
> Hmm. Are these problems really unique to user-space controlled
> dependencies? Couldn't you hit the same or similar problems with
> mis-behaving shaders blocking timeline progress?

We just kill them, which we can because stuff needs to complete in a
timely fashion, and without any further intervention - all
prerequisite dependencies must be and are known by the kernel.

But with the long/endless running compute stuff with userspace sync
point and everything free-wheeling, including stuff like "hey I'll
submit this patch but the memory isn't even all allocated yet, so I'm
just going to hang it on this semaphore until that's done" is entirely
different. There just shooting the batch kills the programming model,
and abitrarily holding up a batch for another one to first get its
memory also breaks it, because userspace might have issued them with
dependencies in the other order.

So with that execution model you don't run batches, but just an entire
context. Up to userspace what it does with that, and like with cpu
threads just running a busy loop doing nothing is perfectly legit
(from the kernel pov's at least) workload. Nothing in the kernel ever
waits on such a context to do anything, if the kernel needs something
you just preempt (or if it's memory and you have gpu page fault
handling, rip out the page). Accounting is all done on a specific gpu
context too. And probably we need a somewhat consistent approach on
how we handle these gpu context things (definitely needed for cgroups
and all that).
-Daniel
Dave Airlie July 21, 2020, 9:42 p.m. UTC | #16
>
> >> That's also why I'm not positive on the "no hw preemption, only
> >> scheduler" case: You still have a dma_fence for the batch itself,
> >> which means still no userspace controlled synchronization or other
> >> form of indefinite batches allowed. So not getting us any closer to
> >> enabling the compute use cases people want.
>
> What compute use case are you talking about? I'm only aware about the
> wait before signal case from Vulkan, the page fault case and the KFD
> preemption fence case.

So slight aside, but it does appear as if Intel's Level 0 API exposes
some of the same problems as vulkan.

They have fences:
"A fence cannot be shared across processes."

They have events (userspace fences) like Vulkan but specify:
"Signaled from the host, and waited upon from within a device’s command list."

"There are no protections against events causing deadlocks, such as
circular waits scenarios.

These problems are left to the application to avoid."

https://spec.oneapi.com/level-zero/latest/core/PROG.html#synchronization-primitives

Dave.
Dave Airlie July 21, 2020, 10:45 p.m. UTC | #17
On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 7/21/20 9:45 AM, Christian König wrote:
> > Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >> wrote:
> >>> Hi,
> >>>
> >>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>> write this down once and for all.
> >>>>
> >>>> What I'm not sure about is whether the text should be more explicit in
> >>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>> workloads or workloads where userspace fencing is allowed.
> >>> Although (in my humble opinion) it might be possible to completely
> >>> untangle
> >>> kernel-introduced fences for resource management and dma-fences used
> >>> for
> >>> completion- and dependency tracking and lift a lot of restrictions
> >>> for the
> >>> dma-fences, including prohibiting infinite ones, I think this makes
> >>> sense
> >>> describing the current state.
> >> Yeah I think a future patch needs to type up how we want to make that
> >> happen (for some cross driver consistency) and what needs to be
> >> considered. Some of the necessary parts are already there (with like the
> >> preemption fences amdkfd has as an example), but I think some clear docs
> >> on what's required from both hw, drivers and userspace would be really
> >> good.
> >
> > I'm currently writing that up, but probably still need a few days for
> > this.
>
> Great! I put down some (very) initial thoughts a couple of weeks ago
> building on eviction fences for various hardware complexity levels here:
>
> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.
Thomas Hellström (Intel) July 22, 2020, 6:45 a.m. UTC | #18
On 2020-07-22 00:45, Dave Airlie wrote:
> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 7/21/20 9:45 AM, Christian König wrote:
>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>> write this down once and for all.
>>>>>>
>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>> Although (in my humble opinion) it might be possible to completely
>>>>> untangle
>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>> for
>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>> for the
>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>> sense
>>>>> describing the current state.
>>>> Yeah I think a future patch needs to type up how we want to make that
>>>> happen (for some cross driver consistency) and what needs to be
>>>> considered. Some of the necessary parts are already there (with like the
>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>> on what's required from both hw, drivers and userspace would be really
>>>> good.
>>> I'm currently writing that up, but probably still need a few days for
>>> this.
>> Great! I put down some (very) initial thoughts a couple of weeks ago
>> building on eviction fences for various hardware complexity levels here:
>>
>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> We are seeing HW that has recoverable GPU page faults but only for
> compute tasks, and scheduler without semaphores hw for graphics.
>
> So a single driver may have to expose both models to userspace and
> also introduces the problem of how to interoperate between the two
> models on one card.
>
> Dave.

Hmm, yes to begin with it's important to note that this is not a 
replacement for new programming models or APIs, This is something that 
takes place internally in drivers to mitigate many of the restrictions 
that are currently imposed on dma-fence and documented in this and 
previous series. It's basically the driver-private narrow completions 
Jason suggested in the lockdep patches discussions implemented the same 
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like 
TTM, and the corresponding drivers.  The only cross-driver-like 
visibility would be that the dma-buf move_notify() callback would not be 
allowed to wait on dma-fences or something that depends on a dma-fence.

So with that in mind, I don't foresee engines with different 
capabilities on the same card being a problem.

/Thomas
Daniel Vetter July 22, 2020, 7:11 a.m. UTC | #19
On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 00:45, Dave Airlie wrote:
> > On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>> write this down once and for all.
> >>>>>>
> >>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>> Although (in my humble opinion) it might be possible to completely
> >>>>> untangle
> >>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>> for
> >>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>> for the
> >>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>> sense
> >>>>> describing the current state.
> >>>> Yeah I think a future patch needs to type up how we want to make that
> >>>> happen (for some cross driver consistency) and what needs to be
> >>>> considered. Some of the necessary parts are already there (with like the
> >>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>> on what's required from both hw, drivers and userspace would be really
> >>>> good.
> >>> I'm currently writing that up, but probably still need a few days for
> >>> this.
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels here:
> >>
> >> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> > We are seeing HW that has recoverable GPU page faults but only for
> > compute tasks, and scheduler without semaphores hw for graphics.
> >
> > So a single driver may have to expose both models to userspace and
> > also introduces the problem of how to interoperate between the two
> > models on one card.
> >
> > Dave.
>
> Hmm, yes to begin with it's important to note that this is not a
> replacement for new programming models or APIs, This is something that
> takes place internally in drivers to mitigate many of the restrictions
> that are currently imposed on dma-fence and documented in this and
> previous series. It's basically the driver-private narrow completions
> Jason suggested in the lockdep patches discussions implemented the same
> way as eviction-fences.
>
> The memory fence API would be local to helpers and middle-layers like
> TTM, and the corresponding drivers.  The only cross-driver-like
> visibility would be that the dma-buf move_notify() callback would not be
> allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new&entirely free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).

Cheers, Daniel

> So with that in mind, I don't foresee engines with different
> capabilities on the same card being a problem.
>
> /Thomas
>
>
Thomas Hellström (Intel) July 22, 2020, 8:05 a.m. UTC | #20
On 2020-07-22 09:11, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 00:45, Dave Airlie wrote:
>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>> write this down once and for all.
>>>>>>>>
>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>> untangle
>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>> for
>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>> for the
>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>> sense
>>>>>>> describing the current state.
>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>> good.
>>>>> I'm currently writing that up, but probably still need a few days for
>>>>> this.
>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>> building on eviction fences for various hardware complexity levels here:
>>>>
>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>> We are seeing HW that has recoverable GPU page faults but only for
>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>
>>> So a single driver may have to expose both models to userspace and
>>> also introduces the problem of how to interoperate between the two
>>> models on one card.
>>>
>>> Dave.
>> Hmm, yes to begin with it's important to note that this is not a
>> replacement for new programming models or APIs, This is something that
>> takes place internally in drivers to mitigate many of the restrictions
>> that are currently imposed on dma-fence and documented in this and
>> previous series. It's basically the driver-private narrow completions
>> Jason suggested in the lockdep patches discussions implemented the same
>> way as eviction-fences.
>>
>> The memory fence API would be local to helpers and middle-layers like
>> TTM, and the corresponding drivers.  The only cross-driver-like
>> visibility would be that the dma-buf move_notify() callback would not be
>> allowed to wait on dma-fences or something that depends on a dma-fence.
> Because we can't preempt (on some engines at least) we already have
> the requirement that cross driver buffer management can get stuck on a
> dma-fence. Not even taking into account the horrors we do with
> userptr, which are cross driver no matter what. Limiting move_notify
> to memory fences only doesn't work, since the pte clearing might need
> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the 
end-of-batch fence. (Unless, as documented, there is a scheduler 
consuming sync-file dependencies in which case the memory fence wait 
needs to be able to break out of that). The key thing is not that we can 
break out of execution, but that we can break out of dependencies, since 
when we're executing all dependecies (modulo semaphores) are already 
fulfilled. That's what's eliminating the deadlocks.

>
> That's kinda why I think only reasonable option is to toss in the
> towel and declare dma-fence to be the memory fence (and suck up all
> the consequences of that decision as uapi, which is kinda where we
> are), and construct something new&entirely free-wheeling for userspace
> fencing. But only for engines that allow enough preempt/gpu page
> faulting to make that possible. Free wheeling userspace fences/gpu
> semaphores or whatever you want to call them (on windows I think it's
> monitored fence) only work if you can preempt to decouple the memory
> fences from your gpu command execution.
>
> There's the in-between step of just decoupling the batchbuffer
> submission prep for hw without any preempt (but a scheduler), but that
> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> all the allocation/mapping work pulled out ahead. vk exposes that
> model directly to clients, GL drivers could use it internally too, so
> I see zero value in spending lots of time engineering very tricky
> kernel code just for old userspace. Much more reasonable to do that in
> userspace, where we have real debuggers and no panics about security
> bugs (or well, a lot less, webgl is still a thing, but at least
> browsers realized you need to container that completely).

Sure, it's definitely a big chunk of work. I think the big win would be 
allowing memory allocation in dma-fence critical sections. But I 
completely buy the above argument. I just wanted to point out that many 
of the dma-fence restrictions are IMHO fixable, should we need to do 
that for whatever reason.

/Thomas


>
> Cheers, Daniel
>
>> So with that in mind, I don't foresee engines with different
>> capabilities on the same card being a problem.
>>
>> /Thomas
>>
>>
>
Daniel Vetter July 22, 2020, 9:45 a.m. UTC | #21
On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 09:11, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 00:45, Dave Airlie wrote:
> >>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>> write this down once and for all.
> >>>>>>>>
> >>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>> untangle
> >>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>> for
> >>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>> for the
> >>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>> sense
> >>>>>>> describing the current state.
> >>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>> good.
> >>>>> I'm currently writing that up, but probably still need a few days for
> >>>>> this.
> >>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>> building on eviction fences for various hardware complexity levels here:
> >>>>
> >>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>> We are seeing HW that has recoverable GPU page faults but only for
> >>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>
> >>> So a single driver may have to expose both models to userspace and
> >>> also introduces the problem of how to interoperate between the two
> >>> models on one card.
> >>>
> >>> Dave.
> >> Hmm, yes to begin with it's important to note that this is not a
> >> replacement for new programming models or APIs, This is something that
> >> takes place internally in drivers to mitigate many of the restrictions
> >> that are currently imposed on dma-fence and documented in this and
> >> previous series. It's basically the driver-private narrow completions
> >> Jason suggested in the lockdep patches discussions implemented the same
> >> way as eviction-fences.
> >>
> >> The memory fence API would be local to helpers and middle-layers like
> >> TTM, and the corresponding drivers.  The only cross-driver-like
> >> visibility would be that the dma-buf move_notify() callback would not be
> >> allowed to wait on dma-fences or something that depends on a dma-fence.
> > Because we can't preempt (on some engines at least) we already have
> > the requirement that cross driver buffer management can get stuck on a
> > dma-fence. Not even taking into account the horrors we do with
> > userptr, which are cross driver no matter what. Limiting move_notify
> > to memory fences only doesn't work, since the pte clearing might need
> > to wait for a dma_fence first. Hence this becomes a full end-of-batch
> > fence, not just a limited kernel-internal memory fence.
>
> For non-preemptible hardware the memory fence typically *is* the
> end-of-batch fence. (Unless, as documented, there is a scheduler
> consuming sync-file dependencies in which case the memory fence wait
> needs to be able to break out of that). The key thing is not that we can
> break out of execution, but that we can break out of dependencies, since
> when we're executing all dependecies (modulo semaphores) are already
> fulfilled. That's what's eliminating the deadlocks.
>
> > That's kinda why I think only reasonable option is to toss in the
> > towel and declare dma-fence to be the memory fence (and suck up all
> > the consequences of that decision as uapi, which is kinda where we
> > are), and construct something new&entirely free-wheeling for userspace
> > fencing. But only for engines that allow enough preempt/gpu page
> > faulting to make that possible. Free wheeling userspace fences/gpu
> > semaphores or whatever you want to call them (on windows I think it's
> > monitored fence) only work if you can preempt to decouple the memory
> > fences from your gpu command execution.
> >
> > There's the in-between step of just decoupling the batchbuffer
> > submission prep for hw without any preempt (but a scheduler), but that
> > seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> > all the allocation/mapping work pulled out ahead. vk exposes that
> > model directly to clients, GL drivers could use it internally too, so
> > I see zero value in spending lots of time engineering very tricky
> > kernel code just for old userspace. Much more reasonable to do that in
> > userspace, where we have real debuggers and no panics about security
> > bugs (or well, a lot less, webgl is still a thing, but at least
> > browsers realized you need to container that completely).
>
> Sure, it's definitely a big chunk of work. I think the big win would be
> allowing memory allocation in dma-fence critical sections. But I
> completely buy the above argument. I just wanted to point out that many
> of the dma-fence restrictions are IMHO fixable, should we need to do
> that for whatever reason.

I'm still not sure that's possible, without preemption at least. We
have 4 edges:
- Kernel has internal depencies among memory fences. We want that to
allow (mild) amounts of overcommit, since that simplifies live so
much.
- Memory fences can block gpu ctx execution (by nature of the memory
simply not being there yet due to our overcommit)
- gpu ctx have (if we allow this) userspace controlled semaphore
dependencies. Of course userspace is expected to not create deadlocks,
but that's only assuming the kernel doesn't inject additional
dependencies. Compute folks really want that.
- gpu ctx can hold up memory allocations if all we have is
end-of-batch fences. And end-of-batch fences are all we have without
preempt, plus if we want backwards compat with the entire current
winsys/compositor ecosystem we need them, which allows us to inject
stuff dependent upon them pretty much anywhere.

Fundamentally that's not fixable without throwing one of the edges
(and the corresponding feature that enables) out, since no entity has
full visibility into what's going on. E.g. forcing userspace to tell
the kernel about all semaphores just brings up back to the
drm_timeline_syncobj design we have merged right now. And that's imo
no better.

That's kinda why I'm not seeing much benefits in a half-way state:
Tons of work, and still not what userspace wants. And for the full
deal that userspace wants we might as well not change anything with
dma-fences. For that we need a) ctx preempt and b) new entirely
decoupled fences that never feed back into a memory fences and c) are
controlled entirely by userspace. And c) is the really important thing
people want us to provide.

And once we're ok with dma_fence == memory fences, then enforcing the
strict and painful memory allocation limitations is actually what we
want.

Cheers, Daniel
Thomas Hellström (Intel) July 22, 2020, 10:31 a.m. UTC | #22
On 2020-07-22 11:45, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 09:11, Daniel Vetter wrote:
>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 00:45, Dave Airlie wrote:
>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>>>> <thomas_os@shipmail.org> wrote:
>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>>> write this down once and for all.
>>>>>>>>>>
>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>>> untangle
>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>>>> for
>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>>> for the
>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>>>> sense
>>>>>>>>> describing the current state.
>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>>>> good.
>>>>>>> I'm currently writing that up, but probably still need a few days for
>>>>>>> this.
>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>>> building on eviction fences for various hardware complexity levels here:
>>>>>>
>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>>>> We are seeing HW that has recoverable GPU page faults but only for
>>>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>>>
>>>>> So a single driver may have to expose both models to userspace and
>>>>> also introduces the problem of how to interoperate between the two
>>>>> models on one card.
>>>>>
>>>>> Dave.
>>>> Hmm, yes to begin with it's important to note that this is not a
>>>> replacement for new programming models or APIs, This is something that
>>>> takes place internally in drivers to mitigate many of the restrictions
>>>> that are currently imposed on dma-fence and documented in this and
>>>> previous series. It's basically the driver-private narrow completions
>>>> Jason suggested in the lockdep patches discussions implemented the same
>>>> way as eviction-fences.
>>>>
>>>> The memory fence API would be local to helpers and middle-layers like
>>>> TTM, and the corresponding drivers.  The only cross-driver-like
>>>> visibility would be that the dma-buf move_notify() callback would not be
>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
>>> Because we can't preempt (on some engines at least) we already have
>>> the requirement that cross driver buffer management can get stuck on a
>>> dma-fence. Not even taking into account the horrors we do with
>>> userptr, which are cross driver no matter what. Limiting move_notify
>>> to memory fences only doesn't work, since the pte clearing might need
>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
>>> fence, not just a limited kernel-internal memory fence.
>> For non-preemptible hardware the memory fence typically *is* the
>> end-of-batch fence. (Unless, as documented, there is a scheduler
>> consuming sync-file dependencies in which case the memory fence wait
>> needs to be able to break out of that). The key thing is not that we can
>> break out of execution, but that we can break out of dependencies, since
>> when we're executing all dependecies (modulo semaphores) are already
>> fulfilled. That's what's eliminating the deadlocks.
>>
>>> That's kinda why I think only reasonable option is to toss in the
>>> towel and declare dma-fence to be the memory fence (and suck up all
>>> the consequences of that decision as uapi, which is kinda where we
>>> are), and construct something new&entirely free-wheeling for userspace
>>> fencing. But only for engines that allow enough preempt/gpu page
>>> faulting to make that possible. Free wheeling userspace fences/gpu
>>> semaphores or whatever you want to call them (on windows I think it's
>>> monitored fence) only work if you can preempt to decouple the memory
>>> fences from your gpu command execution.
>>>
>>> There's the in-between step of just decoupling the batchbuffer
>>> submission prep for hw without any preempt (but a scheduler), but that
>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
>>> all the allocation/mapping work pulled out ahead. vk exposes that
>>> model directly to clients, GL drivers could use it internally too, so
>>> I see zero value in spending lots of time engineering very tricky
>>> kernel code just for old userspace. Much more reasonable to do that in
>>> userspace, where we have real debuggers and no panics about security
>>> bugs (or well, a lot less, webgl is still a thing, but at least
>>> browsers realized you need to container that completely).
>> Sure, it's definitely a big chunk of work. I think the big win would be
>> allowing memory allocation in dma-fence critical sections. But I
>> completely buy the above argument. I just wanted to point out that many
>> of the dma-fence restrictions are IMHO fixable, should we need to do
>> that for whatever reason.
> I'm still not sure that's possible, without preemption at least. We
> have 4 edges:
> - Kernel has internal depencies among memory fences. We want that to
> allow (mild) amounts of overcommit, since that simplifies live so
> much.
> - Memory fences can block gpu ctx execution (by nature of the memory
> simply not being there yet due to our overcommit)
> - gpu ctx have (if we allow this) userspace controlled semaphore
> dependencies. Of course userspace is expected to not create deadlocks,
> but that's only assuming the kernel doesn't inject additional
> dependencies. Compute folks really want that.
> - gpu ctx can hold up memory allocations if all we have is
> end-of-batch fences. And end-of-batch fences are all we have without
> preempt, plus if we want backwards compat with the entire current
> winsys/compositor ecosystem we need them, which allows us to inject
> stuff dependent upon them pretty much anywhere.
>
> Fundamentally that's not fixable without throwing one of the edges
> (and the corresponding feature that enables) out, since no entity has
> full visibility into what's going on. E.g. forcing userspace to tell
> the kernel about all semaphores just brings up back to the
> drm_timeline_syncobj design we have merged right now. And that's imo
> no better.

Indeed, HW waiting for semaphores without being able to preempt that 
wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.

>
> That's kinda why I'm not seeing much benefits in a half-way state:
> Tons of work, and still not what userspace wants. And for the full
> deal that userspace wants we might as well not change anything with
> dma-fences. For that we need a) ctx preempt and b) new entirely
> decoupled fences that never feed back into a memory fences and c) are
> controlled entirely by userspace. And c) is the really important thing
> people want us to provide.
>
> And once we're ok with dma_fence == memory fences, then enforcing the
> strict and painful memory allocation limitations is actually what we
> want.

Let's hope you're right. My fear is that that might be pretty painful as 
well.

> Cheers, Daniel

/Thomas
Daniel Vetter July 22, 2020, 11:39 a.m. UTC | #23
On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 11:45, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 09:11, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 2020-07-22 00:45, Dave Airlie wrote:
> >>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>>> <thomas_os@shipmail.org> wrote:
> >>>>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>>>> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>>>> write this down once and for all.
> >>>>>>>>>>
> >>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>>>> untangle
> >>>>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>>>> for
> >>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>>>> for the
> >>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>>>> sense
> >>>>>>>>> describing the current state.
> >>>>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>>>> good.
> >>>>>>> I'm currently writing that up, but probably still need a few days for
> >>>>>>> this.
> >>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>>>> building on eviction fences for various hardware complexity levels here:
> >>>>>>
> >>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>>>> We are seeing HW that has recoverable GPU page faults but only for
> >>>>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>>>
> >>>>> So a single driver may have to expose both models to userspace and
> >>>>> also introduces the problem of how to interoperate between the two
> >>>>> models on one card.
> >>>>>
> >>>>> Dave.
> >>>> Hmm, yes to begin with it's important to note that this is not a
> >>>> replacement for new programming models or APIs, This is something that
> >>>> takes place internally in drivers to mitigate many of the restrictions
> >>>> that are currently imposed on dma-fence and documented in this and
> >>>> previous series. It's basically the driver-private narrow completions
> >>>> Jason suggested in the lockdep patches discussions implemented the same
> >>>> way as eviction-fences.
> >>>>
> >>>> The memory fence API would be local to helpers and middle-layers like
> >>>> TTM, and the corresponding drivers.  The only cross-driver-like
> >>>> visibility would be that the dma-buf move_notify() callback would not be
> >>>> allowed to wait on dma-fences or something that depends on a dma-fence.
> >>> Because we can't preempt (on some engines at least) we already have
> >>> the requirement that cross driver buffer management can get stuck on a
> >>> dma-fence. Not even taking into account the horrors we do with
> >>> userptr, which are cross driver no matter what. Limiting move_notify
> >>> to memory fences only doesn't work, since the pte clearing might need
> >>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> >>> fence, not just a limited kernel-internal memory fence.
> >> For non-preemptible hardware the memory fence typically *is* the
> >> end-of-batch fence. (Unless, as documented, there is a scheduler
> >> consuming sync-file dependencies in which case the memory fence wait
> >> needs to be able to break out of that). The key thing is not that we can
> >> break out of execution, but that we can break out of dependencies, since
> >> when we're executing all dependecies (modulo semaphores) are already
> >> fulfilled. That's what's eliminating the deadlocks.
> >>
> >>> That's kinda why I think only reasonable option is to toss in the
> >>> towel and declare dma-fence to be the memory fence (and suck up all
> >>> the consequences of that decision as uapi, which is kinda where we
> >>> are), and construct something new&entirely free-wheeling for userspace
> >>> fencing. But only for engines that allow enough preempt/gpu page
> >>> faulting to make that possible. Free wheeling userspace fences/gpu
> >>> semaphores or whatever you want to call them (on windows I think it's
> >>> monitored fence) only work if you can preempt to decouple the memory
> >>> fences from your gpu command execution.
> >>>
> >>> There's the in-between step of just decoupling the batchbuffer
> >>> submission prep for hw without any preempt (but a scheduler), but that
> >>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> >>> all the allocation/mapping work pulled out ahead. vk exposes that
> >>> model directly to clients, GL drivers could use it internally too, so
> >>> I see zero value in spending lots of time engineering very tricky
> >>> kernel code just for old userspace. Much more reasonable to do that in
> >>> userspace, where we have real debuggers and no panics about security
> >>> bugs (or well, a lot less, webgl is still a thing, but at least
> >>> browsers realized you need to container that completely).
> >> Sure, it's definitely a big chunk of work. I think the big win would be
> >> allowing memory allocation in dma-fence critical sections. But I
> >> completely buy the above argument. I just wanted to point out that many
> >> of the dma-fence restrictions are IMHO fixable, should we need to do
> >> that for whatever reason.
> > I'm still not sure that's possible, without preemption at least. We
> > have 4 edges:
> > - Kernel has internal depencies among memory fences. We want that to
> > allow (mild) amounts of overcommit, since that simplifies live so
> > much.
> > - Memory fences can block gpu ctx execution (by nature of the memory
> > simply not being there yet due to our overcommit)
> > - gpu ctx have (if we allow this) userspace controlled semaphore
> > dependencies. Of course userspace is expected to not create deadlocks,
> > but that's only assuming the kernel doesn't inject additional
> > dependencies. Compute folks really want that.
> > - gpu ctx can hold up memory allocations if all we have is
> > end-of-batch fences. And end-of-batch fences are all we have without
> > preempt, plus if we want backwards compat with the entire current
> > winsys/compositor ecosystem we need them, which allows us to inject
> > stuff dependent upon them pretty much anywhere.
> >
> > Fundamentally that's not fixable without throwing one of the edges
> > (and the corresponding feature that enables) out, since no entity has
> > full visibility into what's going on. E.g. forcing userspace to tell
> > the kernel about all semaphores just brings up back to the
> > drm_timeline_syncobj design we have merged right now. And that's imo
> > no better.
>
> Indeed, HW waiting for semaphores without being able to preempt that
> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.

preempt is a necessary but not sufficient condition, you also must not
have end-of-batch memory fences. And i915 has semaphore support and
end-of-batch memory fences, e.g. one piece is:

commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 7 14:08:11 2020 +0100

    drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore

Sure it preempts, but that's not enough.

> > That's kinda why I'm not seeing much benefits in a half-way state:
> > Tons of work, and still not what userspace wants. And for the full
> > deal that userspace wants we might as well not change anything with
> > dma-fences. For that we need a) ctx preempt and b) new entirely
> > decoupled fences that never feed back into a memory fences and c) are
> > controlled entirely by userspace. And c) is the really important thing
> > people want us to provide.
> >
> > And once we're ok with dma_fence == memory fences, then enforcing the
> > strict and painful memory allocation limitations is actually what we
> > want.
>
> Let's hope you're right. My fear is that that might be pretty painful as
> well.

Oh it's very painful too:
- We need a separate uapi flavour for gpu ctx with preempt instead of
end-of-batch dma-fence.
- Which needs to be implemented without breaking stuff badly - e.g. we
need to make sure we don't probe-wait on fences unnecessarily since
that forces random unwanted preempts.
- If we want this with winsys integration we need full userspace
revisions since all the dma_fence based sync sharing is out (implicit
sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
only go the other way round).

Utter pain, but I think it's better since it can be done
driver-by-driver, and even userspace usecase by usecase. Which means
we can experiment in areas where the 10+ years of uapi guarantee isn't
so painful, learn, until we do the big jump of new
zero-interaction-with-memory-management fences become baked in forever
into compositor/winsys/modeset protocols. With the other approach of
splitting dma-fence we need to do all the splitting first, make sure
we get it right, and only then can we enable the use-case for real.

That's just not going to happen, at least not in upstream across all
drivers. Within a single driver in some vendor tree hacking stuff up
is totally fine ofc.
-Daniel
Thomas Hellström (Intel) July 22, 2020, 12:22 p.m. UTC | #24
On 2020-07-22 13:39, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 11:45, Daniel Vetter wrote:
>>> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 09:11, Daniel Vetter wrote:
>>>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
>>>>> <thomas_os@shipmail.org> wrote:
>>>>>> On 2020-07-22 00:45, Dave Airlie wrote:
>>>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>>>>> write this down once and for all.
>>>>>>>>>>>>
>>>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>>>>> untangle
>>>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>>>>>> for
>>>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>>>>> for the
>>>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>>>>>> sense
>>>>>>>>>>> describing the current state.
>>>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>>>>>> good.
>>>>>>>>> I'm currently writing that up, but probably still need a few days for
>>>>>>>>> this.
>>>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>>>>> building on eviction fences for various hardware complexity levels here:
>>>>>>>>
>>>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>>>>>> We are seeing HW that has recoverable GPU page faults but only for
>>>>>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>>>>>
>>>>>>> So a single driver may have to expose both models to userspace and
>>>>>>> also introduces the problem of how to interoperate between the two
>>>>>>> models on one card.
>>>>>>>
>>>>>>> Dave.
>>>>>> Hmm, yes to begin with it's important to note that this is not a
>>>>>> replacement for new programming models or APIs, This is something that
>>>>>> takes place internally in drivers to mitigate many of the restrictions
>>>>>> that are currently imposed on dma-fence and documented in this and
>>>>>> previous series. It's basically the driver-private narrow completions
>>>>>> Jason suggested in the lockdep patches discussions implemented the same
>>>>>> way as eviction-fences.
>>>>>>
>>>>>> The memory fence API would be local to helpers and middle-layers like
>>>>>> TTM, and the corresponding drivers.  The only cross-driver-like
>>>>>> visibility would be that the dma-buf move_notify() callback would not be
>>>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
>>>>> Because we can't preempt (on some engines at least) we already have
>>>>> the requirement that cross driver buffer management can get stuck on a
>>>>> dma-fence. Not even taking into account the horrors we do with
>>>>> userptr, which are cross driver no matter what. Limiting move_notify
>>>>> to memory fences only doesn't work, since the pte clearing might need
>>>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
>>>>> fence, not just a limited kernel-internal memory fence.
>>>> For non-preemptible hardware the memory fence typically *is* the
>>>> end-of-batch fence. (Unless, as documented, there is a scheduler
>>>> consuming sync-file dependencies in which case the memory fence wait
>>>> needs to be able to break out of that). The key thing is not that we can
>>>> break out of execution, but that we can break out of dependencies, since
>>>> when we're executing all dependecies (modulo semaphores) are already
>>>> fulfilled. That's what's eliminating the deadlocks.
>>>>
>>>>> That's kinda why I think only reasonable option is to toss in the
>>>>> towel and declare dma-fence to be the memory fence (and suck up all
>>>>> the consequences of that decision as uapi, which is kinda where we
>>>>> are), and construct something new&entirely free-wheeling for userspace
>>>>> fencing. But only for engines that allow enough preempt/gpu page
>>>>> faulting to make that possible. Free wheeling userspace fences/gpu
>>>>> semaphores or whatever you want to call them (on windows I think it's
>>>>> monitored fence) only work if you can preempt to decouple the memory
>>>>> fences from your gpu command execution.
>>>>>
>>>>> There's the in-between step of just decoupling the batchbuffer
>>>>> submission prep for hw without any preempt (but a scheduler), but that
>>>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
>>>>> all the allocation/mapping work pulled out ahead. vk exposes that
>>>>> model directly to clients, GL drivers could use it internally too, so
>>>>> I see zero value in spending lots of time engineering very tricky
>>>>> kernel code just for old userspace. Much more reasonable to do that in
>>>>> userspace, where we have real debuggers and no panics about security
>>>>> bugs (or well, a lot less, webgl is still a thing, but at least
>>>>> browsers realized you need to container that completely).
>>>> Sure, it's definitely a big chunk of work. I think the big win would be
>>>> allowing memory allocation in dma-fence critical sections. But I
>>>> completely buy the above argument. I just wanted to point out that many
>>>> of the dma-fence restrictions are IMHO fixable, should we need to do
>>>> that for whatever reason.
>>> I'm still not sure that's possible, without preemption at least. We
>>> have 4 edges:
>>> - Kernel has internal depencies among memory fences. We want that to
>>> allow (mild) amounts of overcommit, since that simplifies live so
>>> much.
>>> - Memory fences can block gpu ctx execution (by nature of the memory
>>> simply not being there yet due to our overcommit)
>>> - gpu ctx have (if we allow this) userspace controlled semaphore
>>> dependencies. Of course userspace is expected to not create deadlocks,
>>> but that's only assuming the kernel doesn't inject additional
>>> dependencies. Compute folks really want that.
>>> - gpu ctx can hold up memory allocations if all we have is
>>> end-of-batch fences. And end-of-batch fences are all we have without
>>> preempt, plus if we want backwards compat with the entire current
>>> winsys/compositor ecosystem we need them, which allows us to inject
>>> stuff dependent upon them pretty much anywhere.
>>>
>>> Fundamentally that's not fixable without throwing one of the edges
>>> (and the corresponding feature that enables) out, since no entity has
>>> full visibility into what's going on. E.g. forcing userspace to tell
>>> the kernel about all semaphores just brings up back to the
>>> drm_timeline_syncobj design we have merged right now. And that's imo
>>> no better.
>> Indeed, HW waiting for semaphores without being able to preempt that
>> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.
> preempt is a necessary but not sufficient condition, you also must not
> have end-of-batch memory fences. And i915 has semaphore support and
> end-of-batch memory fences, e.g. one piece is:
>
> commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Apr 7 14:08:11 2020 +0100
>
>      drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore
>
> Sure it preempts, but that's not enough.

Yes, i915 would fall in the "hardware with semaphores" category and 
implement memory fences different from the end-of-batch fences.

>
>>> That's kinda why I'm not seeing much benefits in a half-way state:
>>> Tons of work, and still not what userspace wants. And for the full
>>> deal that userspace wants we might as well not change anything with
>>> dma-fences. For that we need a) ctx preempt and b) new entirely
>>> decoupled fences that never feed back into a memory fences and c) are
>>> controlled entirely by userspace. And c) is the really important thing
>>> people want us to provide.
>>>
>>> And once we're ok with dma_fence == memory fences, then enforcing the
>>> strict and painful memory allocation limitations is actually what we
>>> want.
>> Let's hope you're right. My fear is that that might be pretty painful as
>> well.
> Oh it's very painful too:
> - We need a separate uapi flavour for gpu ctx with preempt instead of
> end-of-batch dma-fence.
> - Which needs to be implemented without breaking stuff badly - e.g. we
> need to make sure we don't probe-wait on fences unnecessarily since
> that forces random unwanted preempts.
> - If we want this with winsys integration we need full userspace
> revisions since all the dma_fence based sync sharing is out (implicit
> sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
> only go the other way round).
> Utter pain, but I think it's better since it can be done
> driver-by-driver, and even userspace usecase by usecase. Which means
> we can experiment in areas where the 10+ years of uapi guarantee isn't
> so painful, learn, until we do the big jump of new
> zero-interaction-with-memory-management fences become baked in forever
> into compositor/winsys/modeset protocols.
>   With the other approach of
> splitting dma-fence we need to do all the splitting first, make sure
> we get it right, and only then can we enable the use-case for real.

Again, let me stress, I'm not advocating for splitting the dma-fence in 
favour of the preempt ctx approach. My question is rather: Do we see the 
need for fixing dma-fence as well, with the motivation that fixing all 
drivers to adhere to the dma-fence restrictions might be just as 
painful. So far the clear answer is no, it's not worth it, and I'm fine 
with that.

>
> That's just not going to happen, at least not in upstream across all
> drivers. Within a single driver in some vendor tree hacking stuff up
> is totally fine ofc.

Actually, due to the asynchronous restart, that's not really possible 
either. It's all or none.

> -Daniel

/Thomas
Daniel Vetter July 22, 2020, 12:41 p.m. UTC | #25
On Wed, Jul 22, 2020 at 2:22 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 13:39, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 11:45, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 2020-07-22 09:11, Daniel Vetter wrote:
> >>>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >>>>> <thomas_os@shipmail.org> wrote:
> >>>>>> On 2020-07-22 00:45, Dave Airlie wrote:
> >>>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>>>>> <thomas_os@shipmail.org> wrote:
> >>>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>>>>>> wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>>>>>> write this down once and for all.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>>>>>> untangle
> >>>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>>>>>> for
> >>>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>>>>>> for the
> >>>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>>>>>> sense
> >>>>>>>>>>> describing the current state.
> >>>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>>>>>> good.
> >>>>>>>>> I'm currently writing that up, but probably still need a few days for
> >>>>>>>>> this.
> >>>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>>>>>> building on eviction fences for various hardware complexity levels here:
> >>>>>>>>
> >>>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>>>>>> We are seeing HW that has recoverable GPU page faults but only for
> >>>>>>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>>>>>
> >>>>>>> So a single driver may have to expose both models to userspace and
> >>>>>>> also introduces the problem of how to interoperate between the two
> >>>>>>> models on one card.
> >>>>>>>
> >>>>>>> Dave.
> >>>>>> Hmm, yes to begin with it's important to note that this is not a
> >>>>>> replacement for new programming models or APIs, This is something that
> >>>>>> takes place internally in drivers to mitigate many of the restrictions
> >>>>>> that are currently imposed on dma-fence and documented in this and
> >>>>>> previous series. It's basically the driver-private narrow completions
> >>>>>> Jason suggested in the lockdep patches discussions implemented the same
> >>>>>> way as eviction-fences.
> >>>>>>
> >>>>>> The memory fence API would be local to helpers and middle-layers like
> >>>>>> TTM, and the corresponding drivers.  The only cross-driver-like
> >>>>>> visibility would be that the dma-buf move_notify() callback would not be
> >>>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
> >>>>> Because we can't preempt (on some engines at least) we already have
> >>>>> the requirement that cross driver buffer management can get stuck on a
> >>>>> dma-fence. Not even taking into account the horrors we do with
> >>>>> userptr, which are cross driver no matter what. Limiting move_notify
> >>>>> to memory fences only doesn't work, since the pte clearing might need
> >>>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> >>>>> fence, not just a limited kernel-internal memory fence.
> >>>> For non-preemptible hardware the memory fence typically *is* the
> >>>> end-of-batch fence. (Unless, as documented, there is a scheduler
> >>>> consuming sync-file dependencies in which case the memory fence wait
> >>>> needs to be able to break out of that). The key thing is not that we can
> >>>> break out of execution, but that we can break out of dependencies, since
> >>>> when we're executing all dependecies (modulo semaphores) are already
> >>>> fulfilled. That's what's eliminating the deadlocks.
> >>>>
> >>>>> That's kinda why I think only reasonable option is to toss in the
> >>>>> towel and declare dma-fence to be the memory fence (and suck up all
> >>>>> the consequences of that decision as uapi, which is kinda where we
> >>>>> are), and construct something new&entirely free-wheeling for userspace
> >>>>> fencing. But only for engines that allow enough preempt/gpu page
> >>>>> faulting to make that possible. Free wheeling userspace fences/gpu
> >>>>> semaphores or whatever you want to call them (on windows I think it's
> >>>>> monitored fence) only work if you can preempt to decouple the memory
> >>>>> fences from your gpu command execution.
> >>>>>
> >>>>> There's the in-between step of just decoupling the batchbuffer
> >>>>> submission prep for hw without any preempt (but a scheduler), but that
> >>>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> >>>>> all the allocation/mapping work pulled out ahead. vk exposes that
> >>>>> model directly to clients, GL drivers could use it internally too, so
> >>>>> I see zero value in spending lots of time engineering very tricky
> >>>>> kernel code just for old userspace. Much more reasonable to do that in
> >>>>> userspace, where we have real debuggers and no panics about security
> >>>>> bugs (or well, a lot less, webgl is still a thing, but at least
> >>>>> browsers realized you need to container that completely).
> >>>> Sure, it's definitely a big chunk of work. I think the big win would be
> >>>> allowing memory allocation in dma-fence critical sections. But I
> >>>> completely buy the above argument. I just wanted to point out that many
> >>>> of the dma-fence restrictions are IMHO fixable, should we need to do
> >>>> that for whatever reason.
> >>> I'm still not sure that's possible, without preemption at least. We
> >>> have 4 edges:
> >>> - Kernel has internal depencies among memory fences. We want that to
> >>> allow (mild) amounts of overcommit, since that simplifies live so
> >>> much.
> >>> - Memory fences can block gpu ctx execution (by nature of the memory
> >>> simply not being there yet due to our overcommit)
> >>> - gpu ctx have (if we allow this) userspace controlled semaphore
> >>> dependencies. Of course userspace is expected to not create deadlocks,
> >>> but that's only assuming the kernel doesn't inject additional
> >>> dependencies. Compute folks really want that.
> >>> - gpu ctx can hold up memory allocations if all we have is
> >>> end-of-batch fences. And end-of-batch fences are all we have without
> >>> preempt, plus if we want backwards compat with the entire current
> >>> winsys/compositor ecosystem we need them, which allows us to inject
> >>> stuff dependent upon them pretty much anywhere.
> >>>
> >>> Fundamentally that's not fixable without throwing one of the edges
> >>> (and the corresponding feature that enables) out, since no entity has
> >>> full visibility into what's going on. E.g. forcing userspace to tell
> >>> the kernel about all semaphores just brings up back to the
> >>> drm_timeline_syncobj design we have merged right now. And that's imo
> >>> no better.
> >> Indeed, HW waiting for semaphores without being able to preempt that
> >> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.
> > preempt is a necessary but not sufficient condition, you also must not
> > have end-of-batch memory fences. And i915 has semaphore support and
> > end-of-batch memory fences, e.g. one piece is:
> >
> > commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Apr 7 14:08:11 2020 +0100
> >
> >      drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore
> >
> > Sure it preempts, but that's not enough.
>
> Yes, i915 would fall in the "hardware with semaphores" category and
> implement memory fences different from the end-of-batch fences.
>
> >
> >>> That's kinda why I'm not seeing much benefits in a half-way state:
> >>> Tons of work, and still not what userspace wants. And for the full
> >>> deal that userspace wants we might as well not change anything with
> >>> dma-fences. For that we need a) ctx preempt and b) new entirely
> >>> decoupled fences that never feed back into a memory fences and c) are
> >>> controlled entirely by userspace. And c) is the really important thing
> >>> people want us to provide.
> >>>
> >>> And once we're ok with dma_fence == memory fences, then enforcing the
> >>> strict and painful memory allocation limitations is actually what we
> >>> want.
> >> Let's hope you're right. My fear is that that might be pretty painful as
> >> well.
> > Oh it's very painful too:
> > - We need a separate uapi flavour for gpu ctx with preempt instead of
> > end-of-batch dma-fence.
> > - Which needs to be implemented without breaking stuff badly - e.g. we
> > need to make sure we don't probe-wait on fences unnecessarily since
> > that forces random unwanted preempts.
> > - If we want this with winsys integration we need full userspace
> > revisions since all the dma_fence based sync sharing is out (implicit
> > sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
> > only go the other way round).
> > Utter pain, but I think it's better since it can be done
> > driver-by-driver, and even userspace usecase by usecase. Which means
> > we can experiment in areas where the 10+ years of uapi guarantee isn't
> > so painful, learn, until we do the big jump of new
> > zero-interaction-with-memory-management fences become baked in forever
> > into compositor/winsys/modeset protocols.
> >   With the other approach of
> > splitting dma-fence we need to do all the splitting first, make sure
> > we get it right, and only then can we enable the use-case for real.
>
> Again, let me stress, I'm not advocating for splitting the dma-fence in
> favour of the preempt ctx approach. My question is rather: Do we see the
> need for fixing dma-fence as well, with the motivation that fixing all
> drivers to adhere to the dma-fence restrictions might be just as
> painful. So far the clear answer is no, it's not worth it, and I'm fine
> with that.

Ah I think I misunderstood which options you want to compare here. I'm
not sure how much pain fixing up "dma-fence as memory fence" really
is. That's kinda why I want a lot more testing on my annotation
patches, to figure that out. Not much feedback aside from amdgpu and
intel, and those two drivers pretty much need to sort out their memory
fence issues anyway (because of userptr and stuff like that).

The only other issues outside of these two drivers I'm aware of:
- various scheduler drivers doing allocations in the drm/scheduler
critical section. Since all arm-soc drivers have a mildly shoddy
memory model of "we just pin everything" they don't really have to
deal with this. So we might just declare arm as a platform broken and
not taint the dma-fence critical sections with fs_reclaim. Otoh we
need to fix this for drm/scheduler anyway, I think best option would
be to have a mempool for hw fences in the scheduler itself, and at
that point fixing the other drivers shouldn't be too onerous.

- vmwgfx doing a dma_resv in the atomic commit tail. Entirely
orthogonal to the entire memory fence discussion.

I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

So thus far I think fixing up the various small bugs the annotations
turn up is the least problem we have here. Much, much smaller then
either of "split dma-fence in two" or "add entire new fence
model/uapi/winsys protocol set on top of dma-fence". I think a big
reason we didn't screw up a lot worse on this is the atomic framework,
which was designed very much with a) no allocations in the wrong spot
and b) no lock taking in the wrong spot in mind from the start. Some
of the early atomic prototypes were real horrors in that regards, but
with the helper framework we have now drivers have to go the extra
mile to screw this up. And there's a lot more atomic drivers than
render drivers nowadays merged in upstream.

> > That's just not going to happen, at least not in upstream across all
> > drivers. Within a single driver in some vendor tree hacking stuff up
> > is totally fine ofc.
>
> Actually, due to the asynchronous restart, that's not really possible
> either. It's all or none.
>
> > -Daniel
>
> /Thomas
>
>
Thomas Hellström (Intel) July 22, 2020, 1:12 p.m. UTC | #26
On 2020-07-22 14:41, Daniel Vetter wrote:
>
> Ah I think I misunderstood which options you want to compare here. I'm
> not sure how much pain fixing up "dma-fence as memory fence" really
> is. That's kinda why I want a lot more testing on my annotation
> patches, to figure that out. Not much feedback aside from amdgpu and
> intel, and those two drivers pretty much need to sort out their memory
> fence issues anyway (because of userptr and stuff like that).
>
> The only other issues outside of these two drivers I'm aware of:
> - various scheduler drivers doing allocations in the drm/scheduler
> critical section. Since all arm-soc drivers have a mildly shoddy
> memory model of "we just pin everything" they don't really have to
> deal with this. So we might just declare arm as a platform broken and
> not taint the dma-fence critical sections with fs_reclaim. Otoh we
> need to fix this for drm/scheduler anyway, I think best option would
> be to have a mempool for hw fences in the scheduler itself, and at
> that point fixing the other drivers shouldn't be too onerous.
>
> - vmwgfx doing a dma_resv in the atomic commit tail. Entirely
> orthogonal to the entire memory fence discussion.

With vmwgfx there is another issue that is hit when the gpu signals an 
error. At that point the batch might be restarted with a new meta 
command buffer that needs to be allocated out of a dma pool. in the 
fence critical section. That's probably a bit nasty to fix, but not 
impossible.

>
> I'm pretty sure there's more bugs, I just haven't heard from them yet.
> Also due to the opt-in nature of dma-fence we can limit the scope of
> what we fix fairly naturally, just don't put them where no one cares
> :-) Of course that also hides general locking issues in dma_fence
> signalling code, but well *shrug*.
Hmm, yes. Another potential big problem would be drivers that want to 
use gpu page faults in the dma-fence critical sections with the 
batch-based programming model.

/Thomas
Daniel Vetter July 22, 2020, 2:07 p.m. UTC | #27
On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
> On 2020-07-22 14:41, Daniel Vetter wrote:
> > Ah I think I misunderstood which options you want to compare here. I'm
> > not sure how much pain fixing up "dma-fence as memory fence" really
> > is. That's kinda why I want a lot more testing on my annotation
> > patches, to figure that out. Not much feedback aside from amdgpu and
> > intel, and those two drivers pretty much need to sort out their memory
> > fence issues anyway (because of userptr and stuff like that).
> >
> > The only other issues outside of these two drivers I'm aware of:
> > - various scheduler drivers doing allocations in the drm/scheduler
> > critical section. Since all arm-soc drivers have a mildly shoddy
> > memory model of "we just pin everything" they don't really have to
> > deal with this. So we might just declare arm as a platform broken and
> > not taint the dma-fence critical sections with fs_reclaim. Otoh we
> > need to fix this for drm/scheduler anyway, I think best option would
> > be to have a mempool for hw fences in the scheduler itself, and at
> > that point fixing the other drivers shouldn't be too onerous.
> >
> > - vmwgfx doing a dma_resv in the atomic commit tail. Entirely
> > orthogonal to the entire memory fence discussion.
>
> With vmwgfx there is another issue that is hit when the gpu signals an
> error. At that point the batch might be restarted with a new meta
> command buffer that needs to be allocated out of a dma pool. in the
> fence critical section. That's probably a bit nasty to fix, but not
> impossible.

Yeah reset is fun. From what I've seen this isn't any worse than the
hw allocation issue for drm/scheduler drivers, they just allocate
another hw fence with all that drags along. So the same mempool should
be sufficient.

The really nasty thing around reset is display interactions, because
you just can't take drm_modeset_lock. amdgpu fixed that now (at least
the modeset_lock side, not yet the memory allocations that brings
along). i915 has the same problem for gen2/3 (so really old stuff),
and we've solved that by breaking&restarting all i915 fence waits, but
that predates multi-gpu and wont work for shared fences ofc. But it's
so old and predates all multi-gpu laptops that I think wontfix is the
right take.

Other drm/scheduler drivers don't have that problem since they're all
render-only, so no display driver interaction.

> > I'm pretty sure there's more bugs, I just haven't heard from them yet.
> > Also due to the opt-in nature of dma-fence we can limit the scope of
> > what we fix fairly naturally, just don't put them where no one cares
> > :-) Of course that also hides general locking issues in dma_fence
> > signalling code, but well *shrug*.
> Hmm, yes. Another potential big problem would be drivers that want to
> use gpu page faults in the dma-fence critical sections with the
> batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate&enforce this before we land any
code and have big regrets.
-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Christian König July 22, 2020, 2:23 p.m. UTC | #28
Am 22.07.20 um 16:07 schrieb Daniel Vetter:
> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>> I'm pretty sure there's more bugs, I just haven't heard from them yet.
>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>> what we fix fairly naturally, just don't put them where no one cares
>>> :-) Of course that also hides general locking issues in dma_fence
>>> signalling code, but well *shrug*.
>> Hmm, yes. Another potential big problem would be drivers that want to
>> use gpu page faults in the dma-fence critical sections with the
>> batch-based programming model.
> Yeah that's a massive can of worms. But luckily there's no such driver
> merged in upstream, so hopefully we can think about all the
> constraints and how to best annotate&enforce this before we land any
> code and have big regrets.

Do you want a bad news? I once made a prototype for that when Vega10 
came out.

But we abandoned this approach for the the batch based approach because 
of the horrible performance.

KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.

Christian.

> -Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Thomas Hellström (Intel) July 22, 2020, 2:30 p.m. UTC | #29
On 2020-07-22 16:23, Christian König wrote:
> Am 22.07.20 um 16:07 schrieb Daniel Vetter:
>> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
>> <thomas_os@shipmail.org> wrote:
>>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>>> I'm pretty sure there's more bugs, I just haven't heard from them yet.
>>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>>> what we fix fairly naturally, just don't put them where no one cares
>>>> :-) Of course that also hides general locking issues in dma_fence
>>>> signalling code, but well *shrug*.
>>> Hmm, yes. Another potential big problem would be drivers that want to
>>> use gpu page faults in the dma-fence critical sections with the
>>> batch-based programming model.
>> Yeah that's a massive can of worms. But luckily there's no such driver
>> merged in upstream, so hopefully we can think about all the
>> constraints and how to best annotate&enforce this before we land any
>> code and have big regrets.
>
> Do you want a bad news? I once made a prototype for that when Vega10 
> came out.
>
> But we abandoned this approach for the the batch based approach 
> because of the horrible performance.

In context of the previous discussion I'd consider the fact that it's 
not performant in the batch-based model good news :)

Thomas


>
> KFD is going to see that, but this is only with user queues and no 
> dma_fence involved whatsoever.
>
> Christian.
>
>> -Daniel
>>
>>
>>
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Christian König July 22, 2020, 2:35 p.m. UTC | #30
Am 22.07.20 um 16:30 schrieb Thomas Hellström (Intel):
>
> On 2020-07-22 16:23, Christian König wrote:
>> Am 22.07.20 um 16:07 schrieb Daniel Vetter:
>>> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>>>> I'm pretty sure there's more bugs, I just haven't heard from them 
>>>>> yet.
>>>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>>>> what we fix fairly naturally, just don't put them where no one cares
>>>>> :-) Of course that also hides general locking issues in dma_fence
>>>>> signalling code, but well *shrug*.
>>>> Hmm, yes. Another potential big problem would be drivers that want to
>>>> use gpu page faults in the dma-fence critical sections with the
>>>> batch-based programming model.
>>> Yeah that's a massive can of worms. But luckily there's no such driver
>>> merged in upstream, so hopefully we can think about all the
>>> constraints and how to best annotate&enforce this before we land any
>>> code and have big regrets.
>>
>> Do you want a bad news? I once made a prototype for that when Vega10 
>> came out.
>>
>> But we abandoned this approach for the the batch based approach 
>> because of the horrible performance.
>
> In context of the previous discussion I'd consider the fact that it's 
> not performant in the batch-based model good news :)

Well the Vega10 had such a horrible page fault performance because it 
was the first generation which enabled it.

Later hardware versions are much better, but we just didn't push for 
this feature on them any more.

But yeah, now you mentioned it we did discuss this locking problem on 
tons of team calls as well.

Our solution at that time was to just not allow waiting if we do any 
allocation in the page fault handler. But this is of course not 
practical for a production environment.

Christian.

>
> Thomas
>
>
>>
>> KFD is going to see that, but this is only with user queues and no 
>> dma_fence involved whatsoever.
>>
>> Christian.
>>
>>> -Daniel
>>>
>>>
>>>
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946&amp;sdata=F8LZEnsMOJLeC3Sr%2BPn2HjGHlttdkVUiOzW7mYeijys%3D&amp;reserved=0 
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946&amp;sdata=V3FsfahK6344%2FXujtLA%2BazWV0XjKWDXFWObRWc1JUKs%3D&amp;reserved=0 
>>>

Patch
diff mbox series

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index f8f6decde359..100bfd227265 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -178,3 +178,73 @@  DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+Indefinite DMA Fences
+~~~~~~~~~~~~~~~~~~~~
+
+At various times &dma_fence with an indefinite time until dma_fence_wait()
+finishes have been proposed. Examples include:
+
+* Future fences, used in HWC1 to signal when a buffer isn't used by the display
+  any longer, and created with the screen update that makes the buffer visible.
+  The time this fence completes is entirely under userspace's control.
+
+* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
+  been set. Used to asynchronously delay command submission.
+
+* Userspace fences or gpu futexes, fine-grained locking within a command buffer
+  that userspace uses for synchronization across engines or with the CPU, which
+  are then imported as a DMA fence for integration into existing winsys
+  protocols.
+
+* Long-running compute command buffers, while still using traditional end of
+  batch DMA fences for memory management instead of context preemption DMA
+  fences which get reattached when the compute job is rescheduled.
+
+Common to all these schemes is that userspace controls the dependencies of these
+fences and controls when they fire. Mixing indefinite fences with normal
+in-kernel DMA fences does not work, even when a fallback timeout is included to
+protect against malicious userspace:
+
+* Only the kernel knows about all DMA fence dependencies, userspace is not aware
+  of dependencies injected due to memory management or scheduler decisions.
+
+* Only userspace knows about all dependencies in indefinite fences and when
+  exactly they will complete, the kernel has no visibility.
+
+Furthermore the kernel has to be able to hold up userspace command submission
+for memory management needs, which means we must support indefinite fences being
+dependent upon DMA fences. If the kernel also support indefinite fences in the
+kernel like a DMA fence, like any of the above proposal would, there is the
+potential for deadlocks.
+
+.. kernel-render:: DOT
+   :alt: Indefinite Fencing Dependency Cycle
+   :caption: Indefinite Fencing Dependency Cycle
+
+   digraph "Fencing Cycle" {
+      node [shape=box bgcolor=grey style=filled]
+      kernel [label="Kernel DMA Fences"]
+      userspace [label="userspace controlled fences"]
+      kernel -> userspace [label="memory management"]
+      userspace -> kernel [label="Future fence, fence proxy, ..."]
+
+      { rank=same; kernel userspace }
+   }
+
+This means that the kernel might accidentally create deadlocks
+through memory management dependencies which userspace is unaware of, which
+randomly hangs workloads until the timeout kicks in. Workloads, which from
+userspace's perspective, do not contain a deadlock.  In such a mixed fencing
+architecture there is no single entity with knowledge of all dependencies.
+Thefore preventing such deadlocks from within the kernel is not possible.
+
+The only solution to avoid dependencies loops is by not allowing indefinite
+fences in the kernel. This means:
+
+* No future fences, proxy fences or userspace fences imported as DMA fences,
+  with or without a timeout.
+
+* No DMA fences that signal end of batchbuffer for command submission where
+  userspace is allowed to use userspace fencing or long running compute
+  workloads. This also means no implicit fencing for shared buffers in these
+  cases.