mbox series

[RFC,00/10] drm/panthor: Add user submission

Message ID 20240828172605.19176-1-mihail.atanassov@arm.com (mailing list archive)
Headers show
Series drm/panthor: Add user submission | expand

Message

Mihail Atanassov Aug. 28, 2024, 5:25 p.m. UTC
Hello all,

This series implements a mechanism to expose Mali CSF GPUs' queue
ringbuffers directly to userspace, along with paraphernalia to allow
userspace to control job synchronisation between the CPU and GPU.

The goal of these changes is to allow userspace to control work
submission to the FW/HW directly without kernel intervention in the
common case, thereby reducing context switching overhead. It also allows
for greater flexibility in the way work is enqueued in the ringbufs.
For example, the current kernel submit path only supports indirect
calls, which is inefficient for small command buffers. Userspace can
also skip unnecessary sync operations.

This is still a work-in-progress, there's an outstanding issue with
multiple processes using different submission flows triggering
scheduling bugs (e.g. the same group getting scheduled twice), but we'd
love to gather some feedback on the suitability of the approach in
general and see if there's a clear path to merging something like this
eventually.

I've also CCd AMD maintainers because they have in the past done
something similar[1], in case they want to chime in.

There are two uses of this new uAPI in Mesa, one in gallium/panfrost
(link TBD), and one in panvk [2].

The Gallium implementation is a naïve change just to switch the
submission model and exercise the new kernel code, and we don't plan
on pursuing this at this time.

The panvk driver changes are, however, a better representation of the
intent behind this new uAPI, so please consider that as the reference
userspace. It is still very much also a work in progress.

 * patch 1 adds all the uAPI changes;
 * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
   the required objects to userspace;
 * patch 3 maps the doorbell pages, similarly to how the user I/O page is
   mapped;
 * patch 4 implements GROUP_KICK, which lets userspace request an
   inactive group to be scheduled on the GPU;
 * patches 5 & 6 implement XGS queues, a way for userspace to
   synchronise GPU queue progress with DRM syncobjs;
 * patches 7 & 8 add notification mechanisms for user & kernel to signal
   changes to native GPU syncobjs.

[1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
[2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads

Ketil Johnsen (7):
  drm/panthor: Add uAPI to submit from user space
  drm/panthor: Extend GROUP_CREATE for user submission
  drm/panthor: Map doorbell pages
  drm/panthor: Add GROUP_KICK ioctl
  drm/panthor: Factor out syncobj handling
  drm/panthor: Implement XGS queues
  drm/panthor: Add SYNC_UPDATE ioctl

Mihail Atanassov (1):
  drm/panthor: Add sync_update eventfd handling

 drivers/gpu/drm/panthor/Makefile          |   4 +-
 drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
 drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
 drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
 drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
 drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
 drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
 drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
 drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
 drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
 include/uapi/drm/panthor_drm.h            | 243 +++++++-
 12 files changed, 1696 insertions(+), 177 deletions(-)
 create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h

Comments

Christian König Aug. 29, 2024, 9:40 a.m. UTC | #1
Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> Hello all,
>
> This series implements a mechanism to expose Mali CSF GPUs' queue
> ringbuffers directly to userspace, along with paraphernalia to allow
> userspace to control job synchronisation between the CPU and GPU.
>
> The goal of these changes is to allow userspace to control work
> submission to the FW/HW directly without kernel intervention in the
> common case, thereby reducing context switching overhead. It also allows
> for greater flexibility in the way work is enqueued in the ringbufs.
> For example, the current kernel submit path only supports indirect
> calls, which is inefficient for small command buffers. Userspace can
> also skip unnecessary sync operations.

Question is how do you guarantee forward progress for fence signaling?

E.g. when are fences created and published? How do they signal?

How are dependencies handled? How can the kernel suspend an userspace queue?

How does memory management work in this case?

Regards,
Christian.

>
> This is still a work-in-progress, there's an outstanding issue with
> multiple processes using different submission flows triggering
> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> love to gather some feedback on the suitability of the approach in
> general and see if there's a clear path to merging something like this
> eventually.
>
> I've also CCd AMD maintainers because they have in the past done
> something similar[1], in case they want to chime in.
>
> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> (link TBD), and one in panvk [2].
>
> The Gallium implementation is a naïve change just to switch the
> submission model and exercise the new kernel code, and we don't plan
> on pursuing this at this time.
>
> The panvk driver changes are, however, a better representation of the
> intent behind this new uAPI, so please consider that as the reference
> userspace. It is still very much also a work in progress.
>
>   * patch 1 adds all the uAPI changes;
>   * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>     the required objects to userspace;
>   * patch 3 maps the doorbell pages, similarly to how the user I/O page is
>     mapped;
>   * patch 4 implements GROUP_KICK, which lets userspace request an
>     inactive group to be scheduled on the GPU;
>   * patches 5 & 6 implement XGS queues, a way for userspace to
>     synchronise GPU queue progress with DRM syncobjs;
>   * patches 7 & 8 add notification mechanisms for user & kernel to signal
>     changes to native GPU syncobjs.
>
> [1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> [2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>
> Ketil Johnsen (7):
>    drm/panthor: Add uAPI to submit from user space
>    drm/panthor: Extend GROUP_CREATE for user submission
>    drm/panthor: Map doorbell pages
>    drm/panthor: Add GROUP_KICK ioctl
>    drm/panthor: Factor out syncobj handling
>    drm/panthor: Implement XGS queues
>    drm/panthor: Add SYNC_UPDATE ioctl
>
> Mihail Atanassov (1):
>    drm/panthor: Add sync_update eventfd handling
>
>   drivers/gpu/drm/panthor/Makefile          |   4 +-
>   drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>   drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>   drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>   drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>   drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>   drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>   drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>   drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>   drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>   drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>   include/uapi/drm/panthor_drm.h            | 243 +++++++-
>   12 files changed, 1696 insertions(+), 177 deletions(-)
>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>
Steven Price Aug. 29, 2024, 1:37 p.m. UTC | #2
Hi Christian,

Mihail should be able to give more definitive answers, but I think I can
answer your questions.

On 29/08/2024 10:40, Christian König wrote:
> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>> Hello all,
>>
>> This series implements a mechanism to expose Mali CSF GPUs' queue
>> ringbuffers directly to userspace, along with paraphernalia to allow
>> userspace to control job synchronisation between the CPU and GPU.
>>
>> The goal of these changes is to allow userspace to control work
>> submission to the FW/HW directly without kernel intervention in the
>> common case, thereby reducing context switching overhead. It also allows
>> for greater flexibility in the way work is enqueued in the ringbufs.
>> For example, the current kernel submit path only supports indirect
>> calls, which is inefficient for small command buffers. Userspace can
>> also skip unnecessary sync operations.

> Question is how do you guarantee forward progress for fence signaling?

A timeout. Although looking at it I think it's probably set too high
currently:

> +#define JOB_TIMEOUT_MS				5000

But basically the XGS queue is a DRM scheduler just like a normal GPU
queue and the jobs have a timeout. If the timeout is hit then any fences
will be signalled (with an error).

> E.g. when are fences created and published? How do they signal?
> 
> How are dependencies handled? How can the kernel suspend an userspace
> queue?

The actual userspace queue can be suspended. This is actually a
combination of firmware and kernel driver, and this functionality is
already present without the user submission. The firmware will multiplex
multiple 'groups' onto the hardware, and if there are too many for the
firmware then the kernel multiplexes the extra groups onto the ones the
firmware supports.

I haven't studied Mihail's series in detail yet, but if I understand
correctly, the XGS queues are handled separately and are not suspended
when the hardware queues are suspended. I guess this might be an area
for improvement and might explain the currently very high timeout (to
deal with the case where the actual GPU work has been suspended).

> How does memory management work in this case?

I'm not entirely sure what you mean here. If you are referring to the
potential memory issues with signalling path then this should be handled
by the timeout - although I haven't studied the code to check for bugs here.

The actual new XGS queues don't allocate/free memory during the queue
execution - so it's just the memory usage related to fences (and the
other work which could be blocked on the fence).

In terms of memory management for the GPU work itself, this is handled
the same as before. The VM_BIND mechanism allows dependencies to be
created between syncobjs and VM operations, with XGS these can then be
tied to GPU HW events.


Fundamentally (modulo bugs) there is little change compared to kernel
submission - it's already fairly trivial to write GPU job which will
make no forward progress (a 'while (1)' equivalent job). The only
difference here is that XGS makes this 'easy' and doesn't involve the
GPU spinning. Either way we rely on a timeout to recover from these
situations.

Thanks,
Steve

> Regards,
> Christian.
> 
>>
>> This is still a work-in-progress, there's an outstanding issue with
>> multiple processes using different submission flows triggering
>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>> love to gather some feedback on the suitability of the approach in
>> general and see if there's a clear path to merging something like this
>> eventually.
>>
>> I've also CCd AMD maintainers because they have in the past done
>> something similar[1], in case they want to chime in.
>>
>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>> (link TBD), and one in panvk [2].
>>
>> The Gallium implementation is a naïve change just to switch the
>> submission model and exercise the new kernel code, and we don't plan
>> on pursuing this at this time.
>>
>> The panvk driver changes are, however, a better representation of the
>> intent behind this new uAPI, so please consider that as the reference
>> userspace. It is still very much also a work in progress.
>>
>>   * patch 1 adds all the uAPI changes;
>>   * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>>     the required objects to userspace;
>>   * patch 3 maps the doorbell pages, similarly to how the user I/O
>> page is
>>     mapped;
>>   * patch 4 implements GROUP_KICK, which lets userspace request an
>>     inactive group to be scheduled on the GPU;
>>   * patches 5 & 6 implement XGS queues, a way for userspace to
>>     synchronise GPU queue progress with DRM syncobjs;
>>   * patches 7 & 8 add notification mechanisms for user & kernel to signal
>>     changes to native GPU syncobjs.
>>
>> [1]
>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>> [2]
>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>
>> Ketil Johnsen (7):
>>    drm/panthor: Add uAPI to submit from user space
>>    drm/panthor: Extend GROUP_CREATE for user submission
>>    drm/panthor: Map doorbell pages
>>    drm/panthor: Add GROUP_KICK ioctl
>>    drm/panthor: Factor out syncobj handling
>>    drm/panthor: Implement XGS queues
>>    drm/panthor: Add SYNC_UPDATE ioctl
>>
>> Mihail Atanassov (1):
>>    drm/panthor: Add sync_update eventfd handling
>>
>>   drivers/gpu/drm/panthor/Makefile          |   4 +-
>>   drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>>   drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>>   drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>>   drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>>   drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>>   drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>>   drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>   drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>>   drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>>   drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>>   include/uapi/drm/panthor_drm.h            | 243 +++++++-
>>   12 files changed, 1696 insertions(+), 177 deletions(-)
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>
>
Ketil Johnsen Aug. 29, 2024, 1:48 p.m. UTC | #3
On 29/08/2024 11:40, Christian König wrote:
> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>> Hello all,
>>
>> This series implements a mechanism to expose Mali CSF GPUs' queue
>> ringbuffers directly to userspace, along with paraphernalia to allow
>> userspace to control job synchronisation between the CPU and GPU.
>>
>> The goal of these changes is to allow userspace to control work
>> submission to the FW/HW directly without kernel intervention in the
>> common case, thereby reducing context switching overhead. It also allows
>> for greater flexibility in the way work is enqueued in the ringbufs.
>> For example, the current kernel submit path only supports indirect
>> calls, which is inefficient for small command buffers. Userspace can
>> also skip unnecessary sync operations.
> 
> Question is how do you guarantee forward progress for fence signaling?
> 
> E.g. when are fences created and published? How do they signal?

Current XGS queue is built upon an instance of the DRM scheduler, and 
XGS jobs which can not complete immediately are assigned a fence (as one 
would expect). This proposal rely on the DRM scheduler timeout to ensure 
forward progress if user space have encoded a "bad" stream of commands.

PS: We have tried to consider error propagation in case of timeouts, but 
the implementation in this area is most likely somewhat missing ATM (not 
tested).

> How are dependencies handled? How can the kernel suspend an userspace 
> queue?

Mali FW will send IDLE interrupts when a group has become idle. This is 
already (mostly) handled in Panthor today.
There is of course the race to handle then, between GPU IDLE and user 
space submitting new work.

I'm actually working on this part right now. As this patchset stands, it 
doesn't check in the RTPM suspend callback if user space have managed to 
submit more work in the timeframe between "IDLE" and RTPM suspend 
callback. I just need to correctly "unbind" the group/queues, unmap the 
IO pages used, and abort the RTPM suspend if I detect that user space 
have managed to submit more work.

> How does memory management work in this case?

Not sure exactly what you refer to here. There has basically been no 
change to how we handle memory.

If you think of how GPU jobs and VM_BIND interacts, then the change here 
is that it is now the XGS job and VM_BIND which interact. XGS takes on 
the same duties as the GPU jobs with kernel submission (in this regard).

Actually, if you see the submission flow for GPU jobs and XGS jobs, you 
will find that they are virtually identical when it comes to setting up 
fences and dependencies. Same goes for VM_BIND jobs.

--
Regards,
Ketil

> Regards,
> Christian.
> 
>>
>> This is still a work-in-progress, there's an outstanding issue with
>> multiple processes using different submission flows triggering
>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>> love to gather some feedback on the suitability of the approach in
>> general and see if there's a clear path to merging something like this
>> eventually.
>>
>> I've also CCd AMD maintainers because they have in the past done
>> something similar[1], in case they want to chime in.
>>
>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>> (link TBD), and one in panvk [2].
>>
>> The Gallium implementation is a naïve change just to switch the
>> submission model and exercise the new kernel code, and we don't plan
>> on pursuing this at this time.
>>
>> The panvk driver changes are, however, a better representation of the
>> intent behind this new uAPI, so please consider that as the reference
>> userspace. It is still very much also a work in progress.
>>
>>   * patch 1 adds all the uAPI changes;
>>   * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>>     the required objects to userspace;
>>   * patch 3 maps the doorbell pages, similarly to how the user I/O 
>> page is
>>     mapped;
>>   * patch 4 implements GROUP_KICK, which lets userspace request an
>>     inactive group to be scheduled on the GPU;
>>   * patches 5 & 6 implement XGS queues, a way for userspace to
>>     synchronise GPU queue progress with DRM syncobjs;
>>   * patches 7 & 8 add notification mechanisms for user & kernel to signal
>>     changes to native GPU syncobjs.
>>
>> [1] 
>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>> [2] 
>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>
>> Ketil Johnsen (7):
>>    drm/panthor: Add uAPI to submit from user space
>>    drm/panthor: Extend GROUP_CREATE for user submission
>>    drm/panthor: Map doorbell pages
>>    drm/panthor: Add GROUP_KICK ioctl
>>    drm/panthor: Factor out syncobj handling
>>    drm/panthor: Implement XGS queues
>>    drm/panthor: Add SYNC_UPDATE ioctl
>>
>> Mihail Atanassov (1):
>>    drm/panthor: Add sync_update eventfd handling
>>
>>   drivers/gpu/drm/panthor/Makefile          |   4 +-
>>   drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>>   drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>>   drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>>   drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>>   drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>>   drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>>   drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>   drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>>   drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>>   drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>>   include/uapi/drm/panthor_drm.h            | 243 +++++++-
>>   12 files changed, 1696 insertions(+), 177 deletions(-)
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>
>
Ketil Johnsen Sept. 3, 2024, 12:27 p.m. UTC | #4
On 28.08.2024 19:25, Mihail Atanassov wrote:
> Hello all,
> 
> This series implements a mechanism to expose Mali CSF GPUs' queue
> ringbuffers directly to userspace, along with paraphernalia to allow
> userspace to control job synchronisation between the CPU and GPU.
> 
> The goal of these changes is to allow userspace to control work
> submission to the FW/HW directly without kernel intervention in the
> common case, thereby reducing context switching overhead. It also allows
> for greater flexibility in the way work is enqueued in the ringbufs.
> For example, the current kernel submit path only supports indirect
> calls, which is inefficient for small command buffers. Userspace can
> also skip unnecessary sync operations.
> 
> This is still a work-in-progress, there's an outstanding issue with
> multiple processes using different submission flows triggering
> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> love to gather some feedback on the suitability of the approach in
> general and see if there's a clear path to merging something like this
> eventually.
> 
> I've also CCd AMD maintainers because they have in the past done
> something similar[1], in case they want to chime in.
> 
> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> (link TBD), and one in panvk [2].

Gallium/Panfrost changes to make use of this new user submission API can 
now be found here:
https://gitlab.freedesktop.org/ketil.johnsen/mesa/-/commits/panthor_usersubmit/?ref_type=heads

It is worth repeating, this is just a dumb switch from kernel submission 
to user submission for the Panfrost Gallium driver, no optimizations 
attempted. We have no concrete plans to pursue upstreaming of this for 
the time being. Panvk is our focus in that regard.

--
Regards,
Ketil Johnsen

> The Gallium implementation is a naïve change just to switch the
> submission model and exercise the new kernel code, and we don't plan
> on pursuing this at this time.
> 
> The panvk driver changes are, however, a better representation of the
> intent behind this new uAPI, so please consider that as the reference
> userspace. It is still very much also a work in progress.
> 
>   * patch 1 adds all the uAPI changes;
>   * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>     the required objects to userspace;
>   * patch 3 maps the doorbell pages, similarly to how the user I/O page is
>     mapped;
>   * patch 4 implements GROUP_KICK, which lets userspace request an
>     inactive group to be scheduled on the GPU;
>   * patches 5 & 6 implement XGS queues, a way for userspace to
>     synchronise GPU queue progress with DRM syncobjs;
>   * patches 7 & 8 add notification mechanisms for user & kernel to signal
>     changes to native GPU syncobjs.
> 
> [1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> [2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
> 
> Ketil Johnsen (7):
>    drm/panthor: Add uAPI to submit from user space
>    drm/panthor: Extend GROUP_CREATE for user submission
>    drm/panthor: Map doorbell pages
>    drm/panthor: Add GROUP_KICK ioctl
>    drm/panthor: Factor out syncobj handling
>    drm/panthor: Implement XGS queues
>    drm/panthor: Add SYNC_UPDATE ioctl
> 
> Mihail Atanassov (1):
>    drm/panthor: Add sync_update eventfd handling
> 
>   drivers/gpu/drm/panthor/Makefile          |   4 +-
>   drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>   drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>   drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>   drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>   drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>   drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>   drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>   drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>   drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>   drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>   include/uapi/drm/panthor_drm.h            | 243 +++++++-
>   12 files changed, 1696 insertions(+), 177 deletions(-)
>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>   create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>   create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>
Christian König Sept. 3, 2024, 1:46 p.m. UTC | #5
Hi Steven,

Am 29.08.24 um 15:37 schrieb Steven Price:
> Hi Christian,
>
> Mihail should be able to give more definitive answers, but I think I can
> answer your questions.
>
> On 29/08/2024 10:40, Christian König wrote:
>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>>> Hello all,
>>>
>>> This series implements a mechanism to expose Mali CSF GPUs' queue
>>> ringbuffers directly to userspace, along with paraphernalia to allow
>>> userspace to control job synchronisation between the CPU and GPU.
>>>
>>> The goal of these changes is to allow userspace to control work
>>> submission to the FW/HW directly without kernel intervention in the
>>> common case, thereby reducing context switching overhead. It also allows
>>> for greater flexibility in the way work is enqueued in the ringbufs.
>>> For example, the current kernel submit path only supports indirect
>>> calls, which is inefficient for small command buffers. Userspace can
>>> also skip unnecessary sync operations.
>> Question is how do you guarantee forward progress for fence signaling?
> A timeout. Although looking at it I think it's probably set too high
> currently:
>
>> +#define JOB_TIMEOUT_MS				5000
> But basically the XGS queue is a DRM scheduler just like a normal GPU
> queue and the jobs have a timeout. If the timeout is hit then any fences
> will be signalled (with an error).

Mhm, that is unfortunately exactly what I feared.

>> E.g. when are fences created and published? How do they signal?
>>
>> How are dependencies handled? How can the kernel suspend an userspace
>> queue?
> The actual userspace queue can be suspended. This is actually a
> combination of firmware and kernel driver, and this functionality is
> already present without the user submission. The firmware will multiplex
> multiple 'groups' onto the hardware, and if there are too many for the
> firmware then the kernel multiplexes the extra groups onto the ones the
> firmware supports.

How do you guarantee forward progress and that resuming of suspended 
queues doesn't end up in a circle dependency?

> I haven't studied Mihail's series in detail yet, but if I understand
> correctly, the XGS queues are handled separately and are not suspended
> when the hardware queues are suspended. I guess this might be an area
> for improvement and might explain the currently very high timeout (to
> deal with the case where the actual GPU work has been suspended).
>
>> How does memory management work in this case?
> I'm not entirely sure what you mean here. If you are referring to the
> potential memory issues with signalling path then this should be handled
> by the timeout - although I haven't studied the code to check for bugs here.

You might have misunderstood my question (and I might misunderstand the 
code), but on first glance it strongly sounds like the current approach 
will be NAKed.

> The actual new XGS queues don't allocate/free memory during the queue
> execution - so it's just the memory usage related to fences (and the
> other work which could be blocked on the fence).

But the kernel and the hardware could suspend the queues, right?

> In terms of memory management for the GPU work itself, this is handled
> the same as before. The VM_BIND mechanism allows dependencies to be
> created between syncobjs and VM operations, with XGS these can then be
> tied to GPU HW events.

I don't know the details, but that again strongly sounds like that won't 
work.

What you need is to somehow guarantee that work doesn't run into memory 
management deadlocks which are resolved by timeouts.

Please read up here on why that stuff isn't allowed: 
https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences

Regards,
Christian.

>
>
> Fundamentally (modulo bugs) there is little change compared to kernel
> submission - it's already fairly trivial to write GPU job which will
> make no forward progress (a 'while (1)' equivalent job). The only
> difference here is that XGS makes this 'easy' and doesn't involve the
> GPU spinning. Either way we rely on a timeout to recover from these
> situations.
>
> Thanks,
> Steve
>
>> Regards,
>> Christian.
>>
>>> This is still a work-in-progress, there's an outstanding issue with
>>> multiple processes using different submission flows triggering
>>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>>> love to gather some feedback on the suitability of the approach in
>>> general and see if there's a clear path to merging something like this
>>> eventually.
>>>
>>> I've also CCd AMD maintainers because they have in the past done
>>> something similar[1], in case they want to chime in.
>>>
>>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>>> (link TBD), and one in panvk [2].
>>>
>>> The Gallium implementation is a naïve change just to switch the
>>> submission model and exercise the new kernel code, and we don't plan
>>> on pursuing this at this time.
>>>
>>> The panvk driver changes are, however, a better representation of the
>>> intent behind this new uAPI, so please consider that as the reference
>>> userspace. It is still very much also a work in progress.
>>>
>>>    * patch 1 adds all the uAPI changes;
>>>    * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>>>      the required objects to userspace;
>>>    * patch 3 maps the doorbell pages, similarly to how the user I/O
>>> page is
>>>      mapped;
>>>    * patch 4 implements GROUP_KICK, which lets userspace request an
>>>      inactive group to be scheduled on the GPU;
>>>    * patches 5 & 6 implement XGS queues, a way for userspace to
>>>      synchronise GPU queue progress with DRM syncobjs;
>>>    * patches 7 & 8 add notification mechanisms for user & kernel to signal
>>>      changes to native GPU syncobjs.
>>>
>>> [1]
>>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>>> [2]
>>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>>
>>> Ketil Johnsen (7):
>>>     drm/panthor: Add uAPI to submit from user space
>>>     drm/panthor: Extend GROUP_CREATE for user submission
>>>     drm/panthor: Map doorbell pages
>>>     drm/panthor: Add GROUP_KICK ioctl
>>>     drm/panthor: Factor out syncobj handling
>>>     drm/panthor: Implement XGS queues
>>>     drm/panthor: Add SYNC_UPDATE ioctl
>>>
>>> Mihail Atanassov (1):
>>>     drm/panthor: Add sync_update eventfd handling
>>>
>>>    drivers/gpu/drm/panthor/Makefile          |   4 +-
>>>    drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>>>    drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>>>    drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>>>    drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>>>    drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>>>    drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>>>    drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>>    drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>>>    drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>>>    drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>>>    include/uapi/drm/panthor_drm.h            | 243 +++++++-
>>>    12 files changed, 1696 insertions(+), 177 deletions(-)
>>>    create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>>    create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>>    create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>>    create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>>
Simona Vetter Sept. 3, 2024, 9:11 p.m. UTC | #6
On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
> Hi Steven,
> 
> Am 29.08.24 um 15:37 schrieb Steven Price:
> > Hi Christian,
> > 
> > Mihail should be able to give more definitive answers, but I think I can
> > answer your questions.
> > 
> > On 29/08/2024 10:40, Christian König wrote:
> > > Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> > > > Hello all,
> > > > 
> > > > This series implements a mechanism to expose Mali CSF GPUs' queue
> > > > ringbuffers directly to userspace, along with paraphernalia to allow
> > > > userspace to control job synchronisation between the CPU and GPU.
> > > > 
> > > > The goal of these changes is to allow userspace to control work
> > > > submission to the FW/HW directly without kernel intervention in the
> > > > common case, thereby reducing context switching overhead. It also allows
> > > > for greater flexibility in the way work is enqueued in the ringbufs.
> > > > For example, the current kernel submit path only supports indirect
> > > > calls, which is inefficient for small command buffers. Userspace can
> > > > also skip unnecessary sync operations.
> > > Question is how do you guarantee forward progress for fence signaling?
> > A timeout. Although looking at it I think it's probably set too high
> > currently:
> > 
> > > +#define JOB_TIMEOUT_MS				5000
> > But basically the XGS queue is a DRM scheduler just like a normal GPU
> > queue and the jobs have a timeout. If the timeout is hit then any fences
> > will be signalled (with an error).
> 
> Mhm, that is unfortunately exactly what I feared.
> 
> > > E.g. when are fences created and published? How do they signal?
> > > 
> > > How are dependencies handled? How can the kernel suspend an userspace
> > > queue?
> > The actual userspace queue can be suspended. This is actually a
> > combination of firmware and kernel driver, and this functionality is
> > already present without the user submission. The firmware will multiplex
> > multiple 'groups' onto the hardware, and if there are too many for the
> > firmware then the kernel multiplexes the extra groups onto the ones the
> > firmware supports.
> 
> How do you guarantee forward progress and that resuming of suspended queues
> doesn't end up in a circle dependency?
> 
> > I haven't studied Mihail's series in detail yet, but if I understand
> > correctly, the XGS queues are handled separately and are not suspended
> > when the hardware queues are suspended. I guess this might be an area
> > for improvement and might explain the currently very high timeout (to
> > deal with the case where the actual GPU work has been suspended).
> > 
> > > How does memory management work in this case?
> > I'm not entirely sure what you mean here. If you are referring to the
> > potential memory issues with signalling path then this should be handled
> > by the timeout - although I haven't studied the code to check for bugs here.
> 
> You might have misunderstood my question (and I might misunderstand the
> code), but on first glance it strongly sounds like the current approach will
> be NAKed.
> 
> > The actual new XGS queues don't allocate/free memory during the queue
> > execution - so it's just the memory usage related to fences (and the
> > other work which could be blocked on the fence).
> 
> But the kernel and the hardware could suspend the queues, right?
> 
> > In terms of memory management for the GPU work itself, this is handled
> > the same as before. The VM_BIND mechanism allows dependencies to be
> > created between syncobjs and VM operations, with XGS these can then be
> > tied to GPU HW events.
> 
> I don't know the details, but that again strongly sounds like that won't
> work.
> 
> What you need is to somehow guarantee that work doesn't run into memory
> management deadlocks which are resolved by timeouts.
> 
> Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences

panthor doesn't yet have a shrinker, so all memory is pinned, which means
memory management easy mode.

But also this means there might be an uapi design bug in here, and we
really don't want to commit to that. My stance is that panthor should gain
a proper shrinker first, which means there will be some changes here too.
And then we can properly evaluate this. As-is it's a bit too much on the
toy end of things.

That aside, I've thought this all through with tons of people, and I do
think it's all possible.
-Sima

> 
> Regards,
> Christian.
> 
> > 
> > 
> > Fundamentally (modulo bugs) there is little change compared to kernel
> > submission - it's already fairly trivial to write GPU job which will
> > make no forward progress (a 'while (1)' equivalent job). The only
> > difference here is that XGS makes this 'easy' and doesn't involve the
> > GPU spinning. Either way we rely on a timeout to recover from these
> > situations.
> > 
> > Thanks,
> > Steve
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > This is still a work-in-progress, there's an outstanding issue with
> > > > multiple processes using different submission flows triggering
> > > > scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> > > > love to gather some feedback on the suitability of the approach in
> > > > general and see if there's a clear path to merging something like this
> > > > eventually.
> > > > 
> > > > I've also CCd AMD maintainers because they have in the past done
> > > > something similar[1], in case they want to chime in.
> > > > 
> > > > There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> > > > (link TBD), and one in panvk [2].
> > > > 
> > > > The Gallium implementation is a naïve change just to switch the
> > > > submission model and exercise the new kernel code, and we don't plan
> > > > on pursuing this at this time.
> > > > 
> > > > The panvk driver changes are, however, a better representation of the
> > > > intent behind this new uAPI, so please consider that as the reference
> > > > userspace. It is still very much also a work in progress.
> > > > 
> > > >    * patch 1 adds all the uAPI changes;
> > > >    * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
> > > >      the required objects to userspace;
> > > >    * patch 3 maps the doorbell pages, similarly to how the user I/O
> > > > page is
> > > >      mapped;
> > > >    * patch 4 implements GROUP_KICK, which lets userspace request an
> > > >      inactive group to be scheduled on the GPU;
> > > >    * patches 5 & 6 implement XGS queues, a way for userspace to
> > > >      synchronise GPU queue progress with DRM syncobjs;
> > > >    * patches 7 & 8 add notification mechanisms for user & kernel to signal
> > > >      changes to native GPU syncobjs.
> > > > 
> > > > [1]
> > > > https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> > > > [2]
> > > > https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
> > > > 
> > > > Ketil Johnsen (7):
> > > >     drm/panthor: Add uAPI to submit from user space
> > > >     drm/panthor: Extend GROUP_CREATE for user submission
> > > >     drm/panthor: Map doorbell pages
> > > >     drm/panthor: Add GROUP_KICK ioctl
> > > >     drm/panthor: Factor out syncobj handling
> > > >     drm/panthor: Implement XGS queues
> > > >     drm/panthor: Add SYNC_UPDATE ioctl
> > > > 
> > > > Mihail Atanassov (1):
> > > >     drm/panthor: Add sync_update eventfd handling
> > > > 
> > > >    drivers/gpu/drm/panthor/Makefile          |   4 +-
> > > >    drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
> > > >    drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
> > > >    drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
> > > >    drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
> > > >    drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
> > > >    drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
> > > >    drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
> > > >    drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
> > > >    drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
> > > >    drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
> > > >    include/uapi/drm/panthor_drm.h            | 243 +++++++-
> > > >    12 files changed, 1696 insertions(+), 177 deletions(-)
> > > >    create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
> > > >    create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
> > > >    create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
> > > >    create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
> > > >
Christian König Sept. 4, 2024, 7:49 a.m. UTC | #7
Am 03.09.24 um 23:11 schrieb Simona Vetter:
> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
>> Hi Steven,
>>
>> Am 29.08.24 um 15:37 schrieb Steven Price:
>>> Hi Christian,
>>>
>>> Mihail should be able to give more definitive answers, but I think I can
>>> answer your questions.
>>>
>>> On 29/08/2024 10:40, Christian König wrote:
>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>>>>> Hello all,
>>>>>
>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
>>>>> userspace to control job synchronisation between the CPU and GPU.
>>>>>
>>>>> The goal of these changes is to allow userspace to control work
>>>>> submission to the FW/HW directly without kernel intervention in the
>>>>> common case, thereby reducing context switching overhead. It also allows
>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
>>>>> For example, the current kernel submit path only supports indirect
>>>>> calls, which is inefficient for small command buffers. Userspace can
>>>>> also skip unnecessary sync operations.
>>>> Question is how do you guarantee forward progress for fence signaling?
>>> A timeout. Although looking at it I think it's probably set too high
>>> currently:
>>>
>>>> +#define JOB_TIMEOUT_MS				5000
>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
>>> queue and the jobs have a timeout. If the timeout is hit then any fences
>>> will be signalled (with an error).
>> Mhm, that is unfortunately exactly what I feared.
>>
>>>> E.g. when are fences created and published? How do they signal?
>>>>
>>>> How are dependencies handled? How can the kernel suspend an userspace
>>>> queue?
>>> The actual userspace queue can be suspended. This is actually a
>>> combination of firmware and kernel driver, and this functionality is
>>> already present without the user submission. The firmware will multiplex
>>> multiple 'groups' onto the hardware, and if there are too many for the
>>> firmware then the kernel multiplexes the extra groups onto the ones the
>>> firmware supports.
>> How do you guarantee forward progress and that resuming of suspended queues
>> doesn't end up in a circle dependency?
>>
>>> I haven't studied Mihail's series in detail yet, but if I understand
>>> correctly, the XGS queues are handled separately and are not suspended
>>> when the hardware queues are suspended. I guess this might be an area
>>> for improvement and might explain the currently very high timeout (to
>>> deal with the case where the actual GPU work has been suspended).
>>>
>>>> How does memory management work in this case?
>>> I'm not entirely sure what you mean here. If you are referring to the
>>> potential memory issues with signalling path then this should be handled
>>> by the timeout - although I haven't studied the code to check for bugs here.
>> You might have misunderstood my question (and I might misunderstand the
>> code), but on first glance it strongly sounds like the current approach will
>> be NAKed.
>>
>>> The actual new XGS queues don't allocate/free memory during the queue
>>> execution - so it's just the memory usage related to fences (and the
>>> other work which could be blocked on the fence).
>> But the kernel and the hardware could suspend the queues, right?
>>
>>> In terms of memory management for the GPU work itself, this is handled
>>> the same as before. The VM_BIND mechanism allows dependencies to be
>>> created between syncobjs and VM operations, with XGS these can then be
>>> tied to GPU HW events.
>> I don't know the details, but that again strongly sounds like that won't
>> work.
>>
>> What you need is to somehow guarantee that work doesn't run into memory
>> management deadlocks which are resolved by timeouts.
>>
>> Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> memory management easy mode.

Ok, that at least makes things work for the moment.

> But also this means there might be an uapi design bug in here, and we
> really don't want to commit to that. My stance is that panthor should gain
> a proper shrinker first, which means there will be some changes here too.
> And then we can properly evaluate this. As-is it's a bit too much on the
> toy end of things.

I wouldn't say toy end, it looks rather fleshed out to me.

But I agree that the people who design the UAPI needs to be aware of the 
restrictions.

>
> That aside, I've thought this all through with tons of people, and I do
> think it's all possible.

It's certainly possible, we have user queue patches for amdgpu in the 
pipeline as well.

It's just really really really hard to get right without creating some 
circle dependencies and deadlocks in between.

If I would get free beer every time somebody came up with a broken 
dma_fence design I would probably end up as alcoholic without spending a 
single penny.

Christian.

> -Sima
>
>> Regards,
>> Christian.
>>
>>>
>>> Fundamentally (modulo bugs) there is little change compared to kernel
>>> submission - it's already fairly trivial to write GPU job which will
>>> make no forward progress (a 'while (1)' equivalent job). The only
>>> difference here is that XGS makes this 'easy' and doesn't involve the
>>> GPU spinning. Either way we rely on a timeout to recover from these
>>> situations.
>>>
>>> Thanks,
>>> Steve
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> This is still a work-in-progress, there's an outstanding issue with
>>>>> multiple processes using different submission flows triggering
>>>>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>>>>> love to gather some feedback on the suitability of the approach in
>>>>> general and see if there's a clear path to merging something like this
>>>>> eventually.
>>>>>
>>>>> I've also CCd AMD maintainers because they have in the past done
>>>>> something similar[1], in case they want to chime in.
>>>>>
>>>>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>>>>> (link TBD), and one in panvk [2].
>>>>>
>>>>> The Gallium implementation is a naïve change just to switch the
>>>>> submission model and exercise the new kernel code, and we don't plan
>>>>> on pursuing this at this time.
>>>>>
>>>>> The panvk driver changes are, however, a better representation of the
>>>>> intent behind this new uAPI, so please consider that as the reference
>>>>> userspace. It is still very much also a work in progress.
>>>>>
>>>>>     * patch 1 adds all the uAPI changes;
>>>>>     * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>>>>>       the required objects to userspace;
>>>>>     * patch 3 maps the doorbell pages, similarly to how the user I/O
>>>>> page is
>>>>>       mapped;
>>>>>     * patch 4 implements GROUP_KICK, which lets userspace request an
>>>>>       inactive group to be scheduled on the GPU;
>>>>>     * patches 5 & 6 implement XGS queues, a way for userspace to
>>>>>       synchronise GPU queue progress with DRM syncobjs;
>>>>>     * patches 7 & 8 add notification mechanisms for user & kernel to signal
>>>>>       changes to native GPU syncobjs.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>>>>> [2]
>>>>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>>>>
>>>>> Ketil Johnsen (7):
>>>>>      drm/panthor: Add uAPI to submit from user space
>>>>>      drm/panthor: Extend GROUP_CREATE for user submission
>>>>>      drm/panthor: Map doorbell pages
>>>>>      drm/panthor: Add GROUP_KICK ioctl
>>>>>      drm/panthor: Factor out syncobj handling
>>>>>      drm/panthor: Implement XGS queues
>>>>>      drm/panthor: Add SYNC_UPDATE ioctl
>>>>>
>>>>> Mihail Atanassov (1):
>>>>>      drm/panthor: Add sync_update eventfd handling
>>>>>
>>>>>     drivers/gpu/drm/panthor/Makefile          |   4 +-
>>>>>     drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>>>>>     drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>>>>>     drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>>>>>     drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>>>>>     drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>>>>>     drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>>>>>     drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>>>>     drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>>>>>     drivers/gpu/drm/panthor/panthor_xgs.c     | 638 ++++++++++++++++++++++
>>>>>     drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>>>>>     include/uapi/drm/panthor_drm.h            | 243 +++++++-
>>>>>     12 files changed, 1696 insertions(+), 177 deletions(-)
>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>>>>
Steven Price Sept. 4, 2024, 9:31 a.m. UTC | #8
On 04/09/2024 08:49, Christian König wrote:
> Am 03.09.24 um 23:11 schrieb Simona Vetter:
>> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
>>> Hi Steven,
>>>
>>> Am 29.08.24 um 15:37 schrieb Steven Price:
>>>> Hi Christian,
>>>>
>>>> Mihail should be able to give more definitive answers, but I think I
>>>> can
>>>> answer your questions.
>>>>
>>>> On 29/08/2024 10:40, Christian König wrote:
>>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>>>>>> Hello all,
>>>>>>
>>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
>>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
>>>>>> userspace to control job synchronisation between the CPU and GPU.
>>>>>>
>>>>>> The goal of these changes is to allow userspace to control work
>>>>>> submission to the FW/HW directly without kernel intervention in the
>>>>>> common case, thereby reducing context switching overhead. It also
>>>>>> allows
>>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
>>>>>> For example, the current kernel submit path only supports indirect
>>>>>> calls, which is inefficient for small command buffers. Userspace can
>>>>>> also skip unnecessary sync operations.
>>>>> Question is how do you guarantee forward progress for fence signaling?
>>>> A timeout. Although looking at it I think it's probably set too high
>>>> currently:
>>>>
>>>>> +#define JOB_TIMEOUT_MS                5000
>>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
>>>> queue and the jobs have a timeout. If the timeout is hit then any
>>>> fences
>>>> will be signalled (with an error).
>>> Mhm, that is unfortunately exactly what I feared.
>>>
>>>>> E.g. when are fences created and published? How do they signal?
>>>>>
>>>>> How are dependencies handled? How can the kernel suspend an userspace
>>>>> queue?
>>>> The actual userspace queue can be suspended. This is actually a
>>>> combination of firmware and kernel driver, and this functionality is
>>>> already present without the user submission. The firmware will
>>>> multiplex
>>>> multiple 'groups' onto the hardware, and if there are too many for the
>>>> firmware then the kernel multiplexes the extra groups onto the ones the
>>>> firmware supports.
>>> How do you guarantee forward progress and that resuming of suspended
>>> queues
>>> doesn't end up in a circle dependency?

I'm not entirely sure what you mean by "guarantee" here - the kernel by
itself only guarantees forward progress by the means of timeouts. User
space can 'easily' shoot itself in the foot by using a XGS queue to
block waiting on a GPU event which will never happen.

However dependencies between applications (and/or other device drivers)
will only occur via dma fences and an unsignalled fence will only be
returned when there is a path forward to signal it. So it shouldn't be
possible to create a dependency loop between contexts (or command stream
groups to use the Mali jargon).

Because the groups can't have dependency cycles it should be possible to
suspend/resume them without deadlocks.

>>>> I haven't studied Mihail's series in detail yet, but if I understand
>>>> correctly, the XGS queues are handled separately and are not suspended
>>>> when the hardware queues are suspended. I guess this might be an area
>>>> for improvement and might explain the currently very high timeout (to
>>>> deal with the case where the actual GPU work has been suspended).
>>>>
>>>>> How does memory management work in this case?
>>>> I'm not entirely sure what you mean here. If you are referring to the
>>>> potential memory issues with signalling path then this should be
>>>> handled
>>>> by the timeout - although I haven't studied the code to check for
>>>> bugs here.
>>> You might have misunderstood my question (and I might misunderstand the
>>> code), but on first glance it strongly sounds like the current
>>> approach will
>>> be NAKed.
>>>
>>>> The actual new XGS queues don't allocate/free memory during the queue
>>>> execution - so it's just the memory usage related to fences (and the
>>>> other work which could be blocked on the fence).
>>> But the kernel and the hardware could suspend the queues, right?
>>>
>>>> In terms of memory management for the GPU work itself, this is handled
>>>> the same as before. The VM_BIND mechanism allows dependencies to be
>>>> created between syncobjs and VM operations, with XGS these can then be
>>>> tied to GPU HW events.
>>> I don't know the details, but that again strongly sounds like that won't
>>> work.
>>>
>>> What you need is to somehow guarantee that work doesn't run into memory
>>> management deadlocks which are resolved by timeouts.
>>>
>>> Please read up here on why that stuff isn't allowed:
>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>> memory management easy mode.
> 
> Ok, that at least makes things work for the moment.

Ah, perhaps this should have been spelt out more clearly ;)

The VM_BIND mechanism that's already in place jumps through some hoops
to ensure that memory is preallocated when the memory operations are
enqueued. So any memory required should have been allocated before any
sync object is returned. We're aware of the issue with memory
allocations on the signalling path and trying to ensure that we don't
have that.

I'm hoping that we don't need a shrinker which deals with (active) GPU
memory with our design. Memory which user space thinks the GPU might
need should be pinned before the GPU work is submitted. APIs which
require any form of 'paging in' of data would need to be implemented by
the GPU work completing and being resubmitted by user space after the
memory changes (i.e. there could be a DMA fence pending on the GPU work).

>> But also this means there might be an uapi design bug in here, and we
>> really don't want to commit to that. My stance is that panthor should
>> gain
>> a proper shrinker first, which means there will be some changes here too.
>> And then we can properly evaluate this. As-is it's a bit too much on the
>> toy end of things.
> 
> I wouldn't say toy end, it looks rather fleshed out to me.
> 
> But I agree that the people who design the UAPI needs to be aware of the
> restrictions.

Yes, I'm aware this could restrict the design in the future. This is of
course why it's an RFC as we welcome discussion on what problems we
could be setting ourselves up for!

>>
>> That aside, I've thought this all through with tons of people, and I do
>> think it's all possible.
> 
> It's certainly possible, we have user queue patches for amdgpu in the
> pipeline as well.
> 
> It's just really really really hard to get right without creating some
> circle dependencies and deadlocks in between.
> 
> If I would get free beer every time somebody came up with a broken
> dma_fence design I would probably end up as alcoholic without spending a
> single penny.

I'll try hard to restrict your beer intake ;)

Thanks,
Steve

> Christian.
> 
>> -Sima
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Fundamentally (modulo bugs) there is little change compared to kernel
>>>> submission - it's already fairly trivial to write GPU job which will
>>>> make no forward progress (a 'while (1)' equivalent job). The only
>>>> difference here is that XGS makes this 'easy' and doesn't involve the
>>>> GPU spinning. Either way we rely on a timeout to recover from these
>>>> situations.
>>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> This is still a work-in-progress, there's an outstanding issue with
>>>>>> multiple processes using different submission flows triggering
>>>>>> scheduling bugs (e.g. the same group getting scheduled twice), but
>>>>>> we'd
>>>>>> love to gather some feedback on the suitability of the approach in
>>>>>> general and see if there's a clear path to merging something like
>>>>>> this
>>>>>> eventually.
>>>>>>
>>>>>> I've also CCd AMD maintainers because they have in the past done
>>>>>> something similar[1], in case they want to chime in.
>>>>>>
>>>>>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>>>>>> (link TBD), and one in panvk [2].
>>>>>>
>>>>>> The Gallium implementation is a naïve change just to switch the
>>>>>> submission model and exercise the new kernel code, and we don't plan
>>>>>> on pursuing this at this time.
>>>>>>
>>>>>> The panvk driver changes are, however, a better representation of the
>>>>>> intent behind this new uAPI, so please consider that as the reference
>>>>>> userspace. It is still very much also a work in progress.
>>>>>>
>>>>>>     * patch 1 adds all the uAPI changes;
>>>>>>     * patch 2 implements the GROUP_CREATE ioctl changes necessary
>>>>>> to expose
>>>>>>       the required objects to userspace;
>>>>>>     * patch 3 maps the doorbell pages, similarly to how the user I/O
>>>>>> page is
>>>>>>       mapped;
>>>>>>     * patch 4 implements GROUP_KICK, which lets userspace request an
>>>>>>       inactive group to be scheduled on the GPU;
>>>>>>     * patches 5 & 6 implement XGS queues, a way for userspace to
>>>>>>       synchronise GPU queue progress with DRM syncobjs;
>>>>>>     * patches 7 & 8 add notification mechanisms for user & kernel
>>>>>> to signal
>>>>>>       changes to native GPU syncobjs.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>>>>>> [2]
>>>>>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>>>>>
>>>>>> Ketil Johnsen (7):
>>>>>>      drm/panthor: Add uAPI to submit from user space
>>>>>>      drm/panthor: Extend GROUP_CREATE for user submission
>>>>>>      drm/panthor: Map doorbell pages
>>>>>>      drm/panthor: Add GROUP_KICK ioctl
>>>>>>      drm/panthor: Factor out syncobj handling
>>>>>>      drm/panthor: Implement XGS queues
>>>>>>      drm/panthor: Add SYNC_UPDATE ioctl
>>>>>>
>>>>>> Mihail Atanassov (1):
>>>>>>      drm/panthor: Add sync_update eventfd handling
>>>>>>
>>>>>>     drivers/gpu/drm/panthor/Makefile          |   4 +-
>>>>>>     drivers/gpu/drm/panthor/panthor_device.c  |  66 ++-
>>>>>>     drivers/gpu/drm/panthor/panthor_device.h  |  35 +-
>>>>>>     drivers/gpu/drm/panthor/panthor_drv.c     | 233 +++++++-
>>>>>>     drivers/gpu/drm/panthor/panthor_fw.c      |   2 +-
>>>>>>     drivers/gpu/drm/panthor/panthor_sched.c   | 408 +++++++++-----
>>>>>>     drivers/gpu/drm/panthor/panthor_sched.h   |   8 +-
>>>>>>     drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>>>>>     drivers/gpu/drm/panthor/panthor_syncobj.h |  27 +
>>>>>>     drivers/gpu/drm/panthor/panthor_xgs.c     | 638
>>>>>> ++++++++++++++++++++++
>>>>>>     drivers/gpu/drm/panthor/panthor_xgs.h     |  42 ++
>>>>>>     include/uapi/drm/panthor_drm.h            | 243 +++++++-
>>>>>>     12 files changed, 1696 insertions(+), 177 deletions(-)
>>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>>>>>     create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>>>>>
>
Simona Vetter Sept. 4, 2024, 9:43 a.m. UTC | #9
On Wed, Sep 04, 2024 at 09:49:44AM +0200, Christian König wrote:
> Am 03.09.24 um 23:11 schrieb Simona Vetter:
> > On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
> > > Hi Steven,
> > > 
> > > Am 29.08.24 um 15:37 schrieb Steven Price:
> > > > Hi Christian,
> > > > 
> > > > Mihail should be able to give more definitive answers, but I think I can
> > > > answer your questions.
> > > > 
> > > > On 29/08/2024 10:40, Christian König wrote:
> > > > > Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> > > > > > Hello all,
> > > > > > 
> > > > > > This series implements a mechanism to expose Mali CSF GPUs' queue
> > > > > > ringbuffers directly to userspace, along with paraphernalia to allow
> > > > > > userspace to control job synchronisation between the CPU and GPU.
> > > > > > 
> > > > > > The goal of these changes is to allow userspace to control work
> > > > > > submission to the FW/HW directly without kernel intervention in the
> > > > > > common case, thereby reducing context switching overhead. It also allows
> > > > > > for greater flexibility in the way work is enqueued in the ringbufs.
> > > > > > For example, the current kernel submit path only supports indirect
> > > > > > calls, which is inefficient for small command buffers. Userspace can
> > > > > > also skip unnecessary sync operations.
> > > > > Question is how do you guarantee forward progress for fence signaling?
> > > > A timeout. Although looking at it I think it's probably set too high
> > > > currently:
> > > > 
> > > > > +#define JOB_TIMEOUT_MS				5000
> > > > But basically the XGS queue is a DRM scheduler just like a normal GPU
> > > > queue and the jobs have a timeout. If the timeout is hit then any fences
> > > > will be signalled (with an error).
> > > Mhm, that is unfortunately exactly what I feared.
> > > 
> > > > > E.g. when are fences created and published? How do they signal?
> > > > > 
> > > > > How are dependencies handled? How can the kernel suspend an userspace
> > > > > queue?
> > > > The actual userspace queue can be suspended. This is actually a
> > > > combination of firmware and kernel driver, and this functionality is
> > > > already present without the user submission. The firmware will multiplex
> > > > multiple 'groups' onto the hardware, and if there are too many for the
> > > > firmware then the kernel multiplexes the extra groups onto the ones the
> > > > firmware supports.
> > > How do you guarantee forward progress and that resuming of suspended queues
> > > doesn't end up in a circle dependency?
> > > 
> > > > I haven't studied Mihail's series in detail yet, but if I understand
> > > > correctly, the XGS queues are handled separately and are not suspended
> > > > when the hardware queues are suspended. I guess this might be an area
> > > > for improvement and might explain the currently very high timeout (to
> > > > deal with the case where the actual GPU work has been suspended).
> > > > 
> > > > > How does memory management work in this case?
> > > > I'm not entirely sure what you mean here. If you are referring to the
> > > > potential memory issues with signalling path then this should be handled
> > > > by the timeout - although I haven't studied the code to check for bugs here.
> > > You might have misunderstood my question (and I might misunderstand the
> > > code), but on first glance it strongly sounds like the current approach will
> > > be NAKed.
> > > 
> > > > The actual new XGS queues don't allocate/free memory during the queue
> > > > execution - so it's just the memory usage related to fences (and the
> > > > other work which could be blocked on the fence).
> > > But the kernel and the hardware could suspend the queues, right?
> > > 
> > > > In terms of memory management for the GPU work itself, this is handled
> > > > the same as before. The VM_BIND mechanism allows dependencies to be
> > > > created between syncobjs and VM operations, with XGS these can then be
> > > > tied to GPU HW events.
> > > I don't know the details, but that again strongly sounds like that won't
> > > work.
> > > 
> > > What you need is to somehow guarantee that work doesn't run into memory
> > > management deadlocks which are resolved by timeouts.
> > > 
> > > Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > memory management easy mode.
> 
> Ok, that at least makes things work for the moment.
> 
> > But also this means there might be an uapi design bug in here, and we
> > really don't want to commit to that. My stance is that panthor should gain
> > a proper shrinker first, which means there will be some changes here too.
> > And then we can properly evaluate this. As-is it's a bit too much on the
> > toy end of things.
> 
> I wouldn't say toy end, it looks rather fleshed out to me.
> 
> But I agree that the people who design the UAPI needs to be aware of the
> restrictions.

The thing is, if you have pinned memory there are no restrictions on
userspace sync, you can do whatever you want. And then you can still bake
that into a dma_fence and it will not deadlock (as long as there's a
timeout and forward progress guarantee) because it cannot ever get back
into a memory management fence because there are none.

If we merge that uapi we're forever condemned to pinning all memory, which
I don't think is good because sooner or later every serious gpu driver
seems to gain some shrinker support.

> > That aside, I've thought this all through with tons of people, and I do
> > think it's all possible.
> 
> It's certainly possible, we have user queue patches for amdgpu in the
> pipeline as well.
> 
> It's just really really really hard to get right without creating some
> circle dependencies and deadlocks in between.
> 
> If I would get free beer every time somebody came up with a broken dma_fence
> design I would probably end up as alcoholic without spending a single penny.

Yeah that's exactly my fear. It's possible, but even if you understand the
abstract rules and implement the kernel correctly it's really hard to get
the userspace right: There's way too many broken ways, and just about
barely one correct way to implement vulkan winsys dma_fence with userspace
submit.

This is why I called userspace direct submit without a shrinker a toy,
because it makes the userspace trivially easy, and trivially easy to get
wrong. Even if the kernel side looks all great and polished and solid.

If we do already have a shrinker that exposes these issues, then userspace
getting stuff wrong is a userspace bug. If we start out without a
shrinker, and userspace gets it wrong, then ever adding a shrinker is a
kernel regression.

And "you cannot ever add dynamic memory management to this driver" is a
uapi regression situation I don't think we want to get into, because
there's no way out. i915-gem folks managed to pull that trick off too, and
Dave&me where left with no other choice than to flat out revert that uapi
and break userspace.

This is not something I want to make a repeat experience.

So what I think should be done, pretty much carbon copy from the xe plan:

- Prerequisite: Have a shrinker or otherwise dynamic memory management, or
  you just wont hit all the interesting bugs.

- Implement support for long running context that do not have dma_fence
  end-of-batch semantics, with the preempt context fence trick. Since this
  would be the 3rd driver after amdkfd and xe I think it's time for a
  little helper, which would be very little code and lots of documentation
  explaining the concept.

  But if that helper ties in with drm_exec and drm_gpuvm then I think it
  would still be really good as a tiny little library itself and not just
  as a good place to put documentation. Extracting that from xe_lr.c
  should be fairly simple. amdkfd could also be used as inspiration at
  least.

- With this you can start using userspace memory fences, as long as you
  never need to bake them into a dma_fence or syncobj that actually waits
  (instead of waiting until rendering finishes and then handing over a
  signalled dma_fence or something like that).

- Implement userspace submit with this infrastructure.

- Implement support for baking dma_fence with all this, relying on the
  vulkan winsys fence guarantee that all the indefinite fences are
  resolved and queued up, or the application will get a
  VK_ERROR_DEVICE_LOST thrown its way.

  This is really hard, and I don't think anyone's made it happen yet in a
  real prototype with full-fledged userspace memory fences vulkan driver.

- Bonus for adding memory fences to drm_syncobj as future fences, for
  sharing. Could be done anytime you have long running context support.

Alternative plan:
- Implement userspace submit, but _every_ job still goes through a kernel
  drm_sched submission and the context doesn't run without a job. That way
  you can drive userspace-submit only hardware, but with the old
  end-of-batch dma_fence kernel submission model, and also the userspace
  submit thread for future fences that haven't materialized yet in a
  drm_syncobj. This might be a good interim stop-gap for gl/vulkan winsys
  drivers, but people seem a lot more interesting in going full userspace
  submit with userspace memory fences for compute use cases. But could do
  both in parallel.

  Unless you absolutely know what you're doing you cannot use userspace
  memory fences with this. I think the only case I've seen is where you
  submit jobs in multiple engines as an atomic unit (like intel media
  workloads split across engines), or one after the other and sync is
  guaranteed to go only one way (like the run-ahead cs job radeonsi iirc
  has).

tldr; expect pain, I'm sorry.

Cheers, Sima
Boris Brezillon Sept. 4, 2024, 11:23 a.m. UTC | #10
On Wed, 4 Sep 2024 10:31:36 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/09/2024 08:49, Christian König wrote:
> > Am 03.09.24 um 23:11 schrieb Simona Vetter:  
> >> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:  
> >>> Hi Steven,
> >>>
> >>> Am 29.08.24 um 15:37 schrieb Steven Price:  
> >>>> Hi Christian,
> >>>>
> >>>> Mihail should be able to give more definitive answers, but I think I
> >>>> can
> >>>> answer your questions.
> >>>>
> >>>> On 29/08/2024 10:40, Christian König wrote:  
> >>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:  
> >>>>>> Hello all,
> >>>>>>
> >>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
> >>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
> >>>>>> userspace to control job synchronisation between the CPU and GPU.
> >>>>>>
> >>>>>> The goal of these changes is to allow userspace to control work
> >>>>>> submission to the FW/HW directly without kernel intervention in the
> >>>>>> common case, thereby reducing context switching overhead. It also
> >>>>>> allows
> >>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
> >>>>>> For example, the current kernel submit path only supports indirect
> >>>>>> calls, which is inefficient for small command buffers. Userspace can
> >>>>>> also skip unnecessary sync operations.  
> >>>>> Question is how do you guarantee forward progress for fence signaling?  
> >>>> A timeout. Although looking at it I think it's probably set too high
> >>>> currently:
> >>>>  
> >>>>> +#define JOB_TIMEOUT_MS                5000  
> >>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
> >>>> queue and the jobs have a timeout. If the timeout is hit then any
> >>>> fences
> >>>> will be signalled (with an error).  
> >>> Mhm, that is unfortunately exactly what I feared.
> >>>  
> >>>>> E.g. when are fences created and published? How do they signal?
> >>>>>
> >>>>> How are dependencies handled? How can the kernel suspend an userspace
> >>>>> queue?  
> >>>> The actual userspace queue can be suspended. This is actually a
> >>>> combination of firmware and kernel driver, and this functionality is
> >>>> already present without the user submission. The firmware will
> >>>> multiplex
> >>>> multiple 'groups' onto the hardware, and if there are too many for the
> >>>> firmware then the kernel multiplexes the extra groups onto the ones the
> >>>> firmware supports.  
> >>> How do you guarantee forward progress and that resuming of suspended
> >>> queues
> >>> doesn't end up in a circle dependency?  
> 
> I'm not entirely sure what you mean by "guarantee" here - the kernel by
> itself only guarantees forward progress by the means of timeouts. User
> space can 'easily' shoot itself in the foot by using a XGS queue to
> block waiting on a GPU event which will never happen.
> 
> However dependencies between applications (and/or other device drivers)
> will only occur via dma fences and an unsignalled fence will only be
> returned when there is a path forward to signal it. So it shouldn't be
> possible to create a dependency loop between contexts (or command stream
> groups to use the Mali jargon).
> 
> Because the groups can't have dependency cycles it should be possible to
> suspend/resume them without deadlocks.
> 
> >>>> I haven't studied Mihail's series in detail yet, but if I understand
> >>>> correctly, the XGS queues are handled separately and are not suspended
> >>>> when the hardware queues are suspended. I guess this might be an area
> >>>> for improvement and might explain the currently very high timeout (to
> >>>> deal with the case where the actual GPU work has been suspended).
> >>>>  
> >>>>> How does memory management work in this case?  
> >>>> I'm not entirely sure what you mean here. If you are referring to the
> >>>> potential memory issues with signalling path then this should be
> >>>> handled
> >>>> by the timeout - although I haven't studied the code to check for
> >>>> bugs here.  
> >>> You might have misunderstood my question (and I might misunderstand the
> >>> code), but on first glance it strongly sounds like the current
> >>> approach will
> >>> be NAKed.
> >>>  
> >>>> The actual new XGS queues don't allocate/free memory during the queue
> >>>> execution - so it's just the memory usage related to fences (and the
> >>>> other work which could be blocked on the fence).  
> >>> But the kernel and the hardware could suspend the queues, right?
> >>>  
> >>>> In terms of memory management for the GPU work itself, this is handled
> >>>> the same as before. The VM_BIND mechanism allows dependencies to be
> >>>> created between syncobjs and VM operations, with XGS these can then be
> >>>> tied to GPU HW events.  
> >>> I don't know the details, but that again strongly sounds like that won't
> >>> work.
> >>>
> >>> What you need is to somehow guarantee that work doesn't run into memory
> >>> management deadlocks which are resolved by timeouts.
> >>>
> >>> Please read up here on why that stuff isn't allowed:
> >>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences  
> >> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >> memory management easy mode.  
> > 
> > Ok, that at least makes things work for the moment.  
> 
> Ah, perhaps this should have been spelt out more clearly ;)
> 
> The VM_BIND mechanism that's already in place jumps through some hoops
> to ensure that memory is preallocated when the memory operations are
> enqueued. So any memory required should have been allocated before any
> sync object is returned. We're aware of the issue with memory
> allocations on the signalling path and trying to ensure that we don't
> have that.
> 
> I'm hoping that we don't need a shrinker which deals with (active) GPU
> memory with our design.

That's actually what we were planning to do: the panthor shrinker was
about to rely on fences attached to GEM objects to know if it can
reclaim the memory. This design relies on each job attaching its fence
to the GEM mapped to the VM at the time the job is submitted, such that
memory that's in-use or about-to-be-used doesn't vanish before the GPU
is done.

> Memory which user space thinks the GPU might
> need should be pinned before the GPU work is submitted. APIs which
> require any form of 'paging in' of data would need to be implemented by
> the GPU work completing and being resubmitted by user space after the
> memory changes (i.e. there could be a DMA fence pending on the GPU work).

Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
that means we can't really transparently swap out GPU memory, or we
have to constantly pin/unpin around each job, which means even more
ioctl()s than we have now. Another option would be to add the XGS fence
to the BOs attached to the VM, assuming it's created before the job
submission itself, but you're no longer reducing the number of user <->
kernel round trips if you do that, because you now have to create an
XSG job for each submission, so you basically get back to one ioctl()
per submission.
Christian König Sept. 4, 2024, 11:34 a.m. UTC | #11
Hi Boris,

Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>> Please read up here on why that stuff isn't allowed:
>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences   
>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>> memory management easy mode.
>>> Ok, that at least makes things work for the moment.
>> Ah, perhaps this should have been spelt out more clearly ;)
>>
>> The VM_BIND mechanism that's already in place jumps through some hoops
>> to ensure that memory is preallocated when the memory operations are
>> enqueued. So any memory required should have been allocated before any
>> sync object is returned. We're aware of the issue with memory
>> allocations on the signalling path and trying to ensure that we don't
>> have that.
>>
>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>> memory with our design.
> That's actually what we were planning to do: the panthor shrinker was
> about to rely on fences attached to GEM objects to know if it can
> reclaim the memory. This design relies on each job attaching its fence
> to the GEM mapped to the VM at the time the job is submitted, such that
> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> is done.

Yeah and exactly that doesn't work any more when you are using user 
queues, because the kernel has no opportunity to attach a fence for each 
submission.

>> Memory which user space thinks the GPU might
>> need should be pinned before the GPU work is submitted. APIs which
>> require any form of 'paging in' of data would need to be implemented by
>> the GPU work completing and being resubmitted by user space after the
>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> that means we can't really transparently swap out GPU memory, or we
> have to constantly pin/unpin around each job, which means even more
> ioctl()s than we have now. Another option would be to add the XGS fence
> to the BOs attached to the VM, assuming it's created before the job
> submission itself, but you're no longer reducing the number of user <->
> kernel round trips if you do that, because you now have to create an
> XSG job for each submission, so you basically get back to one ioctl()
> per submission.

For AMDGPU we are currently working on the following solution with 
memory management and user queues:

1. User queues are created through an kernel IOCTL, submissions work by 
writing into a ring buffer and ringing a doorbell.

2. Each queue can request the kernel to create fences for the currently 
pushed work for a queues which can then be attached to BOs, syncobjs, 
syncfiles etc...

3. Additional to that we have and eviction/preemption fence attached to 
all BOs, page tables, whatever resources we need.

4. When this eviction fences are requested to signal they first wait for 
all submission fences and then suspend the user queues and block 
creating new submission fences until the queues are restarted again.

This way you can still do your memory management inside the kernel (e.g. 
move BOs from local to system memory) or even completely suspend and 
resume applications without their interaction, but as Sima said it is 
just horrible complicated to get right.

We have been working on this for like two years now and it still could 
be that we missed something since it is not in production testing yet.

Regards,
Christian.
Steven Price Sept. 4, 2024, 12:46 p.m. UTC | #12
On 04/09/2024 12:34, Christian König wrote:
> Hi Boris,
> 
> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences  
>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>> memory management easy mode.  
>>>> Ok, that at least makes things work for the moment.  
>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>
>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>> to ensure that memory is preallocated when the memory operations are
>>> enqueued. So any memory required should have been allocated before any
>>> sync object is returned. We're aware of the issue with memory
>>> allocations on the signalling path and trying to ensure that we don't
>>> have that.
>>>
>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>> memory with our design.
>> That's actually what we were planning to do: the panthor shrinker was
>> about to rely on fences attached to GEM objects to know if it can
>> reclaim the memory. This design relies on each job attaching its fence
>> to the GEM mapped to the VM at the time the job is submitted, such that
>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>> is done.

How progressed is this shrinker? It would be good to have an RFC so that
we can look to see how user submission could fit in with it.

> Yeah and exactly that doesn't work any more when you are using user
> queues, because the kernel has no opportunity to attach a fence for each
> submission.

User submission requires a cooperating user space[1]. So obviously user
space would need to ensure any BOs that it expects will be accessed to
be in some way pinned. Since the expectation of user space submission is
that we're reducing kernel involvement, I'd also expect these to be
fairly long-term pins.

[1] Obviously with a timer to kill things from a malicious user space.

The (closed) 'kbase' driver has a shrinker but is only used on a subset
of memory and it's up to user space to ensure that it keeps the relevant
parts pinned (or more specifically not marking them to be discarded if
there's memory pressure). Not that I think we should be taking it's
model as a reference here.

>>> Memory which user space thinks the GPU might
>>> need should be pinned before the GPU work is submitted. APIs which
>>> require any form of 'paging in' of data would need to be implemented by
>>> the GPU work completing and being resubmitted by user space after the
>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>> that means we can't really transparently swap out GPU memory, or we
>> have to constantly pin/unpin around each job, which means even more
>> ioctl()s than we have now. Another option would be to add the XGS fence
>> to the BOs attached to the VM, assuming it's created before the job
>> submission itself, but you're no longer reducing the number of user <->
>> kernel round trips if you do that, because you now have to create an
>> XSG job for each submission, so you basically get back to one ioctl()
>> per submission.

As you say the granularity of pinning has to be fairly coarse for user
space submission to make sense. My assumption (could be wildly wrong)
was that most memory would be pinned whenever a context is rendering.

> For AMDGPU we are currently working on the following solution with
> memory management and user queues:
> 
> 1. User queues are created through an kernel IOCTL, submissions work by
> writing into a ring buffer and ringing a doorbell.
> 
> 2. Each queue can request the kernel to create fences for the currently
> pushed work for a queues which can then be attached to BOs, syncobjs,
> syncfiles etc...
> 
> 3. Additional to that we have and eviction/preemption fence attached to
> all BOs, page tables, whatever resources we need.
> 
> 4. When this eviction fences are requested to signal they first wait for
> all submission fences and then suspend the user queues and block
> creating new submission fences until the queues are restarted again.
> 
> This way you can still do your memory management inside the kernel (e.g.
> move BOs from local to system memory) or even completely suspend and
> resume applications without their interaction, but as Sima said it is
> just horrible complicated to get right.
> 
> We have been working on this for like two years now and it still could
> be that we missed something since it is not in production testing yet.

I'm not entirely sure I follow how this doesn't create a dependency
cycle. From your description it sounds like you create a fence from the
user space queue which is then used to prevent eviction of the BOs needed.

So to me it sounds like:

1. Attach fence to BOs to prevent eviction.

2. User space submits work to the ring buffer, rings doorbell.

3. Call into the kernel to create the fence for step 1.

Which is obviously broken. What am I missing?

One other thing to note is that Mali doesn't have local memory - so the
only benefit of unpinning is if we want to swap to disk (or zram etc).

Steve
Boris Brezillon Sept. 4, 2024, 1:20 p.m. UTC | #13
+ Adrian, who has been looking at the shrinker stuff for Panthor

On Wed, 4 Sep 2024 13:46:12 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/09/2024 12:34, Christian König wrote:
> > Hi Boris,
> > 
> > Am 04.09.24 um 13:23 schrieb Boris Brezillon:  
> >>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences    
> >>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >>>>> memory management easy mode.    
> >>>> Ok, that at least makes things work for the moment.    
> >>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>
> >>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>> to ensure that memory is preallocated when the memory operations are
> >>> enqueued. So any memory required should have been allocated before any
> >>> sync object is returned. We're aware of the issue with memory
> >>> allocations on the signalling path and trying to ensure that we don't
> >>> have that.
> >>>
> >>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>> memory with our design.  
> >> That's actually what we were planning to do: the panthor shrinker was
> >> about to rely on fences attached to GEM objects to know if it can
> >> reclaim the memory. This design relies on each job attaching its fence
> >> to the GEM mapped to the VM at the time the job is submitted, such that
> >> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >> is done.  
> 
> How progressed is this shrinker?

We don't have code yet. All we know is that we want to re-use Dmitry's
generic GEM-SHMEM shrinker implementation [1], and adjust it to match
the VM model, which means not tracking things at the BO granularity,
but at the VM granularity. Actually it has to be an hybrid model, where
shared BOs (those imported/exported) are tracked individually, while
all private BOs are checked simultaneously (since they all share the VM
resv object).

> It would be good to have an RFC so that
> we can look to see how user submission could fit in with it.

Unfortunately, we don't have that yet :-(. All we have is a rough idea
of how things will work, which is basically how TTM reclaim works, but
adapted to GEM.

> 
> > Yeah and exactly that doesn't work any more when you are using user
> > queues, because the kernel has no opportunity to attach a fence for each
> > submission.  
> 
> User submission requires a cooperating user space[1]. So obviously user
> space would need to ensure any BOs that it expects will be accessed to
> be in some way pinned. Since the expectation of user space submission is
> that we're reducing kernel involvement, I'd also expect these to be
> fairly long-term pins.
> 
> [1] Obviously with a timer to kill things from a malicious user space.
> 
> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> of memory and it's up to user space to ensure that it keeps the relevant
> parts pinned (or more specifically not marking them to be discarded if
> there's memory pressure). Not that I think we should be taking it's
> model as a reference here.
> 
> >>> Memory which user space thinks the GPU might
> >>> need should be pinned before the GPU work is submitted. APIs which
> >>> require any form of 'paging in' of data would need to be implemented by
> >>> the GPU work completing and being resubmitted by user space after the
> >>> memory changes (i.e. there could be a DMA fence pending on the GPU work).  
> >> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >> that means we can't really transparently swap out GPU memory, or we
> >> have to constantly pin/unpin around each job, which means even more
> >> ioctl()s than we have now. Another option would be to add the XGS fence
> >> to the BOs attached to the VM, assuming it's created before the job
> >> submission itself, but you're no longer reducing the number of user <->
> >> kernel round trips if you do that, because you now have to create an
> >> XSG job for each submission, so you basically get back to one ioctl()
> >> per submission.  
> 
> As you say the granularity of pinning has to be fairly coarse for user
> space submission to make sense. My assumption (could be wildly wrong)
> was that most memory would be pinned whenever a context is rendering.

The granularity of pinning (in term of which regions are pinned) is not
really the problem, we can just assume anything that's mapped to the VM
will be used by the GPU (which is what we're planning to do for kernel
submission BTW). The problem is making the timeslice during
which VM memory is considered unreclaimable as short as possible, such
that the system can reclaim memory under mem pressure. Ideally, you want
to pin memory as long as you have jobs queued/running, and allow for
reclaim when the GPU context is idle.

We might be able to involve the panthor_scheduler for usermode queues,
such that a context that's eligible for scheduling first gets its VM
mappings pinned (fence creation + assignment to the VM/BO resvs), and
things get reclaimable again when the group is evicted from the CSG
slot. That implies evicting idle groups more aggressively than we do
know, but there's probably a way around it.

> 
> > For AMDGPU we are currently working on the following solution with
> > memory management and user queues:
> > 
> > 1. User queues are created through an kernel IOCTL, submissions work by
> > writing into a ring buffer and ringing a doorbell.
> > 
> > 2. Each queue can request the kernel to create fences for the currently
> > pushed work for a queues which can then be attached to BOs, syncobjs,
> > syncfiles etc...
> > 
> > 3. Additional to that we have and eviction/preemption fence attached to
> > all BOs, page tables, whatever resources we need.
> > 
> > 4. When this eviction fences are requested to signal they first wait for
> > all submission fences and then suspend the user queues and block
> > creating new submission fences until the queues are restarted again.
> > 
> > This way you can still do your memory management inside the kernel (e.g.
> > move BOs from local to system memory) or even completely suspend and
> > resume applications without their interaction, but as Sima said it is
> > just horrible complicated to get right.
> > 
> > We have been working on this for like two years now and it still could
> > be that we missed something since it is not in production testing yet.  
> 
> I'm not entirely sure I follow how this doesn't create a dependency
> cycle. From your description it sounds like you create a fence from the
> user space queue which is then used to prevent eviction of the BOs needed.
> 
> So to me it sounds like:
> 
> 1. Attach fence to BOs to prevent eviction.
> 
> 2. User space submits work to the ring buffer, rings doorbell.
> 
> 3. Call into the kernel to create the fence for step 1.
> 
> Which is obviously broken. What am I missing?
> 
> One other thing to note is that Mali doesn't have local memory - so the
> only benefit of unpinning is if we want to swap to disk (or zram etc).

Which would be good to have, IMHO. If we don't do the implicit swapout
based on some sort of least-recently-used-VM, we have to rely on
userspace freeing its buffer (or flagging them reclaimable) as soon as
they are no longer used by the GPU.

[1]https://patchwork.kernel.org/project/dri-devel/cover/20240105184624.508603-1-dmitry.osipenko@collabora.com/
Steven Price Sept. 4, 2024, 1:35 p.m. UTC | #14
On 04/09/2024 14:20, Boris Brezillon wrote:
> + Adrian, who has been looking at the shrinker stuff for Panthor
> 
> On Wed, 4 Sep 2024 13:46:12 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 04/09/2024 12:34, Christian König wrote:
>>> Hi Boris,
>>>
>>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:  
>>>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences    
>>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>>>> memory management easy mode.    
>>>>>> Ok, that at least makes things work for the moment.    
>>>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>>>
>>>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>>>> to ensure that memory is preallocated when the memory operations are
>>>>> enqueued. So any memory required should have been allocated before any
>>>>> sync object is returned. We're aware of the issue with memory
>>>>> allocations on the signalling path and trying to ensure that we don't
>>>>> have that.
>>>>>
>>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>>>> memory with our design.  
>>>> That's actually what we were planning to do: the panthor shrinker was
>>>> about to rely on fences attached to GEM objects to know if it can
>>>> reclaim the memory. This design relies on each job attaching its fence
>>>> to the GEM mapped to the VM at the time the job is submitted, such that
>>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>>>> is done.  
>>
>> How progressed is this shrinker?
> 
> We don't have code yet. All we know is that we want to re-use Dmitry's
> generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> the VM model, which means not tracking things at the BO granularity,
> but at the VM granularity. Actually it has to be an hybrid model, where
> shared BOs (those imported/exported) are tracked individually, while
> all private BOs are checked simultaneously (since they all share the VM
> resv object).
> 
>> It would be good to have an RFC so that
>> we can look to see how user submission could fit in with it.
> 
> Unfortunately, we don't have that yet :-(. All we have is a rough idea
> of how things will work, which is basically how TTM reclaim works, but
> adapted to GEM.

Fair enough, thanks for the description. We might need to coordinate to
get this looked at sooner if it's going to be blocking user submission.

>>
>>> Yeah and exactly that doesn't work any more when you are using user
>>> queues, because the kernel has no opportunity to attach a fence for each
>>> submission.  
>>
>> User submission requires a cooperating user space[1]. So obviously user
>> space would need to ensure any BOs that it expects will be accessed to
>> be in some way pinned. Since the expectation of user space submission is
>> that we're reducing kernel involvement, I'd also expect these to be
>> fairly long-term pins.
>>
>> [1] Obviously with a timer to kill things from a malicious user space.
>>
>> The (closed) 'kbase' driver has a shrinker but is only used on a subset
>> of memory and it's up to user space to ensure that it keeps the relevant
>> parts pinned (or more specifically not marking them to be discarded if
>> there's memory pressure). Not that I think we should be taking it's
>> model as a reference here.
>>
>>>>> Memory which user space thinks the GPU might
>>>>> need should be pinned before the GPU work is submitted. APIs which
>>>>> require any form of 'paging in' of data would need to be implemented by
>>>>> the GPU work completing and being resubmitted by user space after the
>>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).  
>>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>>>> that means we can't really transparently swap out GPU memory, or we
>>>> have to constantly pin/unpin around each job, which means even more
>>>> ioctl()s than we have now. Another option would be to add the XGS fence
>>>> to the BOs attached to the VM, assuming it's created before the job
>>>> submission itself, but you're no longer reducing the number of user <->
>>>> kernel round trips if you do that, because you now have to create an
>>>> XSG job for each submission, so you basically get back to one ioctl()
>>>> per submission.  
>>
>> As you say the granularity of pinning has to be fairly coarse for user
>> space submission to make sense. My assumption (could be wildly wrong)
>> was that most memory would be pinned whenever a context is rendering.
> 
> The granularity of pinning (in term of which regions are pinned) is not
> really the problem, we can just assume anything that's mapped to the VM
> will be used by the GPU (which is what we're planning to do for kernel
> submission BTW). The problem is making the timeslice during
> which VM memory is considered unreclaimable as short as possible, such
> that the system can reclaim memory under mem pressure. Ideally, you want
> to pin memory as long as you have jobs queued/running, and allow for
> reclaim when the GPU context is idle.
> 
> We might be able to involve the panthor_scheduler for usermode queues,
> such that a context that's eligible for scheduling first gets its VM
> mappings pinned (fence creation + assignment to the VM/BO resvs), and
> things get reclaimable again when the group is evicted from the CSG
> slot. That implies evicting idle groups more aggressively than we do
> know, but there's probably a way around it.

Can we evict idle groups from the shrinker? User space already has to do
a little dance to work out whether it needs to "kick" the kernel to do
the submission. So it would be quite reasonable to extend the kick to
also pin the VM(s). The shrinker could then proactively evict idle
groups, and at that point it would be possible to unpin the memory.

I'm probably missing something, but that seems like a fairly solid
solution to me.

>>
>>> For AMDGPU we are currently working on the following solution with
>>> memory management and user queues:
>>>
>>> 1. User queues are created through an kernel IOCTL, submissions work by
>>> writing into a ring buffer and ringing a doorbell.
>>>
>>> 2. Each queue can request the kernel to create fences for the currently
>>> pushed work for a queues which can then be attached to BOs, syncobjs,
>>> syncfiles etc...
>>>
>>> 3. Additional to that we have and eviction/preemption fence attached to
>>> all BOs, page tables, whatever resources we need.
>>>
>>> 4. When this eviction fences are requested to signal they first wait for
>>> all submission fences and then suspend the user queues and block
>>> creating new submission fences until the queues are restarted again.
>>>
>>> This way you can still do your memory management inside the kernel (e.g.
>>> move BOs from local to system memory) or even completely suspend and
>>> resume applications without their interaction, but as Sima said it is
>>> just horrible complicated to get right.
>>>
>>> We have been working on this for like two years now and it still could
>>> be that we missed something since it is not in production testing yet.  
>>
>> I'm not entirely sure I follow how this doesn't create a dependency
>> cycle. From your description it sounds like you create a fence from the
>> user space queue which is then used to prevent eviction of the BOs needed.
>>
>> So to me it sounds like:
>>
>> 1. Attach fence to BOs to prevent eviction.
>>
>> 2. User space submits work to the ring buffer, rings doorbell.
>>
>> 3. Call into the kernel to create the fence for step 1.
>>
>> Which is obviously broken. What am I missing?
>>
>> One other thing to note is that Mali doesn't have local memory - so the
>> only benefit of unpinning is if we want to swap to disk (or zram etc).
> 
> Which would be good to have, IMHO. If we don't do the implicit swapout
> based on some sort of least-recently-used-VM, we have to rely on
> userspace freeing its buffer (or flagging them reclaimable) as soon as
> they are no longer used by the GPU.

It's an awkward one because really you do want user space to free
buffers - generally the user space driver has a fairly large number of
buffers which are for temporary storage (and kept around to avoid the
overhead of freeing/allocating each frame). I know we've resorted to
idle timers in user space in an attempt to do this in the past - but
it's ugly.

Actual swapout can work, but it incurs a heavy penalty when the
application becomes active again. But I agree we should probably be
aiming to add support for this.

Steve
Christian König Sept. 4, 2024, 1:36 p.m. UTC | #15
Am 04.09.24 um 14:46 schrieb Steven Price:
> On 04/09/2024 12:34, Christian König wrote:
>> Hi Boris,
>>
>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>>> memory management easy mode.
>>>>> Ok, that at least makes things work for the moment.
>>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>>
>>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>>> to ensure that memory is preallocated when the memory operations are
>>>> enqueued. So any memory required should have been allocated before any
>>>> sync object is returned. We're aware of the issue with memory
>>>> allocations on the signalling path and trying to ensure that we don't
>>>> have that.
>>>>
>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>>> memory with our design.
>>> That's actually what we were planning to do: the panthor shrinker was
>>> about to rely on fences attached to GEM objects to know if it can
>>> reclaim the memory. This design relies on each job attaching its fence
>>> to the GEM mapped to the VM at the time the job is submitted, such that
>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>>> is done.
> How progressed is this shrinker? It would be good to have an RFC so that
> we can look to see how user submission could fit in with it.
>
>> Yeah and exactly that doesn't work any more when you are using user
>> queues, because the kernel has no opportunity to attach a fence for each
>> submission.
> User submission requires a cooperating user space[1]. So obviously user
> space would need to ensure any BOs that it expects will be accessed to
> be in some way pinned. Since the expectation of user space submission is
> that we're reducing kernel involvement, I'd also expect these to be
> fairly long-term pins.
>
> [1] Obviously with a timer to kill things from a malicious user space.
>
> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> of memory and it's up to user space to ensure that it keeps the relevant
> parts pinned (or more specifically not marking them to be discarded if
> there's memory pressure). Not that I think we should be taking it's
> model as a reference here.
>
>>>> Memory which user space thinks the GPU might
>>>> need should be pinned before the GPU work is submitted. APIs which
>>>> require any form of 'paging in' of data would need to be implemented by
>>>> the GPU work completing and being resubmitted by user space after the
>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>>> that means we can't really transparently swap out GPU memory, or we
>>> have to constantly pin/unpin around each job, which means even more
>>> ioctl()s than we have now. Another option would be to add the XGS fence
>>> to the BOs attached to the VM, assuming it's created before the job
>>> submission itself, but you're no longer reducing the number of user <->
>>> kernel round trips if you do that, because you now have to create an
>>> XSG job for each submission, so you basically get back to one ioctl()
>>> per submission.
> As you say the granularity of pinning has to be fairly coarse for user
> space submission to make sense. My assumption (could be wildly wrong)
> was that most memory would be pinned whenever a context is rendering.
>
>> For AMDGPU we are currently working on the following solution with
>> memory management and user queues:
>>
>> 1. User queues are created through an kernel IOCTL, submissions work by
>> writing into a ring buffer and ringing a doorbell.
>>
>> 2. Each queue can request the kernel to create fences for the currently
>> pushed work for a queues which can then be attached to BOs, syncobjs,
>> syncfiles etc...
>>
>> 3. Additional to that we have and eviction/preemption fence attached to
>> all BOs, page tables, whatever resources we need.
>>
>> 4. When this eviction fences are requested to signal they first wait for
>> all submission fences and then suspend the user queues and block
>> creating new submission fences until the queues are restarted again.
>>
>> This way you can still do your memory management inside the kernel (e.g.
>> move BOs from local to system memory) or even completely suspend and
>> resume applications without their interaction, but as Sima said it is
>> just horrible complicated to get right.
>>
>> We have been working on this for like two years now and it still could
>> be that we missed something since it is not in production testing yet.
> I'm not entirely sure I follow how this doesn't create a dependency
> cycle. From your description it sounds like you create a fence from the
> user space queue which is then used to prevent eviction of the BOs needed.
>
> So to me it sounds like:
>
> 1. Attach fence to BOs to prevent eviction.
>
> 2. User space submits work to the ring buffer, rings doorbell.
>
> 3. Call into the kernel to create the fence for step 1.
>
> Which is obviously broken. What am I missing?

Those are two different fences, we have an eviction fence and a 
submission fence.

The eviction fence takes care of making sure the memory is not moved 
while submissions are running.

And the submission fence is the actual signal to completion.

> One other thing to note is that Mali doesn't have local memory - so the
> only benefit of unpinning is if we want to swap to disk (or zram etc).

Yeah that makes sense, but the problem remains that with an userspace 
submission module you completely nail that pinning.

Regards,
Christian.

>
> Steve
>
Boris Brezillon Sept. 4, 2024, 1:58 p.m. UTC | #16
On Wed, 4 Sep 2024 14:35:12 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/09/2024 14:20, Boris Brezillon wrote:
> > + Adrian, who has been looking at the shrinker stuff for Panthor
> > 
> > On Wed, 4 Sep 2024 13:46:12 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 04/09/2024 12:34, Christian König wrote:  
> >>> Hi Boris,
> >>>
> >>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:    
> >>>>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences      
> >>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >>>>>>> memory management easy mode.      
> >>>>>> Ok, that at least makes things work for the moment.      
> >>>>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>>>
> >>>>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>>>> to ensure that memory is preallocated when the memory operations are
> >>>>> enqueued. So any memory required should have been allocated before any
> >>>>> sync object is returned. We're aware of the issue with memory
> >>>>> allocations on the signalling path and trying to ensure that we don't
> >>>>> have that.
> >>>>>
> >>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>>>> memory with our design.    
> >>>> That's actually what we were planning to do: the panthor shrinker was
> >>>> about to rely on fences attached to GEM objects to know if it can
> >>>> reclaim the memory. This design relies on each job attaching its fence
> >>>> to the GEM mapped to the VM at the time the job is submitted, such that
> >>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >>>> is done.    
> >>
> >> How progressed is this shrinker?  
> > 
> > We don't have code yet. All we know is that we want to re-use Dmitry's
> > generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> > the VM model, which means not tracking things at the BO granularity,
> > but at the VM granularity. Actually it has to be an hybrid model, where
> > shared BOs (those imported/exported) are tracked individually, while
> > all private BOs are checked simultaneously (since they all share the VM
> > resv object).
> >   
> >> It would be good to have an RFC so that
> >> we can look to see how user submission could fit in with it.  
> > 
> > Unfortunately, we don't have that yet :-(. All we have is a rough idea
> > of how things will work, which is basically how TTM reclaim works, but
> > adapted to GEM.  
> 
> Fair enough, thanks for the description. We might need to coordinate to
> get this looked at sooner if it's going to be blocking user submission.
> 
> >>  
> >>> Yeah and exactly that doesn't work any more when you are using user
> >>> queues, because the kernel has no opportunity to attach a fence for each
> >>> submission.    
> >>
> >> User submission requires a cooperating user space[1]. So obviously user
> >> space would need to ensure any BOs that it expects will be accessed to
> >> be in some way pinned. Since the expectation of user space submission is
> >> that we're reducing kernel involvement, I'd also expect these to be
> >> fairly long-term pins.
> >>
> >> [1] Obviously with a timer to kill things from a malicious user space.
> >>
> >> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> >> of memory and it's up to user space to ensure that it keeps the relevant
> >> parts pinned (or more specifically not marking them to be discarded if
> >> there's memory pressure). Not that I think we should be taking it's
> >> model as a reference here.
> >>  
> >>>>> Memory which user space thinks the GPU might
> >>>>> need should be pinned before the GPU work is submitted. APIs which
> >>>>> require any form of 'paging in' of data would need to be implemented by
> >>>>> the GPU work completing and being resubmitted by user space after the
> >>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).    
> >>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >>>> that means we can't really transparently swap out GPU memory, or we
> >>>> have to constantly pin/unpin around each job, which means even more
> >>>> ioctl()s than we have now. Another option would be to add the XGS fence
> >>>> to the BOs attached to the VM, assuming it's created before the job
> >>>> submission itself, but you're no longer reducing the number of user <->
> >>>> kernel round trips if you do that, because you now have to create an
> >>>> XSG job for each submission, so you basically get back to one ioctl()
> >>>> per submission.    
> >>
> >> As you say the granularity of pinning has to be fairly coarse for user
> >> space submission to make sense. My assumption (could be wildly wrong)
> >> was that most memory would be pinned whenever a context is rendering.  
> > 
> > The granularity of pinning (in term of which regions are pinned) is not
> > really the problem, we can just assume anything that's mapped to the VM
> > will be used by the GPU (which is what we're planning to do for kernel
> > submission BTW). The problem is making the timeslice during
> > which VM memory is considered unreclaimable as short as possible, such
> > that the system can reclaim memory under mem pressure. Ideally, you want
> > to pin memory as long as you have jobs queued/running, and allow for
> > reclaim when the GPU context is idle.
> > 
> > We might be able to involve the panthor_scheduler for usermode queues,
> > such that a context that's eligible for scheduling first gets its VM
> > mappings pinned (fence creation + assignment to the VM/BO resvs), and
> > things get reclaimable again when the group is evicted from the CSG
> > slot. That implies evicting idle groups more aggressively than we do
> > know, but there's probably a way around it.  
> 
> Can we evict idle groups from the shrinker? User space already has to do
> a little dance to work out whether it needs to "kick" the kernel to do
> the submission. So it would be quite reasonable to extend the kick to
> also pin the VM(s). The shrinker could then proactively evict idle
> groups, and at that point it would be possible to unpin the memory.
> 
> I'm probably missing something, but that seems like a fairly solid
> solution to me.

The locking might be tricky to get right, but other than that, it looks
like an option that could work. It's definitely worth investigating.

> 
> >>  
> >>> For AMDGPU we are currently working on the following solution with
> >>> memory management and user queues:
> >>>
> >>> 1. User queues are created through an kernel IOCTL, submissions work by
> >>> writing into a ring buffer and ringing a doorbell.
> >>>
> >>> 2. Each queue can request the kernel to create fences for the currently
> >>> pushed work for a queues which can then be attached to BOs, syncobjs,
> >>> syncfiles etc...
> >>>
> >>> 3. Additional to that we have and eviction/preemption fence attached to
> >>> all BOs, page tables, whatever resources we need.
> >>>
> >>> 4. When this eviction fences are requested to signal they first wait for
> >>> all submission fences and then suspend the user queues and block
> >>> creating new submission fences until the queues are restarted again.
> >>>
> >>> This way you can still do your memory management inside the kernel (e.g.
> >>> move BOs from local to system memory) or even completely suspend and
> >>> resume applications without their interaction, but as Sima said it is
> >>> just horrible complicated to get right.
> >>>
> >>> We have been working on this for like two years now and it still could
> >>> be that we missed something since it is not in production testing yet.    
> >>
> >> I'm not entirely sure I follow how this doesn't create a dependency
> >> cycle. From your description it sounds like you create a fence from the
> >> user space queue which is then used to prevent eviction of the BOs needed.
> >>
> >> So to me it sounds like:
> >>
> >> 1. Attach fence to BOs to prevent eviction.
> >>
> >> 2. User space submits work to the ring buffer, rings doorbell.
> >>
> >> 3. Call into the kernel to create the fence for step 1.
> >>
> >> Which is obviously broken. What am I missing?
> >>
> >> One other thing to note is that Mali doesn't have local memory - so the
> >> only benefit of unpinning is if we want to swap to disk (or zram etc).  
> > 
> > Which would be good to have, IMHO. If we don't do the implicit swapout
> > based on some sort of least-recently-used-VM, we have to rely on
> > userspace freeing its buffer (or flagging them reclaimable) as soon as
> > they are no longer used by the GPU.  
> 
> It's an awkward one because really you do want user space to free
> buffers - generally the user space driver has a fairly large number of
> buffers which are for temporary storage (and kept around to avoid the
> overhead of freeing/allocating each frame). I know we've resorted to
> idle timers in user space in an attempt to do this in the past - but
> it's ugly.

Fair enough.

> 
> Actual swapout can work, but it incurs a heavy penalty when the
> application becomes active again. But I agree we should probably be
> aiming to add support for this.

I guess the other case is a gazillion of GPU contexts, each keeping
buffers around for good reason. If most of those contexts are idle, I
guess swapping out mostly inactive context is better than having the
OOM kick in to reclaim memory. And yes, there's a hit any time an
inactive context gets active again, if and only if the system was under
memory pressure in the meantime. So I guess the slowdown is acceptable
in that case.

Just to mention that, IIRC, MSM already moved away from explicit BO
reclaim (BO flagged reclaimable from userspace when they're back in the
userspace BO cache) in favor of an LRU-based swapout. Dmitry's work is
also centered around this LRU-based swapout model, even though it also
supports the model we have in panfrost, lima, vc4, etc. So it looks
like the new trend for embedded GPU drivers is to defer this memory
reclaim responsibility entirely to the kernel rather than letting
userspace decide which bits are reclaimable.
Simona Vetter Sept. 24, 2024, 9:30 a.m. UTC | #17
Apologies for the late reply ...

On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
> Hi Boris,
> 
> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> > > > > > Please read up here on why that stuff isn't allowed:
> > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > > > > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > > > > memory management easy mode.
> > > > Ok, that at least makes things work for the moment.
> > > Ah, perhaps this should have been spelt out more clearly ;)
> > > 
> > > The VM_BIND mechanism that's already in place jumps through some hoops
> > > to ensure that memory is preallocated when the memory operations are
> > > enqueued. So any memory required should have been allocated before any
> > > sync object is returned. We're aware of the issue with memory
> > > allocations on the signalling path and trying to ensure that we don't
> > > have that.
> > > 
> > > I'm hoping that we don't need a shrinker which deals with (active) GPU
> > > memory with our design.
> > That's actually what we were planning to do: the panthor shrinker was
> > about to rely on fences attached to GEM objects to know if it can
> > reclaim the memory. This design relies on each job attaching its fence
> > to the GEM mapped to the VM at the time the job is submitted, such that
> > memory that's in-use or about-to-be-used doesn't vanish before the GPU
> > is done.
> 
> Yeah and exactly that doesn't work any more when you are using user queues,
> because the kernel has no opportunity to attach a fence for each submission.
> 
> > > Memory which user space thinks the GPU might
> > > need should be pinned before the GPU work is submitted. APIs which
> > > require any form of 'paging in' of data would need to be implemented by
> > > the GPU work completing and being resubmitted by user space after the
> > > memory changes (i.e. there could be a DMA fence pending on the GPU work).
> > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> > that means we can't really transparently swap out GPU memory, or we
> > have to constantly pin/unpin around each job, which means even more
> > ioctl()s than we have now. Another option would be to add the XGS fence
> > to the BOs attached to the VM, assuming it's created before the job
> > submission itself, but you're no longer reducing the number of user <->
> > kernel round trips if you do that, because you now have to create an
> > XSG job for each submission, so you basically get back to one ioctl()
> > per submission.
> 
> For AMDGPU we are currently working on the following solution with memory
> management and user queues:
> 
> 1. User queues are created through an kernel IOCTL, submissions work by
> writing into a ring buffer and ringing a doorbell.
> 
> 2. Each queue can request the kernel to create fences for the currently
> pushed work for a queues which can then be attached to BOs, syncobjs,
> syncfiles etc...
> 
> 3. Additional to that we have and eviction/preemption fence attached to all
> BOs, page tables, whatever resources we need.
> 
> 4. When this eviction fences are requested to signal they first wait for all
> submission fences and then suspend the user queues and block creating new
> submission fences until the queues are restarted again.

Yup this works, at least when I play it out in my head.

Note that the completion fence is only deadlock free if userspace is
really, really careful. Which in practice means you need the very
carefully constructed rules for e.g. vulkan winsys fences, otherwise you
do indeed deadlock.

But if you keep that promise in mind, then it works, and step 2 is
entirely option, which means we can start userspace in a pure long-running
compute mode where there's only eviction/preemption fences. And then if
userspace needs a vulkan winsys fence, we can create that with step 2 as
needed.

But the important part is that you need really strict rules on userspace
for when step 2 is ok and won't result in deadlocks. And those rules are
uapi, which is why I think doing this in panthor without the shrinker and
eviction fences (i.e. steps 3&4 above) is a very bad mistake.

> This way you can still do your memory management inside the kernel (e.g.
> move BOs from local to system memory) or even completely suspend and resume
> applications without their interaction, but as Sima said it is just horrible
> complicated to get right.
> 
> We have been working on this for like two years now and it still could be
> that we missed something since it is not in production testing yet.

Ack.
-Sima