diff mbox series

[v2] drm/doc: device hot-unplug for userspace

Message ID 20200525124614.16339-1-ppaalanen@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2] drm/doc: device hot-unplug for userspace | expand

Commit Message

Pekka Paalanen May 25, 2020, 12:46 p.m. UTC
From: Pekka Paalanen <pekka.paalanen@collabora.com>

Set up the expectations on how hot-unplugging a DRM device should look like to
userspace.

Written by Daniel Vetter's request and largely based on his comments in IRC and
from https://lists.freedesktop.org/archives/dri-devel/2020-May/265484.html .

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Simon Ser <contact@emersion.fr>

---

v2:
- mmap reads/writes undefined (danvet)
- make render ioctl behaviour driver-specific (danvet)
- restructure the mmap paragraphs (danvet)
- chardev minor notes (Simon)
- open behaviour (danvet)
- DRM leasing behaviour (danvet)
- added links

Disclaimer: I am a userspace developer writing for other userspace developers.
I took some liberties in defining what should happen without knowing what is
actually possible or what existing drivers already implement.
---
 Documentation/gpu/drm-uapi.rst | 102 +++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)

Comments

Andrey Grodzovsky May 25, 2020, 1:51 p.m. UTC | #1
On 5/25/20 8:46 AM, Pekka Paalanen wrote:

> From: Pekka Paalanen <pekka.paalanen@collabora.com>
>
> Set up the expectations on how hot-unplugging a DRM device should look like to
> userspace.
>
> Written by Daniel Vetter's request and largely based on his comments in IRC and
> from https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265484.html&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=tbOTr7TfESohEgWspomM1sbMq4U4n7bOvdS6JlYifmM%3D&amp;reserved=0 .
>
> Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Sean Paul <sean@poorly.run>
> Cc: Simon Ser <contact@emersion.fr>
>
> ---
>
> v2:
> - mmap reads/writes undefined (danvet)
> - make render ioctl behaviour driver-specific (danvet)
> - restructure the mmap paragraphs (danvet)
> - chardev minor notes (Simon)
> - open behaviour (danvet)
> - DRM leasing behaviour (danvet)
> - added links
>
> Disclaimer: I am a userspace developer writing for other userspace developers.
> I took some liberties in defining what should happen without knowing what is
> actually possible or what existing drivers already implement.
> ---
>   Documentation/gpu/drm-uapi.rst | 102 +++++++++++++++++++++++++++++++++
>   1 file changed, 102 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 56fec6ed1ad8..520b8e640ad1 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -1,3 +1,5 @@
> +.. Copyright 2020 DisplayLink (UK) Ltd.
> +
>   ===================
>   Userland interfaces
>   ===================
> @@ -162,6 +164,106 @@ other hand, a driver requires shared state between clients which is
>   visible to user-space and accessible beyond open-file boundaries, they
>   cannot support render nodes.
>   
> +Device Hot-Unplug
> +=================
> +
> +.. note::
> +   The following is the plan. Implementation is not there yet
> +   (2020 May).
> +
> +Graphics devices (display and/or render) may be connected via USB (e.g.
> +display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
> +user is able to hot-unplug this kind of devices while they are being
> +used, and expects that the very least the machine does not crash. Any
> +damage from hot-unplugging a DRM device needs to be limited as much as
> +possible and userspace must be given the chance to handle it if it wants
> +to. Ideally, unplugging a DRM device still lets a desktop to continue
> +running, but that is going to need explicit support throughout the whole
> +graphics stack: from kernel and userspace drivers, through display
> +servers, via window system protocols, and in applications and libraries.

So to support all the requirements in this document only kernel changes 
should be enough and no changes are required from user mode part of the 
stack ?

> +
> +Other scenarios that should lead to the same are: unrecoverable GPU
> +crash, PCI device disappearing off the bus, or forced unbind of a driver
> +from the physical device.
> +
> +In other words, from userspace perspective everything needs to keep on
> +working more or less, until userspace stops using the disappeared DRM
> +device and closes it completely. Userspace will learn of the device
> +disappearance from the device removed uevent


Is this uevent already implemented ? Can you point me to the code ?


> or in some cases
> +driver-specific ioctls returning EIO.
> +
> +Only after userspace has closed all relevant DRM device and dmabuf file
> +descriptors and removed all mmaps, the DRM driver can tear down its
> +instance for the device that no longer exists. If the same physical
> +device somehow comes back in the mean time, it shall be a new DRM
> +device.
> +
> +Similar to PIDs, chardev minor numbers are not recycled immediately. A
> +new DRM device always picks the next free minor number compared to the
> +previous one allocated, and wraps around when minor numbers are
> +exhausted.
> +
> +Requirements for UAPI
> +---------------------
> +
> +The goal raises at least the following requirements for the kernel and
> +drivers:
> +
> +- The kernel must not hang, crash or oops, no matter what userspace was
> +  in the middle of doing when the device disappeared.
> +
> +- All GPU jobs that can no longer run must have their fences
> +  force-signalled to avoid inflicting hangs to userspace.
> +
> +- KMS connectors must change their status to disconnected.
> +
> +- Legacy modesets and pageflips fake success.
> +
> +- Atomic commits, both real and TEST_ONLY, fake success.
> +
> +- Pending non-blocking KMS operations deliver the DRM events userspace
> +  is expecting.


The 4 points above refer to mode setting/display attached card and are 
irrelevant for secondary GPU (e.g. DRI-PRIME scenario) or no display 
system in general. Maybe we can somehow highlight this in the document 
and I on the implementing side can then decide as a first step to 
concentrate on implementing the non display case as a first step or the 
only step. In general and correct me if I am wrong, render only GPUs (or 
compute only) are the majority of cases where you would want to be able 
to detach/attach GPU on the fly (e.g attach stronger secondary graphic 
card to a laptop to get high performance in a game or add/remove a GPU 
to/from a compute cluster)

Andrey


> +
> +- dmabuf which point to memory that has disappeared will continue to
> +  be successfully imported if it would have succeeded before the
> +  disappearance.
> +
> +- Attempting to import a dmabuf to a disappeared device will succeed if
> +  it would have succeeded without the disappearance.
> +
> +- Some userspace APIs already define what should happen when the device
> +  disappears (OpenGL, GL ES: `GL_KHR_robustness`_; `Vulkan`_:
> +  VK_ERROR_DEVICE_LOST; etc.). DRM drivers are free to implement this
> +  behaviour the way they see best, e.g. returning failures in
> +  driver-specific ioctls and handling those in userspace drivers, or
> +  rely on uevents, and so on.
> +
> +- open() on a device node whose underlying device has disappeared will
> +  fail.
> +
> +- Attempting to create a DRM lease on a disappeared DRM device will
> +  fail. Existing DRM leases remain.
> +
> +.. _GL_KHR_robustness: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fregistry%2FOpenGL%2Fextensions%2FKHR%2FKHR_robustness.txt&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=m%2FneRusoe6qGVU8Edk%2FncaD7eSJZXtPnA1IqLr7k%2Bos%3D&amp;reserved=0
> +.. _Vulkan: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fvulkan%2F&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178901265&amp;sdata=WsfLduUBzRKlybOJb5PQViBWYu5DgleEeycmf76l3UU%3D&amp;reserved=0
> +
> +Requirements for memory maps
> +----------------------------
> +
> +Memory maps have further requirements. If the underlying memory
> +disappears, the mmap is modified such that reads and writes will still
> +complete successfully but the result is undefined. This applies to both
> +userspace mmap()'d memory and memory pointed to by dmabuf which might be
> +mapped to other devices.
> +
> +Raising SIGBUS is not an option, because userspace cannot realistically
> +handle it.  Signal handlers are global, which makes them extremely
> +difficult to use correctly from libraries like those that Mesa produces.
> +Signal handlers are not composable, you can't have different handlers
> +for GPU1 and GPU2 from different vendors, and a third handler for
> +mmapped regular files.  Threads cause additional pain with signal
> +handling as well.
> +
>   .. _drm_driver_ioctl:
>   
>   IOCTL Support on Device Nodes
Daniel Vetter May 25, 2020, 2:30 p.m. UTC | #2
On Mon, May 25, 2020 at 09:51:30AM -0400, Andrey Grodzovsky wrote:
> On 5/25/20 8:46 AM, Pekka Paalanen wrote:
> 
> > From: Pekka Paalanen <pekka.paalanen@collabora.com>
> > 
> > Set up the expectations on how hot-unplugging a DRM device should look like to
> > userspace.
> > 
> > Written by Daniel Vetter's request and largely based on his comments in IRC and
> > from https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265484.html&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=tbOTr7TfESohEgWspomM1sbMq4U4n7bOvdS6JlYifmM%3D&amp;reserved=0 .
> > 
> > Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Sean Paul <sean@poorly.run>
> > Cc: Simon Ser <contact@emersion.fr>
> > 
> > ---
> > 
> > v2:
> > - mmap reads/writes undefined (danvet)
> > - make render ioctl behaviour driver-specific (danvet)
> > - restructure the mmap paragraphs (danvet)
> > - chardev minor notes (Simon)
> > - open behaviour (danvet)
> > - DRM leasing behaviour (danvet)
> > - added links
> > 
> > Disclaimer: I am a userspace developer writing for other userspace developers.
> > I took some liberties in defining what should happen without knowing what is
> > actually possible or what existing drivers already implement.
> > ---
> >   Documentation/gpu/drm-uapi.rst | 102 +++++++++++++++++++++++++++++++++
> >   1 file changed, 102 insertions(+)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 56fec6ed1ad8..520b8e640ad1 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -1,3 +1,5 @@
> > +.. Copyright 2020 DisplayLink (UK) Ltd.
> > +
> >   ===================
> >   Userland interfaces
> >   ===================
> > @@ -162,6 +164,106 @@ other hand, a driver requires shared state between clients which is
> >   visible to user-space and accessible beyond open-file boundaries, they
> >   cannot support render nodes.
> > +Device Hot-Unplug
> > +=================
> > +
> > +.. note::
> > +   The following is the plan. Implementation is not there yet
> > +   (2020 May).
> > +
> > +Graphics devices (display and/or render) may be connected via USB (e.g.
> > +display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
> > +user is able to hot-unplug this kind of devices while they are being
> > +used, and expects that the very least the machine does not crash. Any
> > +damage from hot-unplugging a DRM device needs to be limited as much as
> > +possible and userspace must be given the chance to handle it if it wants
> > +to. Ideally, unplugging a DRM device still lets a desktop to continue
> > +running, but that is going to need explicit support throughout the whole
> > +graphics stack: from kernel and userspace drivers, through display
> > +servers, via window system protocols, and in applications and libraries.
> 
> So to support all the requirements in this document only kernel changes
> should be enough and no changes are required from user mode part of the
> stack ?
> 
> > +
> > +Other scenarios that should lead to the same are: unrecoverable GPU
> > +crash, PCI device disappearing off the bus, or forced unbind of a driver
> > +from the physical device.
> > +
> > +In other words, from userspace perspective everything needs to keep on
> > +working more or less, until userspace stops using the disappeared DRM
> > +device and closes it completely. Userspace will learn of the device
> > +disappearance from the device removed uevent
> 
> 
> Is this uevent already implemented ? Can you point me to the code ?
> 
> 
> > or in some cases
> > +driver-specific ioctls returning EIO.
> > +
> > +Only after userspace has closed all relevant DRM device and dmabuf file
> > +descriptors and removed all mmaps, the DRM driver can tear down its
> > +instance for the device that no longer exists. If the same physical
> > +device somehow comes back in the mean time, it shall be a new DRM
> > +device.
> > +
> > +Similar to PIDs, chardev minor numbers are not recycled immediately. A
> > +new DRM device always picks the next free minor number compared to the
> > +previous one allocated, and wraps around when minor numbers are
> > +exhausted.
> > +
> > +Requirements for UAPI
> > +---------------------
> > +
> > +The goal raises at least the following requirements for the kernel and
> > +drivers:
> > +
> > +- The kernel must not hang, crash or oops, no matter what userspace was
> > +  in the middle of doing when the device disappeared.
> > +
> > +- All GPU jobs that can no longer run must have their fences
> > +  force-signalled to avoid inflicting hangs to userspace.
> > +
> > +- KMS connectors must change their status to disconnected.
> > +
> > +- Legacy modesets and pageflips fake success.
> > +
> > +- Atomic commits, both real and TEST_ONLY, fake success.
> > +
> > +- Pending non-blocking KMS operations deliver the DRM events userspace
> > +  is expecting.
> 
> 
> The 4 points above refer to mode setting/display attached card and are
> irrelevant for secondary GPU (e.g. DRI-PRIME scenario) or no display system
> in general. Maybe we can somehow highlight this in the document and I on the
> implementing side can then decide as a first step to concentrate on
> implementing the non display case as a first step or the only step. In
> general and correct me if I am wrong, render only GPUs (or compute only) are
> the majority of cases where you would want to be able to detach/attach GPU
> on the fly (e.g attach stronger secondary graphic card to a laptop to get
> high performance in a game or add/remove a GPU to/from a compute cluster)

Yeah maybe splitting this up into kms section, and rendering/cross driver
section (the dma-buf/fence stuff is relevant for both display and
rendering) would make some sense.
-Daniel

> 
> Andrey
> 
> 
> > +
> > +- dmabuf which point to memory that has disappeared will continue to
> > +  be successfully imported if it would have succeeded before the
> > +  disappearance.
> > +
> > +- Attempting to import a dmabuf to a disappeared device will succeed if
> > +  it would have succeeded without the disappearance.
> > +
> > +- Some userspace APIs already define what should happen when the device
> > +  disappears (OpenGL, GL ES: `GL_KHR_robustness`_; `Vulkan`_:
> > +  VK_ERROR_DEVICE_LOST; etc.). DRM drivers are free to implement this
> > +  behaviour the way they see best, e.g. returning failures in
> > +  driver-specific ioctls and handling those in userspace drivers, or
> > +  rely on uevents, and so on.
> > +
> > +- open() on a device node whose underlying device has disappeared will
> > +  fail.
> > +
> > +- Attempting to create a DRM lease on a disappeared DRM device will
> > +  fail. Existing DRM leases remain.
> > +
> > +.. _GL_KHR_robustness: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fregistry%2FOpenGL%2Fextensions%2FKHR%2FKHR_robustness.txt&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=m%2FneRusoe6qGVU8Edk%2FncaD7eSJZXtPnA1IqLr7k%2Bos%3D&amp;reserved=0
> > +.. _Vulkan: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fvulkan%2F&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178901265&amp;sdata=WsfLduUBzRKlybOJb5PQViBWYu5DgleEeycmf76l3UU%3D&amp;reserved=0
> > +
> > +Requirements for memory maps
> > +----------------------------
> > +
> > +Memory maps have further requirements. If the underlying memory
> > +disappears, the mmap is modified such that reads and writes will still
> > +complete successfully but the result is undefined. This applies to both
> > +userspace mmap()'d memory and memory pointed to by dmabuf which might be
> > +mapped to other devices.
> > +
> > +Raising SIGBUS is not an option, because userspace cannot realistically
> > +handle it.  Signal handlers are global, which makes them extremely
> > +difficult to use correctly from libraries like those that Mesa produces.
> > +Signal handlers are not composable, you can't have different handlers
> > +for GPU1 and GPU2 from different vendors, and a third handler for
> > +mmapped regular files.  Threads cause additional pain with signal
> > +handling as well.
> > +
> >   .. _drm_driver_ioctl:
> >   IOCTL Support on Device Nodes
Pekka Paalanen May 25, 2020, 2:41 p.m. UTC | #3
On Mon, 25 May 2020 09:51:30 -0400
Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> wrote:

> On 5/25/20 8:46 AM, Pekka Paalanen wrote:
> 
> > From: Pekka Paalanen <pekka.paalanen@collabora.com>
> >
> > Set up the expectations on how hot-unplugging a DRM device should look like to
> > userspace.
> >
> > Written by Daniel Vetter's request and largely based on his comments in IRC and
> > from https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265484.html&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=tbOTr7TfESohEgWspomM1sbMq4U4n7bOvdS6JlYifmM%3D&amp;reserved=0 .
> >
> > Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Sean Paul <sean@poorly.run>
> > Cc: Simon Ser <contact@emersion.fr>
> >
> > ---
> >
> > v2:
> > - mmap reads/writes undefined (danvet)
> > - make render ioctl behaviour driver-specific (danvet)
> > - restructure the mmap paragraphs (danvet)
> > - chardev minor notes (Simon)
> > - open behaviour (danvet)
> > - DRM leasing behaviour (danvet)
> > - added links
> >
> > Disclaimer: I am a userspace developer writing for other userspace developers.
> > I took some liberties in defining what should happen without knowing what is
> > actually possible or what existing drivers already implement.
> > ---
> >   Documentation/gpu/drm-uapi.rst | 102 +++++++++++++++++++++++++++++++++
> >   1 file changed, 102 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 56fec6ed1ad8..520b8e640ad1 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -1,3 +1,5 @@
> > +.. Copyright 2020 DisplayLink (UK) Ltd.
> > +
> >   ===================
> >   Userland interfaces
> >   ===================
> > @@ -162,6 +164,106 @@ other hand, a driver requires shared state between clients which is
> >   visible to user-space and accessible beyond open-file boundaries, they
> >   cannot support render nodes.
> >   
> > +Device Hot-Unplug
> > +=================
> > +
> > +.. note::
> > +   The following is the plan. Implementation is not there yet
> > +   (2020 May).
> > +
> > +Graphics devices (display and/or render) may be connected via USB (e.g.
> > +display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
> > +user is able to hot-unplug this kind of devices while they are being
> > +used, and expects that the very least the machine does not crash. Any
> > +damage from hot-unplugging a DRM device needs to be limited as much as
> > +possible and userspace must be given the chance to handle it if it wants
> > +to. Ideally, unplugging a DRM device still lets a desktop to continue
> > +running, but that is going to need explicit support throughout the whole
> > +graphics stack: from kernel and userspace drivers, through display
> > +servers, via window system protocols, and in applications and libraries.  
> 
> So to support all the requirements in this document only kernel changes 
> should be enough and no changes are required from user mode part of the 
> stack ?

Hi,

my intention is that this document describes what the kernel delivers,
or should deliver, to allow userspace to cope with hot-unplug if
userspace wishes to do so. "Userspace" here includes userspace part of
GPU drivers.

Userspace has a lot to develop to actually recover instead of just sit
in the dark after the device disappears. Handling the uevent for DRM
device removal or errors from GL/Vulkan is just the beginning of it. I
would assume that userspace drivers have things to implement as well,
before GL or Vulkan apps can recover instead of get stuck or crash.

Unplugging "secondary" DRM devices (mostly used for KMS to have more
monitors lit) should be relatively easy to implement in display
servers. Unplugging the GPU that a display server is using for
rendering is going to be really difficult and will need client
(application toolkit) support the very least, and perhaps even new
window system protocol.

I imagine this will be incremental development: first the kernel stops
crashing. Then display servers stop crashing. At some point userspace
GPU drivers stop crashing. Then display servers learn to recover
instead of sit in the dark, but disconnect most of their clients. Then
maybe with the help of window system protocol additions, some major
toolkits learn to not get killed. And so on.

Once all that works, the follow-up step could be some protocol to
switch applications from one GPU to another in flight. That's off-topic
here, but being able to handle GPU unplug is half of the switch.

> > +
> > +Other scenarios that should lead to the same are: unrecoverable GPU
> > +crash, PCI device disappearing off the bus, or forced unbind of a driver
> > +from the physical device.
> > +
> > +In other words, from userspace perspective everything needs to keep on
> > +working more or less, until userspace stops using the disappeared DRM
> > +device and closes it completely. Userspace will learn of the device
> > +disappearance from the device removed uevent  
> 
> 
> Is this uevent already implemented ? Can you point me to the code ?

I can't point to any kernel code, I'm just not familiar with it. But
it's the same uevent all Linux devices emit. You unplug a USB mouse,
this is the event that gets sent.

You can emulate it with 'udevadm trigger -c remove' IIRC, and it is the
"remove" event you can match in udev rules.

KMS hotplug event is also a uevent, but I think it is "change" rather
than "remove". Otherwise the same mechanism. Display servers already
watch for uevents to learn about monitor hotplug, and some watch for
DRM device added events too. But I don't think any really watch for DRM
device removed events, because usually everything explodes first. I
don't know, maybe X.org handles UDL unplugs?

> > or in some cases
> > +driver-specific ioctls returning EIO.
> > +
> > +Only after userspace has closed all relevant DRM device and dmabuf file
> > +descriptors and removed all mmaps, the DRM driver can tear down its
> > +instance for the device that no longer exists. If the same physical
> > +device somehow comes back in the mean time, it shall be a new DRM
> > +device.
> > +
> > +Similar to PIDs, chardev minor numbers are not recycled immediately. A
> > +new DRM device always picks the next free minor number compared to the
> > +previous one allocated, and wraps around when minor numbers are
> > +exhausted.
> > +
> > +Requirements for UAPI
> > +---------------------
> > +
> > +The goal raises at least the following requirements for the kernel and
> > +drivers:
> > +
> > +- The kernel must not hang, crash or oops, no matter what userspace was
> > +  in the middle of doing when the device disappeared.
> > +
> > +- All GPU jobs that can no longer run must have their fences
> > +  force-signalled to avoid inflicting hangs to userspace.
> > +
> > +- KMS connectors must change their status to disconnected.
> > +
> > +- Legacy modesets and pageflips fake success.
> > +
> > +- Atomic commits, both real and TEST_ONLY, fake success.
> > +
> > +- Pending non-blocking KMS operations deliver the DRM events userspace
> > +  is expecting.  
> 
> 
> The 4 points above refer to mode setting/display attached card and are 
> irrelevant for secondary GPU (e.g. DRI-PRIME scenario) or no display 
> system in general. Maybe we can somehow highlight this in the document 
> and I on the implementing side can then decide as a first step to 
> concentrate on implementing the non display case as a first step or the 
> only step. In general and correct me if I am wrong, render only GPUs (or 
> compute only) are the majority of cases where you would want to be able 
> to detach/attach GPU on the fly (e.g attach stronger secondary graphic 
> card to a laptop to get high performance in a game or add/remove a GPU 
> to/from a compute cluster)

I do think KMS-only (not rendering) devices are a major use case for
hot-unplug: docks, USB-display-adapters etc. I wrote this patch on
behalf of DisplayLink after all.

Render-only GPUs are another important use case like you describe. And
a dock might perhaps have both: a powerful GPU and a big screen
connected.

Personally, I have no expectations other than a hope that some day at
least the drivers that support hot-unpluggable hardware would implement
all of this. :-)

I would assume it's fine to work piece by piece towards the goal on
your own pace. This patch here is just for setting up the goal without
a deadline. I'm no DRM maintainer or even a DRM developer.


Thanks,
pq
Pekka Paalanen May 25, 2020, 3:02 p.m. UTC | #4
On Mon, 25 May 2020 16:30:17 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, May 25, 2020 at 09:51:30AM -0400, Andrey Grodzovsky wrote:
> > On 5/25/20 8:46 AM, Pekka Paalanen wrote:
> >   
> > > From: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > 
> > > Set up the expectations on how hot-unplugging a DRM device should look like to
> > > userspace.
> > > 
> > > Written by Daniel Vetter's request and largely based on his comments in IRC and
> > > from https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265484.html&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=tbOTr7TfESohEgWspomM1sbMq4U4n7bOvdS6JlYifmM%3D&amp;reserved=0 .
> > > 
> > > Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Sean Paul <sean@poorly.run>
> > > Cc: Simon Ser <contact@emersion.fr>
> > > 
> > > ---
> > > 
> > > v2:
> > > - mmap reads/writes undefined (danvet)
> > > - make render ioctl behaviour driver-specific (danvet)
> > > - restructure the mmap paragraphs (danvet)
> > > - chardev minor notes (Simon)
> > > - open behaviour (danvet)
> > > - DRM leasing behaviour (danvet)
> > > - added links
> > > 
> > > Disclaimer: I am a userspace developer writing for other userspace developers.
> > > I took some liberties in defining what should happen without knowing what is
> > > actually possible or what existing drivers already implement.
> > > ---
> > >   Documentation/gpu/drm-uapi.rst | 102 +++++++++++++++++++++++++++++++++
> > >   1 file changed, 102 insertions(+)
> > > 
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 56fec6ed1ad8..520b8e640ad1 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -1,3 +1,5 @@
> > > +.. Copyright 2020 DisplayLink (UK) Ltd.
> > > +
> > >   ===================
> > >   Userland interfaces
> > >   ===================
> > > @@ -162,6 +164,106 @@ other hand, a driver requires shared state between clients which is
> > >   visible to user-space and accessible beyond open-file boundaries, they
> > >   cannot support render nodes.
> > > +Device Hot-Unplug
> > > +=================
> > > +
> > > +.. note::
> > > +   The following is the plan. Implementation is not there yet
> > > +   (2020 May).
> > > +
> > > +Graphics devices (display and/or render) may be connected via USB (e.g.
> > > +display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
> > > +user is able to hot-unplug this kind of devices while they are being
> > > +used, and expects that the very least the machine does not crash. Any
> > > +damage from hot-unplugging a DRM device needs to be limited as much as
> > > +possible and userspace must be given the chance to handle it if it wants
> > > +to. Ideally, unplugging a DRM device still lets a desktop to continue
> > > +running, but that is going to need explicit support throughout the whole
> > > +graphics stack: from kernel and userspace drivers, through display
> > > +servers, via window system protocols, and in applications and libraries.  
> > 
> > So to support all the requirements in this document only kernel changes
> > should be enough and no changes are required from user mode part of the
> > stack ?
> >   
> > > +
> > > +Other scenarios that should lead to the same are: unrecoverable GPU
> > > +crash, PCI device disappearing off the bus, or forced unbind of a driver
> > > +from the physical device.
> > > +
> > > +In other words, from userspace perspective everything needs to keep on
> > > +working more or less, until userspace stops using the disappeared DRM
> > > +device and closes it completely. Userspace will learn of the device
> > > +disappearance from the device removed uevent  
> > 
> > 
> > Is this uevent already implemented ? Can you point me to the code ?
> > 
> >   
> > > or in some cases
> > > +driver-specific ioctls returning EIO.
> > > +
> > > +Only after userspace has closed all relevant DRM device and dmabuf file
> > > +descriptors and removed all mmaps, the DRM driver can tear down its
> > > +instance for the device that no longer exists. If the same physical
> > > +device somehow comes back in the mean time, it shall be a new DRM
> > > +device.
> > > +
> > > +Similar to PIDs, chardev minor numbers are not recycled immediately. A
> > > +new DRM device always picks the next free minor number compared to the
> > > +previous one allocated, and wraps around when minor numbers are
> > > +exhausted.
> > > +
> > > +Requirements for UAPI
> > > +---------------------
> > > +
> > > +The goal raises at least the following requirements for the kernel and
> > > +drivers:
> > > +
> > > +- The kernel must not hang, crash or oops, no matter what userspace was
> > > +  in the middle of doing when the device disappeared.
> > > +
> > > +- All GPU jobs that can no longer run must have their fences
> > > +  force-signalled to avoid inflicting hangs to userspace.
> > > +
> > > +- KMS connectors must change their status to disconnected.
> > > +
> > > +- Legacy modesets and pageflips fake success.
> > > +
> > > +- Atomic commits, both real and TEST_ONLY, fake success.
> > > +
> > > +- Pending non-blocking KMS operations deliver the DRM events userspace
> > > +  is expecting.  
> > 
> > 
> > The 4 points above refer to mode setting/display attached card and are
> > irrelevant for secondary GPU (e.g. DRI-PRIME scenario) or no display system
> > in general. Maybe we can somehow highlight this in the document and I on the
> > implementing side can then decide as a first step to concentrate on
> > implementing the non display case as a first step or the only step. In
> > general and correct me if I am wrong, render only GPUs (or compute only) are
> > the majority of cases where you would want to be able to detach/attach GPU
> > on the fly (e.g attach stronger secondary graphic card to a laptop to get
> > high performance in a game or add/remove a GPU to/from a compute cluster)  
> 
> Yeah maybe splitting this up into kms section, and rendering/cross driver
> section (the dma-buf/fence stuff is relevant for both display and
> rendering) would make some sense.

Is that really something that needs spelling out?

Hmm. I guess the unwritten assumption on every "fake success" is the
condition that it would have succeeded if the device was not unplugged.

Is the problem here that one might read this as needing to fake success
for things that could never have worked at all? Like KMS on render-only
device.

The dmabuf items below have the wording.

I think splitting stuff into KMS stuff, render stuff, KMS-and-render
stuff, cross-device stuff, and mmaps gets a bit far. Or do you expect a
lot more text in here? Maybe expanding each bullet point to a paragraph?


Thanks,
pq


> -Daniel
> 
> > 
> > Andrey
> > 
> >   
> > > +
> > > +- dmabuf which point to memory that has disappeared will continue to
> > > +  be successfully imported if it would have succeeded before the
> > > +  disappearance.
> > > +
> > > +- Attempting to import a dmabuf to a disappeared device will succeed if
> > > +  it would have succeeded without the disappearance.
> > > +
> > > +- Some userspace APIs already define what should happen when the device
> > > +  disappears (OpenGL, GL ES: `GL_KHR_robustness`_; `Vulkan`_:
> > > +  VK_ERROR_DEVICE_LOST; etc.). DRM drivers are free to implement this
> > > +  behaviour the way they see best, e.g. returning failures in
> > > +  driver-specific ioctls and handling those in userspace drivers, or
> > > +  rely on uevents, and so on.
> > > +
> > > +- open() on a device node whose underlying device has disappeared will
> > > +  fail.
> > > +
> > > +- Attempting to create a DRM lease on a disappeared DRM device will
> > > +  fail. Existing DRM leases remain.
> > > +
> > > +.. _GL_KHR_robustness: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fregistry%2FOpenGL%2Fextensions%2FKHR%2FKHR_robustness.txt&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178891269&amp;sdata=m%2FneRusoe6qGVU8Edk%2FncaD7eSJZXtPnA1IqLr7k%2Bos%3D&amp;reserved=0
> > > +.. _Vulkan: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.khronos.org%2Fvulkan%2F&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc9676f35bbdf4d5a052808d800a9b517%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637260076178901265&amp;sdata=WsfLduUBzRKlybOJb5PQViBWYu5DgleEeycmf76l3UU%3D&amp;reserved=0
> > > +
> > > +Requirements for memory maps
> > > +----------------------------
> > > +
> > > +Memory maps have further requirements. If the underlying memory
> > > +disappears, the mmap is modified such that reads and writes will still
> > > +complete successfully but the result is undefined. This applies to both
> > > +userspace mmap()'d memory and memory pointed to by dmabuf which might be
> > > +mapped to other devices.
> > > +
> > > +Raising SIGBUS is not an option, because userspace cannot realistically
> > > +handle it.  Signal handlers are global, which makes them extremely
> > > +difficult to use correctly from libraries like those that Mesa produces.
> > > +Signal handlers are not composable, you can't have different handlers
> > > +for GPU1 and GPU2 from different vendors, and a third handler for
> > > +mmapped regular files.  Threads cause additional pain with signal
> > > +handling as well.
> > > +
> > >   .. _drm_driver_ioctl:
> > >   IOCTL Support on Device Nodes  
>
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 56fec6ed1ad8..520b8e640ad1 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -1,3 +1,5 @@ 
+.. Copyright 2020 DisplayLink (UK) Ltd.
+
 ===================
 Userland interfaces
 ===================
@@ -162,6 +164,106 @@  other hand, a driver requires shared state between clients which is
 visible to user-space and accessible beyond open-file boundaries, they
 cannot support render nodes.
 
+Device Hot-Unplug
+=================
+
+.. note::
+   The following is the plan. Implementation is not there yet
+   (2020 May).
+
+Graphics devices (display and/or render) may be connected via USB (e.g.
+display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
+user is able to hot-unplug this kind of devices while they are being
+used, and expects that the very least the machine does not crash. Any
+damage from hot-unplugging a DRM device needs to be limited as much as
+possible and userspace must be given the chance to handle it if it wants
+to. Ideally, unplugging a DRM device still lets a desktop to continue
+running, but that is going to need explicit support throughout the whole
+graphics stack: from kernel and userspace drivers, through display
+servers, via window system protocols, and in applications and libraries.
+
+Other scenarios that should lead to the same are: unrecoverable GPU
+crash, PCI device disappearing off the bus, or forced unbind of a driver
+from the physical device.
+
+In other words, from userspace perspective everything needs to keep on
+working more or less, until userspace stops using the disappeared DRM
+device and closes it completely. Userspace will learn of the device
+disappearance from the device removed uevent or in some cases
+driver-specific ioctls returning EIO.
+
+Only after userspace has closed all relevant DRM device and dmabuf file
+descriptors and removed all mmaps, the DRM driver can tear down its
+instance for the device that no longer exists. If the same physical
+device somehow comes back in the mean time, it shall be a new DRM
+device.
+
+Similar to PIDs, chardev minor numbers are not recycled immediately. A
+new DRM device always picks the next free minor number compared to the
+previous one allocated, and wraps around when minor numbers are
+exhausted.
+
+Requirements for UAPI
+---------------------
+
+The goal raises at least the following requirements for the kernel and
+drivers:
+
+- The kernel must not hang, crash or oops, no matter what userspace was
+  in the middle of doing when the device disappeared.
+
+- All GPU jobs that can no longer run must have their fences
+  force-signalled to avoid inflicting hangs to userspace.
+
+- KMS connectors must change their status to disconnected.
+
+- Legacy modesets and pageflips fake success.
+
+- Atomic commits, both real and TEST_ONLY, fake success.
+
+- Pending non-blocking KMS operations deliver the DRM events userspace
+  is expecting.
+
+- dmabuf which point to memory that has disappeared will continue to
+  be successfully imported if it would have succeeded before the
+  disappearance.
+
+- Attempting to import a dmabuf to a disappeared device will succeed if
+  it would have succeeded without the disappearance.
+
+- Some userspace APIs already define what should happen when the device
+  disappears (OpenGL, GL ES: `GL_KHR_robustness`_; `Vulkan`_:
+  VK_ERROR_DEVICE_LOST; etc.). DRM drivers are free to implement this
+  behaviour the way they see best, e.g. returning failures in
+  driver-specific ioctls and handling those in userspace drivers, or
+  rely on uevents, and so on.
+
+- open() on a device node whose underlying device has disappeared will
+  fail.
+
+- Attempting to create a DRM lease on a disappeared DRM device will
+  fail. Existing DRM leases remain.
+
+.. _GL_KHR_robustness: https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_robustness.txt
+.. _Vulkan: https://www.khronos.org/vulkan/
+
+Requirements for memory maps
+----------------------------
+
+Memory maps have further requirements. If the underlying memory
+disappears, the mmap is modified such that reads and writes will still
+complete successfully but the result is undefined. This applies to both
+userspace mmap()'d memory and memory pointed to by dmabuf which might be
+mapped to other devices.
+
+Raising SIGBUS is not an option, because userspace cannot realistically
+handle it.  Signal handlers are global, which makes them extremely
+difficult to use correctly from libraries like those that Mesa produces.
+Signal handlers are not composable, you can't have different handlers
+for GPU1 and GPU2 from different vendors, and a third handler for
+mmapped regular files.  Threads cause additional pain with signal
+handling as well.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes