diff mbox series

drm: support up to 128 drm devices

Message ID 20230630115651.354849-1-James.Zhu@amd.com (mailing list archive)
State New, archived
Headers show
Series drm: support up to 128 drm devices | expand

Commit Message

James Zhu June 30, 2023, 11:56 a.m. UTC
From: Christian König <ckoenig.leichtzumerken@gmail.com>

This makes room for up to 128 DRM devices.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: James Zhu <James.Zhu@amd.com>
---
 drivers/gpu/drm/drm_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Simon Ser July 14, 2023, 10:31 a.m. UTC | #1
(cc Daniel Vetter and Pekka because this change has uAPI repercussions)

On Friday, June 30th, 2023 at 13:56, James Zhu <James.Zhu@amd.com> wrote:

> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> 
> This makes room for up to 128 DRM devices.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: James Zhu <James.Zhu@amd.com>
> ---
>  drivers/gpu/drm/drm_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 73b845a75d52..0d55c64444f5 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -137,7 +137,7 @@ static int drm_minor_alloc(struct drm_device *dev, unsigned int type)
>  		r = idr_alloc(&drm_minors_idr,
>  			NULL,
>  			64 * type,
> -			64 * (type + 1),
> +			64 * (type + 2),

The type comes from this enum:

    enum drm_minor_type {
    	DRM_MINOR_PRIMARY,
    	DRM_MINOR_CONTROL,
    	DRM_MINOR_RENDER,
    	DRM_MINOR_ACCEL = 32,
    };

Before this patch, 0..63 are for primary, 64..127 are for control (never
exposed by the kernel), 128..191 are for render, 2048..2112 are for accel.
After this patch, 0..127 are for primary, 64..191 are for control (never
exposed by the kernel), 128..255 are for render, 2048..2176 are for accel.

I'm worried what might happen with old user-space, especially old libdrm.
When there are a lot of primary nodes, some would get detected as control
nodes, even if they aren't. Is this fine? This would at least be confusing
for information-gathering tools like drm_info. I'm not sure about other
consequences. Do we want to forever forbid the 64..127 range instead, so
that we have the guarantee that old libdrm never sees it?

I'm also worried about someone adding a new entry to the enum after
DRM_MINOR_RENDER (DRM_MINOR_ACCEL was specifically set to 32 so that new
node types could be added before). drm_minor_alloc() would blow up in this
case, because it'd use the 192..319 range, which overlaps with render.
I think a switch with explicit ranges (and WARN when an unknown type is
passed in) would be both easier to read and less risky.

>  			GFP_NOWAIT);
>  		spin_unlock_irqrestore(&drm_minor_lock, flags);
>  	}
> --
> 2.34.1
Simon Ser July 14, 2023, 10:49 a.m. UTC | #2
On Friday, July 14th, 2023 at 12:31, Simon Ser <contact@emersion.fr> wrote:

> Before this patch, 0..63 are for primary, 64..127 are for control (never
> exposed by the kernel), 128..191 are for render, 2048..2112 are for accel.
> After this patch, 0..127 are for primary, 64..191 are for control (never
> exposed by the kernel), 128..255 are for render, 2048..2176 are for accel.

Correction: reading the code, accel is handled separately.

Additional find: the kernel creates sysfs control nodes because old
user-space reads them to figure out whether a device is DRIVER_MODESET.
Emil Velikov July 17, 2023, 7:30 a.m. UTC | #3
On Fri, 14 Jul 2023 at 11:32, Simon Ser <contact@emersion.fr> wrote:
>
> (cc Daniel Vetter and Pekka because this change has uAPI repercussions)
>
> On Friday, June 30th, 2023 at 13:56, James Zhu <James.Zhu@amd.com> wrote:
>
> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >
> > This makes room for up to 128 DRM devices.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> > Signed-off-by: James Zhu <James.Zhu@amd.com>
> > ---
> >  drivers/gpu/drm/drm_drv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 73b845a75d52..0d55c64444f5 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -137,7 +137,7 @@ static int drm_minor_alloc(struct drm_device *dev, unsigned int type)
> >               r = idr_alloc(&drm_minors_idr,
> >                       NULL,
> >                       64 * type,
> > -                     64 * (type + 1),
> > +                     64 * (type + 2),
>
> The type comes from this enum:
>
>     enum drm_minor_type {
>         DRM_MINOR_PRIMARY,
>         DRM_MINOR_CONTROL,
>         DRM_MINOR_RENDER,
>         DRM_MINOR_ACCEL = 32,
>     };
>
> Before this patch, 0..63 are for primary, 64..127 are for control (never
> exposed by the kernel), 128..191 are for render, 2048..2112 are for accel.
> After this patch, 0..127 are for primary, 64..191 are for control (never
> exposed by the kernel), 128..255 are for render, 2048..2176 are for accel.
>
> I'm worried what might happen with old user-space, especially old libdrm.

I also share the same concern. Although the bigger issue is not libdrm
- since we can update it and prod distributions to update it.
The biggest concern is software that cannot be rebuild/updated -
closed source and various open-source that has been abandoned.

As mentioned in the gitlab ticket - the current style of embedding the
uABI/uAPI in each and every application is not great IMHO. That is why
I've introduced the `drmGetDevices2` API, to consolidate most of the
chaos in a single place.

For going forward, here is one way we can shave this yak:
 - update libdrm to max 64 nodes
 - roll libdrm release, nag distributions to update to it // could be
folded with the next release below
 - update libdrm to use name driven type detection
 - roll libdrm release, nag distributions to update to it
 - once ^^ release hits most distributions, merge the above proposed
kernel patch
   - the commit message should explain the caveats and fixed libdrm version
   - we should be prepared to revert the commit, if it causes user
space regression - fix (see below) and re-introduce the kernel patch
1-2 releases later

In parallel to all the above, as a community we should collectively
audit and update open-source applications to the `drmDevices2` API.

As with other legacy (DRI1), this one will take some time but we can get there.

HTH
Emil
Simon Ser July 17, 2023, 9:45 a.m. UTC | #4
On Monday, July 17th, 2023 at 09:30, Emil Velikov <emil.l.velikov@gmail.com> wrote:

> > I'm worried what might happen with old user-space, especially old libdrm.
> 
> I also share the same concern. Although the bigger issue is not libdrm
> - since we can update it and prod distributions to update it.
> The biggest concern is software that cannot be rebuild/updated -
> closed source and various open-source that has been abandoned.

Well. Now that we have Flatpak and AppImage and friends, we're really likely
to have old libdrm copies vendored all over the place, and these will stick
around essentially forever.

> For going forward, here is one way we can shave this yak:
>  - update libdrm to max 64 nodes
>  - roll libdrm release, nag distributions to update to it // could be
> folded with the next release below
>  - update libdrm to use name driven type detection
>  - roll libdrm release, nag distributions to update to it
>  - once ^^ release hits most distributions, merge the above proposed
> kernel patch
>    - the commit message should explain the caveats and fixed libdrm version
>    - we should be prepared to revert the commit, if it causes user
> space regression - fix (see below) and re-introduce the kernel patch
> 1-2 releases later

That sounds really scary to me. I'd really prefer to try not to break the
kernel uAPI here.

The kernel rule is "do not break user-space".
Emil Velikov July 17, 2023, 1:24 p.m. UTC | #5
On Mon, 17 Jul 2023 at 10:45, Simon Ser <contact@emersion.fr> wrote:
>
> On Monday, July 17th, 2023 at 09:30, Emil Velikov <emil.l.velikov@gmail.com> wrote:
>
> > > I'm worried what might happen with old user-space, especially old libdrm.
> >
> > I also share the same concern. Although the bigger issue is not libdrm
> > - since we can update it and prod distributions to update it.
> > The biggest concern is software that cannot be rebuild/updated -
> > closed source and various open-source that has been abandoned.
>
> Well. Now that we have Flatpak and AppImage and friends, we're really likely
> to have old libdrm copies vendored all over the place, and these will stick
> around essentially forever.
>

The flatpak devs have been very helpful. So I'm pretty sure we can get
those updated - even for older flatpaks.
For AppImage, I have no experience.

> > For going forward, here is one way we can shave this yak:
> >  - update libdrm to max 64 nodes
> >  - roll libdrm release, nag distributions to update to it // could be
> > folded with the next release below
> >  - update libdrm to use name driven type detection
> >  - roll libdrm release, nag distributions to update to it
> >  - once ^^ release hits most distributions, merge the above proposed
> > kernel patch
> >    - the commit message should explain the caveats and fixed libdrm version
> >    - we should be prepared to revert the commit, if it causes user
> > space regression - fix (see below) and re-introduce the kernel patch
> > 1-2 releases later
>
> That sounds really scary to me. I'd really prefer to try not to break the
> kernel uAPI here.
>

With part in particular? Mind you I'm not particularly happy either,
since in essence it's like a controlled explosion.

> The kernel rule is "do not break user-space".

Yes, in a perfect world. In practice, there have been multiple kernel
changes breaking user-space. Some got reverted, some remained.
AFAICT the above will get us out of the sticky situation we're in with
the least amount of explosion.

If there is a concrete proposal, please go ahead and sorry if I've
missed it. I'm supposed to be off, having fun with family when I saw
this whole thing explode.

Small note: literally all the users I've seen will stop on a missing
node (card or render) aka if the kernel creates card0...63 and then
card200... then (hard wavy estimate) 80% of the apps will be broken.

HTH
Emil
Simon Ser July 17, 2023, 1:54 p.m. UTC | #6
On Monday, July 17th, 2023 at 15:24, Emil Velikov <emil.l.velikov@gmail.com> wrote:

> > > For going forward, here is one way we can shave this yak:
> > >  - update libdrm to max 64 nodes
> > >  - roll libdrm release, nag distributions to update to it // could be
> > > folded with the next release below
> > >  - update libdrm to use name driven type detection
> > >  - roll libdrm release, nag distributions to update to it
> > >  - once ^^ release hits most distributions, merge the above proposed
> > > kernel patch
> > >    - the commit message should explain the caveats and fixed libdrm version
> > >    - we should be prepared to revert the commit, if it causes user
> > > space regression - fix (see below) and re-introduce the kernel patch
> > > 1-2 releases later
> >
> > That sounds really scary to me. I'd really prefer to try not to break the
> > kernel uAPI here.
> >
> 
> With part in particular? Mind you I'm not particularly happy either,
> since in essence it's like a controlled explosion.

I believe there are ways to extend the uAPI to support more devices without
breaking the uAPI. Michał Winiarski's patch for instance tried something to
this effect.

> > The kernel rule is "do not break user-space".
> 
> Yes, in a perfect world. In practice, there have been multiple kernel
> changes breaking user-space. Some got reverted, some remained.
> AFAICT the above will get us out of the sticky situation we're in with
> the least amount of explosion.
> 
> If there is a concrete proposal, please go ahead and sorry if I've
> missed it. I'm supposed to be off, having fun with family when I saw
> this whole thing explode.
> 
> Small note: literally all the users I've seen will stop on a missing
> node (card or render) aka if the kernel creates card0...63 and then
> card200... then (hard wavy estimate) 80% of the apps will be broken.

That's fine, because that's not a kernel regression. Supporting more than 64
devices is a new kernel feature, and if some user-space ignores the new nodes,
that's not a kernel regression. A regression only happens when a use-case which
works with an older kernel is broken by a newer kernel.
Emil Velikov July 25, 2023, 9:21 a.m. UTC | #7
On Mon, 17 Jul 2023 at 14:54, Simon Ser <contact@emersion.fr> wrote:
>
> On Monday, July 17th, 2023 at 15:24, Emil Velikov <emil.l.velikov@gmail.com> wrote:
>
> > > > For going forward, here is one way we can shave this yak:
> > > >  - update libdrm to max 64 nodes
> > > >  - roll libdrm release, nag distributions to update to it // could be
> > > > folded with the next release below
> > > >  - update libdrm to use name driven type detection
> > > >  - roll libdrm release, nag distributions to update to it
> > > >  - once ^^ release hits most distributions, merge the above proposed
> > > > kernel patch
> > > >    - the commit message should explain the caveats and fixed libdrm version
> > > >    - we should be prepared to revert the commit, if it causes user
> > > > space regression - fix (see below) and re-introduce the kernel patch
> > > > 1-2 releases later
> > >
> > > That sounds really scary to me. I'd really prefer to try not to break the
> > > kernel uAPI here.
> > >
> >
> > With part in particular? Mind you I'm not particularly happy either,
> > since in essence it's like a controlled explosion.
>
> I believe there are ways to extend the uAPI to support more devices without
> breaking the uAPI. Michał Winiarski's patch for instance tried something to
> this effect.
>
> > > The kernel rule is "do not break user-space".
> >
> > Yes, in a perfect world. In practice, there have been multiple kernel
> > changes breaking user-space. Some got reverted, some remained.
> > AFAICT the above will get us out of the sticky situation we're in with
> > the least amount of explosion.
> >
> > If there is a concrete proposal, please go ahead and sorry if I've
> > missed it. I'm supposed to be off, having fun with family when I saw
> > this whole thing explode.
> >
> > Small note: literally all the users I've seen will stop on a missing
> > node (card or render) aka if the kernel creates card0...63 and then
> > card200... then (hard wavy estimate) 80% of the apps will be broken.
>
> That's fine, because that's not a kernel regression. Supporting more than 64
> devices is a new kernel feature, and if some user-space ignores the new nodes,
> that's not a kernel regression. A regression only happens when a use-case which
> works with an older kernel is broken by a newer kernel.

Won't this approach effectively hard-code/leak even more kernel uABI
into dozens of not hundreds of userspace projects? This does not sound
like a scalable solution IMHO.

I am 100% behind the "don't break userspace rule", alas very few
things in life are as black/white as your comments seem to suggest.
Thus I would suggest doing a bit of both or a compromise if you will.
Namely:
 - try the initial route outlined above
 - if there are (m)any fires, revert the kernel patch and opt for the
work by Michał

This has the benefit of fixing a bunch of the uABI abuses out there,
and leaking more uABI only on as-needed basis.

Side note: KDE folks have their own flatpak runtime and have been
quite open to backport libdrm/other fixes.

HTH
Emil
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 73b845a75d52..0d55c64444f5 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -137,7 +137,7 @@  static int drm_minor_alloc(struct drm_device *dev, unsigned int type)
 		r = idr_alloc(&drm_minors_idr,
 			NULL,
 			64 * type,
-			64 * (type + 1),
+			64 * (type + 2),
 			GFP_NOWAIT);
 		spin_unlock_irqrestore(&drm_minor_lock, flags);
 	}