diff mbox series

[01/13] drm/amdgpu: introduce and honour DRM_FORCE_AUTH workaround

Message ID 20190527081741.14235-1-emil.l.velikov@gmail.com (mailing list archive)
State New, archived
Headers show
Series [01/13] drm/amdgpu: introduce and honour DRM_FORCE_AUTH workaround | expand

Commit Message

Emil Velikov May 27, 2019, 8:17 a.m. UTC
From: Emil Velikov <emil.velikov@collabora.com>

Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
render node. A seemingly deliberate design decision.

Hence we can drop the DRM_AUTH all together (details in follow-up patch)
yet not all userspace checks if it's authenticated, but instead uses
uncommon assumptions.

After days of digging through git log and testing, only a single (ab)use
was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
assuming that failure implies lack of authentication.

Affected versions are:
 - the whole 18.2.x series, which is EOL
 - the whole 18.3.x series, which is EOL
 - the 19.0.x series, prior to 19.0.4

Add a special quirk for that case, thus we can drop DRM_AUTH bits as
mentioned earlier.

Since all the affected userspace is EOL, we also add a kconfig option
to disable this quirk.

The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
---
 drivers/gpu/drm/amd/amdgpu/Kconfig      | 16 ++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 12 +++++++++++-
 drivers/gpu/drm/drm_ioctl.c             |  5 +++++
 include/drm/drm_ioctl.h                 | 17 +++++++++++++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

Comments

Christian König May 27, 2019, 10:47 a.m. UTC | #1
Am 27.05.19 um 10:17 schrieb Emil Velikov:
> From: Emil Velikov <emil.velikov@collabora.com>
>
> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> render node. A seemingly deliberate design decision.
>
> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> yet not all userspace checks if it's authenticated, but instead uses
> uncommon assumptions.
>
> After days of digging through git log and testing, only a single (ab)use
> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> assuming that failure implies lack of authentication.
>
> Affected versions are:
>   - the whole 18.2.x series, which is EOL
>   - the whole 18.3.x series, which is EOL
>   - the 19.0.x series, prior to 19.0.4

Well you could have saved your time, cause this is still a NAK.

For the record: I strongly think that we don't want to expose any render 
functionality on the primary node.

To even go a step further I would say that at least on AMD hardware we 
should completely disable DRI2 for one of the next generations.

As a follow up I would then completely disallow the DRM authentication 
for amdgpu, so that the command submission interface on the primary node 
can only be used by the display server.

Regards,
Christian.

>
> Add a special quirk for that case, thus we can drop DRM_AUTH bits as
> mentioned earlier.
>
> Since all the affected userspace is EOL, we also add a kconfig option
> to disable this quirk.
>
> The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/Kconfig      | 16 ++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 12 +++++++++++-
>   drivers/gpu/drm/drm_ioctl.c             |  5 +++++
>   include/drm/drm_ioctl.h                 | 17 +++++++++++++++++
>   4 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
> index 9221e5489069..da415f445187 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Kconfig
> +++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
> @@ -40,6 +40,22 @@ config DRM_AMDGPU_GART_DEBUGFS
>   	  Selecting this option creates a debugfs file to inspect the mapped
>   	  pages. Uses more memory for housekeeping, enable only for debugging.
>   
> +config DRM_AMDGPU_FORCE_AUTH
> +	bool "Force authentication check on AMDGPU_INFO ioctl"
> +	default y
> +	help
> +	  There were some version of the Mesa RADV drivers, which relied on
> +	  the ioctl failing, if the client is not authenticated.
> +
> +	  Namely, the following versions are affected:
> +	    - the whole 18.2.x series, which is EOL
> +	    - the whole 18.3.x series, which is EOL
> +	    - the 19.0.x series, prior to 19.0.4
> +
> +	  Modern distributions, should disable this. That will allow various
> +	  other clients to work, that would otherwise require root privileges.
> +
> +
>   source "drivers/gpu/drm/amd/acp/Kconfig"
>   source "drivers/gpu/drm/amd/display/Kconfig"
>   source "drivers/gpu/drm/amd/amdkfd/Kconfig"
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index b17d0545728e..b8076929440b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1214,7 +1214,17 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
>   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_MMAP, amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_WAIT_IDLE, amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>   	DRM_IOCTL_DEF_DRV(AMDGPU_CS, amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> -	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> +	/* The DRM_FORCE_AUTH is effectively a workaround for the RADV Mesa driver.
> +	 * This is required for Mesa:
> +	 *  - the whole 18.2.x series, which is EOL
> +	 *  - the whole 18.3.x series, which is EOL
> +	 *  - the 19.0.x series, prior to 19.0.4
> +	 */
> +	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl,
> +#if defined(DRM_AMDGPU_FORCE_AUTH)
> +		DRM_FORCE_AUTH|
> +#endif
> +		DRM_AUTH|DRM_RENDER_ALLOW),
>   	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_CS, amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>   	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_FENCES, amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_METADATA, amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> index 2263e3ddd822..9841c0076f02 100644
> --- a/drivers/gpu/drm/drm_ioctl.c
> +++ b/drivers/gpu/drm/drm_ioctl.c
> @@ -544,6 +544,11 @@ int drm_ioctl_permit(u32 flags, struct drm_file *file_priv)
>   		     drm_is_render_client(file_priv)))
>   		return -EACCES;
>   
> +	/* FORCE_AUTH is only for authenticated or render client */
> +	if (unlikely((flags & DRM_FORCE_AUTH) && !drm_is_render_client(file_priv) &&
> +		     !file_priv->authenticated))
> +		return -EACCES;
> +
>   	return 0;
>   }
>   EXPORT_SYMBOL(drm_ioctl_permit);
> diff --git a/include/drm/drm_ioctl.h b/include/drm/drm_ioctl.h
> index fafb6f592c4b..6084ee32043d 100644
> --- a/include/drm/drm_ioctl.h
> +++ b/include/drm/drm_ioctl.h
> @@ -126,6 +126,23 @@ enum drm_ioctl_flags {
>   	 * not set DRM_AUTH because they do not require authentication.
>   	 */
>   	DRM_RENDER_ALLOW	= BIT(5),
> +	/**
> +	 * @DRM_FORCE_AUTH:
> +	 *
> +	 * Authentication of the primary node is mandatory. Regardless that the
> +	 * user can usually circumvent that by using the render node with exact
> +	 * same ioctl.
> +	 *
> +	 * Note: this is effectively a workaround for AMDGPU AMDGPU_INFO ioctl
> +	 * and the RADV Mesa driver. This is required for Mesa:
> +	 *  - the whole 18.2.x series, which is EOL
> +	 *  - the whole 18.3.x series, which is EOL
> +	 *  - the 19.0.x series, prior to 19.0.4
> +	 *
> +	 * Note: later patch will effectively drop the DRM_AUTH for ioctls
> +	 * annotated as DRM_AUTH | DRM_RENDER_ALLOW.
> +	 */
> +	DRM_FORCE_AUTH          = BIT(6),
>   };
>   
>   /**
Emil Velikov May 27, 2019, 12:05 p.m. UTC | #2
On 2019/05/27, Koenig, Christian wrote:
> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> > From: Emil Velikov <emil.velikov@collabora.com>
> >
> > Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > render node. A seemingly deliberate design decision.
> >
> > Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > yet not all userspace checks if it's authenticated, but instead uses
> > uncommon assumptions.
> >
> > After days of digging through git log and testing, only a single (ab)use
> > was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > assuming that failure implies lack of authentication.
> >
> > Affected versions are:
> >   - the whole 18.2.x series, which is EOL
> >   - the whole 18.3.x series, which is EOL
> >   - the 19.0.x series, prior to 19.0.4
> 
> Well you could have saved your time, cause this is still a NAK.
> 
Did I mention that I've spent quite a bit of time in AMDVLK? Even fixed 
a bug while I was there :-)

> For the record: I strongly think that we don't want to expose any render 
> functionality on the primary node.
> 
> To even go a step further I would say that at least on AMD hardware we 
> should completely disable DRI2 for one of the next generations.
>
It's doable and overall pretty neat idea.

There is a consern that this approach may cause far more regressions
that this series. We are talking about years worth of Mesa drivers (et
al) that depend on render functionality exposed via the primary node.

I'm OK with writing the patches, but it'll be up-to the AMDGPU team to
follow-up with any regressions. Are you ok with that?

Fwiw I could also move the FORCE_AUTH hack to core drm, if you prefer.

Thanks
Emil
Christian König May 27, 2019, 12:20 p.m. UTC | #3
Am 27.05.19 um 14:05 schrieb Emil Velikov:
> On 2019/05/27, Koenig, Christian wrote:
>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
>>> From: Emil Velikov <emil.velikov@collabora.com>
>>>
>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
>>> render node. A seemingly deliberate design decision.
>>>
>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
>>> yet not all userspace checks if it's authenticated, but instead uses
>>> uncommon assumptions.
>>>
>>> After days of digging through git log and testing, only a single (ab)use
>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
>>> assuming that failure implies lack of authentication.
>>>
>>> Affected versions are:
>>>    - the whole 18.2.x series, which is EOL
>>>    - the whole 18.3.x series, which is EOL
>>>    - the 19.0.x series, prior to 19.0.4
>> Well you could have saved your time, cause this is still a NAK.
>>
> Did I mention that I've spent quite a bit of time in AMDVLK? Even fixed
> a bug while I was there :-)

Yeah, thanks for doing this.

But we have done so much stuff with customers which can't be audited 
this way, that I still have a really bad feeling about this :/

>> For the record: I strongly think that we don't want to expose any render
>> functionality on the primary node.
>>
>> To even go a step further I would say that at least on AMD hardware we
>> should completely disable DRI2 for one of the next generations.
>>
> It's doable and overall pretty neat idea.
>
> There is a consern that this approach may cause far more regressions
> that this series. We are talking about years worth of Mesa drivers (et
> al) that depend on render functionality exposed via the primary node.

Yeah, that's I'm perfectly aware of. It's the reason why I said we 
should only do it for new hardware generations.

But at some point I think we should do this and get rid of 
authentication/flink/DRI2/....

> I'm OK with writing the patches, but it'll be up-to the AMDGPU team to
> follow-up with any regressions. Are you ok with that?

As long as we have a check like adev->family_id > 
WHATEVER_IS_THE_CURRENT_LATEST_UPSTREAM_HW, I think we are fine with that.

If I understood Michel correctly xf86-video-amdgpu should work, but 
modeset might break and needs a patch.

> Fwiw I could also move the FORCE_AUTH hack to core drm, if you prefer.

Well, the hack is the least of my concerns.

Thanks,
Christian.

>
> Thanks
> Emil
Emil Velikov May 27, 2019, 12:52 p.m. UTC | #4
On 2019/05/27, Koenig, Christian wrote:
> Am 27.05.19 um 14:05 schrieb Emil Velikov:
> > On 2019/05/27, Koenig, Christian wrote:
> >> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> >>> From: Emil Velikov <emil.velikov@collabora.com>
> >>>
> >>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> >>> render node. A seemingly deliberate design decision.
> >>>
> >>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> >>> yet not all userspace checks if it's authenticated, but instead uses
> >>> uncommon assumptions.
> >>>
> >>> After days of digging through git log and testing, only a eingle (ab)use
> >>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> >>> assuming that failure implies lack of authentication.
> >>>
> >>> Affected versions are:
> >>>    - the whole 18.2.x series, which is EOL
> >>>    - the whole 18.3.x series, which is EOL
> >>>    - the 19.0.x series, prior to 19.0.4
> >> Well you could have saved your time, cause this is still a NAK.
> >>
> > Did I mention that I've spent quite a bit of time in AMDVLK? Even fixed
> > a bug while I was there :-)
> 
> Yeah, thanks for doing this.
> 
> But we have done so much stuff with customers which can't be audited 
> this way, that I still have a really bad feeling about this :/
> 
Fair point, I've already thought about those.

Example A:  customer X, using closed source/private software Y.
Ioctls A, B and C require the old behaviour - simply add FORCE_AUTH to
the A B C and carry on happily.

Example B: the team, as a whole thinks that this will be problematic for
customer X - add FORCE_AUTH to all ioctls and carry on.

I do see and understand why anyone can be hesitant about the series.

IMHO the above examples, illustrate quite reasonable compromise between
open-source and closed-source/private solutions.

Don't you agree?

> >> For the record: I strongly think that we don't want to expose any render
> >> functionality on the primary node.
> >>
> >> To even go a step further I would say that at least on AMD hardware we
> >> should completely disable DRI2 for one of the next generations.
> >>
> > It's doable and overall pretty neat idea.
> >
> > There is a consern that this approach may cause far more regressions
> > that this series. We are talking about years worth of Mesa drivers (et
> > al) that depend on render functionality exposed via the primary node.
> 
> Yeah, that's I'm perfectly aware of. It's the reason why I said we 
> should only do it for new hardware generations.
> 
> But at some point I think we should do this and get rid of 
> authentication/flink/DRI2/....
> 
Fwiw I share a similar sentiment. If you've looked through the series,
this is effectively step 1, towards nuking DRM_AUTH :-)

> > I'm OK with writing the patches, but it'll be up-to the AMDGPU team to
> > follow-up with any regressions. Are you ok with that?
> 
> As long as we have a check like adev->family_id > 
> WHATEVER_IS_THE_CURRENT_LATEST_UPSTREAM_HW, I think we are fine with that.
> 
> If I understood Michel correctly xf86-video-amdgpu should work, but 
> modeset might break and needs a patch.
> 
Unless I have concrete WHATEVER_IS_THE... I cannot do much here :-(


Thanks
Emil
Daniel Vetter May 27, 2019, 1:11 p.m. UTC | #5
On Mon, May 27, 2019 at 09:17:29AM +0100, Emil Velikov wrote:
> From: Emil Velikov <emil.velikov@collabora.com>
> 
> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> render node. A seemingly deliberate design decision.
> 
> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> yet not all userspace checks if it's authenticated, but instead uses
> uncommon assumptions.
> 
> After days of digging through git log and testing, only a single (ab)use
> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> assuming that failure implies lack of authentication.

Maybe correction here: Id does not care about authentication at all. It
wants to figure out whether it has access to modeset resources, which is
something entirely different, but happened to match in amdgpu's case.

> Affected versions are:
>  - the whole 18.2.x series, which is EOL
>  - the whole 18.3.x series, which is EOL
>  - the 19.0.x series, prior to 19.0.4

Hm I think I'm blind, I'm not seeing the fix merged to mesa master. Still
there on gitlab:

https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/vulkan/radv_device.c#L291

What am I missing?

> Add a special quirk for that case, thus we can drop DRM_AUTH bits as
> mentioned earlier.
> 
> Since all the affected userspace is EOL, we also add a kconfig option
> to disable this quirk.
> 
> The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Signed-off-by: Emil Velikov <emil.velikov@collabora.com>

Aside from the nits I think reasonable approach.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/Kconfig      | 16 ++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 12 +++++++++++-
>  drivers/gpu/drm/drm_ioctl.c             |  5 +++++
>  include/drm/drm_ioctl.h                 | 17 +++++++++++++++++
>  4 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
> index 9221e5489069..da415f445187 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Kconfig
> +++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
> @@ -40,6 +40,22 @@ config DRM_AMDGPU_GART_DEBUGFS
>  	  Selecting this option creates a debugfs file to inspect the mapped
>  	  pages. Uses more memory for housekeeping, enable only for debugging.
>  
> +config DRM_AMDGPU_FORCE_AUTH
> +	bool "Force authentication check on AMDGPU_INFO ioctl"
> +	default y
> +	help
> +	  There were some version of the Mesa RADV drivers, which relied on
> +	  the ioctl failing, if the client is not authenticated.
> +
> +	  Namely, the following versions are affected:
> +	    - the whole 18.2.x series, which is EOL
> +	    - the whole 18.3.x series, which is EOL
> +	    - the 19.0.x series, prior to 19.0.4
> +
> +	  Modern distributions, should disable this. That will allow various
> +	  other clients to work, that would otherwise require root privileges.
> +
> +
>  source "drivers/gpu/drm/amd/acp/Kconfig"
>  source "drivers/gpu/drm/amd/display/Kconfig"
>  source "drivers/gpu/drm/amd/amdkfd/Kconfig"
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index b17d0545728e..b8076929440b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1214,7 +1214,17 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
>  	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_MMAP, amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>  	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_WAIT_IDLE, amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>  	DRM_IOCTL_DEF_DRV(AMDGPU_CS, amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> -	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> +	/* The DRM_FORCE_AUTH is effectively a workaround for the RADV Mesa driver.
> +	 * This is required for Mesa:
> +	 *  - the whole 18.2.x series, which is EOL
> +	 *  - the whole 18.3.x series, which is EOL
> +	 *  - the 19.0.x series, prior to 19.0.4
> +	 */
> +	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl,
> +#if defined(DRM_AMDGPU_FORCE_AUTH)
> +		DRM_FORCE_AUTH|
> +#endif
> +		DRM_AUTH|DRM_RENDER_ALLOW),
>  	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_CS, amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>  	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_FENCES, amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
>  	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_METADATA, amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> index 2263e3ddd822..9841c0076f02 100644
> --- a/drivers/gpu/drm/drm_ioctl.c
> +++ b/drivers/gpu/drm/drm_ioctl.c
> @@ -544,6 +544,11 @@ int drm_ioctl_permit(u32 flags, struct drm_file *file_priv)
>  		     drm_is_render_client(file_priv)))
>  		return -EACCES;
>  
> +	/* FORCE_AUTH is only for authenticated or render client */
> +	if (unlikely((flags & DRM_FORCE_AUTH) && !drm_is_render_client(file_priv) &&
> +		     !file_priv->authenticated))
> +		return -EACCES;
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(drm_ioctl_permit);
> diff --git a/include/drm/drm_ioctl.h b/include/drm/drm_ioctl.h
> index fafb6f592c4b..6084ee32043d 100644
> --- a/include/drm/drm_ioctl.h
> +++ b/include/drm/drm_ioctl.h
> @@ -126,6 +126,23 @@ enum drm_ioctl_flags {
>  	 * not set DRM_AUTH because they do not require authentication.
>  	 */
>  	DRM_RENDER_ALLOW	= BIT(5),
> +	/**
> +	 * @DRM_FORCE_AUTH:
> +	 *
> +	 * Authentication of the primary node is mandatory. Regardless that the
> +	 * user can usually circumvent that by using the render node with exact
> +	 * same ioctl.
> +	 *
> +	 * Note: this is effectively a workaround for AMDGPU AMDGPU_INFO ioctl
> +	 * and the RADV Mesa driver. This is required for Mesa:
> +	 *  - the whole 18.2.x series, which is EOL
> +	 *  - the whole 18.3.x series, which is EOL
> +	 *  - the 19.0.x series, prior to 19.0.4
> +	 *
> +	 * Note: later patch will effectively drop the DRM_AUTH for ioctls
> +	 * annotated as DRM_AUTH | DRM_RENDER_ALLOW.
> +	 */
> +	DRM_FORCE_AUTH          = BIT(6),
>  };
>  
>  /**
> -- 
> 2.21.0
>
Daniel Vetter May 27, 2019, 1:20 p.m. UTC | #6
On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> > From: Emil Velikov <emil.velikov@collabora.com>
> >
> > Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > render node. A seemingly deliberate design decision.
> >
> > Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > yet not all userspace checks if it's authenticated, but instead uses
> > uncommon assumptions.
> >
> > After days of digging through git log and testing, only a single (ab)use
> > was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > assuming that failure implies lack of authentication.
> >
> > Affected versions are:
> >   - the whole 18.2.x series, which is EOL
> >   - the whole 18.3.x series, which is EOL
> >   - the 19.0.x series, prior to 19.0.4
> 
> Well you could have saved your time, cause this is still a NAK.
> 
> For the record: I strongly think that we don't want to expose any render 
> functionality on the primary node.
> 
> To even go a step further I would say that at least on AMD hardware we 
> should completely disable DRI2 for one of the next generations.
> 
> As a follow up I would then completely disallow the DRM authentication 
> for amdgpu, so that the command submission interface on the primary node 
> can only be used by the display server.

So amdgpu is running in one direction, while everyone else is running in
the other direction? Please note that your patch to change i915 landed
already, so that ship is sailing (but we could ofc revert that back
again).

Imo really not a great idea if we do a amdgpu vs. everyone else split
here. If we want to deprecate dri2/flink/rendering on primary nodes across
the stack, then that should be a stack wide decision, not an amdgpu one.

Same for the other one, i.e. this stuff here.
-Daniel

> 
> Regards,
> Christian.
> 
> >
> > Add a special quirk for that case, thus we can drop DRM_AUTH bits as
> > mentioned earlier.
> >
> > Since all the affected userspace is EOL, we also add a kconfig option
> > to disable this quirk.
> >
> > The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT
> >
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/Kconfig      | 16 ++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 12 +++++++++++-
> >   drivers/gpu/drm/drm_ioctl.c             |  5 +++++
> >   include/drm/drm_ioctl.h                 | 17 +++++++++++++++++
> >   4 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
> > index 9221e5489069..da415f445187 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/Kconfig
> > +++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
> > @@ -40,6 +40,22 @@ config DRM_AMDGPU_GART_DEBUGFS
> >   	  Selecting this option creates a debugfs file to inspect the mapped
> >   	  pages. Uses more memory for housekeeping, enable only for debugging.
> >   
> > +config DRM_AMDGPU_FORCE_AUTH
> > +	bool "Force authentication check on AMDGPU_INFO ioctl"
> > +	default y
> > +	help
> > +	  There were some version of the Mesa RADV drivers, which relied on
> > +	  the ioctl failing, if the client is not authenticated.
> > +
> > +	  Namely, the following versions are affected:
> > +	    - the whole 18.2.x series, which is EOL
> > +	    - the whole 18.3.x series, which is EOL
> > +	    - the 19.0.x series, prior to 19.0.4
> > +
> > +	  Modern distributions, should disable this. That will allow various
> > +	  other clients to work, that would otherwise require root privileges.
> > +
> > +
> >   source "drivers/gpu/drm/amd/acp/Kconfig"
> >   source "drivers/gpu/drm/amd/display/Kconfig"
> >   source "drivers/gpu/drm/amd/amdkfd/Kconfig"
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > index b17d0545728e..b8076929440b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > @@ -1214,7 +1214,17 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_MMAP, amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_WAIT_IDLE, amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_CS, amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> > -	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> > +	/* The DRM_FORCE_AUTH is effectively a workaround for the RADV Mesa driver.
> > +	 * This is required for Mesa:
> > +	 *  - the whole 18.2.x series, which is EOL
> > +	 *  - the whole 18.3.x series, which is EOL
> > +	 *  - the 19.0.x series, prior to 19.0.4
> > +	 */
> > +	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl,
> > +#if defined(DRM_AMDGPU_FORCE_AUTH)
> > +		DRM_FORCE_AUTH|
> > +#endif
> > +		DRM_AUTH|DRM_RENDER_ALLOW),
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_CS, amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_FENCES, amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >   	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_METADATA, amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> > diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> > index 2263e3ddd822..9841c0076f02 100644
> > --- a/drivers/gpu/drm/drm_ioctl.c
> > +++ b/drivers/gpu/drm/drm_ioctl.c
> > @@ -544,6 +544,11 @@ int drm_ioctl_permit(u32 flags, struct drm_file *file_priv)
> >   		     drm_is_render_client(file_priv)))
> >   		return -EACCES;
> >   
> > +	/* FORCE_AUTH is only for authenticated or render client */
> > +	if (unlikely((flags & DRM_FORCE_AUTH) && !drm_is_render_client(file_priv) &&
> > +		     !file_priv->authenticated))
> > +		return -EACCES;
> > +
> >   	return 0;
> >   }
> >   EXPORT_SYMBOL(drm_ioctl_permit);
> > diff --git a/include/drm/drm_ioctl.h b/include/drm/drm_ioctl.h
> > index fafb6f592c4b..6084ee32043d 100644
> > --- a/include/drm/drm_ioctl.h
> > +++ b/include/drm/drm_ioctl.h
> > @@ -126,6 +126,23 @@ enum drm_ioctl_flags {
> >   	 * not set DRM_AUTH because they do not require authentication.
> >   	 */
> >   	DRM_RENDER_ALLOW	= BIT(5),
> > +	/**
> > +	 * @DRM_FORCE_AUTH:
> > +	 *
> > +	 * Authentication of the primary node is mandatory. Regardless that the
> > +	 * user can usually circumvent that by using the render node with exact
> > +	 * same ioctl.
> > +	 *
> > +	 * Note: this is effectively a workaround for AMDGPU AMDGPU_INFO ioctl
> > +	 * and the RADV Mesa driver. This is required for Mesa:
> > +	 *  - the whole 18.2.x series, which is EOL
> > +	 *  - the whole 18.3.x series, which is EOL
> > +	 *  - the 19.0.x series, prior to 19.0.4
> > +	 *
> > +	 * Note: later patch will effectively drop the DRM_AUTH for ioctls
> > +	 * annotated as DRM_AUTH | DRM_RENDER_ALLOW.
> > +	 */
> > +	DRM_FORCE_AUTH          = BIT(6),
> >   };
> >   
> >   /**
>
Emil Velikov May 27, 2019, 1:26 p.m. UTC | #7
On 2019/05/27, Daniel Vetter wrote:
> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
> > Am 27.05.19 um 10:17 schrieb Emil Velikov:
> > > From: Emil Velikov <emil.velikov@collabora.com>
> > >
> > > Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > > render node. A seemingly deliberate design decision.
> > >
> > > Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > > yet not all userspace checks if it's authenticated, but instead uses
> > > uncommon assumptions.
> > >
> > > After days of digging through git log and testing, only a single (ab)use
> > > was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > > assuming that failure implies lack of authentication.
> > >
> > > Affected versions are:
> > >   - the whole 18.2.x series, which is EOL
> > >   - the whole 18.3.x series, which is EOL
> > >   - the 19.0.x series, prior to 19.0.4
> > 
> > Well you could have saved your time, cause this is still a NAK.
> > 
> > For the record: I strongly think that we don't want to expose any render 
> > functionality on the primary node.
> > 
> > To even go a step further I would say that at least on AMD hardware we 
> > should completely disable DRI2 for one of the next generations.
> > 
> > As a follow up I would then completely disallow the DRM authentication 
> > for amdgpu, so that the command submission interface on the primary node 
> > can only be used by the display server.
> 
> So amdgpu is running in one direction, while everyone else is running in
> the other direction? Please note that your patch to change i915 landed
> already, so that ship is sailing (but we could ofc revert that back
> again).
> 
> Imo really not a great idea if we do a amdgpu vs. everyone else split
> here. If we want to deprecate dri2/flink/rendering on primary nodes across
> the stack, then that should be a stack wide decision, not an amdgpu one.
> 
Cannot agree more - I would love to see drivers stay consistent.

Fwiw, this series consistently paves the way toward nuking DRM_AUTH ;-)

Thanks
Emil
Daniel Vetter May 27, 2019, 1:26 p.m. UTC | #8
On Mon, May 27, 2019 at 01:52:05PM +0100, Emil Velikov wrote:
> On 2019/05/27, Koenig, Christian wrote:
> > Am 27.05.19 um 14:05 schrieb Emil Velikov:
> > > On 2019/05/27, Koenig, Christian wrote:
> > >> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> > >>> From: Emil Velikov <emil.velikov@collabora.com>
> > >>>
> > >>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > >>> render node. A seemingly deliberate design decision.
> > >>>
> > >>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > >>> yet not all userspace checks if it's authenticated, but instead uses
> > >>> uncommon assumptions.
> > >>>
> > >>> After days of digging through git log and testing, only a eingle (ab)use
> > >>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > >>> assuming that failure implies lack of authentication.
> > >>>
> > >>> Affected versions are:
> > >>>    - the whole 18.2.x series, which is EOL
> > >>>    - the whole 18.3.x series, which is EOL
> > >>>    - the 19.0.x series, prior to 19.0.4
> > >> Well you could have saved your time, cause this is still a NAK.
> > >>
> > > Did I mention that I've spent quite a bit of time in AMDVLK? Even fixed
> > > a bug while I was there :-)
> > 
> > Yeah, thanks for doing this.
> > 
> > But we have done so much stuff with customers which can't be audited 
> > this way, that I still have a really bad feeling about this :/
> > 
> Fair point, I've already thought about those.
> 
> Example A:  customer X, using closed source/private software Y.
> Ioctls A, B and C require the old behaviour - simply add FORCE_AUTH to
> the A B C and carry on happily.

Hm, if the entire concern here is all the bazillion different versions of
blobs shipped in the past few years, why can't we just have a revert of
this in the amdgpu DKMS? Not like one more patch among the hundres will
matter. I'd suspect that the overlap of people wanting to run the blobs
and people who don't run the DKMS or similar is roughly 0. Always been the
case here at Intel at least.

If there's other stuff that we need to audit (like rocm or whatever), then
we should look into those ofc.
-Daniel


> 
> Example B: the team, as a whole thinks that this will be problematic for
> customer X - add FORCE_AUTH to all ioctls and carry on.
> 
> I do see and understand why anyone can be hesitant about the series.
> 
> IMHO the above examples, illustrate quite reasonable compromise between
> open-source and closed-source/private solutions.
> 
> Don't you agree?
> 
> > >> For the record: I strongly think that we don't want to expose any render
> > >> functionality on the primary node.
> > >>
> > >> To even go a step further I would say that at least on AMD hardware we
> > >> should completely disable DRI2 for one of the next generations.
> > >>
> > > It's doable and overall pretty neat idea.
> > >
> > > There is a consern that this approach may cause far more regressions
> > > that this series. We are talking about years worth of Mesa drivers (et
> > > al) that depend on render functionality exposed via the primary node.
> > 
> > Yeah, that's I'm perfectly aware of. It's the reason why I said we 
> > should only do it for new hardware generations.
> > 
> > But at some point I think we should do this and get rid of 
> > authentication/flink/DRI2/....
> > 
> Fwiw I share a similar sentiment. If you've looked through the series,
> this is effectively step 1, towards nuking DRM_AUTH :-)
> 
> > > I'm OK with writing the patches, but it'll be up-to the AMDGPU team to
> > > follow-up with any regressions. Are you ok with that?
> > 
> > As long as we have a check like adev->family_id > 
> > WHATEVER_IS_THE_CURRENT_LATEST_UPSTREAM_HW, I think we are fine with that.
> > 
> > If I understood Michel correctly xf86-video-amdgpu should work, but 
> > modeset might break and needs a patch.
> > 
> Unless I have concrete WHATEVER_IS_THE... I cannot do much here :-(
> 
> 
> Thanks
> Emil
Daniel Vetter May 27, 2019, 1:34 p.m. UTC | #9
On Mon, May 27, 2019 at 3:26 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, May 27, 2019 at 01:52:05PM +0100, Emil Velikov wrote:
> > On 2019/05/27, Koenig, Christian wrote:
> > > Am 27.05.19 um 14:05 schrieb Emil Velikov:
> > > > On 2019/05/27, Koenig, Christian wrote:
> > > >> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> > > >>> From: Emil Velikov <emil.velikov@collabora.com>
> > > >>>
> > > >>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > > >>> render node. A seemingly deliberate design decision.
> > > >>>
> > > >>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > > >>> yet not all userspace checks if it's authenticated, but instead uses
> > > >>> uncommon assumptions.
> > > >>>
> > > >>> After days of digging through git log and testing, only a eingle (ab)use
> > > >>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > > >>> assuming that failure implies lack of authentication.
> > > >>>
> > > >>> Affected versions are:
> > > >>>    - the whole 18.2.x series, which is EOL
> > > >>>    - the whole 18.3.x series, which is EOL
> > > >>>    - the 19.0.x series, prior to 19.0.4
> > > >> Well you could have saved your time, cause this is still a NAK.
> > > >>
> > > > Did I mention that I've spent quite a bit of time in AMDVLK? Even fixed
> > > > a bug while I was there :-)
> > >
> > > Yeah, thanks for doing this.
> > >
> > > But we have done so much stuff with customers which can't be audited
> > > this way, that I still have a really bad feeling about this :/
> > >
> > Fair point, I've already thought about those.
> >
> > Example A:  customer X, using closed source/private software Y.
> > Ioctls A, B and C require the old behaviour - simply add FORCE_AUTH to
> > the A B C and carry on happily.
>
> Hm, if the entire concern here is all the bazillion different versions of
> blobs shipped in the past few years, why can't we just have a revert of
> this in the amdgpu DKMS? Not like one more patch among the hundres will
> matter. I'd suspect that the overlap of people wanting to run the blobs
> and people who don't run the DKMS or similar is roughly 0. Always been the
> case here at Intel at least.
>
> If there's other stuff that we need to audit (like rocm or whatever), then
> we should look into those ofc.

Also note that Emil's patch actually doesn't break anything, since
default y. So you don't even need the revert (except maybe in 10 years
or so when we throw that option out).
-Daniel


> > Example B: the team, as a whole thinks that this will be problematic for
> > customer X - add FORCE_AUTH to all ioctls and carry on.
> >
> > I do see and understand why anyone can be hesitant about the series.
> >
> > IMHO the above examples, illustrate quite reasonable compromise between
> > open-source and closed-source/private solutions.
> >
> > Don't you agree?
> >
> > > >> For the record: I strongly think that we don't want to expose any render
> > > >> functionality on the primary node.
> > > >>
> > > >> To even go a step further I would say that at least on AMD hardware we
> > > >> should completely disable DRI2 for one of the next generations.
> > > >>
> > > > It's doable and overall pretty neat idea.
> > > >
> > > > There is a consern that this approach may cause far more regressions
> > > > that this series. We are talking about years worth of Mesa drivers (et
> > > > al) that depend on render functionality exposed via the primary node.
> > >
> > > Yeah, that's I'm perfectly aware of. It's the reason why I said we
> > > should only do it for new hardware generations.
> > >
> > > But at some point I think we should do this and get rid of
> > > authentication/flink/DRI2/....
> > >
> > Fwiw I share a similar sentiment. If you've looked through the series,
> > this is effectively step 1, towards nuking DRM_AUTH :-)
> >
> > > > I'm OK with writing the patches, but it'll be up-to the AMDGPU team to
> > > > follow-up with any regressions. Are you ok with that?
> > >
> > > As long as we have a check like adev->family_id >
> > > WHATEVER_IS_THE_CURRENT_LATEST_UPSTREAM_HW, I think we are fine with that.
> > >
> > > If I understood Michel correctly xf86-video-amdgpu should work, but
> > > modeset might break and needs a patch.
> > >
> > Unless I have concrete WHATEVER_IS_THE... I cannot do much here :-(
> >
> >
> > Thanks
> > Emil
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Christian König May 27, 2019, 1:42 p.m. UTC | #10
Am 27.05.19 um 15:26 schrieb Emil Velikov:
> On 2019/05/27, Daniel Vetter wrote:
>> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
>>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
>>>> From: Emil Velikov <emil.velikov@collabora.com>
>>>>
>>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
>>>> render node. A seemingly deliberate design decision.
>>>>
>>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
>>>> yet not all userspace checks if it's authenticated, but instead uses
>>>> uncommon assumptions.
>>>>
>>>> After days of digging through git log and testing, only a single (ab)use
>>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
>>>> assuming that failure implies lack of authentication.
>>>>
>>>> Affected versions are:
>>>>    - the whole 18.2.x series, which is EOL
>>>>    - the whole 18.3.x series, which is EOL
>>>>    - the 19.0.x series, prior to 19.0.4
>>> Well you could have saved your time, cause this is still a NAK.
>>>
>>> For the record: I strongly think that we don't want to expose any render
>>> functionality on the primary node.
>>>
>>> To even go a step further I would say that at least on AMD hardware we
>>> should completely disable DRI2 for one of the next generations.
>>>
>>> As a follow up I would then completely disallow the DRM authentication
>>> for amdgpu, so that the command submission interface on the primary node
>>> can only be used by the display server.
>> So amdgpu is running in one direction, while everyone else is running in
>> the other direction? Please note that your patch to change i915 landed
>> already, so that ship is sailing (but we could ofc revert that back
>> again).
>>
>> Imo really not a great idea if we do a amdgpu vs. everyone else split
>> here. If we want to deprecate dri2/flink/rendering on primary nodes across
>> the stack, then that should be a stack wide decision, not an amdgpu one.
>>
> Cannot agree more - I would love to see drivers stay consistent.

Yeah, completely agree to that. That's why I think we should not do this 
at all and just let Intel fix it's userspace bugs :P

Anyway my concern is really that we should stop extending functionality 
on the primary node.

E.g. the render node is for use by the clients and the primary node for 
mode setting and use by the display server only.

Regards,
Christian.

> Fwiw, this series consistently paves the way toward nuking DRM_AUTH ;-)
>
> Thanks
> Emil
Emil Velikov May 27, 2019, 1:47 p.m. UTC | #11
On 2019/05/27, Daniel Vetter wrote:
> On Mon, May 27, 2019 at 09:17:29AM +0100, Emil Velikov wrote:
> > From: Emil Velikov <emil.velikov@collabora.com>
> > 
> > Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> > render node. A seemingly deliberate design decision.
> > 
> > Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> > yet not all userspace checks if it's authenticated, but instead uses
> > uncommon assumptions.
> > 
> > After days of digging through git log and testing, only a single (ab)use
> > was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> > assuming that failure implies lack of authentication.
> 
> Maybe correction here: Id does not care about authentication at all. It
> wants to figure out whether it has access to modeset resources, which is
> something entirely different, but happened to match in amdgpu's case.
> 
It feels like we have conflated the two (perhaps due to historical
reasons) and I'm not 100% sure which one the AMDGPU code inspects.

How about:

Hence we can drop the DRM_AUTH all together (details in follow-up patch)
yet that cause a regression in some userspace, when it conflates the
authentication status with access to modeset resources.

After days of digging through git log and testing, only a single (ab)use
was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl.

> > Affected versions are:
> >  - the whole 18.2.x series, which is EOL
> >  - the whole 18.3.x series, which is EOL
> >  - the 19.0.x series, prior to 19.0.4
> 
> Hm I think I'm blind, I'm not seeing the fix merged to mesa master. Still
> there on gitlab:
> 
> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/vulkan/radv_device.c#L291
> 
> What am I missing?
> 
Opted for a simple, more generic solution see commit
c962a78f18284e2971301bf68c9c60500ca398e4

This way, the same check is:
 - enforced where it's used
 - present for all Mesa Vulkan drivers


Will include the sha + commit title for v2.

> > Add a special quirk for that case, thus we can drop DRM_AUTH bits as
> > mentioned earlier.
> > 
> > Since all the affected userspace is EOL, we also add a kconfig option
> > to disable this quirk.
> > 
> > The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT
> > 
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
> 
> Aside from the nits I think reasonable approach.

Great, glad to hear. Now all we need is to bribe^Wconvince Christian.

Thanks
Emil
Daniel Vetter May 27, 2019, 3:26 p.m. UTC | #12
On Mon, May 27, 2019 at 3:42 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 27.05.19 um 15:26 schrieb Emil Velikov:
> > On 2019/05/27, Daniel Vetter wrote:
> >> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
> >>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> >>>> From: Emil Velikov <emil.velikov@collabora.com>
> >>>>
> >>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> >>>> render node. A seemingly deliberate design decision.
> >>>>
> >>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> >>>> yet not all userspace checks if it's authenticated, but instead uses
> >>>> uncommon assumptions.
> >>>>
> >>>> After days of digging through git log and testing, only a single (ab)use
> >>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> >>>> assuming that failure implies lack of authentication.
> >>>>
> >>>> Affected versions are:
> >>>>    - the whole 18.2.x series, which is EOL
> >>>>    - the whole 18.3.x series, which is EOL
> >>>>    - the 19.0.x series, prior to 19.0.4
> >>> Well you could have saved your time, cause this is still a NAK.
> >>>
> >>> For the record: I strongly think that we don't want to expose any render
> >>> functionality on the primary node.
> >>>
> >>> To even go a step further I would say that at least on AMD hardware we
> >>> should completely disable DRI2 for one of the next generations.
> >>>
> >>> As a follow up I would then completely disallow the DRM authentication
> >>> for amdgpu, so that the command submission interface on the primary node
> >>> can only be used by the display server.
> >> So amdgpu is running in one direction, while everyone else is running in
> >> the other direction? Please note that your patch to change i915 landed
> >> already, so that ship is sailing (but we could ofc revert that back
> >> again).
> >>
> >> Imo really not a great idea if we do a amdgpu vs. everyone else split
> >> here. If we want to deprecate dri2/flink/rendering on primary nodes across
> >> the stack, then that should be a stack wide decision, not an amdgpu one.
> >>
> > Cannot agree more - I would love to see drivers stay consistent.
>
> Yeah, completely agree to that. That's why I think we should not do this
> at all and just let Intel fix it's userspace bugs :P

So you're planning to submit that revert? You did jump the gun with
sending out that patch after all too ... (aside from it got merged
because of some other mixup with r-b tags and what not).
-Daniel

> Anyway my concern is really that we should stop extending functionality
> on the primary node.
>
> E.g. the render node is for use by the clients and the primary node for
> mode setting and use by the display server only.
>
> Regards,
> Christian.
>
> > Fwiw, this series consistently paves the way toward nuking DRM_AUTH ;-)
> >
> > Thanks
> > Emil
>
Emil Velikov May 27, 2019, 3:32 p.m. UTC | #13
On 2019/05/27, Koenig, Christian wrote:
> Am 27.05.19 um 15:26 schrieb Emil Velikov:
> > On 2019/05/27, Daniel Vetter wrote:
> >> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
> >>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> >>>> From: Emil Velikov <emil.velikov@collabora.com>
> >>>>
> >>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> >>>> render node. A seemingly deliberate design decision.
> >>>>
> >>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> >>>> yet not all userspace checks if it's authenticated, but instead uses
> >>>> uncommon assumptions.
> >>>>
> >>>> After days of digging through git log and testing, only a single (ab)use
> >>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> >>>> assuming that failure implies lack of authentication.
> >>>>
> >>>> Affected versions are:
> >>>>    - the whole 18.2.x series, which is EOL
> >>>>    - the whole 18.3.x series, which is EOL
> >>>>    - the 19.0.x series, prior to 19.0.4
> >>> Well you could have saved your time, cause this is still a NAK.
> >>>
> >>> For the record: I strongly think that we don't want to expose any render
> >>> functionality on the primary node.
> >>>
> >>> To even go a step further I would say that at least on AMD hardware we
> >>> should completely disable DRI2 for one of the next generations.
> >>>
> >>> As a follow up I would then completely disallow the DRM authentication
> >>> for amdgpu, so that the command submission interface on the primary node
> >>> can only be used by the display server.
> >> So amdgpu is running in one direction, while everyone else is running in
> >> the other direction? Please note that your patch to change i915 landed
> >> already, so that ship is sailing (but we could ofc revert that back
> >> again).
> >>
> >> Imo really not a great idea if we do a amdgpu vs. everyone else split
> >> here. If we want to deprecate dri2/flink/rendering on primary nodes across
> >> the stack, then that should be a stack wide decision, not an amdgpu one.
> >>
> > Cannot agree more - I would love to see drivers stay consistent.
> 
> Yeah, completely agree to that. That's why I think we should not do this 
> at all and just let Intel fix it's userspace bugs :P
> 
Pretty sure I mentioned it before - might have been too subtle:

The problem is _neither_ Intel nor libva specific.


> Anyway my concern is really that we should stop extending functionality 
> on the primary node.
> 
> E.g. the render node is for use by the clients and the primary node for 
> mode setting and use by the display server only.
> 
Fully agreed. I'm not extending anything really. If anything - code is
removed from drivers and core :-)

Thanks
Emil
Christian König May 28, 2019, 6:58 a.m. UTC | #14
Am 27.05.19 um 17:26 schrieb Daniel Vetter:
> On Mon, May 27, 2019 at 3:42 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 27.05.19 um 15:26 schrieb Emil Velikov:
>>> On 2019/05/27, Daniel Vetter wrote:
>>>> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
>>>>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
>>>>>> From: Emil Velikov <emil.velikov@collabora.com>
>>>>>>
>>>>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
>>>>>> render node. A seemingly deliberate design decision.
>>>>>>
>>>>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
>>>>>> yet not all userspace checks if it's authenticated, but instead uses
>>>>>> uncommon assumptions.
>>>>>>
>>>>>> After days of digging through git log and testing, only a single (ab)use
>>>>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
>>>>>> assuming that failure implies lack of authentication.
>>>>>>
>>>>>> Affected versions are:
>>>>>>     - the whole 18.2.x series, which is EOL
>>>>>>     - the whole 18.3.x series, which is EOL
>>>>>>     - the 19.0.x series, prior to 19.0.4
>>>>> Well you could have saved your time, cause this is still a NAK.
>>>>>
>>>>> For the record: I strongly think that we don't want to expose any render
>>>>> functionality on the primary node.
>>>>>
>>>>> To even go a step further I would say that at least on AMD hardware we
>>>>> should completely disable DRI2 for one of the next generations.
>>>>>
>>>>> As a follow up I would then completely disallow the DRM authentication
>>>>> for amdgpu, so that the command submission interface on the primary node
>>>>> can only be used by the display server.
>>>> So amdgpu is running in one direction, while everyone else is running in
>>>> the other direction? Please note that your patch to change i915 landed
>>>> already, so that ship is sailing (but we could ofc revert that back
>>>> again).
>>>>
>>>> Imo really not a great idea if we do a amdgpu vs. everyone else split
>>>> here. If we want to deprecate dri2/flink/rendering on primary nodes across
>>>> the stack, then that should be a stack wide decision, not an amdgpu one.
>>>>
>>> Cannot agree more - I would love to see drivers stay consistent.
>> Yeah, completely agree to that. That's why I think we should not do this
>> at all and just let Intel fix it's userspace bugs :P
> So you're planning to submit that revert? You did jump the gun with
> sending out that patch after all too ... (aside from it got merged
> because of some other mixup with r-b tags and what not).

Well already regretting submitting that. On the other hand what is the 
minimum IOCTLs we need to get working to fix the vainfo issue?

Might be a good idea looking into reverting it partially, so that at 
least command submission and buffer allocation is still blocked.

Christian.

> -Daniel
>
>> Anyway my concern is really that we should stop extending functionality
>> on the primary node.
>>
>> E.g. the render node is for use by the clients and the primary node for
>> mode setting and use by the display server only.
>>
>> Regards,
>> Christian.
>>
>>> Fwiw, this series consistently paves the way toward nuking DRM_AUTH ;-)
>>>
>>> Thanks
>>> Emil
>
Daniel Vetter May 28, 2019, 7:38 a.m. UTC | #15
On Tue, May 28, 2019 at 8:58 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 27.05.19 um 17:26 schrieb Daniel Vetter:
> > On Mon, May 27, 2019 at 3:42 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 27.05.19 um 15:26 schrieb Emil Velikov:
> >>> On 2019/05/27, Daniel Vetter wrote:
> >>>> On Mon, May 27, 2019 at 10:47:39AM +0000, Koenig, Christian wrote:
> >>>>> Am 27.05.19 um 10:17 schrieb Emil Velikov:
> >>>>>> From: Emil Velikov <emil.velikov@collabora.com>
> >>>>>>
> >>>>>> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> >>>>>> render node. A seemingly deliberate design decision.
> >>>>>>
> >>>>>> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> >>>>>> yet not all userspace checks if it's authenticated, but instead uses
> >>>>>> uncommon assumptions.
> >>>>>>
> >>>>>> After days of digging through git log and testing, only a single (ab)use
> >>>>>> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> >>>>>> assuming that failure implies lack of authentication.
> >>>>>>
> >>>>>> Affected versions are:
> >>>>>>     - the whole 18.2.x series, which is EOL
> >>>>>>     - the whole 18.3.x series, which is EOL
> >>>>>>     - the 19.0.x series, prior to 19.0.4
> >>>>> Well you could have saved your time, cause this is still a NAK.
> >>>>>
> >>>>> For the record: I strongly think that we don't want to expose any render
> >>>>> functionality on the primary node.
> >>>>>
> >>>>> To even go a step further I would say that at least on AMD hardware we
> >>>>> should completely disable DRI2 for one of the next generations.
> >>>>>
> >>>>> As a follow up I would then completely disallow the DRM authentication
> >>>>> for amdgpu, so that the command submission interface on the primary node
> >>>>> can only be used by the display server.
> >>>> So amdgpu is running in one direction, while everyone else is running in
> >>>> the other direction? Please note that your patch to change i915 landed
> >>>> already, so that ship is sailing (but we could ofc revert that back
> >>>> again).
> >>>>
> >>>> Imo really not a great idea if we do a amdgpu vs. everyone else split
> >>>> here. If we want to deprecate dri2/flink/rendering on primary nodes across
> >>>> the stack, then that should be a stack wide decision, not an amdgpu one.
> >>>>
> >>> Cannot agree more - I would love to see drivers stay consistent.
> >> Yeah, completely agree to that. That's why I think we should not do this
> >> at all and just let Intel fix it's userspace bugs :P
> > So you're planning to submit that revert? You did jump the gun with
> > sending out that patch after all too ... (aside from it got merged
> > because of some other mixup with r-b tags and what not).
>
> Well already regretting submitting that. On the other hand what is the
> minimum IOCTLs we need to get working to fix the vainfo issue?

We have a bit more time, it's only going to be in 5.3. But yeah if we
don't bottom out on any of the options here I think revert it has to
be.

> Might be a good idea looking into reverting it partially, so that at
> least command submission and buffer allocation is still blocked.

I thought the issue is a lot more than vainfo, it's pretty much every
hacked up compositor under the sun getting this wrong one way or
another. Thinking about this some more, I also have no idea how you'd
want to deprecate rendering on primary nodes in general. Apparently
that breaks -modesetting already, and probably lots more compositors.
And it looks like we're finally achieve the goal kms set out to 10
years ago, and new compositors are sprouting up all the time. I guess
we could just break them all (on new hardware) and tell them to all
suck it up. But I don't think that's a great option. And just
deprecating this on amdgpu is going to be even harder, since then
everywhere else it'll keep working, and it's just amdgpu.ko that looks
broken.

Aside: I'm not supporting Emil's idea here because it fixes any issues
Intel has - Intel doesn't care. I support it because reality sucks,
people get this render vs. primary vs. multi-gpu prime wrong all the
time (that's also why we have hardcoded display+gpu pairs in mesa for
the various soc combinations out there), and this looks like a
pragmatic solution. It'd be nice if every compositor and everything
else would perfectly support multi gpu and only use render nodes for
rendering, and only primary nodes for display. But reality is that
people hack on stuff until gears on screen and then move on to more
interesting things (to them). So I don't think we'll ever win this :-/
-Daniel

> Christian.
>
> > -Daniel
> >
> >> Anyway my concern is really that we should stop extending functionality
> >> on the primary node.
> >>
> >> E.g. the render node is for use by the clients and the primary node for
> >> mode setting and use by the display server only.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Fwiw, this series consistently paves the way toward nuking DRM_AUTH ;-)
> >>>
> >>> Thanks
> >>> Emil
> >
>
Christian König May 28, 2019, 8:03 a.m. UTC | #16
Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> [SNIP]
>> Might be a good idea looking into reverting it partially, so that at
>> least command submission and buffer allocation is still blocked.
> I thought the issue is a lot more than vainfo, it's pretty much every
> hacked up compositor under the sun getting this wrong one way or
> another. Thinking about this some more, I also have no idea how you'd
> want to deprecate rendering on primary nodes in general. Apparently
> that breaks -modesetting already, and probably lots more compositors.
> And it looks like we're finally achieve the goal kms set out to 10
> years ago, and new compositors are sprouting up all the time. I guess
> we could just break them all (on new hardware) and tell them to all
> suck it up. But I don't think that's a great option. And just
> deprecating this on amdgpu is going to be even harder, since then
> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> broken.
>
> Aside: I'm not supporting Emil's idea here because it fixes any issues
> Intel has - Intel doesn't care. I support it because reality sucks,
> people get this render vs. primary vs. multi-gpu prime wrong all the
> time (that's also why we have hardcoded display+gpu pairs in mesa for
> the various soc combinations out there), and this looks like a
> pragmatic solution. It'd be nice if every compositor and everything
> else would perfectly support multi gpu and only use render nodes for
> rendering, and only primary nodes for display. But reality is that
> people hack on stuff until gears on screen and then move on to more
> interesting things (to them). So I don't think we'll ever win this :-/

Yeah, but this is a classic case of working around user space issues by 
making kernel changes instead of fixing user space.

Having privileged (output control) and unprivileged (rendering control) 
functionality behind the same node is a mistake we have made a long time 
ago and render nodes finally seemed to be a way to fix that.

I mean why are compositors using the primary node in the first place? 
Because they want to have access to privileged resources I think and in 
this case it is perfectly ok to do so.

Now extending unprivileged access to the primary node actually sounds 
like a step into the wrong direction to me.

I rather think that we should go down the route of completely dropping 
command submission and buffer allocation through the primary node for 
non master clients. And then as next step at some point drop support for 
authentication/flink.

I mean we have done this with UMS as well and I don't see much other way 
to move forward and get rid of those ancient interface in the long term.

Regards,
Christian.

> -Daniel
Daniel Vetter May 28, 2019, 8:18 a.m. UTC | #17
On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> > [SNIP]
> >> Might be a good idea looking into reverting it partially, so that at
> >> least command submission and buffer allocation is still blocked.
> > I thought the issue is a lot more than vainfo, it's pretty much every
> > hacked up compositor under the sun getting this wrong one way or
> > another. Thinking about this some more, I also have no idea how you'd
> > want to deprecate rendering on primary nodes in general. Apparently
> > that breaks -modesetting already, and probably lots more compositors.
> > And it looks like we're finally achieve the goal kms set out to 10
> > years ago, and new compositors are sprouting up all the time. I guess
> > we could just break them all (on new hardware) and tell them to all
> > suck it up. But I don't think that's a great option. And just
> > deprecating this on amdgpu is going to be even harder, since then
> > everywhere else it'll keep working, and it's just amdgpu.ko that looks
> > broken.
> >
> > Aside: I'm not supporting Emil's idea here because it fixes any issues
> > Intel has - Intel doesn't care. I support it because reality sucks,
> > people get this render vs. primary vs. multi-gpu prime wrong all the
> > time (that's also why we have hardcoded display+gpu pairs in mesa for
> > the various soc combinations out there), and this looks like a
> > pragmatic solution. It'd be nice if every compositor and everything
> > else would perfectly support multi gpu and only use render nodes for
> > rendering, and only primary nodes for display. But reality is that
> > people hack on stuff until gears on screen and then move on to more
> > interesting things (to them). So I don't think we'll ever win this :-/
>
> Yeah, but this is a classic case of working around user space issues by
> making kernel changes instead of fixing user space.
>
> Having privileged (output control) and unprivileged (rendering control)
> functionality behind the same node is a mistake we have made a long time
> ago and render nodes finally seemed to be a way to fix that.
>
> I mean why are compositors using the primary node in the first place?
> Because they want to have access to privileged resources I think and in
> this case it is perfectly ok to do so.
>
> Now extending unprivileged access to the primary node actually sounds
> like a step into the wrong direction to me.
>
> I rather think that we should go down the route of completely dropping
> command submission and buffer allocation through the primary node for
> non master clients. And then as next step at some point drop support for
> authentication/flink.
>
> I mean we have done this with UMS as well and I don't see much other way
> to move forward and get rid of those ancient interface in the long term.

Well kms had some really good benefits that drove quick adoption, like
"suspend/resume actually has a chance of working" or "comes with
buffer management so you can run multiple gears".

The render node thing is a lot more niche use case (prime, better priv
separation), plus "it's cleaner design". And the "cleaner design" part
is something that empirically doesn't seem to matter :-/ Just two
examples:
- KHR_display/leases just iterated display resources on the fd needed
for rendering (and iirc there was even a patch to expose that for
render nodes too so it works with DRI3), because implementing
protocols is too hard. Barely managed to stop that one before it
happened.
- Various video players use the vblank ioctl on directly to schedule
frames, without telling the compositor. I discovered that when I
wanted to limite the vblank ioctl to master clients only. Again,
apparently too hard to use the existing extensions, or fix the bugs in
there, or whatever. One userspace got fixed last year, but it'll
probably get copypasted around forever :-/

So I don't think we'll ever manage to roll a clean split out, and best
we can do is give in and just hand userspace what it wants. As much as
that's misguided and unclean and all that. Maybe it'll result in a
least fewer stuff getting run as root to hack around this, because
fixing properly seems not to be on the table.

The beauty of kms is that we've achieved the mission, everyone's
writing their own thing. Which is also terrible, and I don't think
it'll get better.
-Daniel
Emil Velikov May 28, 2019, 4:10 p.m. UTC | #18
On 2019/05/28, Daniel Vetter wrote:
> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> > > [SNIP]
> > >> Might be a good idea looking into reverting it partially, so that at
> > >> least command submission and buffer allocation is still blocked.
> > > I thought the issue is a lot more than vainfo, it's pretty much every
> > > hacked up compositor under the sun getting this wrong one way or
> > > another. Thinking about this some more, I also have no idea how you'd
> > > want to deprecate rendering on primary nodes in general. Apparently
> > > that breaks -modesetting already, and probably lots more compositors.
> > > And it looks like we're finally achieve the goal kms set out to 10
> > > years ago, and new compositors are sprouting up all the time. I guess
> > > we could just break them all (on new hardware) and tell them to all
> > > suck it up. But I don't think that's a great option. And just
> > > deprecating this on amdgpu is going to be even harder, since then
> > > everywhere else it'll keep working, and it's just amdgpu.ko that looks
> > > broken.
> > >
> > > Aside: I'm not supporting Emil's idea here because it fixes any issues
> > > Intel has - Intel doesn't care. I support it because reality sucks,
> > > people get this render vs. primary vs. multi-gpu prime wrong all the
> > > time (that's also why we have hardcoded display+gpu pairs in mesa for
> > > the various soc combinations out there), and this looks like a
> > > pragmatic solution. It'd be nice if every compositor and everything
> > > else would perfectly support multi gpu and only use render nodes for
> > > rendering, and only primary nodes for display. But reality is that
> > > people hack on stuff until gears on screen and then move on to more
> > > interesting things (to them). So I don't think we'll ever win this :-/
> >
> > Yeah, but this is a classic case of working around user space issues by
> > making kernel changes instead of fixing user space.
> >
> > Having privileged (output control) and unprivileged (rendering control)
> > functionality behind the same node is a mistake we have made a long time
> > ago and render nodes finally seemed to be a way to fix that.
> >
> > I mean why are compositors using the primary node in the first place?
> > Because they want to have access to privileged resources I think and in
> > this case it is perfectly ok to do so.
> >
> > Now extending unprivileged access to the primary node actually sounds
> > like a step into the wrong direction to me.
> >
> > I rather think that we should go down the route of completely dropping
> > command submission and buffer allocation through the primary node for
> > non master clients. And then as next step at some point drop support for
> > authentication/flink.
> >
> > I mean we have done this with UMS as well and I don't see much other way
> > to move forward and get rid of those ancient interface in the long term.
> 
> Well kms had some really good benefits that drove quick adoption, like
> "suspend/resume actually has a chance of working" or "comes with
> buffer management so you can run multiple gears".
> 
> The render node thing is a lot more niche use case (prime, better priv
> separation), plus "it's cleaner design". And the "cleaner design" part
> is something that empirically doesn't seem to matter :-/ Just two
> examples:
> - KHR_display/leases just iterated display resources on the fd needed
> for rendering (and iirc there was even a patch to expose that for
> render nodes too so it works with DRI3), because implementing
> protocols is too hard. Barely managed to stop that one before it
> happened.
> - Various video players use the vblank ioctl on directly to schedule
> frames, without telling the compositor. I discovered that when I
> wanted to limite the vblank ioctl to master clients only. Again,
> apparently too hard to use the existing extensions, or fix the bugs in
> there, or whatever. One userspace got fixed last year, but it'll
> probably get copypasted around forever :-/
> 
> So I don't think we'll ever manage to roll a clean split out, and best
> we can do is give in and just hand userspace what it wants. As much as
> that's misguided and unclean and all that. Maybe it'll result in a
> least fewer stuff getting run as root to hack around this, because
> fixing properly seems not to be on the table.
> 
> The beauty of kms is that we've achieved the mission, everyone's
> writing their own thing. Which is also terrible, and I don't think
> it'll get better.


With the risk of coming rude I will repeat my earlier comment:

The problem is _neither_ Intel nor libva specific.



That said, let's step back for a moment and consider:

 - the "block everything but KMS via the primary node" idea is great but
orthogonal

 - the series does address issues that are vendor-agnostic

 - by default this series does _not_ cause any regression be that for
new or old userspace

 - there are two trivial solutions, if the AMD team has concerns about
closed-source/private stack depending on the old behaviour
If they want I can even write the patches ;-)


That said, the notable comments received so far are:
 - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
handle. I'm OK but this will change the return code - from EACCES to
ENOSYS

 - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
planning to drop nearly all DRM_AUTH instances in their driver.


Christian, as mentioned before - this series does _not_ add
functionality to render nodes. It effectively paves a way towards
removing DRM_AUTH.

I understand the series may feel a bit dirty. Yet I would gladly address
any technical concerns you have.

Thanks
Emil
Christian König May 28, 2019, 4:22 p.m. UTC | #19
Am 28.05.19 um 18:10 schrieb Emil Velikov:
> On 2019/05/28, Daniel Vetter wrote:
>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
>> <Christian.Koenig@amd.com> wrote:
>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
>>>> [SNIP]
>>>>> Might be a good idea looking into reverting it partially, so that at
>>>>> least command submission and buffer allocation is still blocked.
>>>> I thought the issue is a lot more than vainfo, it's pretty much every
>>>> hacked up compositor under the sun getting this wrong one way or
>>>> another. Thinking about this some more, I also have no idea how you'd
>>>> want to deprecate rendering on primary nodes in general. Apparently
>>>> that breaks -modesetting already, and probably lots more compositors.
>>>> And it looks like we're finally achieve the goal kms set out to 10
>>>> years ago, and new compositors are sprouting up all the time. I guess
>>>> we could just break them all (on new hardware) and tell them to all
>>>> suck it up. But I don't think that's a great option. And just
>>>> deprecating this on amdgpu is going to be even harder, since then
>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
>>>> broken.
>>>>
>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
>>>> Intel has - Intel doesn't care. I support it because reality sucks,
>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
>>>> the various soc combinations out there), and this looks like a
>>>> pragmatic solution. It'd be nice if every compositor and everything
>>>> else would perfectly support multi gpu and only use render nodes for
>>>> rendering, and only primary nodes for display. But reality is that
>>>> people hack on stuff until gears on screen and then move on to more
>>>> interesting things (to them). So I don't think we'll ever win this :-/
>>> Yeah, but this is a classic case of working around user space issues by
>>> making kernel changes instead of fixing user space.
>>>
>>> Having privileged (output control) and unprivileged (rendering control)
>>> functionality behind the same node is a mistake we have made a long time
>>> ago and render nodes finally seemed to be a way to fix that.
>>>
>>> I mean why are compositors using the primary node in the first place?
>>> Because they want to have access to privileged resources I think and in
>>> this case it is perfectly ok to do so.
>>>
>>> Now extending unprivileged access to the primary node actually sounds
>>> like a step into the wrong direction to me.
>>>
>>> I rather think that we should go down the route of completely dropping
>>> command submission and buffer allocation through the primary node for
>>> non master clients. And then as next step at some point drop support for
>>> authentication/flink.
>>>
>>> I mean we have done this with UMS as well and I don't see much other way
>>> to move forward and get rid of those ancient interface in the long term.
>> Well kms had some really good benefits that drove quick adoption, like
>> "suspend/resume actually has a chance of working" or "comes with
>> buffer management so you can run multiple gears".
>>
>> The render node thing is a lot more niche use case (prime, better priv
>> separation), plus "it's cleaner design". And the "cleaner design" part
>> is something that empirically doesn't seem to matter :-/ Just two
>> examples:
>> - KHR_display/leases just iterated display resources on the fd needed
>> for rendering (and iirc there was even a patch to expose that for
>> render nodes too so it works with DRI3), because implementing
>> protocols is too hard. Barely managed to stop that one before it
>> happened.
>> - Various video players use the vblank ioctl on directly to schedule
>> frames, without telling the compositor. I discovered that when I
>> wanted to limite the vblank ioctl to master clients only. Again,
>> apparently too hard to use the existing extensions, or fix the bugs in
>> there, or whatever. One userspace got fixed last year, but it'll
>> probably get copypasted around forever :-/
>>
>> So I don't think we'll ever manage to roll a clean split out, and best
>> we can do is give in and just hand userspace what it wants. As much as
>> that's misguided and unclean and all that. Maybe it'll result in a
>> least fewer stuff getting run as root to hack around this, because
>> fixing properly seems not to be on the table.
>>
>> The beauty of kms is that we've achieved the mission, everyone's
>> writing their own thing. Which is also terrible, and I don't think
>> it'll get better.
>
> With the risk of coming rude I will repeat my earlier comment:
>
> The problem is _neither_ Intel nor libva specific.
>
>
>
> That said, let's step back for a moment and consider:
>
>   - the "block everything but KMS via the primary node" idea is great but
> orthogonal
>
>   - the series does address issues that are vendor-agnostic
>
>   - by default this series does _not_ cause any regression be that for
> new or old userspace
>
>   - there are two trivial solutions, if the AMD team has concerns about
> closed-source/private stack depending on the old behaviour
> If they want I can even write the patches ;-)
>
>
> That said, the notable comments received so far are:
>   - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> handle. I'm OK but this will change the return code - from EACCES to
> ENOSYS
>
>   - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> planning to drop nearly all DRM_AUTH instances in their driver.
>
>
> Christian, as mentioned before - this series does _not_ add
> functionality to render nodes. It effectively paves a way towards
> removing DRM_AUTH.

But it adds functionality to the primary node.

> I understand the series may feel a bit dirty. Yet I would gladly address
> any technical concerns you have.

Well putting compatibility issues aside my concern is that this is 
simply a bad design decision which we can't revert later on.

Because of this I will still reject any changes trying to remove 
DRM_AUTH from AMD drivers.

If that means that AMD drivers will have a behavior difference to other 
drivers I think we are going to live with that.

Regards,
Christian.

>
> Thanks
> Emil
Emil Velikov May 28, 2019, 4:46 p.m. UTC | #20
On 2019/05/28, Koenig, Christian wrote:
> Am 28.05.19 um 18:10 schrieb Emil Velikov:
> > On 2019/05/28, Daniel Vetter wrote:
> >> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> >> <Christian.Koenig@amd.com> wrote:
> >>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> >>>> [SNIP]
> >>>>> Might be a good idea looking into reverting it partially, so that at
> >>>>> least command submission and buffer allocation is still blocked.
> >>>> I thought the issue is a lot more than vainfo, it's pretty much every
> >>>> hacked up compositor under the sun getting this wrong one way or
> >>>> another. Thinking about this some more, I also have no idea how you'd
> >>>> want to deprecate rendering on primary nodes in general. Apparently
> >>>> that breaks -modesetting already, and probably lots more compositors.
> >>>> And it looks like we're finally achieve the goal kms set out to 10
> >>>> years ago, and new compositors are sprouting up all the time. I guess
> >>>> we could just break them all (on new hardware) and tell them to all
> >>>> suck it up. But I don't think that's a great option. And just
> >>>> deprecating this on amdgpu is going to be even harder, since then
> >>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> >>>> broken.
> >>>>
> >>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> >>>> Intel has - Intel doesn't care. I support it because reality sucks,
> >>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> >>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> >>>> the various soc combinations out there), and this looks like a
> >>>> pragmatic solution. It'd be nice if every compositor and everything
> >>>> else would perfectly support multi gpu and only use render nodes for
> >>>> rendering, and only primary nodes for display. But reality is that
> >>>> people hack on stuff until gears on screen and then move on to more
> >>>> interesting things (to them). So I don't think we'll ever win this :-/
> >>> Yeah, but this is a classic case of working around user space issues by
> >>> making kernel changes instead of fixing user space.
> >>>
> >>> Having privileged (output control) and unprivileged (rendering control)
> >>> functionality behind the same node is a mistake we have made a long time
> >>> ago and render nodes finally seemed to be a way to fix that.
> >>>
> >>> I mean why are compositors using the primary node in the first place?
> >>> Because they want to have access to privileged resources I think and in
> >>> this case it is perfectly ok to do so.
> >>>
> >>> Now extending unprivileged access to the primary node actually sounds
> >>> like a step into the wrong direction to me.
> >>>
> >>> I rather think that we should go down the route of completely dropping
> >>> command submission and buffer allocation through the primary node for
> >>> non master clients. And then as next step at some point drop support for
> >>> authentication/flink.
> >>>
> >>> I mean we have done this with UMS as well and I don't see much other way
> >>> to move forward and get rid of those ancient interface in the long term.
> >> Well kms had some really good benefits that drove quick adoption, like
> >> "suspend/resume actually has a chance of working" or "comes with
> >> buffer management so you can run multiple gears".
> >>
> >> The render node thing is a lot more niche use case (prime, better priv
> >> separation), plus "it's cleaner design". And the "cleaner design" part
> >> is something that empirically doesn't seem to matter :-/ Just two
> >> examples:
> >> - KHR_display/leases just iterated display resources on the fd needed
> >> for rendering (and iirc there was even a patch to expose that for
> >> render nodes too so it works with DRI3), because implementing
> >> protocols is too hard. Barely managed to stop that one before it
> >> happened.
> >> - Various video players use the vblank ioctl on directly to schedule
> >> frames, without telling the compositor. I discovered that when I
> >> wanted to limite the vblank ioctl to master clients only. Again,
> >> apparently too hard to use the existing extensions, or fix the bugs in
> >> there, or whatever. One userspace got fixed last year, but it'll
> >> probably get copypasted around forever :-/
> >>
> >> So I don't think we'll ever manage to roll a clean split out, and best
> >> we can do is give in and just hand userspace what it wants. As much as
> >> that's misguided and unclean and all that. Maybe it'll result in a
> >> least fewer stuff getting run as root to hack around this, because
> >> fixing properly seems not to be on the table.
> >>
> >> The beauty of kms is that we've achieved the mission, everyone's
> >> writing their own thing. Which is also terrible, and I don't think
> >> it'll get better.
> >
> > With the risk of coming rude I will repeat my earlier comment:
> >
> > The problem is _neither_ Intel nor libva specific.
> >
> >
> >
> > That said, let's step back for a moment and consider:
> >
> >   - the "block everything but KMS via the primary node" idea is great but
> > orthogonal
> >
> >   - the series does address issues that are vendor-agnostic
> >
> >   - by default this series does _not_ cause any regression be that for
> > new or old userspace
> >
> >   - there are two trivial solutions, if the AMD team has concerns about
> > closed-source/private stack depending on the old behaviour
> > If they want I can even write the patches ;-)
> >
> >
> > That said, the notable comments received so far are:
> >   - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> > handle. I'm OK but this will change the return code - from EACCES to
> > ENOSYS
> >
> >   - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> > planning to drop nearly all DRM_AUTH instances in their driver.
> >
> >
> > Christian, as mentioned before - this series does _not_ add
> > functionality to render nodes. It effectively paves a way towards
> > removing DRM_AUTH.
> 
> But it adds functionality to the primary node.
> 
Behaviour is adjusted - functionality was there since day 1.

> > I understand the series may feel a bit dirty. Yet I would gladly address
> > any technical concerns you have.
> 
> Well putting compatibility issues aside my concern is that this is 
> simply a bad design decision which we can't revert later on.
> 
As sad above - any concerns (theoretical or actual regressions) can be
trivially fixed _without_ reverting any of this.

I am more than happy to step up and address any regressions in timely
manner.


As a reminder without this series, some of your customers are forced to
run their applications as root.


Thanks
Emil
Dave Airlie May 28, 2019, 8:05 p.m. UTC | #21
On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
>
> On 2019/05/28, Koenig, Christian wrote:
> > Am 28.05.19 um 18:10 schrieb Emil Velikov:
> > > On 2019/05/28, Daniel Vetter wrote:
> > >> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> > >> <Christian.Koenig@amd.com> wrote:
> > >>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> > >>>> [SNIP]
> > >>>>> Might be a good idea looking into reverting it partially, so that at
> > >>>>> least command submission and buffer allocation is still blocked.
> > >>>> I thought the issue is a lot more than vainfo, it's pretty much every
> > >>>> hacked up compositor under the sun getting this wrong one way or
> > >>>> another. Thinking about this some more, I also have no idea how you'd
> > >>>> want to deprecate rendering on primary nodes in general. Apparently
> > >>>> that breaks -modesetting already, and probably lots more compositors.
> > >>>> And it looks like we're finally achieve the goal kms set out to 10
> > >>>> years ago, and new compositors are sprouting up all the time. I guess
> > >>>> we could just break them all (on new hardware) and tell them to all
> > >>>> suck it up. But I don't think that's a great option. And just
> > >>>> deprecating this on amdgpu is going to be even harder, since then
> > >>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> > >>>> broken.
> > >>>>
> > >>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> > >>>> Intel has - Intel doesn't care. I support it because reality sucks,
> > >>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> > >>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> > >>>> the various soc combinations out there), and this looks like a
> > >>>> pragmatic solution. It'd be nice if every compositor and everything
> > >>>> else would perfectly support multi gpu and only use render nodes for
> > >>>> rendering, and only primary nodes for display. But reality is that
> > >>>> people hack on stuff until gears on screen and then move on to more
> > >>>> interesting things (to them). So I don't think we'll ever win this :-/
> > >>> Yeah, but this is a classic case of working around user space issues by
> > >>> making kernel changes instead of fixing user space.
> > >>>
> > >>> Having privileged (output control) and unprivileged (rendering control)
> > >>> functionality behind the same node is a mistake we have made a long time
> > >>> ago and render nodes finally seemed to be a way to fix that.
> > >>>
> > >>> I mean why are compositors using the primary node in the first place?
> > >>> Because they want to have access to privileged resources I think and in
> > >>> this case it is perfectly ok to do so.
> > >>>
> > >>> Now extending unprivileged access to the primary node actually sounds
> > >>> like a step into the wrong direction to me.
> > >>>
> > >>> I rather think that we should go down the route of completely dropping
> > >>> command submission and buffer allocation through the primary node for
> > >>> non master clients. And then as next step at some point drop support for
> > >>> authentication/flink.
> > >>>
> > >>> I mean we have done this with UMS as well and I don't see much other way
> > >>> to move forward and get rid of those ancient interface in the long term.
> > >> Well kms had some really good benefits that drove quick adoption, like
> > >> "suspend/resume actually has a chance of working" or "comes with
> > >> buffer management so you can run multiple gears".
> > >>
> > >> The render node thing is a lot more niche use case (prime, better priv
> > >> separation), plus "it's cleaner design". And the "cleaner design" part
> > >> is something that empirically doesn't seem to matter :-/ Just two
> > >> examples:
> > >> - KHR_display/leases just iterated display resources on the fd needed
> > >> for rendering (and iirc there was even a patch to expose that for
> > >> render nodes too so it works with DRI3), because implementing
> > >> protocols is too hard. Barely managed to stop that one before it
> > >> happened.
> > >> - Various video players use the vblank ioctl on directly to schedule
> > >> frames, without telling the compositor. I discovered that when I
> > >> wanted to limite the vblank ioctl to master clients only. Again,
> > >> apparently too hard to use the existing extensions, or fix the bugs in
> > >> there, or whatever. One userspace got fixed last year, but it'll
> > >> probably get copypasted around forever :-/
> > >>
> > >> So I don't think we'll ever manage to roll a clean split out, and best
> > >> we can do is give in and just hand userspace what it wants. As much as
> > >> that's misguided and unclean and all that. Maybe it'll result in a
> > >> least fewer stuff getting run as root to hack around this, because
> > >> fixing properly seems not to be on the table.
> > >>
> > >> The beauty of kms is that we've achieved the mission, everyone's
> > >> writing their own thing. Which is also terrible, and I don't think
> > >> it'll get better.
> > >
> > > With the risk of coming rude I will repeat my earlier comment:
> > >
> > > The problem is _neither_ Intel nor libva specific.
> > >
> > >
> > >
> > > That said, let's step back for a moment and consider:
> > >
> > >   - the "block everything but KMS via the primary node" idea is great but
> > > orthogonal
> > >
> > >   - the series does address issues that are vendor-agnostic
> > >
> > >   - by default this series does _not_ cause any regression be that for
> > > new or old userspace
> > >
> > >   - there are two trivial solutions, if the AMD team has concerns about
> > > closed-source/private stack depending on the old behaviour
> > > If they want I can even write the patches ;-)
> > >
> > >
> > > That said, the notable comments received so far are:
> > >   - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> > > handle. I'm OK but this will change the return code - from EACCES to
> > > ENOSYS
> > >
> > >   - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> > > planning to drop nearly all DRM_AUTH instances in their driver.
> > >
> > >
> > > Christian, as mentioned before - this series does _not_ add
> > > functionality to render nodes. It effectively paves a way towards
> > > removing DRM_AUTH.
> >
> > But it adds functionality to the primary node.
> >
> Behaviour is adjusted - functionality was there since day 1.
>
> > > I understand the series may feel a bit dirty. Yet I would gladly address
> > > any technical concerns you have.
> >
> > Well putting compatibility issues aside my concern is that this is
> > simply a bad design decision which we can't revert later on.
> >
> As sad above - any concerns (theoretical or actual regressions) can be
> trivially fixed _without_ reverting any of this.
>
> I am more than happy to step up and address any regressions in timely
> manner.
>
>
> As a reminder without this series, some of your customers are forced to
> run their applications as root.

I'm torn here on whether this is worth it. Have we got more use cases
to justify it?

I'm wary of opening this up just because we can.

Dave.
Emil Velikov May 29, 2019, 1:03 p.m. UTC | #22
On 2019/05/29, Dave Airlie wrote:
> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
> >
> > On 2019/05/28, Koenig, Christian wrote:
> > > Am 28.05.19 um 18:10 schrieb Emil Velikov:
> > > > On 2019/05/28, Daniel Vetter wrote:
> > > >> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> > > >> <Christian.Koenig@amd.com> wrote:
> > > >>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> > > >>>> [SNIP]
> > > >>>>> Might be a good idea looking into reverting it partially, so that at
> > > >>>>> least command submission and buffer allocation is still blocked.
> > > >>>> I thought the issue is a lot more than vainfo, it's pretty much every
> > > >>>> hacked up compositor under the sun getting this wrong one way or
> > > >>>> another. Thinking about this some more, I also have no idea how you'd
> > > >>>> want to deprecate rendering on primary nodes in general. Apparently
> > > >>>> that breaks -modesetting already, and probably lots more compositors.
> > > >>>> And it looks like we're finally achieve the goal kms set out to 10
> > > >>>> years ago, and new compositors are sprouting up all the time. I guess
> > > >>>> we could just break them all (on new hardware) and tell them to all
> > > >>>> suck it up. But I don't think that's a great option. And just
> > > >>>> deprecating this on amdgpu is going to be even harder, since then
> > > >>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> > > >>>> broken.
> > > >>>>
> > > >>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> > > >>>> Intel has - Intel doesn't care. I support it because reality sucks,
> > > >>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> > > >>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> > > >>>> the various soc combinations out there), and this looks like a
> > > >>>> pragmatic solution. It'd be nice if every compositor and everything
> > > >>>> else would perfectly support multi gpu and only use render nodes for
> > > >>>> rendering, and only primary nodes for display. But reality is that
> > > >>>> people hack on stuff until gears on screen and then move on to more
> > > >>>> interesting things (to them). So I don't think we'll ever win this :-/
> > > >>> Yeah, but this is a classic case of working around user space issues by
> > > >>> making kernel changes instead of fixing user space.
> > > >>>
> > > >>> Having privileged (output control) and unprivileged (rendering control)
> > > >>> functionality behind the same node is a mistake we have made a long time
> > > >>> ago and render nodes finally seemed to be a way to fix that.
> > > >>>
> > > >>> I mean why are compositors using the primary node in the first place?
> > > >>> Because they want to have access to privileged resources I think and in
> > > >>> this case it is perfectly ok to do so.
> > > >>>
> > > >>> Now extending unprivileged access to the primary node actually sounds
> > > >>> like a step into the wrong direction to me.
> > > >>>
> > > >>> I rather think that we should go down the route of completely dropping
> > > >>> command submission and buffer allocation through the primary node for
> > > >>> non master clients. And then as next step at some point drop support for
> > > >>> authentication/flink.
> > > >>>
> > > >>> I mean we have done this with UMS as well and I don't see much other way
> > > >>> to move forward and get rid of those ancient interface in the long term.
> > > >> Well kms had some really good benefits that drove quick adoption, like
> > > >> "suspend/resume actually has a chance of working" or "comes with
> > > >> buffer management so you can run multiple gears".
> > > >>
> > > >> The render node thing is a lot more niche use case (prime, better priv
> > > >> separation), plus "it's cleaner design". And the "cleaner design" part
> > > >> is something that empirically doesn't seem to matter :-/ Just two
> > > >> examples:
> > > >> - KHR_display/leases just iterated display resources on the fd needed
> > > >> for rendering (and iirc there was even a patch to expose that for
> > > >> render nodes too so it works with DRI3), because implementing
> > > >> protocols is too hard. Barely managed to stop that one before it
> > > >> happened.
> > > >> - Various video players use the vblank ioctl on directly to schedule
> > > >> frames, without telling the compositor. I discovered that when I
> > > >> wanted to limite the vblank ioctl to master clients only. Again,
> > > >> apparently too hard to use the existing extensions, or fix the bugs in
> > > >> there, or whatever. One userspace got fixed last year, but it'll
> > > >> probably get copypasted around forever :-/
> > > >>
> > > >> So I don't think we'll ever manage to roll a clean split out, and best
> > > >> we can do is give in and just hand userspace what it wants. As much as
> > > >> that's misguided and unclean and all that. Maybe it'll result in a
> > > >> least fewer stuff getting run as root to hack around this, because
> > > >> fixing properly seems not to be on the table.
> > > >>
> > > >> The beauty of kms is that we've achieved the mission, everyone's
> > > >> writing their own thing. Which is also terrible, and I don't think
> > > >> it'll get better.
> > > >
> > > > With the risk of coming rude I will repeat my earlier comment:
> > > >
> > > > The problem is _neither_ Intel nor libva specific.
> > > >
> > > >
> > > >
> > > > That said, let's step back for a moment and consider:
> > > >
> > > >   - the "block everything but KMS via the primary node" idea is great but
> > > > orthogonal
> > > >
> > > >   - the series does address issues that are vendor-agnostic
> > > >
> > > >   - by default this series does _not_ cause any regression be that for
> > > > new or old userspace
> > > >
> > > >   - there are two trivial solutions, if the AMD team has concerns about
> > > > closed-source/private stack depending on the old behaviour
> > > > If they want I can even write the patches ;-)
> > > >
> > > >
> > > > That said, the notable comments received so far are:
> > > >   - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> > > > handle. I'm OK but this will change the return code - from EACCES to
> > > > ENOSYS
> > > >
> > > >   - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> > > > planning to drop nearly all DRM_AUTH instances in their driver.
> > > >
> > > >
> > > > Christian, as mentioned before - this series does _not_ add
> > > > functionality to render nodes. It effectively paves a way towards
> > > > removing DRM_AUTH.
> > >
> > > But it adds functionality to the primary node.
> > >
> > Behaviour is adjusted - functionality was there since day 1.
> >
> > > > I understand the series may feel a bit dirty. Yet I would gladly address
> > > > any technical concerns you have.
> > >
> > > Well putting compatibility issues aside my concern is that this is
> > > simply a bad design decision which we can't revert later on.
> > >
> > As sad above - any concerns (theoretical or actual regressions) can be
> > trivially fixed _without_ reverting any of this.
> >
> > I am more than happy to step up and address any regressions in timely
> > manner.
> >
> >
> > As a reminder without this series, some of your customers are forced to
> > run their applications as root.
> 
> I'm torn here on whether this is worth it. Have we got more use cases
> to justify it?
> 

Should have mentioned: three DRM drivers (not counting i915) have
dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.

Apart from the libva, kmscube + gst and mesa, I'm expecting other
projects to make the same mistake. Since the former three define the
norm of using DRM.

The "fix" for all of these being "run as root" :-\

> I'm wary of opening this up just because we can.
> 

What can I do to alleviate that worry? I have spent over a week auditing
code and designed so that we can reinstate the authentication only where
needed.

Thanks
Emil
Christian König May 29, 2019, 1:14 p.m. UTC | #23
Am 29.05.19 um 15:03 schrieb Emil Velikov:
> On 2019/05/29, Dave Airlie wrote:
>> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
>>> On 2019/05/28, Koenig, Christian wrote:
>>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
>>>>> On 2019/05/28, Daniel Vetter wrote:
>>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
>>>>>>>> [SNIP]
>>>>>>>>> Might be a good idea looking into reverting it partially, so that at
>>>>>>>>> least command submission and buffer allocation is still blocked.
>>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
>>>>>>>> hacked up compositor under the sun getting this wrong one way or
>>>>>>>> another. Thinking about this some more, I also have no idea how you'd
>>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
>>>>>>>> that breaks -modesetting already, and probably lots more compositors.
>>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
>>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
>>>>>>>> we could just break them all (on new hardware) and tell them to all
>>>>>>>> suck it up. But I don't think that's a great option. And just
>>>>>>>> deprecating this on amdgpu is going to be even harder, since then
>>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
>>>>>>>> broken.
>>>>>>>>
>>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
>>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
>>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
>>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
>>>>>>>> the various soc combinations out there), and this looks like a
>>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
>>>>>>>> else would perfectly support multi gpu and only use render nodes for
>>>>>>>> rendering, and only primary nodes for display. But reality is that
>>>>>>>> people hack on stuff until gears on screen and then move on to more
>>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
>>>>>>> Yeah, but this is a classic case of working around user space issues by
>>>>>>> making kernel changes instead of fixing user space.
>>>>>>>
>>>>>>> Having privileged (output control) and unprivileged (rendering control)
>>>>>>> functionality behind the same node is a mistake we have made a long time
>>>>>>> ago and render nodes finally seemed to be a way to fix that.
>>>>>>>
>>>>>>> I mean why are compositors using the primary node in the first place?
>>>>>>> Because they want to have access to privileged resources I think and in
>>>>>>> this case it is perfectly ok to do so.
>>>>>>>
>>>>>>> Now extending unprivileged access to the primary node actually sounds
>>>>>>> like a step into the wrong direction to me.
>>>>>>>
>>>>>>> I rather think that we should go down the route of completely dropping
>>>>>>> command submission and buffer allocation through the primary node for
>>>>>>> non master clients. And then as next step at some point drop support for
>>>>>>> authentication/flink.
>>>>>>>
>>>>>>> I mean we have done this with UMS as well and I don't see much other way
>>>>>>> to move forward and get rid of those ancient interface in the long term.
>>>>>> Well kms had some really good benefits that drove quick adoption, like
>>>>>> "suspend/resume actually has a chance of working" or "comes with
>>>>>> buffer management so you can run multiple gears".
>>>>>>
>>>>>> The render node thing is a lot more niche use case (prime, better priv
>>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
>>>>>> is something that empirically doesn't seem to matter :-/ Just two
>>>>>> examples:
>>>>>> - KHR_display/leases just iterated display resources on the fd needed
>>>>>> for rendering (and iirc there was even a patch to expose that for
>>>>>> render nodes too so it works with DRI3), because implementing
>>>>>> protocols is too hard. Barely managed to stop that one before it
>>>>>> happened.
>>>>>> - Various video players use the vblank ioctl on directly to schedule
>>>>>> frames, without telling the compositor. I discovered that when I
>>>>>> wanted to limite the vblank ioctl to master clients only. Again,
>>>>>> apparently too hard to use the existing extensions, or fix the bugs in
>>>>>> there, or whatever. One userspace got fixed last year, but it'll
>>>>>> probably get copypasted around forever :-/
>>>>>>
>>>>>> So I don't think we'll ever manage to roll a clean split out, and best
>>>>>> we can do is give in and just hand userspace what it wants. As much as
>>>>>> that's misguided and unclean and all that. Maybe it'll result in a
>>>>>> least fewer stuff getting run as root to hack around this, because
>>>>>> fixing properly seems not to be on the table.
>>>>>>
>>>>>> The beauty of kms is that we've achieved the mission, everyone's
>>>>>> writing their own thing. Which is also terrible, and I don't think
>>>>>> it'll get better.
>>>>> With the risk of coming rude I will repeat my earlier comment:
>>>>>
>>>>> The problem is _neither_ Intel nor libva specific.
>>>>>
>>>>>
>>>>>
>>>>> That said, let's step back for a moment and consider:
>>>>>
>>>>>    - the "block everything but KMS via the primary node" idea is great but
>>>>> orthogonal
>>>>>
>>>>>    - the series does address issues that are vendor-agnostic
>>>>>
>>>>>    - by default this series does _not_ cause any regression be that for
>>>>> new or old userspace
>>>>>
>>>>>    - there are two trivial solutions, if the AMD team has concerns about
>>>>> closed-source/private stack depending on the old behaviour
>>>>> If they want I can even write the patches ;-)
>>>>>
>>>>>
>>>>> That said, the notable comments received so far are:
>>>>>    - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
>>>>> handle. I'm OK but this will change the return code - from EACCES to
>>>>> ENOSYS
>>>>>
>>>>>    - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
>>>>> planning to drop nearly all DRM_AUTH instances in their driver.
>>>>>
>>>>>
>>>>> Christian, as mentioned before - this series does _not_ add
>>>>> functionality to render nodes. It effectively paves a way towards
>>>>> removing DRM_AUTH.
>>>> But it adds functionality to the primary node.
>>>>
>>> Behaviour is adjusted - functionality was there since day 1.
>>>
>>>>> I understand the series may feel a bit dirty. Yet I would gladly address
>>>>> any technical concerns you have.
>>>> Well putting compatibility issues aside my concern is that this is
>>>> simply a bad design decision which we can't revert later on.
>>>>
>>> As sad above - any concerns (theoretical or actual regressions) can be
>>> trivially fixed _without_ reverting any of this.
>>>
>>> I am more than happy to step up and address any regressions in timely
>>> manner.
>>>
>>>
>>> As a reminder without this series, some of your customers are forced to
>>> run their applications as root.
>> I'm torn here on whether this is worth it. Have we got more use cases
>> to justify it?
>>
> Should have mentioned: three DRM drivers (not counting i915) have
> dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
>
> Apart from the libva, kmscube + gst and mesa, I'm expecting other
> projects to make the same mistake. Since the former three define the
> norm of using DRM.
>
> The "fix" for all of these being "run as root" :-\
>
>> I'm wary of opening this up just because we can.
>>
> What can I do to alleviate that worry? I have spent over a week auditing
> code and designed so that we can reinstate the authentication only where
> needed.

Well I don't think the worry here is about regressions, but rather about 
a design decision we will never be able to revert.

So the question we have to ask is rather if it's a good design decision 
to resurrect the primary node with all its related compability burdens 
to work around an issue which is essentially an userspace coding error.

And from my point of view the answer is relatively clear that this is 
not going to worth it.

Regards,
Christian.

>
> Thanks
> Emil
Emil Velikov May 29, 2019, 4:29 p.m. UTC | #24
On 2019/05/29, Koenig, Christian wrote:
> Am 29.05.19 um 15:03 schrieb Emil Velikov:
> > On 2019/05/29, Dave Airlie wrote:
> >> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
> >>> On 2019/05/28, Koenig, Christian wrote:
> >>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
> >>>>> On 2019/05/28, Daniel Vetter wrote:
> >>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> >>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> >>>>>>>> [SNIP]
> >>>>>>>>> Might be a good idea looking into reverting it partially, so that at
> >>>>>>>>> least command submission and buffer allocation is still blocked.
> >>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
> >>>>>>>> hacked up compositor under the sun getting this wrong one way or
> >>>>>>>> another. Thinking about this some more, I also have no idea how you'd
> >>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
> >>>>>>>> that breaks -modesetting already, and probably lots more compositors.
> >>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
> >>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
> >>>>>>>> we could just break them all (on new hardware) and tell them to all
> >>>>>>>> suck it up. But I don't think that's a great option. And just
> >>>>>>>> deprecating this on amdgpu is going to be even harder, since then
> >>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> >>>>>>>> broken.
> >>>>>>>>
> >>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> >>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
> >>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> >>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> >>>>>>>> the various soc combinations out there), and this looks like a
> >>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
> >>>>>>>> else would perfectly support multi gpu and only use render nodes for
> >>>>>>>> rendering, and only primary nodes for display. But reality is that
> >>>>>>>> people hack on stuff until gears on screen and then move on to more
> >>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
> >>>>>>> Yeah, but this is a classic case of working around user space issues by
> >>>>>>> making kernel changes instead of fixing user space.
> >>>>>>>
> >>>>>>> Having privileged (output control) and unprivileged (rendering control)
> >>>>>>> functionality behind the same node is a mistake we have made a long time
> >>>>>>> ago and render nodes finally seemed to be a way to fix that.
> >>>>>>>
> >>>>>>> I mean why are compositors using the primary node in the first place?
> >>>>>>> Because they want to have access to privileged resources I think and in
> >>>>>>> this case it is perfectly ok to do so.
> >>>>>>>
> >>>>>>> Now extending unprivileged access to the primary node actually sounds
> >>>>>>> like a step into the wrong direction to me.
> >>>>>>>
> >>>>>>> I rather think that we should go down the route of completely dropping
> >>>>>>> command submission and buffer allocation through the primary node for
> >>>>>>> non master clients. And then as next step at some point drop support for
> >>>>>>> authentication/flink.
> >>>>>>>
> >>>>>>> I mean we have done this with UMS as well and I don't see much other way
> >>>>>>> to move forward and get rid of those ancient interface in the long term.
> >>>>>> Well kms had some really good benefits that drove quick adoption, like
> >>>>>> "suspend/resume actually has a chance of working" or "comes with
> >>>>>> buffer management so you can run multiple gears".
> >>>>>>
> >>>>>> The render node thing is a lot more niche use case (prime, better priv
> >>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
> >>>>>> is something that empirically doesn't seem to matter :-/ Just two
> >>>>>> examples:
> >>>>>> - KHR_display/leases just iterated display resources on the fd needed
> >>>>>> for rendering (and iirc there was even a patch to expose that for
> >>>>>> render nodes too so it works with DRI3), because implementing
> >>>>>> protocols is too hard. Barely managed to stop that one before it
> >>>>>> happened.
> >>>>>> - Various video players use the vblank ioctl on directly to schedule
> >>>>>> frames, without telling the compositor. I discovered that when I
> >>>>>> wanted to limite the vblank ioctl to master clients only. Again,
> >>>>>> apparently too hard to use the existing extensions, or fix the bugs in
> >>>>>> there, or whatever. One userspace got fixed last year, but it'll
> >>>>>> probably get copypasted around forever :-/
> >>>>>>
> >>>>>> So I don't think we'll ever manage to roll a clean split out, and best
> >>>>>> we can do is give in and just hand userspace what it wants. As much as
> >>>>>> that's misguided and unclean and all that. Maybe it'll result in a
> >>>>>> least fewer stuff getting run as root to hack around this, because
> >>>>>> fixing properly seems not to be on the table.
> >>>>>>
> >>>>>> The beauty of kms is that we've achieved the mission, everyone's
> >>>>>> writing their own thing. Which is also terrible, and I don't think
> >>>>>> it'll get better.
> >>>>> With the risk of coming rude I will repeat my earlier comment:
> >>>>>
> >>>>> The problem is _neither_ Intel nor libva specific.
> >>>>>
> >>>>>
> >>>>>
> >>>>> That said, let's step back for a moment and consider:
> >>>>>
> >>>>>    - the "block everything but KMS via the primary node" idea is great but
> >>>>> orthogonal
> >>>>>
> >>>>>    - the series does address issues that are vendor-agnostic
> >>>>>
> >>>>>    - by default this series does _not_ cause any regression be that for
> >>>>> new or old userspace
> >>>>>
> >>>>>    - there are two trivial solutions, if the AMD team has concerns about
> >>>>> closed-source/private stack depending on the old behaviour
> >>>>> If they want I can even write the patches ;-)
> >>>>>
> >>>>>
> >>>>> That said, the notable comments received so far are:
> >>>>>    - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> >>>>> handle. I'm OK but this will change the return code - from EACCES to
> >>>>> ENOSYS
> >>>>>
> >>>>>    - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> >>>>> planning to drop nearly all DRM_AUTH instances in their driver.
> >>>>>
> >>>>>
> >>>>> Christian, as mentioned before - this series does _not_ add
> >>>>> functionality to render nodes. It effectively paves a way towards
> >>>>> removing DRM_AUTH.
> >>>> But it adds functionality to the primary node.
> >>>>
> >>> Behaviour is adjusted - functionality was there since day 1.
> >>>
> >>>>> I understand the series may feel a bit dirty. Yet I would gladly address
> >>>>> any technical concerns you have.
> >>>> Well putting compatibility issues aside my concern is that this is
> >>>> simply a bad design decision which we can't revert later on.
> >>>>
> >>> As sad above - any concerns (theoretical or actual regressions) can be
> >>> trivially fixed _without_ reverting any of this.
> >>>
> >>> I am more than happy to step up and address any regressions in timely
> >>> manner.
> >>>
> >>>
> >>> As a reminder without this series, some of your customers are forced to
> >>> run their applications as root.
> >> I'm torn here on whether this is worth it. Have we got more use cases
> >> to justify it?
> >>
> > Should have mentioned: three DRM drivers (not counting i915) have
> > dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
> >
> > Apart from the libva, kmscube + gst and mesa, I'm expecting other
> > projects to make the same mistake. Since the former three define the
> > norm of using DRM.
> >
> > The "fix" for all of these being "run as root" :-\
> >
> >> I'm wary of opening this up just because we can.
> >>
> > What can I do to alleviate that worry? I have spent over a week auditing
> > code and designed so that we can reinstate the authentication only where
> > needed.
> 
> Well I don't think the worry here is about regressions,
Glad to hear.

> but rather about 
> a design decision we will never be able to revert.
> 
Can you think of any reason/issue why we would want to revert this? I
will gladly spend some thing exploring how to address it.

> So the question we have to ask is rather if it's a good design decision 
> to resurrect the primary node with all its related compability burdens 
> to work around an issue which is essentially an userspace coding error.
> 

Can see you're not happy on the topic - I'm not too excited either. The
truth to the matter is - DRM drivers have dropped DRM_AUTH regardless of
my work.

It's very unfortunate, if AMDGPU stands out. Perhaps after some time and
unhappy users you'll reconsider.

I believe that Linus has pointed out a number of times that kernel
developers should care about our users. Even when it's an userspace
error.


HTH
Emil
Christian König May 31, 2019, 12:20 p.m. UTC | #25
Am 29.05.19 um 18:29 schrieb Emil Velikov:
> On 2019/05/29, Koenig, Christian wrote:
>> Am 29.05.19 um 15:03 schrieb Emil Velikov:
>>> On 2019/05/29, Dave Airlie wrote:
>>>> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
>>>>> On 2019/05/28, Koenig, Christian wrote:
>>>>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
>>>>>>> On 2019/05/28, Daniel Vetter wrote:
>>>>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
>>>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
>>>>>>>>>> [SNIP]
>>>>>>>>>>> Might be a good idea looking into reverting it partially, so that at
>>>>>>>>>>> least command submission and buffer allocation is still blocked.
>>>>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
>>>>>>>>>> hacked up compositor under the sun getting this wrong one way or
>>>>>>>>>> another. Thinking about this some more, I also have no idea how you'd
>>>>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
>>>>>>>>>> that breaks -modesetting already, and probably lots more compositors.
>>>>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
>>>>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
>>>>>>>>>> we could just break them all (on new hardware) and tell them to all
>>>>>>>>>> suck it up. But I don't think that's a great option. And just
>>>>>>>>>> deprecating this on amdgpu is going to be even harder, since then
>>>>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
>>>>>>>>>> broken.
>>>>>>>>>>
>>>>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
>>>>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
>>>>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
>>>>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
>>>>>>>>>> the various soc combinations out there), and this looks like a
>>>>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
>>>>>>>>>> else would perfectly support multi gpu and only use render nodes for
>>>>>>>>>> rendering, and only primary nodes for display. But reality is that
>>>>>>>>>> people hack on stuff until gears on screen and then move on to more
>>>>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
>>>>>>>>> Yeah, but this is a classic case of working around user space issues by
>>>>>>>>> making kernel changes instead of fixing user space.
>>>>>>>>>
>>>>>>>>> Having privileged (output control) and unprivileged (rendering control)
>>>>>>>>> functionality behind the same node is a mistake we have made a long time
>>>>>>>>> ago and render nodes finally seemed to be a way to fix that.
>>>>>>>>>
>>>>>>>>> I mean why are compositors using the primary node in the first place?
>>>>>>>>> Because they want to have access to privileged resources I think and in
>>>>>>>>> this case it is perfectly ok to do so.
>>>>>>>>>
>>>>>>>>> Now extending unprivileged access to the primary node actually sounds
>>>>>>>>> like a step into the wrong direction to me.
>>>>>>>>>
>>>>>>>>> I rather think that we should go down the route of completely dropping
>>>>>>>>> command submission and buffer allocation through the primary node for
>>>>>>>>> non master clients. And then as next step at some point drop support for
>>>>>>>>> authentication/flink.
>>>>>>>>>
>>>>>>>>> I mean we have done this with UMS as well and I don't see much other way
>>>>>>>>> to move forward and get rid of those ancient interface in the long term.
>>>>>>>> Well kms had some really good benefits that drove quick adoption, like
>>>>>>>> "suspend/resume actually has a chance of working" or "comes with
>>>>>>>> buffer management so you can run multiple gears".
>>>>>>>>
>>>>>>>> The render node thing is a lot more niche use case (prime, better priv
>>>>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
>>>>>>>> is something that empirically doesn't seem to matter :-/ Just two
>>>>>>>> examples:
>>>>>>>> - KHR_display/leases just iterated display resources on the fd needed
>>>>>>>> for rendering (and iirc there was even a patch to expose that for
>>>>>>>> render nodes too so it works with DRI3), because implementing
>>>>>>>> protocols is too hard. Barely managed to stop that one before it
>>>>>>>> happened.
>>>>>>>> - Various video players use the vblank ioctl on directly to schedule
>>>>>>>> frames, without telling the compositor. I discovered that when I
>>>>>>>> wanted to limite the vblank ioctl to master clients only. Again,
>>>>>>>> apparently too hard to use the existing extensions, or fix the bugs in
>>>>>>>> there, or whatever. One userspace got fixed last year, but it'll
>>>>>>>> probably get copypasted around forever :-/
>>>>>>>>
>>>>>>>> So I don't think we'll ever manage to roll a clean split out, and best
>>>>>>>> we can do is give in and just hand userspace what it wants. As much as
>>>>>>>> that's misguided and unclean and all that. Maybe it'll result in a
>>>>>>>> least fewer stuff getting run as root to hack around this, because
>>>>>>>> fixing properly seems not to be on the table.
>>>>>>>>
>>>>>>>> The beauty of kms is that we've achieved the mission, everyone's
>>>>>>>> writing their own thing. Which is also terrible, and I don't think
>>>>>>>> it'll get better.
>>>>>>> With the risk of coming rude I will repeat my earlier comment:
>>>>>>>
>>>>>>> The problem is _neither_ Intel nor libva specific.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That said, let's step back for a moment and consider:
>>>>>>>
>>>>>>>     - the "block everything but KMS via the primary node" idea is great but
>>>>>>> orthogonal
>>>>>>>
>>>>>>>     - the series does address issues that are vendor-agnostic
>>>>>>>
>>>>>>>     - by default this series does _not_ cause any regression be that for
>>>>>>> new or old userspace
>>>>>>>
>>>>>>>     - there are two trivial solutions, if the AMD team has concerns about
>>>>>>> closed-source/private stack depending on the old behaviour
>>>>>>> If they want I can even write the patches ;-)
>>>>>>>
>>>>>>>
>>>>>>> That said, the notable comments received so far are:
>>>>>>>     - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
>>>>>>> handle. I'm OK but this will change the return code - from EACCES to
>>>>>>> ENOSYS
>>>>>>>
>>>>>>>     - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
>>>>>>> planning to drop nearly all DRM_AUTH instances in their driver.
>>>>>>>
>>>>>>>
>>>>>>> Christian, as mentioned before - this series does _not_ add
>>>>>>> functionality to render nodes. It effectively paves a way towards
>>>>>>> removing DRM_AUTH.
>>>>>> But it adds functionality to the primary node.
>>>>>>
>>>>> Behaviour is adjusted - functionality was there since day 1.
>>>>>
>>>>>>> I understand the series may feel a bit dirty. Yet I would gladly address
>>>>>>> any technical concerns you have.
>>>>>> Well putting compatibility issues aside my concern is that this is
>>>>>> simply a bad design decision which we can't revert later on.
>>>>>>
>>>>> As sad above - any concerns (theoretical or actual regressions) can be
>>>>> trivially fixed _without_ reverting any of this.
>>>>>
>>>>> I am more than happy to step up and address any regressions in timely
>>>>> manner.
>>>>>
>>>>>
>>>>> As a reminder without this series, some of your customers are forced to
>>>>> run their applications as root.
>>>> I'm torn here on whether this is worth it. Have we got more use cases
>>>> to justify it?
>>>>
>>> Should have mentioned: three DRM drivers (not counting i915) have
>>> dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
>>>
>>> Apart from the libva, kmscube + gst and mesa, I'm expecting other
>>> projects to make the same mistake. Since the former three define the
>>> norm of using DRM.
>>>
>>> The "fix" for all of these being "run as root" :-\
>>>
>>>> I'm wary of opening this up just because we can.
>>>>
>>> What can I do to alleviate that worry? I have spent over a week auditing
>>> code and designed so that we can reinstate the authentication only where
>>> needed.
>> Well I don't think the worry here is about regressions,
> Glad to hear.
>
>> but rather about
>> a design decision we will never be able to revert.
>>
> Can you think of any reason/issue why we would want to revert this? I
> will gladly spend some thing exploring how to address it.

Well, to finally get rid of the primary node for non display hardware.

And in general to have a clean separation between display and rendering.

>> So the question we have to ask is rather if it's a good design decision
>> to resurrect the primary node with all its related compability burdens
>> to work around an issue which is essentially an userspace coding error.
>>
> Can see you're not happy on the topic - I'm not too excited either. The
> truth to the matter is - DRM drivers have dropped DRM_AUTH regardless of
> my work.

Then we should probably consider stopping doing this and enforce that 
the primary node is not used that widely any more.

Regards,
Christian.

>
> It's very unfortunate, if AMDGPU stands out. Perhaps after some time and
> unhappy users you'll reconsider.
>
> I believe that Linus has pointed out a number of times that kernel
> developers should care about our users. Even when it's an userspace
> error.
>
>
> HTH
> Emil
Michel Dänzer June 4, 2019, 10:50 a.m. UTC | #26
On 2019-05-28 10:03 a.m., Koenig, Christian wrote:
> 
> I rather think that we should go down the route of completely dropping 
> command submission and buffer allocation through the primary node for 
> non master clients. And then as next step at some point drop support for 
> authentication/flink.

Keep in mind that even display servers aren't DRM master while their VT
isn't active, so this might be problematic if a display server needs to
do some command submission / buffer allocation during that time.
Christian König June 4, 2019, 11:24 a.m. UTC | #27
Am 04.06.19 um 12:50 schrieb Michel Dänzer:
> On 2019-05-28 10:03 a.m., Koenig, Christian wrote:
>> I rather think that we should go down the route of completely dropping
>> command submission and buffer allocation through the primary node for
>> non master clients. And then as next step at some point drop support for
>> authentication/flink.
> Keep in mind that even display servers aren't DRM master while their VT
> isn't active, so this might be problematic if a display server needs to
> do some command submission / buffer allocation during that time.

If I understand it correctly the DRM fd stays master even when the VT is 
switched away, it's just not the current master any more.

So in this case fpriv->is_master stays true, but 
drm_is_current_master(fpriv) returns false.

And yes we mixed that up in amdgpu, i915 and vmwgfx. Somebody should 
probably write patches to fix this.

Regards,
Christian.
Daniel Vetter June 4, 2019, 1:28 p.m. UTC | #28
On Tue, Jun 4, 2019 at 1:24 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 04.06.19 um 12:50 schrieb Michel Dänzer:
> > On 2019-05-28 10:03 a.m., Koenig, Christian wrote:
> >> I rather think that we should go down the route of completely dropping
> >> command submission and buffer allocation through the primary node for
> >> non master clients. And then as next step at some point drop support for
> >> authentication/flink.
> > Keep in mind that even display servers aren't DRM master while their VT
> > isn't active, so this might be problematic if a display server needs to
> > do some command submission / buffer allocation during that time.
>
> If I understand it correctly the DRM fd stays master even when the VT is
> switched away, it's just not the current master any more.
>
> So in this case fpriv->is_master stays true, but
> drm_is_current_master(fpriv) returns false.
>
> And yes we mixed that up in amdgpu, i915 and vmwgfx. Somebody should
> probably write patches to fix this.

master should always be authenticated, so should be able to continue
rendering. Well ... except on drivers who do take isolation somewhat
serious, but don't have full per-client isolation. Those actually
refuse rendering/gpu access for any authenticated client if their
corresponding master isn't the current master. But I think only vmwgfx
does that.

Per-client isolation has taken over anyway, so all a bit moot, at
least on modern hw.
-Daniel
Emil Velikov June 4, 2019, 5:59 p.m. UTC | #29
On 2019/05/31, Koenig, Christian wrote:
> Am 29.05.19 um 18:29 schrieb Emil Velikov:
> > On 2019/05/29, Koenig, Christian wrote:
> >> Am 29.05.19 um 15:03 schrieb Emil Velikov:
> >>> On 2019/05/29, Dave Airlie wrote:
> >>>> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@gmail.com> wrote:
> >>>>> On 2019/05/28, Koenig, Christian wrote:
> >>>>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
> >>>>>>> On 2019/05/28, Daniel Vetter wrote:
> >>>>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> >>>>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> >>>>>>>>>> [SNIP]
> >>>>>>>>>>> Might be a good idea looking into reverting it partially, so that at
> >>>>>>>>>>> least command submission and buffer allocation is still blocked.
> >>>>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
> >>>>>>>>>> hacked up compositor under the sun getting this wrong one way or
> >>>>>>>>>> another. Thinking about this some more, I also have no idea how you'd
> >>>>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
> >>>>>>>>>> that breaks -modesetting already, and probably lots more compositors.
> >>>>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
> >>>>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
> >>>>>>>>>> we could just break them all (on new hardware) and tell them to all
> >>>>>>>>>> suck it up. But I don't think that's a great option. And just
> >>>>>>>>>> deprecating this on amdgpu is going to be even harder, since then
> >>>>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> >>>>>>>>>> broken.
> >>>>>>>>>>
> >>>>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> >>>>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
> >>>>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> >>>>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> >>>>>>>>>> the various soc combinations out there), and this looks like a
> >>>>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
> >>>>>>>>>> else would perfectly support multi gpu and only use render nodes for
> >>>>>>>>>> rendering, and only primary nodes for display. But reality is that
> >>>>>>>>>> people hack on stuff until gears on screen and then move on to more
> >>>>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
> >>>>>>>>> Yeah, but this is a classic case of working around user space issues by
> >>>>>>>>> making kernel changes instead of fixing user space.
> >>>>>>>>>
> >>>>>>>>> Having privileged (output control) and unprivileged (rendering control)
> >>>>>>>>> functionality behind the same node is a mistake we have made a long time
> >>>>>>>>> ago and render nodes finally seemed to be a way to fix that.
> >>>>>>>>>
> >>>>>>>>> I mean why are compositors using the primary node in the first place?
> >>>>>>>>> Because they want to have access to privileged resources I think and in
> >>>>>>>>> this case it is perfectly ok to do so.
> >>>>>>>>>
> >>>>>>>>> Now extending unprivileged access to the primary node actually sounds
> >>>>>>>>> like a step into the wrong direction to me.
> >>>>>>>>>
> >>>>>>>>> I rather think that we should go down the route of completely dropping
> >>>>>>>>> command submission and buffer allocation through the primary node for
> >>>>>>>>> non master clients. And then as next step at some point drop support for
> >>>>>>>>> authentication/flink.
> >>>>>>>>>
> >>>>>>>>> I mean we have done this with UMS as well and I don't see much other way
> >>>>>>>>> to move forward and get rid of those ancient interface in the long term.
> >>>>>>>> Well kms had some really good benefits that drove quick adoption, like
> >>>>>>>> "suspend/resume actually has a chance of working" or "comes with
> >>>>>>>> buffer management so you can run multiple gears".
> >>>>>>>>
> >>>>>>>> The render node thing is a lot more niche use case (prime, better priv
> >>>>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
> >>>>>>>> is something that empirically doesn't seem to matter :-/ Just two
> >>>>>>>> examples:
> >>>>>>>> - KHR_display/leases just iterated display resources on the fd needed
> >>>>>>>> for rendering (and iirc there was even a patch to expose that for
> >>>>>>>> render nodes too so it works with DRI3), because implementing
> >>>>>>>> protocols is too hard. Barely managed to stop that one before it
> >>>>>>>> happened.
> >>>>>>>> - Various video players use the vblank ioctl on directly to schedule
> >>>>>>>> frames, without telling the compositor. I discovered that when I
> >>>>>>>> wanted to limite the vblank ioctl to master clients only. Again,
> >>>>>>>> apparently too hard to use the existing extensions, or fix the bugs in
> >>>>>>>> there, or whatever. One userspace got fixed last year, but it'll
> >>>>>>>> probably get copypasted around forever :-/
> >>>>>>>>
> >>>>>>>> So I don't think we'll ever manage to roll a clean split out, and best
> >>>>>>>> we can do is give in and just hand userspace what it wants. As much as
> >>>>>>>> that's misguided and unclean and all that. Maybe it'll result in a
> >>>>>>>> least fewer stuff getting run as root to hack around this, because
> >>>>>>>> fixing properly seems not to be on the table.
> >>>>>>>>
> >>>>>>>> The beauty of kms is that we've achieved the mission, everyone's
> >>>>>>>> writing their own thing. Which is also terrible, and I don't think
> >>>>>>>> it'll get better.
> >>>>>>> With the risk of coming rude I will repeat my earlier comment:
> >>>>>>>
> >>>>>>> The problem is _neither_ Intel nor libva specific.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> That said, let's step back for a moment and consider:
> >>>>>>>
> >>>>>>>     - the "block everything but KMS via the primary node" idea is great but
> >>>>>>> orthogonal
> >>>>>>>
> >>>>>>>     - the series does address issues that are vendor-agnostic
> >>>>>>>
> >>>>>>>     - by default this series does _not_ cause any regression be that for
> >>>>>>> new or old userspace
> >>>>>>>
> >>>>>>>     - there are two trivial solutions, if the AMD team has concerns about
> >>>>>>> closed-source/private stack depending on the old behaviour
> >>>>>>> If they want I can even write the patches ;-)
> >>>>>>>
> >>>>>>>
> >>>>>>> That said, the notable comments received so far are:
> >>>>>>>     - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> >>>>>>> handle. I'm OK but this will change the return code - from EACCES to
> >>>>>>> ENOSYS
> >>>>>>>
> >>>>>>>     - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> >>>>>>> planning to drop nearly all DRM_AUTH instances in their driver.
> >>>>>>>
> >>>>>>>
> >>>>>>> Christian, as mentioned before - this series does _not_ add
> >>>>>>> functionality to render nodes. It effectively paves a way towards
> >>>>>>> removing DRM_AUTH.
> >>>>>> But it adds functionality to the primary node.
> >>>>>>
> >>>>> Behaviour is adjusted - functionality was there since day 1.
> >>>>>
> >>>>>>> I understand the series may feel a bit dirty. Yet I would gladly address
> >>>>>>> any technical concerns you have.
> >>>>>> Well putting compatibility issues aside my concern is that this is
> >>>>>> simply a bad design decision which we can't revert later on.
> >>>>>>
> >>>>> As sad above - any concerns (theoretical or actual regressions) can be
> >>>>> trivially fixed _without_ reverting any of this.
> >>>>>
> >>>>> I am more than happy to step up and address any regressions in timely
> >>>>> manner.
> >>>>>
> >>>>>
> >>>>> As a reminder without this series, some of your customers are forced to
> >>>>> run their applications as root.
> >>>> I'm torn here on whether this is worth it. Have we got more use cases
> >>>> to justify it?
> >>>>
> >>> Should have mentioned: three DRM drivers (not counting i915) have
> >>> dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
> >>>
> >>> Apart from the libva, kmscube + gst and mesa, I'm expecting other
> >>> projects to make the same mistake. Since the former three define the
> >>> norm of using DRM.
> >>>
> >>> The "fix" for all of these being "run as root" :-\
> >>>
> >>>> I'm wary of opening this up just because we can.
> >>>>
> >>> What can I do to alleviate that worry? I have spent over a week auditing
> >>> code and designed so that we can reinstate the authentication only where
> >>> needed.
> >> Well I don't think the worry here is about regressions,
> > Glad to hear.
> >
> >> but rather about
> >> a design decision we will never be able to revert.
> >>
> > Can you think of any reason/issue why we would want to revert this? I
> > will gladly spend some thing exploring how to address it.
> 
> Well, to finally get rid of the primary node for non display hardware.
> 
> And in general to have a clean separation between display and rendering.
> 
Clean separation of display and rendering is a good end goal to have,
but is going to break the current userspace. In DRM we try to enforce
the Linux guarantee of not breaking userspace.

If we want to maintain this guarantee we are effectively stuck with
providing render functionality in the primary node, so we might as well
do it properly (i.e., relax the auth restrictions).

But most importantly, if we are to pursue KMS-only primary nodes:
 - we should really have a wider discussion about it, and
 - this series does NOT impede any of if


Thanks
Emil
Emil Velikov June 14, 2019, 12:09 p.m. UTC | #30
On 2019/05/27, Emil Velikov wrote:
> From: Emil Velikov <emil.velikov@collabora.com>
> 
> Currently one can circumvent DRM_AUTH, when the ioctl is exposed via the
> render node. A seemingly deliberate design decision.
> 
> Hence we can drop the DRM_AUTH all together (details in follow-up patch)
> yet not all userspace checks if it's authenticated, but instead uses
> uncommon assumptions.
> 
> After days of digging through git log and testing, only a single (ab)use
> was spotted - the Mesa RADV driver, using the AMDGPU_INFO ioctl and
> assuming that failure implies lack of authentication.
> 
> Affected versions are:
>  - the whole 18.2.x series, which is EOL
>  - the whole 18.3.x series, which is EOL
>  - the 19.0.x series, prior to 19.0.4
> 
> Add a special quirk for that case, thus we can drop DRM_AUTH bits as
> mentioned earlier.
> 
> Since all the affected userspace is EOL, we also add a kconfig option
> to disable this quirk.
> 
> The whole approach is inspired by DRIVER_KMS_LEGACY_CONTEXT
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/Kconfig      | 16 ++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 12 +++++++++++-
>  drivers/gpu/drm/drm_ioctl.c             |  5 +++++
>  include/drm/drm_ioctl.h                 | 17 +++++++++++++++++
>  4 files changed, 49 insertions(+), 1 deletion(-)
> 

Hi Christian,


In the following, I would like to summarise and emphasize the need for
DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
extra reading it.


Today DRM drivers* do not make any distinction between primary and
render node clients. Thus for a render capable driver, any premise of
separation, security or otherwise imposed via DRM_AUTH is a fallacy.

Considering the examples of flaky or outright missing drmAuth in prime
open-source projects (mesa, kmscube, libva) we can reasonably assume
other projects exbibit the same problem.

For these and/or other reasons, two DRM drivers have dropped DRM_AUTH
since day one.

Since we are interested in providing consistent and predictable
behaviour to user-space, dropping DRM_AUTH seems to be the most
effective way forward.

Of course, I'd be more than happy to hear for any other way to achieve
the goal outlined.

Based on the series, other maintainers agree with the idea/goal here.
Amdgpu not following the same pattern would compromise predictability
across drivers and complicate userspace, so I would kindly ask you to
reconsider.

Alternatively, I see two ways to special case:
 - New flag annotating amdgpu/radeon - akin to the one proposed earlier
 - Check for amdgpu/radeon in core DRM



Now on your pain point - not allowing render iocts via the primary node,
and how this patch could make it harder to achieve that goal.

While I love the idea, there are number of obstacles that prevent us
from doing so at this time:
 - Ensuring both old and new userspace does not regress
 - Drivers (QXL, others?) do not expose a render node
 - We want to avoid driver specific behaviour

The only way forward that I can see is:
 - Address QXL/others to expose render nodes
 - Add a Kconfig toggle to disable !KMS ioctls via the primary node
 - Jump-start ^^ with distribution X
 - Fix user-space fallouts
 - X months down the line, flip the Kconfig
 - In case of regressions - revert the KConfig, goto Fix user-space...


That said, the proposal will not conflict with the DRM_AUTH removal. If
anything it is step 0.5 of the grand master plan.


Tl;Dr: Removing DRM_AUTH is orthogonal to not allowing render iocts via
the primary node. Here's an idea how to achieve the latter.


Thanks
Emil
Christian König June 14, 2019, 12:55 p.m. UTC | #31
Am 14.06.19 um 14:09 schrieb Emil Velikov:
> On 2019/05/27, Emil Velikov wrote:
> [SNIP]
> Hi Christian,
>
>
> In the following, I would like to summarise and emphasize the need for
> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> extra reading it.
>
>
> Today DRM drivers* do not make any distinction between primary and
> render node clients.

That is actually not 100% correct. We have a special case where a DRM 
master is allowed to change the priority of render node clients.

> Thus for a render capable driver, any premise of
> separation, security or otherwise imposed via DRM_AUTH is a fallacy.

Yeah, that's what I agree on. I just don't think that removing DRM_AUTH 
now is the right direction to take.

> Considering the examples of flaky or outright missing drmAuth in prime
> open-source projects (mesa, kmscube, libva) we can reasonably assume
> other projects exbibit the same problem.
>
> For these and/or other reasons, two DRM drivers have dropped DRM_AUTH
> since day one.
>
> Since we are interested in providing consistent and predictable
> behaviour to user-space, dropping DRM_AUTH seems to be the most
> effective way forward.

Well and what do you do with drivers which doesn't implement render nodes?

> Of course, I'd be more than happy to hear for any other way to achieve
> the goal outlined.
>
> Based on the series, other maintainers agree with the idea/goal here.
> Amdgpu not following the same pattern would compromise predictability
> across drivers and complicate userspace, so I would kindly ask you to
> reconsider.
>
> Alternatively, I see two ways to special case:
>   - New flag annotating amdgpu/radeon - akin to the one proposed earlier
>   - Check for amdgpu/radeon in core DRM

I perfectly agree that I don't want any special handling for amdgpu/radeon.

My key point is that with DRM_AUTH we created an authorization and 
authentication mess in DRM which is functionality which doesn't belong 
into the DRM subsystem in the first place.

> Now on your pain point - not allowing render iocts via the primary node,
> and how this patch could make it harder to achieve that goal.
>
> While I love the idea, there are number of obstacles that prevent us
> from doing so at this time:
>   - Ensuring both old and new userspace does not regress

Yeah, agree totally. That's why I said we should probably start doing 
this for only for upcoming hardware generations.

>   - Drivers (QXL, others?) do not expose a render node

Well isn't that is a rather big problem for the removal of DRM_AUTH in 
general?

E.g. at least QXL would then expose functionality on the primary node 
without authentication.

>   - We want to avoid driver specific behaviour
>
> The only way forward that I can see is:
>   - Address QXL/others to expose render nodes
>   - Add a Kconfig toggle to disable !KMS ioctls via the primary node
>   - Jump-start ^^ with distribution X
>   - Fix user-space fallouts
>   - X months down the line, flip the Kconfig
>   - In case of regressions - revert the KConfig, goto Fix user-space...

Well that at least sounds like a plan :) Let's to this!

> That said, the proposal will not conflict with the DRM_AUTH removal. If
> anything it is step 0.5 of the grand master plan.

That's the point I strongly disagree on.

By lowering the DRM_AUTH restriction you are encouraging userspace to 
use the shortcut and use the primary node for rendering instead of 
fixing the code and using the render node.

So at the end of the day userspace will most likely completely drop 
support for the render node, simply for the reason that it became 
superfluous. You can just open up the primary node and get the same 
functionality.

I absolutely can't understand how you can argue that this won't make it 
harder to cleanly separate rendering and primary node functionality.

Regards,
Christian.

>
>
> Tl;Dr: Removing DRM_AUTH is orthogonal to not allowing render iocts via
> the primary node. Here's an idea how to achieve the latter.
>
>
> Thanks
> Emil
Michel Dänzer June 14, 2019, 2:16 p.m. UTC | #32
On 2019-06-14 2:55 p.m., Koenig, Christian wrote:
> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> 
>> That said, the proposal will not conflict with the DRM_AUTH removal. If
>> anything it is step 0.5 of the grand master plan.
> 
> That's the point I strongly disagree on.
> 
> By lowering the DRM_AUTH restriction you are encouraging userspace to 
> use the shortcut and use the primary node for rendering instead of 
> fixing the code and using the render node.
> 
> So at the end of the day userspace will most likely completely drop 
> support for the render node, simply for the reason that it became 
> superfluous. You can just open up the primary node and get the same 
> functionality.
> 
> I absolutely can't understand how you can argue that this won't make it 
> harder to cleanly separate rendering and primary node functionality.

FWIW, I agree with Christian on this.


Also FWIW, the solution I'm working on for
https://bugs.freedesktop.org/110903 should allow making libdrm_amdgpu
always use a render node for rendering. For backwards compatibility
it'll probably require adding a new variant of amdgpu_device_initialize
though, since existing users may rely on the first amdgpu_device for a
GPU using the DRM file description passed to amdgpu_device_initialize
for rendering.
Emil Velikov June 14, 2019, 3:53 p.m. UTC | #33
On 2019/06/14, Koenig, Christian wrote:
> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> > On 2019/05/27, Emil Velikov wrote:
> > [SNIP]
> > Hi Christian,
> >
> >
> > In the following, I would like to summarise and emphasize the need for
> > DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> > extra reading it.
> >
> >
> > Today DRM drivers* do not make any distinction between primary and
> > render node clients.
> 
> That is actually not 100% correct. We have a special case where a DRM 
> master is allowed to change the priority of render node clients.
> 
Can you provide a link? I cannot find that code.

> > Thus for a render capable driver, any premise of
> > separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> 
> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH 
> now is the right direction to take.
> 
Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
ioctls.

That aside, can you propose an alternative solution that addresses this
and the second point just below?

> > Considering the examples of flaky or outright missing drmAuth in prime
> > open-source projects (mesa, kmscube, libva) we can reasonably assume
> > other projects exbibit the same problem.
> >
> > For these and/or other reasons, two DRM drivers have dropped DRM_AUTH
> > since day one.
> >
> > Since we are interested in providing consistent and predictable
> > behaviour to user-space, dropping DRM_AUTH seems to be the most
> > effective way forward.
> 
> Well and what do you do with drivers which doesn't implement render nodes?
> 
AFAICT there is a single non-DC driver which does not expose - QXL.
I would expect it runs on a rather customised stack.

Would be great to fix that, but ETIME and it's the only exception to the
rule.


> > Of course, I'd be more than happy to hear for any other way to achieve
> > the goal outlined.
> >
> > Based on the series, other maintainers agree with the idea/goal here.
> > Amdgpu not following the same pattern would compromise predictability
> > across drivers and complicate userspace, so I would kindly ask you to
> > reconsider.
> >
> > Alternatively, I see two ways to special case:
> >   - New flag annotating amdgpu/radeon - akin to the one proposed earlier
> >   - Check for amdgpu/radeon in core DRM
> 
> I perfectly agree that I don't want any special handling for amdgpu/radeon.
> 
> My key point is that with DRM_AUTH we created an authorization and 
> authentication mess in DRM which is functionality which doesn't belong 
> into the DRM subsystem in the first place.
> 
Precisely and I've outlined below how to resolve this in the long run.

> > Now on your pain point - not allowing render iocts via the primary node,
> > and how this patch could make it harder to achieve that goal.
> >
> > While I love the idea, there are number of obstacles that prevent us
> > from doing so at this time:
> >   - Ensuring both old and new userspace does not regress
> 
> Yeah, agree totally. That's why I said we should probably start doing 
> this for only for upcoming hardware generations.
> 
That will side-step the regression issue, but will enforce driver
specific behaviour outlined before.

> >   - Drivers (QXL, others?) do not expose a render node
> 
> Well isn't that is a rather big problem for the removal of DRM_AUTH in 
> general?
> 
> E.g. at least QXL would then expose functionality on the primary node 
> without authentication.
> 
With this series QXL remains functionally unchanged. I would love to fix
that as well yet ETIME as mentioned above.

> >   - We want to avoid driver specific behaviour
> >
> > The only way forward that I can see is:
> >   - Address QXL/others to expose render nodes
> >   - Add a Kconfig toggle to disable !KMS ioctls via the primary node
> >   - Jump-start ^^ with distribution X
> >   - Fix user-space fallouts
> >   - X months down the line, flip the Kconfig
> >   - In case of regressions - revert the KConfig, goto Fix user-space...
> 
> Well that at least sounds like a plan :) Let's to this!
> 
We're talking about months if not years until it comes to fruition. We
need something quicker.

That said, I'm quite happy to take the first stab, yet this is not a
replacement for this series.

> > That said, the proposal will not conflict with the DRM_AUTH removal. If
> > anything it is step 0.5 of the grand master plan.
> 
> That's the point I strongly disagree on.
> 
> By lowering the DRM_AUTH restriction you are encouraging userspace to 
> use the shortcut and use the primary node for rendering instead of 
> fixing the code and using the render node.
> 
Have to disagree here. After working on the user-space side and fixing
issues in various projects I can honestly say that most user-space is
sane and developers _care_ about doing things correctly.

> So at the end of the day userspace will most likely completely drop 
> support for the render node, simply for the reason that it became 
> superfluous. You can just open up the primary node and get the same 
> functionality.
> 
I think you're being overly pessimistic here. This is coming from a
person who is often closer to the pessimistic side of things.

As a middle ground how about we do the following:
 - Get this series as-is, alongside
 - picking the first three items from the grand master plan
   - happy to start a discussion about QXL
   - a patch for the Kconfig toggle, is coming in a few minutes
   - happy to prod my distribution of choice (Arch) and work with them

What do you think?

Thanks
Emil
Christian König June 14, 2019, 4 p.m. UTC | #34
Am 14.06.19 um 17:53 schrieb Emil Velikov:
> On 2019/06/14, Koenig, Christian wrote:
>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
>>> On 2019/05/27, Emil Velikov wrote:
>>> [SNIP]
>>> Hi Christian,
>>>
>>>
>>> In the following, I would like to summarise and emphasize the need for
>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
>>> extra reading it.
>>>
>>>
>>> Today DRM drivers* do not make any distinction between primary and
>>> render node clients.
>> That is actually not 100% correct. We have a special case where a DRM
>> master is allowed to change the priority of render node clients.
>>
> Can you provide a link? I cannot find that code.

See amdgpu_sched_ioctl().

>>> Thus for a render capable driver, any premise of
>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
>> now is the right direction to take.
>>
> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> ioctls.
>
> That aside, can you propose an alternative solution that addresses this
> and the second point just below?

Give me a few days to work on this, it's already Friday 6pm here.

Christian.

>
>>> Considering the examples of flaky or outright missing drmAuth in prime
>>> open-source projects (mesa, kmscube, libva) we can reasonably assume
>>> other projects exbibit the same problem.
>>>
>>> For these and/or other reasons, two DRM drivers have dropped DRM_AUTH
>>> since day one.
>>>
>>> Since we are interested in providing consistent and predictable
>>> behaviour to user-space, dropping DRM_AUTH seems to be the most
>>> effective way forward.
>> Well and what do you do with drivers which doesn't implement render nodes?
>>
> AFAICT there is a single non-DC driver which does not expose - QXL.
> I would expect it runs on a rather customised stack.
>
> Would be great to fix that, but ETIME and it's the only exception to the
> rule.
>
>
>>> Of course, I'd be more than happy to hear for any other way to achieve
>>> the goal outlined.
>>>
>>> Based on the series, other maintainers agree with the idea/goal here.
>>> Amdgpu not following the same pattern would compromise predictability
>>> across drivers and complicate userspace, so I would kindly ask you to
>>> reconsider.
>>>
>>> Alternatively, I see two ways to special case:
>>>    - New flag annotating amdgpu/radeon - akin to the one proposed earlier
>>>    - Check for amdgpu/radeon in core DRM
>> I perfectly agree that I don't want any special handling for amdgpu/radeon.
>>
>> My key point is that with DRM_AUTH we created an authorization and
>> authentication mess in DRM which is functionality which doesn't belong
>> into the DRM subsystem in the first place.
>>
> Precisely and I've outlined below how to resolve this in the long run.
>
>>> Now on your pain point - not allowing render iocts via the primary node,
>>> and how this patch could make it harder to achieve that goal.
>>>
>>> While I love the idea, there are number of obstacles that prevent us
>>> from doing so at this time:
>>>    - Ensuring both old and new userspace does not regress
>> Yeah, agree totally. That's why I said we should probably start doing
>> this for only for upcoming hardware generations.
>>
> That will side-step the regression issue, but will enforce driver
> specific behaviour outlined before.
>
>>>    - Drivers (QXL, others?) do not expose a render node
>> Well isn't that is a rather big problem for the removal of DRM_AUTH in
>> general?
>>
>> E.g. at least QXL would then expose functionality on the primary node
>> without authentication.
>>
> With this series QXL remains functionally unchanged. I would love to fix
> that as well yet ETIME as mentioned above.
>
>>>    - We want to avoid driver specific behaviour
>>>
>>> The only way forward that I can see is:
>>>    - Address QXL/others to expose render nodes
>>>    - Add a Kconfig toggle to disable !KMS ioctls via the primary node
>>>    - Jump-start ^^ with distribution X
>>>    - Fix user-space fallouts
>>>    - X months down the line, flip the Kconfig
>>>    - In case of regressions - revert the KConfig, goto Fix user-space...
>> Well that at least sounds like a plan :) Let's to this!
>>
> We're talking about months if not years until it comes to fruition. We
> need something quicker.
>
> That said, I'm quite happy to take the first stab, yet this is not a
> replacement for this series.
>
>>> That said, the proposal will not conflict with the DRM_AUTH removal. If
>>> anything it is step 0.5 of the grand master plan.
>> That's the point I strongly disagree on.
>>
>> By lowering the DRM_AUTH restriction you are encouraging userspace to
>> use the shortcut and use the primary node for rendering instead of
>> fixing the code and using the render node.
>>
> Have to disagree here. After working on the user-space side and fixing
> issues in various projects I can honestly say that most user-space is
> sane and developers _care_ about doing things correctly.
>
>> So at the end of the day userspace will most likely completely drop
>> support for the render node, simply for the reason that it became
>> superfluous. You can just open up the primary node and get the same
>> functionality.
>>
> I think you're being overly pessimistic here. This is coming from a
> person who is often closer to the pessimistic side of things.
>
> As a middle ground how about we do the following:
>   - Get this series as-is, alongside
>   - picking the first three items from the grand master plan
>     - happy to start a discussion about QXL
>     - a patch for the Kconfig toggle, is coming in a few minutes
>     - happy to prod my distribution of choice (Arch) and work with them
>
> What do you think?
>
> Thanks
> Emil
>
Emil Velikov June 14, 2019, 4:25 p.m. UTC | #35
On 2019/06/14, Koenig, Christian wrote:
> Am 14.06.19 um 17:53 schrieb Emil Velikov:
> > On 2019/06/14, Koenig, Christian wrote:
> >> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> >>> On 2019/05/27, Emil Velikov wrote:
> >>> [SNIP]
> >>> Hi Christian,
> >>>
> >>>
> >>> In the following, I would like to summarise and emphasize the need for
> >>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> >>> extra reading it.
> >>>
> >>>
> >>> Today DRM drivers* do not make any distinction between primary and
> >>> render node clients.
> >> That is actually not 100% correct. We have a special case where a DRM
> >> master is allowed to change the priority of render node clients.
> >>
> > Can you provide a link? I cannot find that code.
> 
> See amdgpu_sched_ioctl().
> 
> >>> Thus for a render capable driver, any premise of
> >>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> >> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
> >> now is the right direction to take.
> >>
> > Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> > ioctls.
> >
> > That aside, can you propose an alternative solution that addresses this
> > and the second point just below?
> 
> Give me a few days to work on this, it's already Friday 6pm here.
> 
Great thanks. Fwiw I was asking for a ideas/proposal, was not expecting you
to write the patches ;-)

Emil
Emil Velikov June 20, 2019, 4:30 p.m. UTC | #36
On 2019/06/14, Koenig, Christian wrote:
> Am 14.06.19 um 17:53 schrieb Emil Velikov:
> > On 2019/06/14, Koenig, Christian wrote:
> >> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> >>> On 2019/05/27, Emil Velikov wrote:
> >>> [SNIP]
> >>> Hi Christian,
> >>>
> >>>
> >>> In the following, I would like to summarise and emphasize the need for
> >>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> >>> extra reading it.
> >>>
> >>>
> >>> Today DRM drivers* do not make any distinction between primary and
> >>> render node clients.
> >> That is actually not 100% correct. We have a special case where a DRM
> >> master is allowed to change the priority of render node clients.
> >>
> > Can you provide a link? I cannot find that code.
> 
> See amdgpu_sched_ioctl().
> 
> >>> Thus for a render capable driver, any premise of
> >>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> >> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
> >> now is the right direction to take.
> >>
> > Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> > ioctls.
> >
> > That aside, can you propose an alternative solution that addresses this
> > and the second point just below?
> 
> Give me a few days to work on this, it's already Friday 6pm here.
> 
Hi Christian,

Any progress? As mentioned earlier, I'm OK with writing the patches although
I would love to hear your plan.

Thanks
Emil
Christian König June 21, 2019, 7:12 a.m. UTC | #37
Am 20.06.19 um 18:30 schrieb Emil Velikov:
> On 2019/06/14, Koenig, Christian wrote:
>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
>>> On 2019/06/14, Koenig, Christian wrote:
>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
>>>>> On 2019/05/27, Emil Velikov wrote:
>>>>> [SNIP]
>>>>> Hi Christian,
>>>>>
>>>>>
>>>>> In the following, I would like to summarise and emphasize the need for
>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
>>>>> extra reading it.
>>>>>
>>>>>
>>>>> Today DRM drivers* do not make any distinction between primary and
>>>>> render node clients.
>>>> That is actually not 100% correct. We have a special case where a DRM
>>>> master is allowed to change the priority of render node clients.
>>>>
>>> Can you provide a link? I cannot find that code.
>> See amdgpu_sched_ioctl().
>>
>>>>> Thus for a render capable driver, any premise of
>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
>>>> now is the right direction to take.
>>>>
>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
>>> ioctls.
>>>
>>> That aside, can you propose an alternative solution that addresses this
>>> and the second point just below?
>> Give me a few days to work on this, it's already Friday 6pm here.
>>
> Hi Christian,
>
> Any progress? As mentioned earlier, I'm OK with writing the patches although
> I would love to hear your plan.

First of all I tried to disable DRM authentication completely with a 
kernel config option. Surprisingly that actually works out of the box at 
least on the AMDGPU stack.

This effectively allows us to get rid of DRI2 and the related security 
problems. Only thing left for that is that I'm just not sure how to 
signal this to userspace so that the DDX wouldn't advertise DRI2 at all 
any more.


As a next step I looked into if we can disable the command submission 
for DRM master. Turned out that this is relatively easy as well.

All we have to do is to fix the bug Michel pointed out about KMS handles 
for display and let the DDX use a render node instead of the DRM master 
for Glamor. Still need to sync up with Michel and/or Marek whats the 
best way of doing this.


The last step (and that's what I haven't tried yet) would be to disable 
DRM authentication for Navi even when it is still compiled into the 
kernel. But that is more or less just a typing exercise.

Regards,
Christian.

>
> Thanks
> Emil
Michel Dänzer June 21, 2019, 7:41 a.m. UTC | #38
On 2019-06-21 9:12 a.m., Koenig, Christian wrote:
> 
> First of all I tried to disable DRM authentication completely with a 
> kernel config option. Surprisingly that actually works out of the box at 
> least on the AMDGPU stack.
> 
> This effectively allows us to get rid of DRI2 and the related security 
> problems. Only thing left for that is that I'm just not sure how to 
> signal this to userspace so that the DDX wouldn't advertise DRI2 at all 
> any more.

FWIW, getting rid of DRI2 also needs to be discussed with amdgpu-pro
OpenGL driver folks.


> As a next step I looked into if we can disable the command submission 
> for DRM master. Turned out that this is relatively easy as well.
> 
> All we have to do is to fix the bug Michel pointed out about KMS handles 
> for display

I'm working on that, consider it fixed.


> and let the DDX use a render node instead of the DRM master for Glamor.
> Still need to sync up with Michel and/or Marek whats the best way of
> doing this.

My suggestion was to add a new variant of amdgpu_device_initialize. When
the new variant is called, libdrm_amdgpu internally uses a render node
for command submission etc. whenever possible.
Christian König June 21, 2019, 8:23 a.m. UTC | #39
Am 21.06.19 um 09:41 schrieb Michel Dänzer:
> On 2019-06-21 9:12 a.m., Koenig, Christian wrote:
>> First of all I tried to disable DRM authentication completely with a
>> kernel config option. Surprisingly that actually works out of the box at
>> least on the AMDGPU stack.
>>
>> This effectively allows us to get rid of DRI2 and the related security
>> problems. Only thing left for that is that I'm just not sure how to
>> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
>> any more.
> FWIW, getting rid of DRI2 also needs to be discussed with amdgpu-pro
> OpenGL driver folks.

Good point, but I don't expect much problems from that direction.

IIRC they where quite happy to not have to support DRI2 except for a 
very very old X server version.

>> As a next step I looked into if we can disable the command submission
>> for DRM master. Turned out that this is relatively easy as well.
>>
>> All we have to do is to fix the bug Michel pointed out about KMS handles
>> for display
> I'm working on that, consider it fixed.
>
>
>> and let the DDX use a render node instead of the DRM master for Glamor.
>> Still need to sync up with Michel and/or Marek whats the best way of
>> doing this.
> My suggestion was to add a new variant of amdgpu_device_initialize. When
> the new variant is called, libdrm_amdgpu internally uses a render node
> for command submission etc. whenever possible.

Yeah, sounds like a plan to me.

Christian.
Daniel Vetter June 21, 2019, 9:09 a.m. UTC | #40
On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
> Am 20.06.19 um 18:30 schrieb Emil Velikov:
> > On 2019/06/14, Koenig, Christian wrote:
> >> Am 14.06.19 um 17:53 schrieb Emil Velikov:
> >>> On 2019/06/14, Koenig, Christian wrote:
> >>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> >>>>> On 2019/05/27, Emil Velikov wrote:
> >>>>> [SNIP]
> >>>>> Hi Christian,
> >>>>>
> >>>>>
> >>>>> In the following, I would like to summarise and emphasize the need for
> >>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> >>>>> extra reading it.
> >>>>>
> >>>>>
> >>>>> Today DRM drivers* do not make any distinction between primary and
> >>>>> render node clients.
> >>>> That is actually not 100% correct. We have a special case where a DRM
> >>>> master is allowed to change the priority of render node clients.
> >>>>
> >>> Can you provide a link? I cannot find that code.
> >> See amdgpu_sched_ioctl().
> >>
> >>>>> Thus for a render capable driver, any premise of
> >>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> >>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
> >>>> now is the right direction to take.
> >>>>
> >>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> >>> ioctls.
> >>>
> >>> That aside, can you propose an alternative solution that addresses this
> >>> and the second point just below?
> >> Give me a few days to work on this, it's already Friday 6pm here.
> >>
> > Hi Christian,
> >
> > Any progress? As mentioned earlier, I'm OK with writing the patches although
> > I would love to hear your plan.
> 
> First of all I tried to disable DRM authentication completely with a 
> kernel config option. Surprisingly that actually works out of the box at 
> least on the AMDGPU stack.
> 
> This effectively allows us to get rid of DRI2 and the related security 
> problems. Only thing left for that is that I'm just not sure how to 
> signal this to userspace so that the DDX wouldn't advertise DRI2 at all 
> any more.
> 
> 
> As a next step I looked into if we can disable the command submission 
> for DRM master. Turned out that this is relatively easy as well.
> 
> All we have to do is to fix the bug Michel pointed out about KMS handles 
> for display and let the DDX use a render node instead of the DRM master 
> for Glamor. Still need to sync up with Michel and/or Marek whats the 
> best way of doing this.
> 
> 
> The last step (and that's what I haven't tried yet) would be to disable 
> DRM authentication for Navi even when it is still compiled into the 
> kernel. But that is more or less just a typing exercise.

So just to get this straight, we're now full on embracing a subsystem
split here:
- amdgpu plans to ditch dri2/rendering on primary nodes
- bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
  removal
- others are just hanging in there somehow

"best of both^W worlds" ftw?
-Daniel
Christian König June 21, 2019, 9:25 a.m. UTC | #41
Am 21.06.19 um 11:09 schrieb Daniel Vetter:
> On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
>> Am 20.06.19 um 18:30 schrieb Emil Velikov:
>>> On 2019/06/14, Koenig, Christian wrote:
>>>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
>>>>> On 2019/06/14, Koenig, Christian wrote:
>>>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
>>>>>>> On 2019/05/27, Emil Velikov wrote:
>>>>>>> [SNIP]
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>>
>>>>>>> In the following, I would like to summarise and emphasize the need for
>>>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
>>>>>>> extra reading it.
>>>>>>>
>>>>>>>
>>>>>>> Today DRM drivers* do not make any distinction between primary and
>>>>>>> render node clients.
>>>>>> That is actually not 100% correct. We have a special case where a DRM
>>>>>> master is allowed to change the priority of render node clients.
>>>>>>
>>>>> Can you provide a link? I cannot find that code.
>>>> See amdgpu_sched_ioctl().
>>>>
>>>>>>> Thus for a render capable driver, any premise of
>>>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
>>>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
>>>>>> now is the right direction to take.
>>>>>>
>>>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
>>>>> ioctls.
>>>>>
>>>>> That aside, can you propose an alternative solution that addresses this
>>>>> and the second point just below?
>>>> Give me a few days to work on this, it's already Friday 6pm here.
>>>>
>>> Hi Christian,
>>>
>>> Any progress? As mentioned earlier, I'm OK with writing the patches although
>>> I would love to hear your plan.
>> First of all I tried to disable DRM authentication completely with a
>> kernel config option. Surprisingly that actually works out of the box at
>> least on the AMDGPU stack.
>>
>> This effectively allows us to get rid of DRI2 and the related security
>> problems. Only thing left for that is that I'm just not sure how to
>> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
>> any more.
>>
>>
>> As a next step I looked into if we can disable the command submission
>> for DRM master. Turned out that this is relatively easy as well.
>>
>> All we have to do is to fix the bug Michel pointed out about KMS handles
>> for display and let the DDX use a render node instead of the DRM master
>> for Glamor. Still need to sync up with Michel and/or Marek whats the
>> best way of doing this.
>>
>>
>> The last step (and that's what I haven't tried yet) would be to disable
>> DRM authentication for Navi even when it is still compiled into the
>> kernel. But that is more or less just a typing exercise.
> So just to get this straight, we're now full on embracing a subsystem
> split here:
> - amdgpu plans to ditch dri2/rendering on primary nodes
> - bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
>    removal
> - others are just hanging in there somehow
>
> "best of both^W worlds" ftw?

Yes, exactly!

Think a step further, when this is upstream we can apply the DRM_AUTH 
removal to amdgpu as well.

Because we simply made sure that userspace is using the render node for 
command submission and not the primary node.

So nobody can go ahead and remove render node support any more :)

Regards,
Christian.

> -Daniel
Daniel Vetter June 21, 2019, 9:35 a.m. UTC | #42
On Fri, Jun 21, 2019 at 11:25 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 21.06.19 um 11:09 schrieb Daniel Vetter:
> > On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
> >> Am 20.06.19 um 18:30 schrieb Emil Velikov:
> >>> On 2019/06/14, Koenig, Christian wrote:
> >>>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
> >>>>> On 2019/06/14, Koenig, Christian wrote:
> >>>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> >>>>>>> On 2019/05/27, Emil Velikov wrote:
> >>>>>>> [SNIP]
> >>>>>>> Hi Christian,
> >>>>>>>
> >>>>>>>
> >>>>>>> In the following, I would like to summarise and emphasize the need for
> >>>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> >>>>>>> extra reading it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Today DRM drivers* do not make any distinction between primary and
> >>>>>>> render node clients.
> >>>>>> That is actually not 100% correct. We have a special case where a DRM
> >>>>>> master is allowed to change the priority of render node clients.
> >>>>>>
> >>>>> Can you provide a link? I cannot find that code.
> >>>> See amdgpu_sched_ioctl().
> >>>>
> >>>>>>> Thus for a render capable driver, any premise of
> >>>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> >>>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
> >>>>>> now is the right direction to take.
> >>>>>>
> >>>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> >>>>> ioctls.
> >>>>>
> >>>>> That aside, can you propose an alternative solution that addresses this
> >>>>> and the second point just below?
> >>>> Give me a few days to work on this, it's already Friday 6pm here.
> >>>>
> >>> Hi Christian,
> >>>
> >>> Any progress? As mentioned earlier, I'm OK with writing the patches although
> >>> I would love to hear your plan.
> >> First of all I tried to disable DRM authentication completely with a
> >> kernel config option. Surprisingly that actually works out of the box at
> >> least on the AMDGPU stack.
> >>
> >> This effectively allows us to get rid of DRI2 and the related security
> >> problems. Only thing left for that is that I'm just not sure how to
> >> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
> >> any more.
> >>
> >>
> >> As a next step I looked into if we can disable the command submission
> >> for DRM master. Turned out that this is relatively easy as well.
> >>
> >> All we have to do is to fix the bug Michel pointed out about KMS handles
> >> for display and let the DDX use a render node instead of the DRM master
> >> for Glamor. Still need to sync up with Michel and/or Marek whats the
> >> best way of doing this.
> >>
> >>
> >> The last step (and that's what I haven't tried yet) would be to disable
> >> DRM authentication for Navi even when it is still compiled into the
> >> kernel. But that is more or less just a typing exercise.
> > So just to get this straight, we're now full on embracing a subsystem
> > split here:
> > - amdgpu plans to ditch dri2/rendering on primary nodes
> > - bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
> >    removal
> > - others are just hanging in there somehow
> >
> > "best of both^W worlds" ftw?
>
> Yes, exactly!
>
> Think a step further, when this is upstream we can apply the DRM_AUTH
> removal to amdgpu as well.
>
> Because we simply made sure that userspace is using the render node for
> command submission and not the primary node.
>
> So nobody can go ahead and remove render node support any more :)

How does this work? I thought the entire problem of DRM_AUTH removal
for you was that it broke stuff, and you didn't want to deal with
that. We still have that exact problem afaics ... old userspace
doesn't simply vanish, even if you entirely nuke it going forward.

Also, testing on the amdgpu stack isn't really testing a hole lot
here, it's all the various silly compositors out there I'd expect to
fall over. Will gbm/radeonsi/whatever just internally do all the
necessary dma-buf import/export between the primary node and display
node to keep this all working?
-Daniel
Christian König June 21, 2019, 10:16 a.m. UTC | #43
Am 21.06.19 um 11:35 schrieb Daniel Vetter:
> On Fri, Jun 21, 2019 at 11:25 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 21.06.19 um 11:09 schrieb Daniel Vetter:
>>> On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
>>>> Am 20.06.19 um 18:30 schrieb Emil Velikov:
>>>>> On 2019/06/14, Koenig, Christian wrote:
>>>>>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
>>>>>>> On 2019/06/14, Koenig, Christian wrote:
>>>>>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
>>>>>>>>> On 2019/05/27, Emil Velikov wrote:
>>>>>>>>> [SNIP]
>>>>>>>>> Hi Christian,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the following, I would like to summarise and emphasize the need for
>>>>>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
>>>>>>>>> extra reading it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Today DRM drivers* do not make any distinction between primary and
>>>>>>>>> render node clients.
>>>>>>>> That is actually not 100% correct. We have a special case where a DRM
>>>>>>>> master is allowed to change the priority of render node clients.
>>>>>>>>
>>>>>>> Can you provide a link? I cannot find that code.
>>>>>> See amdgpu_sched_ioctl().
>>>>>>
>>>>>>>>> Thus for a render capable driver, any premise of
>>>>>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
>>>>>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
>>>>>>>> now is the right direction to take.
>>>>>>>>
>>>>>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
>>>>>>> ioctls.
>>>>>>>
>>>>>>> That aside, can you propose an alternative solution that addresses this
>>>>>>> and the second point just below?
>>>>>> Give me a few days to work on this, it's already Friday 6pm here.
>>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> Any progress? As mentioned earlier, I'm OK with writing the patches although
>>>>> I would love to hear your plan.
>>>> First of all I tried to disable DRM authentication completely with a
>>>> kernel config option. Surprisingly that actually works out of the box at
>>>> least on the AMDGPU stack.
>>>>
>>>> This effectively allows us to get rid of DRI2 and the related security
>>>> problems. Only thing left for that is that I'm just not sure how to
>>>> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
>>>> any more.
>>>>
>>>>
>>>> As a next step I looked into if we can disable the command submission
>>>> for DRM master. Turned out that this is relatively easy as well.
>>>>
>>>> All we have to do is to fix the bug Michel pointed out about KMS handles
>>>> for display and let the DDX use a render node instead of the DRM master
>>>> for Glamor. Still need to sync up with Michel and/or Marek whats the
>>>> best way of doing this.
>>>>
>>>>
>>>> The last step (and that's what I haven't tried yet) would be to disable
>>>> DRM authentication for Navi even when it is still compiled into the
>>>> kernel. But that is more or less just a typing exercise.
>>> So just to get this straight, we're now full on embracing a subsystem
>>> split here:
>>> - amdgpu plans to ditch dri2/rendering on primary nodes
>>> - bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
>>>     removal
>>> - others are just hanging in there somehow
>>>
>>> "best of both^W worlds" ftw?
>> Yes, exactly!
>>
>> Think a step further, when this is upstream we can apply the DRM_AUTH
>> removal to amdgpu as well.
>>
>> Because we simply made sure that userspace is using the render node for
>> command submission and not the primary node.
>>
>> So nobody can go ahead and remove render node support any more :)
> How does this work? I thought the entire problem of DRM_AUTH removal
> for you was that it broke stuff, and you didn't want to deal with
> that.

Yeah, that is indeed still true.

It's just that we have done way to many projects with radeon/amdgpu and 
customized userspace stuff.

> We still have that exact problem afaics ... old userspace
> doesn't simply vanish, even if you entirely nuke it going forward.

Well old userspace doesn't work with new hardware either.

So the idea is to keep old userspace for old hardware working, but only 
disable old stuff for new hardware.

> Also, testing on the amdgpu stack isn't really testing a hole lot
> here, it's all the various silly compositors out there I'd expect to
> fall over. Will gbm/radeonsi/whatever just internally do all the
> necessary dma-buf import/export between the primary node and display
> node to keep this all working?

Yes, at least that's how I understand Michel's idea.

We keep both file descriptors for primary and render node around at the 
same time anyway. So the change is actually not that much.

This also solves the problem that people are running things as root, 
since we now always use the render node and never the primary node for 
everything except KMS.

Christian.

> -Daniel
Emil Velikov June 21, 2019, 10:20 a.m. UTC | #44
On 2019/06/21, Daniel Vetter wrote:
> On Fri, Jun 21, 2019 at 11:25 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 21.06.19 um 11:09 schrieb Daniel Vetter:
> > > On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
> > >> Am 20.06.19 um 18:30 schrieb Emil Velikov:
> > >>> On 2019/06/14, Koenig, Christian wrote:
> > >>>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
> > >>>>> On 2019/06/14, Koenig, Christian wrote:
> > >>>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
> > >>>>>>> On 2019/05/27, Emil Velikov wrote:
> > >>>>>>> [SNIP]
> > >>>>>>> Hi Christian,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> In the following, I would like to summarise and emphasize the need for
> > >>>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
> > >>>>>>> extra reading it.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Today DRM drivers* do not make any distinction between primary and
> > >>>>>>> render node clients.
> > >>>>>> That is actually not 100% correct. We have a special case where a DRM
> > >>>>>> master is allowed to change the priority of render node clients.
> > >>>>>>
> > >>>>> Can you provide a link? I cannot find that code.
> > >>>> See amdgpu_sched_ioctl().
> > >>>>
> > >>>>>>> Thus for a render capable driver, any premise of
> > >>>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
> > >>>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
> > >>>>>> now is the right direction to take.
> > >>>>>>
> > >>>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
> > >>>>> ioctls.
> > >>>>>
> > >>>>> That aside, can you propose an alternative solution that addresses this
> > >>>>> and the second point just below?
> > >>>> Give me a few days to work on this, it's already Friday 6pm here.
> > >>>>
> > >>> Hi Christian,
> > >>>
> > >>> Any progress? As mentioned earlier, I'm OK with writing the patches although
> > >>> I would love to hear your plan.
> > >> First of all I tried to disable DRM authentication completely with a
> > >> kernel config option. Surprisingly that actually works out of the box at
> > >> least on the AMDGPU stack.
> > >>
> > >> This effectively allows us to get rid of DRI2 and the related security
> > >> problems. Only thing left for that is that I'm just not sure how to
> > >> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
> > >> any more.
> > >>
> > >>
> > >> As a next step I looked into if we can disable the command submission
> > >> for DRM master. Turned out that this is relatively easy as well.
> > >>
> > >> All we have to do is to fix the bug Michel pointed out about KMS handles
> > >> for display and let the DDX use a render node instead of the DRM master
> > >> for Glamor. Still need to sync up with Michel and/or Marek whats the
> > >> best way of doing this.
> > >>
> > >>
> > >> The last step (and that's what I haven't tried yet) would be to disable
> > >> DRM authentication for Navi even when it is still compiled into the
> > >> kernel. But that is more or less just a typing exercise.
> > > So just to get this straight, we're now full on embracing a subsystem
> > > split here:
> > > - amdgpu plans to ditch dri2/rendering on primary nodes
> > > - bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
> > >    removal
> > > - others are just hanging in there somehow
> > >
> > > "best of both^W worlds" ftw?
> >
> > Yes, exactly!
> >
> > Think a step further, when this is upstream we can apply the DRM_AUTH
> > removal to amdgpu as well.
> >
So this is effectively what I suggested/planned with a couple of caveats:
 - making amdgpu behave differently from the rest of drm ;-(
 - having the KConfig amdgpu only and merged before this DRM_AUTH

While I suggested:
 - keeping amdgpu consistent with the rest
 - exposing the KConfig option for the whole of the kernel, and
 - merging it alongside this work

So I effectively waited for weeks, instead of simply going ahead and
writing/submitting patches. That's a bit unfortunate...

> > Because we simply made sure that userspace is using the render node for
> > command submission and not the primary node.
> >
> > So nobody can go ahead and remove render node support any more :)
> 
> How does this work? I thought the entire problem of DRM_AUTH removal
> for you was that it broke stuff, and you didn't want to deal with
> that. We still have that exact problem afaics ... old userspace
> doesn't simply vanish, even if you entirely nuke it going forward.
> 
> Also, testing on the amdgpu stack isn't really testing a hole lot
> here, it's all the various silly compositors out there I'd expect to
> fall over. Will gbm/radeonsi/whatever just internally do all the
> necessary dma-buf import/export between the primary node and display
> node to keep this all working?

If I understood Christian, feel free to correct me, the fact that my
earlier patch broke RADV was not the argument. The problem was was the
patch does.

In particular, that user-space will "remove" render nodes.

I'm really sad that amdgpu insists on standing out, hope one day it will
converge. Yet since all other maintainers are ok with the series, I'll
start merging patches in a few hours. I'm really sad that amdgpu wants
to stand out, hope it will converge sooner rather than later.

Christian, how would you like amdgpu handled - with a separate flag in
the driver or shall we special case amdgpu within core drm?

Thanks
Emil
Christian König June 21, 2019, 10:31 a.m. UTC | #45
Am 21.06.19 um 12:20 schrieb Emil Velikov:
> On 2019/06/21, Daniel Vetter wrote:
>> On Fri, Jun 21, 2019 at 11:25 AM Koenig, Christian
>> <Christian.Koenig@amd.com> wrote:
>>> Am 21.06.19 um 11:09 schrieb Daniel Vetter:
>>>> On Fri, Jun 21, 2019 at 07:12:14AM +0000, Koenig, Christian wrote:
>>>>> Am 20.06.19 um 18:30 schrieb Emil Velikov:
>>>>>> On 2019/06/14, Koenig, Christian wrote:
>>>>>>> Am 14.06.19 um 17:53 schrieb Emil Velikov:
>>>>>>>> On 2019/06/14, Koenig, Christian wrote:
>>>>>>>>> Am 14.06.19 um 14:09 schrieb Emil Velikov:
>>>>>>>>>> On 2019/05/27, Emil Velikov wrote:
>>>>>>>>>> [SNIP]
>>>>>>>>>> Hi Christian,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the following, I would like to summarise and emphasize the need for
>>>>>>>>>> DRM_AUTH removal. I would kindly ask you to spend a couple of minutes
>>>>>>>>>> extra reading it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Today DRM drivers* do not make any distinction between primary and
>>>>>>>>>> render node clients.
>>>>>>>>> That is actually not 100% correct. We have a special case where a DRM
>>>>>>>>> master is allowed to change the priority of render node clients.
>>>>>>>>>
>>>>>>>> Can you provide a link? I cannot find that code.
>>>>>>> See amdgpu_sched_ioctl().
>>>>>>>
>>>>>>>>>> Thus for a render capable driver, any premise of
>>>>>>>>>> separation, security or otherwise imposed via DRM_AUTH is a fallacy.
>>>>>>>>> Yeah, that's what I agree on. I just don't think that removing DRM_AUTH
>>>>>>>>> now is the right direction to take.
>>>>>>>>>
>>>>>>>> Could have been clearer - I'm talking about DRM_AUTH | DRM_RENDER_ALLOW
>>>>>>>> ioctls.
>>>>>>>>
>>>>>>>> That aside, can you propose an alternative solution that addresses this
>>>>>>>> and the second point just below?
>>>>>>> Give me a few days to work on this, it's already Friday 6pm here.
>>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Any progress? As mentioned earlier, I'm OK with writing the patches although
>>>>>> I would love to hear your plan.
>>>>> First of all I tried to disable DRM authentication completely with a
>>>>> kernel config option. Surprisingly that actually works out of the box at
>>>>> least on the AMDGPU stack.
>>>>>
>>>>> This effectively allows us to get rid of DRI2 and the related security
>>>>> problems. Only thing left for that is that I'm just not sure how to
>>>>> signal this to userspace so that the DDX wouldn't advertise DRI2 at all
>>>>> any more.
>>>>>
>>>>>
>>>>> As a next step I looked into if we can disable the command submission
>>>>> for DRM master. Turned out that this is relatively easy as well.
>>>>>
>>>>> All we have to do is to fix the bug Michel pointed out about KMS handles
>>>>> for display and let the DDX use a render node instead of the DRM master
>>>>> for Glamor. Still need to sync up with Michel and/or Marek whats the
>>>>> best way of doing this.
>>>>>
>>>>>
>>>>> The last step (and that's what I haven't tried yet) would be to disable
>>>>> DRM authentication for Navi even when it is still compiled into the
>>>>> kernel. But that is more or less just a typing exercise.
>>>> So just to get this straight, we're now full on embracing a subsystem
>>>> split here:
>>>> - amdgpu plans to ditch dri2/rendering on primary nodes
>>>> - bunch of drivers (I think more than i915 by now) merged the DRM_AUTH
>>>>     removal
>>>> - others are just hanging in there somehow
>>>>
>>>> "best of both^W worlds" ftw?
>>> Yes, exactly!
>>>
>>> Think a step further, when this is upstream we can apply the DRM_AUTH
>>> removal to amdgpu as well.
>>>
> So this is effectively what I suggested/planned with a couple of caveats:
>   - making amdgpu behave differently from the rest of drm ;-(
>   - having the KConfig amdgpu only and merged before this DRM_AUTH

No this should apply to all drivers, not just amdgpu.

> While I suggested:
>   - keeping amdgpu consistent with the rest
>   - exposing the KConfig option for the whole of the kernel, and
>   - merging it alongside this work
>
> So I effectively waited for weeks, instead of simply going ahead and
> writing/submitting patches. That's a bit unfortunate...
>
>>> Because we simply made sure that userspace is using the render node for
>>> command submission and not the primary node.
>>>
>>> So nobody can go ahead and remove render node support any more :)
>> How does this work? I thought the entire problem of DRM_AUTH removal
>> for you was that it broke stuff, and you didn't want to deal with
>> that. We still have that exact problem afaics ... old userspace
>> doesn't simply vanish, even if you entirely nuke it going forward.
>>
>> Also, testing on the amdgpu stack isn't really testing a hole lot
>> here, it's all the various silly compositors out there I'd expect to
>> fall over. Will gbm/radeonsi/whatever just internally do all the
>> necessary dma-buf import/export between the primary node and display
>> node to keep this all working?
> If I understood Christian, feel free to correct me, the fact that my
> earlier patch broke RADV was not the argument. The problem was was the
> patch does.

Well partially. That RADV broke is unfortunate, but we have done so many 
customized userspace stuff both based on Mesa/DDX as well as closed 
source components that it is just highly likely that we would break 
something else as well.

> In particular, that user-space will "remove" render nodes.

Yes, that is my main concern here. You basically make render nodes 
superfluously. That gives userspace all legitimization to also remove 
support for them. That is not stupid or evil or whatever, userspace 
would just follow the kernel design.

> I'm really sad that amdgpu insists on standing out, hope one day it will
> converge. Yet since all other maintainers are ok with the series, I'll
> start merging patches in a few hours. I'm really sad that amdgpu wants
> to stand out, hope it will converge sooner rather than later.
>
> Christian, how would you like amdgpu handled - with a separate flag in
> the driver or shall we special case amdgpu within core drm?

No, I ask you to please stick around DRM_AUTH for at least amdgpu and 
radeon. And NOT enable any render node functionality for them on the 
primary node.

Thanks,
Christian.

>
> Thanks
> Emil
Emil Velikov June 21, 2019, 10:53 a.m. UTC | #46
On 2019/06/21, Koenig, Christian wrote:
> No this should apply to all drivers, not just amdgpu.
> 
> > While I suggested:
> >   - keeping amdgpu consistent with the rest
> >   - exposing the KConfig option for the whole of the kernel, and
> >   - merging it alongside this work
> >
> > So I effectively waited for weeks, instead of simply going ahead and
> > writing/submitting patches. That's a bit unfortunate...
> >
> >>> Because we simply made sure that userspace is using the render node for
> >>> command submission and not the primary node.
> >>>
> >>> So nobody can go ahead and remove render node support any more :)
> >> How does this work? I thought the entire problem of DRM_AUTH removal
> >> for you was that it broke stuff, and you didn't want to deal with
> >> that. We still have that exact problem afaics ... old userspace
> >> doesn't simply vanish, even if you entirely nuke it going forward.
> >>
> >> Also, testing on the amdgpu stack isn't really testing a hole lot
> >> here, it's all the various silly compositors out there I'd expect to
> >> fall over. Will gbm/radeonsi/whatever just internally do all the
> >> necessary dma-buf import/export between the primary node and display
> >> node to keep this all working?
> > If I understood Christian, feel free to correct me, the fact that my
> > earlier patch broke RADV was not the argument. The problem was was the
> > patch does.
> 
> Well partially. That RADV broke is unfortunate, but we have done so many 
> customized userspace stuff both based on Mesa/DDX as well as closed 
> source components that it is just highly likely that we would break 
> something else as well.
> 
As an engineer I like to work with tangibles. The highly likely part is
probable, but as mentioned multiple times the series allows for a _dead_
trivial way to address any such problems.

> > In particular, that user-space will "remove" render nodes.
> 
> Yes, that is my main concern here. You basically make render nodes 
> superfluously. That gives userspace all legitimization to also remove 
> support for them. That is not stupid or evil or whatever, userspace 
> would just follow the kernel design.
> 
Do you have an example of userspace doing such an extremely silly thing?
It does seem like suspect you're a couple of steps beyond overcautious,
perhaps rightfully so. Maybe you've seen some closed-source user-space
going crazy? Or any other projects?

In other words, being cautious is great, but without references of
misuse it's very hard for others to draw the full picture.

> > I'm really sad that amdgpu insists on standing out, hope one day it will
> > converge. Yet since all other maintainers are ok with the series, I'll
> > start merging patches in a few hours. I'm really sad that amdgpu wants
> > to stand out, hope it will converge sooner rather than later.
> >
> > Christian, how would you like amdgpu handled - with a separate flag in
> > the driver or shall we special case amdgpu within core drm?
> 
> No, I ask you to please stick around DRM_AUTH for at least amdgpu and 
> radeon. And NOT enable any render node functionality for them on the 
> primary node.
> 
My question is how do you want this handled:
 - DRM_PLEASE_FORCE_AUTH - added to AMDGPU/RADEON, or
 - driver_name == amdgpu, in core DRM.


Thanks
Emil
Daniel Vetter June 21, 2019, 11:03 a.m. UTC | #47
On Fri, Jun 21, 2019 at 12:31 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
> Am 21.06.19 um 12:20 schrieb Emil Velikov:
> > In particular, that user-space will "remove" render nodes.
>
> Yes, that is my main concern here. You basically make render nodes
> superfluously. That gives userspace all legitimization to also remove
> support for them. That is not stupid or evil or whatever, userspace
> would just follow the kernel design.

This already happened. At least for kms-only display socs we had to
hide the separate render node (and there you really have to deal with
the separate render node, because it's a distinct driver) entirely
within gbm/mesa.

So if you want to depracate render functionality on primary nodes, you
_have_ to do that hiding in userspace. Or you'll break a lot of
compositors everywhere. Just testing -amdgpu doesn't really prove
anything here. So you won't move the larger ecosystem with this at
all, that ship sailed.

At that point this all feels like a bikeshed, and for a bikeshed I
don't care what the color is we pick, as long as they're all painted
the same.

Once we picked a color it's a simple technical matter of how to roll
it out, using Kconfig options, or only enabling on new hw, or "merge
and fix the regressions as they come" or any of the other well proven
paths forward.

So if you want to do this, please start with the mesa side work (as
the biggest userspace, not all of it) with patches to redirect all
primary node rendering to render nodes. The code is there already for
socs, just needs to be rolled out more. Or we go with the DRM_AUTH
horrors. Or a 3rd option, I really don't care which it is, as long as
its consistent.

tldr; consistent color choice on this bikeshed please.

Thanks, Daniel
Christian König June 21, 2019, 11:07 a.m. UTC | #48
Am 21.06.19 um 12:53 schrieb Emil Velikov:
> On 2019/06/21, Koenig, Christian wrote:
>> [SNIP]
>> Well partially. That RADV broke is unfortunate, but we have done so many
>> customized userspace stuff both based on Mesa/DDX as well as closed
>> source components that it is just highly likely that we would break
>> something else as well.
>>
> As an engineer I like to work with tangibles. The highly likely part is
> probable, but as mentioned multiple times the series allows for a _dead_
> trivial way to address any such problems.

Well to clarify my concern is that this won't be so trivial.

You implemented on workaround for one specific case and it is perfectly 
possible that for other cases we would have to completely revert the 
removal of DRM_AUTH.

>>> In particular, that user-space will "remove" render nodes.
>> Yes, that is my main concern here. You basically make render nodes
>> superfluously. That gives userspace all legitimization to also remove
>> support for them. That is not stupid or evil or whatever, userspace
>> would just follow the kernel design.
>>
> Do you have an example of userspace doing such an extremely silly thing?
> It does seem like suspect you're a couple of steps beyond overcautious,
> perhaps rightfully so. Maybe you've seen some closed-source user-space
> going crazy? Or any other projects?

The key point is that I don't think this is silly or strange or crazy at 
all. See the kernel defines the interface userspace can and should use.

When the kernel defines that everything will work with the primary node 
it is perfectly valid for userspace to drop support for the render node.

I mean why should they keep this? Just because we tell them not to do this?

> In other words, being cautious is great, but without references of
> misuse it's very hard for others to draw the full picture.
>
>>> I'm really sad that amdgpu insists on standing out, hope one day it will
>>> converge. Yet since all other maintainers are ok with the series, I'll
>>> start merging patches in a few hours. I'm really sad that amdgpu wants
>>> to stand out, hope it will converge sooner rather than later.
>>>
>>> Christian, how would you like amdgpu handled - with a separate flag in
>>> the driver or shall we special case amdgpu within core drm?
>> No, I ask you to please stick around DRM_AUTH for at least amdgpu and
>> radeon. And NOT enable any render node functionality for them on the
>> primary node.
>>
> My question is how do you want this handled:
>   - DRM_PLEASE_FORCE_AUTH - added to AMDGPU/RADEON, or
>   - driver_name == amdgpu, in core DRM.

I want to keep DRM_AUTH in amdgpu and radeon for at least the next 5-10 
years.

At least until we have established that nobody is using the primary node 
for command submission any more. Plus some grace time to make sure that 
this has been spread enough.

Regards,
Christian.

>
>
> Thanks
> Emil
Christian König June 21, 2019, 11:37 a.m. UTC | #49
Am 21.06.19 um 13:03 schrieb Daniel Vetter:
> On Fri, Jun 21, 2019 at 12:31 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 21.06.19 um 12:20 schrieb Emil Velikov:
>>> In particular, that user-space will "remove" render nodes.
>> Yes, that is my main concern here. You basically make render nodes
>> superfluously. That gives userspace all legitimization to also remove
>> support for them. That is not stupid or evil or whatever, userspace
>> would just follow the kernel design.
> This already happened. At least for kms-only display socs we had to
> hide the separate render node (and there you really have to deal with
> the separate render node, because it's a distinct driver) entirely
> within gbm/mesa.

Ok, that is something I didn't knew before and is rather interesting.

> So if you want to depracate render functionality on primary nodes, you
> _have_ to do that hiding in userspace. Or you'll break a lot of
> compositors everywhere. Just testing -amdgpu doesn't really prove
> anything here. So you won't move the larger ecosystem with this at
> all, that ship sailed.

So what else can we do? That sounds like you suggest we should 
completely drop render node functionality.

I mean it's not my preferred option, but certainly something that 
everybody could do.

My primary concern is that we somehow need to get rid of thinks like GEM 
flink and all that other crufty stuff we still have around on the 
primary node (we should probably make a list of that).

Switching everything over to render nodes just sounded like the best 
alternative so far to archive that.

> At that point this all feels like a bikeshed, and for a bikeshed I
> don't care what the color is we pick, as long as they're all painted
> the same.
>
> Once we picked a color it's a simple technical matter of how to roll
> it out, using Kconfig options, or only enabling on new hw, or "merge
> and fix the regressions as they come" or any of the other well proven
> paths forward.

Yeah, the problem is I don't see an option which currently works for 
everyone.

I absolutely need a grace time of multiple years until we can apply this 
to amdgpu/radeon.

And that under the prerequisite to have a plan to somehow enable that 
functionality now to make it at least painful for userspace to rely on 
hack around that.

> So if you want to do this, please start with the mesa side work (as
> the biggest userspace, not all of it) with patches to redirect all
> primary node rendering to render nodes. The code is there already for
> socs, just needs to be rolled out more. Or we go with the DRM_AUTH
> horrors. Or a 3rd option, I really don't care which it is, as long as
> its consistent.

How about this:
1. We keep DRM_AUTH around for amdgpu/radeon for now.
2. We add a Kconfig option to ignore DRM_AUTH, currently default to N. 
This also works as a workaround for old RADV.
3. We start to work on further removing old cruft from the primary node. 
Possible the Kconfig option I suggested to disable GEM flink.

Regards,
Christian.

>
> tldr; consistent color choice on this bikeshed please.
>
> Thanks, Daniel
Daniel Vetter June 21, 2019, 11:50 a.m. UTC | #50
On Fri, Jun 21, 2019 at 1:37 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 21.06.19 um 13:03 schrieb Daniel Vetter:
> > On Fri, Jun 21, 2019 at 12:31 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 21.06.19 um 12:20 schrieb Emil Velikov:
> >>> In particular, that user-space will "remove" render nodes.
> >> Yes, that is my main concern here. You basically make render nodes
> >> superfluously. That gives userspace all legitimization to also remove
> >> support for them. That is not stupid or evil or whatever, userspace
> >> would just follow the kernel design.
> > This already happened. At least for kms-only display socs we had to
> > hide the separate render node (and there you really have to deal with
> > the separate render node, because it's a distinct driver) entirely
> > within gbm/mesa.
>
> Ok, that is something I didn't knew before and is rather interesting.
>
> > So if you want to depracate render functionality on primary nodes, you
> > _have_ to do that hiding in userspace. Or you'll break a lot of
> > compositors everywhere. Just testing -amdgpu doesn't really prove
> > anything here. So you won't move the larger ecosystem with this at
> > all, that ship sailed.
>
> So what else can we do? That sounds like you suggest we should
> completely drop render node functionality.
>
> I mean it's not my preferred option, but certainly something that
> everybody could do.
>
> My primary concern is that we somehow need to get rid of thinks like GEM
> flink and all that other crufty stuff we still have around on the
> primary node (we should probably make a list of that).
>
> Switching everything over to render nodes just sounded like the best
> alternative so far to archive that.

tbh I do like that plan too, but it's a lot more work. And I think to
have any push whatsoever we probably need to roll it out in gbm as a
hack to keep things going. But probably not enough.

I also think that at least some compositors will bother to do the
right thing, and actually bother with render nodes and all that
correctly. It's just that there's way more which dont.

Also for server rendering it'll be render nodes all the way down I
hope (or we need to seriously educate cloud people about the
permissions they set on their default images, and distros on how this
cloud stuff should work.

> > At that point this all feels like a bikeshed, and for a bikeshed I
> > don't care what the color is we pick, as long as they're all painted
> > the same.
> >
> > Once we picked a color it's a simple technical matter of how to roll
> > it out, using Kconfig options, or only enabling on new hw, or "merge
> > and fix the regressions as they come" or any of the other well proven
> > paths forward.
>
> Yeah, the problem is I don't see an option which currently works for
> everyone.
>
> I absolutely need a grace time of multiple years until we can apply this
> to amdgpu/radeon.

Yeah that's what I meant with "pick a color, pick a way to roll it
out". "enable for new hw, roll out years and years later" is one of
the options for roll out.

> And that under the prerequisite to have a plan to somehow enable that
> functionality now to make it at least painful for userspace to rely on
> hack around that.
>
> > So if you want to do this, please start with the mesa side work (as
> > the biggest userspace, not all of it) with patches to redirect all
> > primary node rendering to render nodes. The code is there already for
> > socs, just needs to be rolled out more. Or we go with the DRM_AUTH
> > horrors. Or a 3rd option, I really don't care which it is, as long as
> > its consistent.
>
> How about this:
> 1. We keep DRM_AUTH around for amdgpu/radeon for now.
> 2. We add a Kconfig option to ignore DRM_AUTH, currently default to N.
> This also works as a workaround for old RADV.
> 3. We start to work on further removing old cruft from the primary node.
> Possible the Kconfig option I suggested to disable GEM flink.

Hm I'm not worried about flink really. It's gem_open which is the
security gap, and that one has to stay master-only forever. I guess we
could try to disable that so people have to deal with dma-buf (and
once you have that render nodes become a lot easier to use). But then
I still think we have drivers which don't do dma-buf self-import, so
again we're stuck. So maybe first step is to just roll out a default
self-import dma-buf support for every gem driver. Then start ditching
flink/gem_open. Then start ditching render support on primary nodes.
Every step in the way taking a few years of prodding userspace plus
even more years to wait until it's all gone.

Another option would be to extract the kms stuff from primary nodes,
but we've tried that with control nodes. Until I realized a few years
back that with control nodes it's impossible to get any rendered
buffer in there at all, so useless, and I removed it. No one ever
complained.

So yeah would be really nice if we could fix this, but the universe
conspires against us too much it seems. Hence the fallback of "please
at least lets aim for a consistent color for this mess, whatever it
is".

Cheers, Daniel
Emil Velikov June 21, 2019, 11:58 a.m. UTC | #51
On 2019/06/21, Koenig, Christian wrote:
> Am 21.06.19 um 12:53 schrieb Emil Velikov:
> > On 2019/06/21, Koenig, Christian wrote:
> >> [SNIP]
> >> Well partially. That RADV broke is unfortunate, but we have done so many
> >> customized userspace stuff both based on Mesa/DDX as well as closed
> >> source components that it is just highly likely that we would break
> >> something else as well.
> >>
> > As an engineer I like to work with tangibles. The highly likely part is
> > probable, but as mentioned multiple times the series allows for a _dead_
> > trivial way to address any such problems.
> 
> Well to clarify my concern is that this won't be so trivial.
> 
> You implemented on workaround for one specific case and it is perfectly 
> possible that for other cases we would have to completely revert the 
> removal of DRM_AUTH.
> 
I would encourage you to take a closer look at my patch and point out
how parcicular cases cannot be handled.

> >>> In particular, that user-space will "remove" render nodes.
> >> Yes, that is my main concern here. You basically make render nodes
> >> superfluously. That gives userspace all legitimization to also remove
> >> support for them. That is not stupid or evil or whatever, userspace
> >> would just follow the kernel design.
> >>
> > Do you have an example of userspace doing such an extremely silly thing?
> > It does seem like suspect you're a couple of steps beyond overcautious,
> > perhaps rightfully so. Maybe you've seen some closed-source user-space
> > going crazy? Or any other projects?
> 
> The key point is that I don't think this is silly or strange or crazy at 
> all. See the kernel defines the interface userspace can and should use.
> 
> When the kernel defines that everything will work with the primary node 
> it is perfectly valid for userspace to drop support for the render node.
> 
> I mean why should they keep this? Just because we tell them not to do this?
> 
From your experiense, do user-space developers care about doing (or
generally do) the right thing?

In either case, as pointed already the cat is out of the bag - has been
for years, and if developers did behave as you describe them they would
have "removed" render nodes already.

> > In other words, being cautious is great, but without references of
> > misuse it's very hard for others to draw the full picture.
> >
> >>> I'm really sad that amdgpu insists on standing out, hope one day it will
> >>> converge. Yet since all other maintainers are ok with the series, I'll
> >>> start merging patches in a few hours. I'm really sad that amdgpu wants
> >>> to stand out, hope it will converge sooner rather than later.
> >>>
> >>> Christian, how would you like amdgpu handled - with a separate flag in
> >>> the driver or shall we special case amdgpu within core drm?
> >> No, I ask you to please stick around DRM_AUTH for at least amdgpu and
> >> radeon. And NOT enable any render node functionality for them on the
> >> primary node.
> >>
> > My question is how do you want this handled:
> >   - DRM_PLEASE_FORCE_AUTH - added to AMDGPU/RADEON, or
> >   - driver_name == amdgpu, in core DRM.
> 
> I want to keep DRM_AUTH in amdgpu and radeon for at least the next 5-10 
> years.
> 
Believe we're all fully aware of that fact. I'm asking which _approach_
you prefer. That said, I'll send a new patch/series and we'll nitpick it
there.

Thanks
-Emil
Daniel Vetter June 21, 2019, 11:59 a.m. UTC | #52
On Fri, Jun 21, 2019 at 1:50 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Jun 21, 2019 at 1:37 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Am 21.06.19 um 13:03 schrieb Daniel Vetter:
> > > On Fri, Jun 21, 2019 at 12:31 PM Koenig, Christian
> > > <Christian.Koenig@amd.com> wrote:
> > >> Am 21.06.19 um 12:20 schrieb Emil Velikov:
> > >>> In particular, that user-space will "remove" render nodes.
> > >> Yes, that is my main concern here. You basically make render nodes
> > >> superfluously. That gives userspace all legitimization to also remove
> > >> support for them. That is not stupid or evil or whatever, userspace
> > >> would just follow the kernel design.
> > > This already happened. At least for kms-only display socs we had to
> > > hide the separate render node (and there you really have to deal with
> > > the separate render node, because it's a distinct driver) entirely
> > > within gbm/mesa.
> >
> > Ok, that is something I didn't knew before and is rather interesting.
> >
> > > So if you want to depracate render functionality on primary nodes, you
> > > _have_ to do that hiding in userspace. Or you'll break a lot of
> > > compositors everywhere. Just testing -amdgpu doesn't really prove
> > > anything here. So you won't move the larger ecosystem with this at
> > > all, that ship sailed.
> >
> > So what else can we do? That sounds like you suggest we should
> > completely drop render node functionality.
> >
> > I mean it's not my preferred option, but certainly something that
> > everybody could do.
> >
> > My primary concern is that we somehow need to get rid of thinks like GEM
> > flink and all that other crufty stuff we still have around on the
> > primary node (we should probably make a list of that).
> >
> > Switching everything over to render nodes just sounded like the best
> > alternative so far to archive that.
>
> tbh I do like that plan too, but it's a lot more work. And I think to
> have any push whatsoever we probably need to roll it out in gbm as a
> hack to keep things going. But probably not enough.
>
> I also think that at least some compositors will bother to do the
> right thing, and actually bother with render nodes and all that
> correctly. It's just that there's way more which dont.
>
> Also for server rendering it'll be render nodes all the way down I
> hope (or we need to seriously educate cloud people about the
> permissions they set on their default images, and distros on how this
> cloud stuff should work.
>
> > > At that point this all feels like a bikeshed, and for a bikeshed I
> > > don't care what the color is we pick, as long as they're all painted
> > > the same.
> > >
> > > Once we picked a color it's a simple technical matter of how to roll
> > > it out, using Kconfig options, or only enabling on new hw, or "merge
> > > and fix the regressions as they come" or any of the other well proven
> > > paths forward.
> >
> > Yeah, the problem is I don't see an option which currently works for
> > everyone.
> >
> > I absolutely need a grace time of multiple years until we can apply this
> > to amdgpu/radeon.
>
> Yeah that's what I meant with "pick a color, pick a way to roll it
> out". "enable for new hw, roll out years and years later" is one of
> the options for roll out.
>
> > And that under the prerequisite to have a plan to somehow enable that
> > functionality now to make it at least painful for userspace to rely on
> > hack around that.
> >
> > > So if you want to do this, please start with the mesa side work (as
> > > the biggest userspace, not all of it) with patches to redirect all
> > > primary node rendering to render nodes. The code is there already for
> > > socs, just needs to be rolled out more. Or we go with the DRM_AUTH
> > > horrors. Or a 3rd option, I really don't care which it is, as long as
> > > its consistent.
> >
> > How about this:
> > 1. We keep DRM_AUTH around for amdgpu/radeon for now.
> > 2. We add a Kconfig option to ignore DRM_AUTH, currently default to N.
> > This also works as a workaround for old RADV.
> > 3. We start to work on further removing old cruft from the primary node.
> > Possible the Kconfig option I suggested to disable GEM flink.
>
> Hm I'm not worried about flink really. It's gem_open which is the
> security gap, and that one has to stay master-only forever. I guess we
> could try to disable that so people have to deal with dma-buf (and
> once you have that render nodes become a lot easier to use). But then
> I still think we have drivers which don't do dma-buf self-import, so
> again we're stuck. So maybe first step is to just roll out a default
> self-import dma-buf support for every gem driver. Then start ditching
> flink/gem_open. Then start ditching render support on primary nodes.
> Every step in the way taking a few years of prodding userspace plus
> even more years to wait until it's all gone.

I forgot one step here: I think we even still have drivers without
render node support. As long as those exists (and are still relevant)
then userspace needs primary node rendering + flink code anyway. And
they're not going to be happy about us telling them to add more. So I
think that would need to be fixed first. Hence rough plan:

- Make sure all gem drivers with rendering have render nodes.
- Roll out self-import of dma-buf for all gem drivers (we can do that
with 0 driver support, it's like flink).
- Roll out mesa gbm for everyone to ignore primary nodes for rendering
as much as possible. Maybe needs more gbm work so that compositors can
ask for the display drmfd and the render drmfd.
- wait. like really long time :-/
- slowly deprecate flink for new hw as additional forcing function to
get people to move to dma-buf and render nodes
- wait even longer
- ditch rendering on primary nodes.

Lots of work, and I really mean _lots_, but I think this has a chance
of actually working.
-Daniel


> Another option would be to extract the kms stuff from primary nodes,
> but we've tried that with control nodes. Until I realized a few years
> back that with control nodes it's impossible to get any rendered
> buffer in there at all, so useless, and I removed it. No one ever
> complained.
>
> So yeah would be really nice if we could fix this, but the universe
> conspires against us too much it seems. Hence the fallback of "please
> at least lets aim for a consistent color for this mess, whatever it
> is".
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Emil Velikov June 21, 2019, 12:01 p.m. UTC | #53
On 2019/06/21, Daniel Vetter wrote:
> On Fri, Jun 21, 2019 at 1:50 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, Jun 21, 2019 at 1:37 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > >
> > > Am 21.06.19 um 13:03 schrieb Daniel Vetter:
> > > > On Fri, Jun 21, 2019 at 12:31 PM Koenig, Christian
> > > > <Christian.Koenig@amd.com> wrote:
> > > >> Am 21.06.19 um 12:20 schrieb Emil Velikov:
> > > >>> In particular, that user-space will "remove" render nodes.
> > > >> Yes, that is my main concern here. You basically make render nodes
> > > >> superfluously. That gives userspace all legitimization to also remove
> > > >> support for them. That is not stupid or evil or whatever, userspace
> > > >> would just follow the kernel design.
> > > > This already happened. At least for kms-only display socs we had to
> > > > hide the separate render node (and there you really have to deal with
> > > > the separate render node, because it's a distinct driver) entirely
> > > > within gbm/mesa.
> > >
> > > Ok, that is something I didn't knew before and is rather interesting.
> > >
> > > > So if you want to depracate render functionality on primary nodes, you
> > > > _have_ to do that hiding in userspace. Or you'll break a lot of
> > > > compositors everywhere. Just testing -amdgpu doesn't really prove
> > > > anything here. So you won't move the larger ecosystem with this at
> > > > all, that ship sailed.
> > >
> > > So what else can we do? That sounds like you suggest we should
> > > completely drop render node functionality.
> > >
> > > I mean it's not my preferred option, but certainly something that
> > > everybody could do.
> > >
> > > My primary concern is that we somehow need to get rid of thinks like GEM
> > > flink and all that other crufty stuff we still have around on the
> > > primary node (we should probably make a list of that).
> > >
> > > Switching everything over to render nodes just sounded like the best
> > > alternative so far to archive that.
> >
> > tbh I do like that plan too, but it's a lot more work. And I think to
> > have any push whatsoever we probably need to roll it out in gbm as a
> > hack to keep things going. But probably not enough.
> >
> > I also think that at least some compositors will bother to do the
> > right thing, and actually bother with render nodes and all that
> > correctly. It's just that there's way more which dont.
> >
> > Also for server rendering it'll be render nodes all the way down I
> > hope (or we need to seriously educate cloud people about the
> > permissions they set on their default images, and distros on how this
> > cloud stuff should work.
> >
> > > > At that point this all feels like a bikeshed, and for a bikeshed I
> > > > don't care what the color is we pick, as long as they're all painted
> > > > the same.
> > > >
> > > > Once we picked a color it's a simple technical matter of how to roll
> > > > it out, using Kconfig options, or only enabling on new hw, or "merge
> > > > and fix the regressions as they come" or any of the other well proven
> > > > paths forward.
> > >
> > > Yeah, the problem is I don't see an option which currently works for
> > > everyone.
> > >
> > > I absolutely need a grace time of multiple years until we can apply this
> > > to amdgpu/radeon.
> >
> > Yeah that's what I meant with "pick a color, pick a way to roll it
> > out". "enable for new hw, roll out years and years later" is one of
> > the options for roll out.
> >
> > > And that under the prerequisite to have a plan to somehow enable that
> > > functionality now to make it at least painful for userspace to rely on
> > > hack around that.
> > >
> > > > So if you want to do this, please start with the mesa side work (as
> > > > the biggest userspace, not all of it) with patches to redirect all
> > > > primary node rendering to render nodes. The code is there already for
> > > > socs, just needs to be rolled out more. Or we go with the DRM_AUTH
> > > > horrors. Or a 3rd option, I really don't care which it is, as long as
> > > > its consistent.
> > >
> > > How about this:
> > > 1. We keep DRM_AUTH around for amdgpu/radeon for now.
> > > 2. We add a Kconfig option to ignore DRM_AUTH, currently default to N.
> > > This also works as a workaround for old RADV.
> > > 3. We start to work on further removing old cruft from the primary node.
> > > Possible the Kconfig option I suggested to disable GEM flink.
> >
> > Hm I'm not worried about flink really. It's gem_open which is the
> > security gap, and that one has to stay master-only forever. I guess we
> > could try to disable that so people have to deal with dma-buf (and
> > once you have that render nodes become a lot easier to use). But then
> > I still think we have drivers which don't do dma-buf self-import, so
> > again we're stuck. So maybe first step is to just roll out a default
> > self-import dma-buf support for every gem driver. Then start ditching
> > flink/gem_open. Then start ditching render support on primary nodes.
> > Every step in the way taking a few years of prodding userspace plus
> > even more years to wait until it's all gone.
> 
> I forgot one step here: I think we even still have drivers without
> render node support. As long as those exists (and are still relevant)
> then userspace needs primary node rendering + flink code anyway. And
> they're not going to be happy about us telling them to add more. So I
> think that would need to be fixed first. Hence rough plan:
> 
> - Make sure all gem drivers with rendering have render nodes.
> - Roll out self-import of dma-buf for all gem drivers (we can do that
> with 0 driver support, it's like flink).
> - Roll out mesa gbm for everyone to ignore primary nodes for rendering
> as much as possible. Maybe needs more gbm work so that compositors can
> ask for the display drmfd and the render drmfd.
> - wait. like really long time :-/
> - slowly deprecate flink for new hw as additional forcing function to
> get people to move to dma-buf and render nodes
> - wait even longer
> - ditch rendering on primary nodes.
> 
> Lots of work, and I really mean _lots_, but I think this has a chance
> of actually working.
Thanks for the extensive proposal/list Daniel. I mentioned a, perhaps
too short, version of this a while back.

Above sounds perfectly reasonable IMHO.

-Emil
Christian König June 21, 2019, 12:13 p.m. UTC | #54
Am 21.06.19 um 13:58 schrieb Emil Velikov:
> On 2019/06/21, Koenig, Christian wrote:
>> Am 21.06.19 um 12:53 schrieb Emil Velikov:
>>> On 2019/06/21, Koenig, Christian wrote:
>>>> [SNIP]
>>>> Well partially. That RADV broke is unfortunate, but we have done so many
>>>> customized userspace stuff both based on Mesa/DDX as well as closed
>>>> source components that it is just highly likely that we would break
>>>> something else as well.
>>>>
>>> As an engineer I like to work with tangibles. The highly likely part is
>>> probable, but as mentioned multiple times the series allows for a _dead_
>>> trivial way to address any such problems.
>> Well to clarify my concern is that this won't be so trivial.
>>
>> You implemented on workaround for one specific case and it is perfectly
>> possible that for other cases we would have to completely revert the
>> removal of DRM_AUTH.
>>
> I would encourage you to take a closer look at my patch and point out
> how parcicular cases cannot be handled.

Well the last time I looked it only blocked the first call to the IOCTL.

If that is no longer the case then what is the actual difference to 
DRM_AUTH+DRM_ALLOW_RENDER?

>>>>> In particular, that user-space will "remove" render nodes.
>>>> Yes, that is my main concern here. You basically make render nodes
>>>> superfluously. That gives userspace all legitimization to also remove
>>>> support for them. That is not stupid or evil or whatever, userspace
>>>> would just follow the kernel design.
>>>>
>>> Do you have an example of userspace doing such an extremely silly thing?
>>> It does seem like suspect you're a couple of steps beyond overcautious,
>>> perhaps rightfully so. Maybe you've seen some closed-source user-space
>>> going crazy? Or any other projects?
>> The key point is that I don't think this is silly or strange or crazy at
>> all. See the kernel defines the interface userspace can and should use.
>>
>> When the kernel defines that everything will work with the primary node
>> it is perfectly valid for userspace to drop support for the render node.
>>
>> I mean why should they keep this? Just because we tell them not to do this?
>>
>  From your experiense, do user-space developers care about doing (or
> generally do) the right thing?

No, not at all. Except for Marek and Michel they just take what works 
and even Marek is often short-cutting on this.

> In either case, as pointed already the cat is out of the bag - has been
> for years, and if developers did behave as you describe them they would
> have "removed" render nodes already.

Currently render nodes are mandatory because they are needed for 
headless operation.

E.g. we have a whole bunch of customers which do transcoding with that 
on GPUs which don't even have a CRTC or even X running.

Regards,
Christian.
Emil Velikov June 21, 2019, 12:47 p.m. UTC | #55
On 2019/06/21, Koenig, Christian wrote:
> Am 21.06.19 um 13:58 schrieb Emil Velikov:
> > On 2019/06/21, Koenig, Christian wrote:
> >> Am 21.06.19 um 12:53 schrieb Emil Velikov:
> >>> On 2019/06/21, Koenig, Christian wrote:
> >>>> [SNIP]
> >>>> Well partially. That RADV broke is unfortunate, but we have done so many
> >>>> customized userspace stuff both based on Mesa/DDX as well as closed
> >>>> source components that it is just highly likely that we would break
> >>>> something else as well.
> >>>>
> >>> As an engineer I like to work with tangibles. The highly likely part is
> >>> probable, but as mentioned multiple times the series allows for a _dead_
> >>> trivial way to address any such problems.
> >> Well to clarify my concern is that this won't be so trivial.
> >>
> >> You implemented on workaround for one specific case and it is perfectly
> >> possible that for other cases we would have to completely revert the
> >> removal of DRM_AUTH.
> >>
> > I would encourage you to take a closer look at my patch and point out
> > how parcicular cases cannot be handled.
> 
> Well the last time I looked it only blocked the first call to the IOCTL.
> 
Hmm interesting, you're replied to my patch without even reading it :'-(
Can you please give it a close look and reply inline?

[1] https://lists.freedesktop.org/archives/dri-devel/2019-May/219238.html

> >  From your experiense, do user-space developers care about doing (or
> > generally do) the right thing?
> 
> No, not at all. Except for Marek and Michel they just take what works 
> and even Marek is often short-cutting on this.
> 
Interesting, guess I should have asked this question from the start.

Thanks
Emil
Christian König June 21, 2019, 1 p.m. UTC | #56
Am 21.06.19 um 14:47 schrieb Emil Velikov:
> On 2019/06/21, Koenig, Christian wrote:
>> Am 21.06.19 um 13:58 schrieb Emil Velikov:
>>> On 2019/06/21, Koenig, Christian wrote:
>>>> Am 21.06.19 um 12:53 schrieb Emil Velikov:
>>>>> On 2019/06/21, Koenig, Christian wrote:
>>>>>> [SNIP]
>>>>>> Well partially. That RADV broke is unfortunate, but we have done so many
>>>>>> customized userspace stuff both based on Mesa/DDX as well as closed
>>>>>> source components that it is just highly likely that we would break
>>>>>> something else as well.
>>>>>>
>>>>> As an engineer I like to work with tangibles. The highly likely part is
>>>>> probable, but as mentioned multiple times the series allows for a _dead_
>>>>> trivial way to address any such problems.
>>>> Well to clarify my concern is that this won't be so trivial.
>>>>
>>>> You implemented on workaround for one specific case and it is perfectly
>>>> possible that for other cases we would have to completely revert the
>>>> removal of DRM_AUTH.
>>>>
>>> I would encourage you to take a closer look at my patch and point out
>>> how parcicular cases cannot be handled.
>> Well the last time I looked it only blocked the first call to the IOCTL.
>>
> Hmm interesting, you're replied to my patch without even reading it :'-(

Well I've NAKed the series because of the exposure of the render node 
functionality

> Can you please give it a close look and reply inline?
>
> [1] https://lists.freedesktop.org/archives/dri-devel/2019-May/219238.html

So the workaround no longer just blocks the first IOCTL.

But then the question is why don't you just keep the DRM_AUTH flag 
instead of adding the same functionality with a new one?

>>>   From your experiense, do user-space developers care about doing (or
>>> generally do) the right thing?
>> No, not at all. Except for Marek and Michel they just take what works
>> and even Marek is often short-cutting on this.
>>
> Interesting, guess I should have asked this question from the start.

Well sounds like you don't have to work with a closed source driver 
team. But as I said I honestly would do the same as user space developer.

I mean it's really a bunch of more code to maintain, and getting rid of 
code is always less work in the long term then keeping it.

That kernel developers scream: No no, please don't do that we want to 
keep using it is completely irrelevant for this.

Christian.

>
> Thanks
> Emil
Michel Dänzer June 21, 2019, 3:15 p.m. UTC | #57
On 2019-06-21 1:50 p.m., Daniel Vetter wrote:
> On Fri, Jun 21, 2019 at 1:37 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 21.06.19 um 13:03 schrieb Daniel Vetter:
>>> So if you want to depracate render functionality on primary nodes, you
>>> _have_ to do that hiding in userspace. Or you'll break a lot of
>>> compositors everywhere. Just testing -amdgpu doesn't really prove
>>> anything here. So you won't move the larger ecosystem with this at
>>> all, that ship sailed.
>>
>> So what else can we do? That sounds like you suggest we should
>> completely drop render node functionality.
>>
>> I mean it's not my preferred option, but certainly something that
>> everybody could do.
>>
>> My primary concern is that we somehow need to get rid of thinks like GEM
>> flink and all that other crufty stuff we still have around on the
>> primary node (we should probably make a list of that).
>>
>> Switching everything over to render nodes just sounded like the best
>> alternative so far to archive that.
> 
> tbh I do like that plan too, but it's a lot more work. And I think to
> have any push whatsoever we probably need to roll it out in gbm as a
> hack to keep things going. But probably not enough.
> 
> I also think that at least some compositors will bother to do the
> right thing, and actually bother with render nodes and all that
> correctly. It's just that there's way more which dont.

With amdgpu, we can make userspace always use render nodes for rendering
with changes to libdrm_amdgpu/radeonsi/xf86-video-amdgpu (and maybe some
amdgpu-pro components) only. No GBM/compositor changes needed.


>>> At that point this all feels like a bikeshed, and for a bikeshed I
>>> don't care what the color is we pick, as long as they're all painted
>>> the same.

Then some sheds shouldn't have been re-painted without DRM_AUTH already...


>>> Once we picked a color it's a simple technical matter of how to roll
>>> it out, using Kconfig options, or only enabling on new hw, or "merge
>>> and fix the regressions as they come" or any of the other well proven
>>> paths forward.
>>
>> Yeah, the problem is I don't see an option which currently works for
>> everyone.
>>
>> I absolutely need a grace time of multiple years until we can apply this
>> to amdgpu/radeon.
> 
> Yeah that's what I meant with "pick a color, pick a way to roll it
> out". "enable for new hw, roll out years and years later" is one of
> the options for roll out.

One gotcha being that generic userspace such as the Xorg modesetting
driver may still try to use phased out functionality such as DRI2 with
new hardware.


>> How about this:
>> 1. We keep DRM_AUTH around for amdgpu/radeon for now.
>> 2. We add a Kconfig option to ignore DRM_AUTH, currently default to N.
>> This also works as a workaround for old RADV.
>> 3. We start to work on further removing old cruft from the primary node.
>> Possible the Kconfig option I suggested to disable GEM flink.
> 
> Hm I'm not worried about flink really. It's gem_open which is the
> security gap, and that one has to stay master-only forever.

GEM_OPEN is used by DRI2 clients, it can't be master-only. It's probably
one of the main reasons for the existence of DRM_AUTH.
Michel Dänzer June 21, 2019, 3:24 p.m. UTC | #58
On 2019-06-21 1:58 p.m., Emil Velikov wrote:
> On 2019/06/21, Koenig, Christian wrote:
>> Am 21.06.19 um 12:53 schrieb Emil Velikov:
>>> On 2019/06/21, Koenig, Christian wrote:
> 
>>>>> In particular, that user-space will "remove" render nodes.
>>>> Yes, that is my main concern here. You basically make render nodes
>>>> superfluously. That gives userspace all legitimization to also remove
>>>> support for them. That is not stupid or evil or whatever, userspace
>>>> would just follow the kernel design.
>>>>
>>> Do you have an example of userspace doing such an extremely silly thing?
>>> It does seem like suspect you're a couple of steps beyond overcautious,
>>> perhaps rightfully so. Maybe you've seen some closed-source user-space
>>> going crazy? Or any other projects?
>>
>> The key point is that I don't think this is silly or strange or crazy at 
>> all. See the kernel defines the interface userspace can and should use.
>>
>> When the kernel defines that everything will work with the primary node 
>> it is perfectly valid for userspace to drop support for the render node.
>>
>> I mean why should they keep this? Just because we tell them not to do this?
>>
> From your experiense, do user-space developers care about doing (or
> generally do) the right thing?
> 
> In either case, as pointed already the cat is out of the bag - has been
> for years, and if developers did behave as you describe them they would
> have "removed" render nodes already.

That may be the case with userspace specific to DRM_AUTH-less kernel
drivers, but such userspace couldn't work with all the other drivers. So
I'd argue that while the bag may be open and the cat's tail may even be
sticking out already, the cat is still firmly in the bag. :)
Daniel Vetter June 21, 2019, 3:37 p.m. UTC | #59
On Fri, Jun 21, 2019 at 01:00:17PM +0000, Koenig, Christian wrote:
> Am 21.06.19 um 14:47 schrieb Emil Velikov:
> > On 2019/06/21, Koenig, Christian wrote:
> >> Am 21.06.19 um 13:58 schrieb Emil Velikov:
> >>> On 2019/06/21, Koenig, Christian wrote:
> >>>> Am 21.06.19 um 12:53 schrieb Emil Velikov:
> >>>>> On 2019/06/21, Koenig, Christian wrote:
> >>>>>> [SNIP]
> >>>>>> Well partially. That RADV broke is unfortunate, but we have done so many
> >>>>>> customized userspace stuff both based on Mesa/DDX as well as closed
> >>>>>> source components that it is just highly likely that we would break
> >>>>>> something else as well.
> >>>>>>
> >>>>> As an engineer I like to work with tangibles. The highly likely part is
> >>>>> probable, but as mentioned multiple times the series allows for a _dead_
> >>>>> trivial way to address any such problems.
> >>>> Well to clarify my concern is that this won't be so trivial.
> >>>>
> >>>> You implemented on workaround for one specific case and it is perfectly
> >>>> possible that for other cases we would have to completely revert the
> >>>> removal of DRM_AUTH.
> >>>>
> >>> I would encourage you to take a closer look at my patch and point out
> >>> how parcicular cases cannot be handled.
> >> Well the last time I looked it only blocked the first call to the IOCTL.
> >>
> > Hmm interesting, you're replied to my patch without even reading it :'-(
> 
> Well I've NAKed the series because of the exposure of the render node 
> functionality
> 
> > Can you please give it a close look and reply inline?
> >
> > [1] https://lists.freedesktop.org/archives/dri-devel/2019-May/219238.html
> 
> So the workaround no longer just blocks the first IOCTL.
> 
> But then the question is why don't you just keep the DRM_AUTH flag 
> instead of adding the same functionality with a new one?
> 
> >>>   From your experiense, do user-space developers care about doing (or
> >>> generally do) the right thing?
> >> No, not at all. Except for Marek and Michel they just take what works
> >> and even Marek is often short-cutting on this.
> >>
> > Interesting, guess I should have asked this question from the start.
> 
> Well sounds like you don't have to work with a closed source driver 
> team. But as I said I honestly would do the same as user space developer.

From an upstream kernel pov I've never cared about the blobs. I don't
upstream should terribly start caring about blobs - if they ship hacks
that don't reflect what the open stack does, their problem.
 
Speaking as someone who's pissed off closed source driver teams to no end.
I'm happy to be of service here too :-)

> I mean it's really a bunch of more code to maintain, and getting rid of 
> code is always less work in the long term then keeping it.
> 
> That kernel developers scream: No no, please don't do that we want to 
> keep using it is completely irrelevant for this.

Jokes aside, I think we should look at what's the right thing to do
looking at open source only, and if that gets the blobby folks shafted,
well so be it. Really not my problem, and I'm pretty sure Dave doesn't
care one hair of an inch more.
-Daniel
Daniel Vetter June 21, 2019, 3:44 p.m. UTC | #60
On Fri, Jun 21, 2019 at 05:15:22PM +0200, Michel Dänzer wrote:
> On 2019-06-21 1:50 p.m., Daniel Vetter wrote:
> > On Fri, Jun 21, 2019 at 1:37 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 21.06.19 um 13:03 schrieb Daniel Vetter:
> >>> So if you want to depracate render functionality on primary nodes, you
> >>> _have_ to do that hiding in userspace. Or you'll break a lot of
> >>> compositors everywhere. Just testing -amdgpu doesn't really prove
> >>> anything here. So you won't move the larger ecosystem with this at
> >>> all, that ship sailed.
> >>
> >> So what else can we do? That sounds like you suggest we should
> >> completely drop render node functionality.
> >>
> >> I mean it's not my preferred option, but certainly something that
> >> everybody could do.
> >>
> >> My primary concern is that we somehow need to get rid of thinks like GEM
> >> flink and all that other crufty stuff we still have around on the
> >> primary node (we should probably make a list of that).
> >>
> >> Switching everything over to render nodes just sounded like the best
> >> alternative so far to archive that.
> > 
> > tbh I do like that plan too, but it's a lot more work. And I think to
> > have any push whatsoever we probably need to roll it out in gbm as a
> > hack to keep things going. But probably not enough.
> > 
> > I also think that at least some compositors will bother to do the
> > right thing, and actually bother with render nodes and all that
> > correctly. It's just that there's way more which dont.
> 
> With amdgpu, we can make userspace always use render nodes for rendering
> with changes to libdrm_amdgpu/radeonsi/xf86-video-amdgpu (and maybe some
> amdgpu-pro components) only. No GBM/compositor changes needed.

This is a very non-exhaustive list of userspace that runs on your driver
... This wayland compositor thing, actually shipping now, and there's many :-)

> >>> At that point this all feels like a bikeshed, and for a bikeshed I
> >>> don't care what the color is we pick, as long as they're all painted
> >>> the same.
> 
> Then some sheds shouldn't have been re-painted without DRM_AUTH already...

Christian proposed that and then didn't feel like reverting it, plus vc4,
and tegra never bothered with it. There's quite a bit more than the tail
out of this particular bag already.

> >>> Once we picked a color it's a simple technical matter of how to roll
> >>> it out, using Kconfig options, or only enabling on new hw, or "merge
> >>> and fix the regressions as they come" or any of the other well proven
> >>> paths forward.
> >>
> >> Yeah, the problem is I don't see an option which currently works for
> >> everyone.
> >>
> >> I absolutely need a grace time of multiple years until we can apply this
> >> to amdgpu/radeon.
> > 
> > Yeah that's what I meant with "pick a color, pick a way to roll it
> > out". "enable for new hw, roll out years and years later" is one of
> > the options for roll out.
> 
> One gotcha being that generic userspace such as the Xorg modesetting
> driver may still try to use phased out functionality such as DRI2 with
> new hardware.

There's a lot more generic userspace than -modesetting. That was the
entire point of kms, and it succeed really well. That's why I don't think
amdgpu doing their own flavour pick is going to help anyone here, except
maybe if all you care about is the amd stand-alone stack only. But then
why do you bother with this upstream stuff at all ...

> >> How about this:
> >> 1. We keep DRM_AUTH around for amdgpu/radeon for now.
> >> 2. We add a Kconfig option to ignore DRM_AUTH, currently default to N.
> >> This also works as a workaround for old RADV.
> >> 3. We start to work on further removing old cruft from the primary node.
> >> Possible the Kconfig option I suggested to disable GEM flink.
> > 
> > Hm I'm not worried about flink really. It's gem_open which is the
> > security gap, and that one has to stay master-only forever.
> 
> GEM_OPEN is used by DRI2 clients, it can't be master-only. It's probably
> one of the main reasons for the existence of DRM_AUTH.

Oh sweet I forgot we're allowing this both ways :-/ Well doesn't change that
much really, once we break one of these the other isn't useful anymore
either.
-Daniel
Michel Dänzer June 21, 2019, 3:52 p.m. UTC | #61
On 2019-06-21 5:44 p.m., Daniel Vetter wrote:
> On Fri, Jun 21, 2019 at 05:15:22PM +0200, Michel Dänzer wrote:
>> On 2019-06-21 1:50 p.m., Daniel Vetter wrote:
>>> On Fri, Jun 21, 2019 at 1:37 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 21.06.19 um 13:03 schrieb Daniel Vetter:
>>>>> So if you want to depracate render functionality on primary nodes, you
>>>>> _have_ to do that hiding in userspace. Or you'll break a lot of
>>>>> compositors everywhere. Just testing -amdgpu doesn't really prove
>>>>> anything here. So you won't move the larger ecosystem with this at
>>>>> all, that ship sailed.
>>>>
>>>> So what else can we do? That sounds like you suggest we should
>>>> completely drop render node functionality.
>>>>
>>>> I mean it's not my preferred option, but certainly something that
>>>> everybody could do.
>>>>
>>>> My primary concern is that we somehow need to get rid of thinks like GEM
>>>> flink and all that other crufty stuff we still have around on the
>>>> primary node (we should probably make a list of that).
>>>>
>>>> Switching everything over to render nodes just sounded like the best
>>>> alternative so far to archive that.
>>>
>>> tbh I do like that plan too, but it's a lot more work. And I think to
>>> have any push whatsoever we probably need to roll it out in gbm as a
>>> hack to keep things going. But probably not enough.
>>>
>>> I also think that at least some compositors will bother to do the
>>> right thing, and actually bother with render nodes and all that
>>> correctly. It's just that there's way more which dont.
>>
>> With amdgpu, we can make userspace always use render nodes for rendering
>> with changes to libdrm_amdgpu/radeonsi/xf86-video-amdgpu (and maybe some
>> amdgpu-pro components) only. No GBM/compositor changes needed.
> 
> This is a very non-exhaustive list of userspace that runs on your driver
> ... This wayland compositor thing, actually shipping now, and there's many :-)

You don't seem to understand what I wrote. Everything will work at least
as well as now, without any other changes.


>>>>> Once we picked a color it's a simple technical matter of how to roll
>>>>> it out, using Kconfig options, or only enabling on new hw, or "merge
>>>>> and fix the regressions as they come" or any of the other well proven
>>>>> paths forward.
>>>>
>>>> Yeah, the problem is I don't see an option which currently works for
>>>> everyone.
>>>>
>>>> I absolutely need a grace time of multiple years until we can apply this
>>>> to amdgpu/radeon.
>>>
>>> Yeah that's what I meant with "pick a color, pick a way to roll it
>>> out". "enable for new hw, roll out years and years later" is one of
>>> the options for roll out.
>>
>> One gotcha being that generic userspace such as the Xorg modesetting
>> driver may still try to use phased out functionality such as DRI2 with
>> new hardware.
> 
> There's a lot more generic userspace than -modesetting.

That is affected by phasing out DRI2 related functionality? (I thought
that was the context of this sub-sub-thread)


> That was the entire point of kms, and it succeed really well. That's
> why I don't think amdgpu doing their own flavour pick is going to help
> anyone here,
Other drivers are welcome to emulate amdgpu's design making the above
possible. :)


> except maybe if all you care about is the amd stand-alone> stack only.
> But then why do you bother with this upstream stuff at all> ...

I'm afraid you've lost me here.
Michel Dänzer June 24, 2019, 9:37 a.m. UTC | #62
On 2019-06-21 5:52 p.m., Michel Dänzer wrote:
> On 2019-06-21 5:44 p.m., Daniel Vetter wrote:
>> On Fri, Jun 21, 2019 at 05:15:22PM +0200, Michel Dänzer wrote:
>>> On 2019-06-21 1:50 p.m., Daniel Vetter wrote:
>>>> On Fri, Jun 21, 2019 at 1:37 PM Christian König
>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>> Am 21.06.19 um 13:03 schrieb Daniel Vetter:
>>>>>> So if you want to depracate render functionality on primary nodes, you
>>>>>> _have_ to do that hiding in userspace. Or you'll break a lot of
>>>>>> compositors everywhere. Just testing -amdgpu doesn't really prove
>>>>>> anything here. So you won't move the larger ecosystem with this at
>>>>>> all, that ship sailed.
>>>>>
>>>>> So what else can we do? That sounds like you suggest we should
>>>>> completely drop render node functionality.
>>>>>
>>>>> I mean it's not my preferred option, but certainly something that
>>>>> everybody could do.
>>>>>
>>>>> My primary concern is that we somehow need to get rid of thinks like GEM
>>>>> flink and all that other crufty stuff we still have around on the
>>>>> primary node (we should probably make a list of that).
>>>>>
>>>>> Switching everything over to render nodes just sounded like the best
>>>>> alternative so far to archive that.
>>>>
>>>> tbh I do like that plan too, but it's a lot more work. And I think to
>>>> have any push whatsoever we probably need to roll it out in gbm as a
>>>> hack to keep things going. But probably not enough.
>>>>
>>>> I also think that at least some compositors will bother to do the
>>>> right thing, and actually bother with render nodes and all that
>>>> correctly. It's just that there's way more which dont.
>>>
>>> With amdgpu, we can make userspace always use render nodes for rendering
>>> with changes to libdrm_amdgpu/radeonsi/xf86-video-amdgpu (and maybe some
>>> amdgpu-pro components) only. No GBM/compositor changes needed.
>>
>> This is a very non-exhaustive list of userspace that runs on your driver
>> ... This wayland compositor thing, actually shipping now, and there's many :-)
> 
> You don't seem to understand what I wrote. Everything will work at least
> as well as now, without any other changes.
> 
> [...]
> 
>> That was the entire point of kms, and it succeed really well. That's
>> why I don't think amdgpu doing their own flavour pick is going to help
>> anyone here,
> Other drivers are welcome to emulate amdgpu's design making the above
> possible. :)

Seriously though, I don't think any changes outside of libdrm/Mesa
should be needed with other drivers either. Fundamentally there's
nothing magic about it, just sharing BOs via PRIME between the render
node file description and that passed in via the EGL/GBM/... API.
Daniel Vetter June 24, 2019, 9:48 a.m. UTC | #63
On Mon, Jun 24, 2019 at 11:37 AM Michel Dänzer <michel@daenzer.net> wrote:
> On 2019-06-21 5:52 p.m., Michel Dänzer wrote:
> >> That was the entire point of kms, and it succeed really well. That's
> >> why I don't think amdgpu doing their own flavour pick is going to help
> >> anyone here,
> > Other drivers are welcome to emulate amdgpu's design making the above
> > possible. :)
>
> Seriously though, I don't think any changes outside of libdrm/Mesa
> should be needed with other drivers either. Fundamentally there's
> nothing magic about it, just sharing BOs via PRIME between the render
> node file description and that passed in via the EGL/GBM/... API.

The libdrm/mesa design isn't the hard part, we have that even as a
helper for etnaviv and all those drivers. Being inconsistent here is a
think more a nuisance for integrators, who might want to sandbox gpu
clients real hard and really want to know what access rights they need
to give out.

But then we have much, much bigger fish to fry from a cross-driver
consistency pov, so *shrug*. Just feels somewhat silly we can't even
get agreement or some kind of consistent plan on something fairly
simple like this.
-Daniel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 9221e5489069..da415f445187 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -40,6 +40,22 @@  config DRM_AMDGPU_GART_DEBUGFS
 	  Selecting this option creates a debugfs file to inspect the mapped
 	  pages. Uses more memory for housekeeping, enable only for debugging.
 
+config DRM_AMDGPU_FORCE_AUTH
+	bool "Force authentication check on AMDGPU_INFO ioctl"
+	default y
+	help
+	  There were some version of the Mesa RADV drivers, which relied on
+	  the ioctl failing, if the client is not authenticated.
+
+	  Namely, the following versions are affected:
+	    - the whole 18.2.x series, which is EOL
+	    - the whole 18.3.x series, which is EOL
+	    - the 19.0.x series, prior to 19.0.4
+
+	  Modern distributions, should disable this. That will allow various
+	  other clients to work, that would otherwise require root privileges.
+
+
 source "drivers/gpu/drm/amd/acp/Kconfig"
 source "drivers/gpu/drm/amd/display/Kconfig"
 source "drivers/gpu/drm/amd/amdkfd/Kconfig"
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index b17d0545728e..b8076929440b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1214,7 +1214,17 @@  const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
 	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_MMAP, amdgpu_gem_mmap_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_WAIT_IDLE, amdgpu_gem_wait_idle_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(AMDGPU_CS, amdgpu_cs_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
-	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
+	/* The DRM_FORCE_AUTH is effectively a workaround for the RADV Mesa driver.
+	 * This is required for Mesa:
+	 *  - the whole 18.2.x series, which is EOL
+	 *  - the whole 18.3.x series, which is EOL
+	 *  - the 19.0.x series, prior to 19.0.4
+	 */
+	DRM_IOCTL_DEF_DRV(AMDGPU_INFO, amdgpu_info_ioctl,
+#if defined(DRM_AMDGPU_FORCE_AUTH)
+		DRM_FORCE_AUTH|
+#endif
+		DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_CS, amdgpu_cs_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(AMDGPU_WAIT_FENCES, amdgpu_cs_wait_fences_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(AMDGPU_GEM_METADATA, amdgpu_gem_metadata_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 2263e3ddd822..9841c0076f02 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -544,6 +544,11 @@  int drm_ioctl_permit(u32 flags, struct drm_file *file_priv)
 		     drm_is_render_client(file_priv)))
 		return -EACCES;
 
+	/* FORCE_AUTH is only for authenticated or render client */
+	if (unlikely((flags & DRM_FORCE_AUTH) && !drm_is_render_client(file_priv) &&
+		     !file_priv->authenticated))
+		return -EACCES;
+
 	return 0;
 }
 EXPORT_SYMBOL(drm_ioctl_permit);
diff --git a/include/drm/drm_ioctl.h b/include/drm/drm_ioctl.h
index fafb6f592c4b..6084ee32043d 100644
--- a/include/drm/drm_ioctl.h
+++ b/include/drm/drm_ioctl.h
@@ -126,6 +126,23 @@  enum drm_ioctl_flags {
 	 * not set DRM_AUTH because they do not require authentication.
 	 */
 	DRM_RENDER_ALLOW	= BIT(5),
+	/**
+	 * @DRM_FORCE_AUTH:
+	 *
+	 * Authentication of the primary node is mandatory. Regardless that the
+	 * user can usually circumvent that by using the render node with exact
+	 * same ioctl.
+	 *
+	 * Note: this is effectively a workaround for AMDGPU AMDGPU_INFO ioctl
+	 * and the RADV Mesa driver. This is required for Mesa:
+	 *  - the whole 18.2.x series, which is EOL
+	 *  - the whole 18.3.x series, which is EOL
+	 *  - the 19.0.x series, prior to 19.0.4
+	 *
+	 * Note: later patch will effectively drop the DRM_AUTH for ioctls
+	 * annotated as DRM_AUTH | DRM_RENDER_ALLOW.
+	 */
+	DRM_FORCE_AUTH          = BIT(6),
 };
 
 /**