drm/amdgpu: fix drm leases being broken on radv

Message ID	20190416214046.3551-1-andresx7@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Andres Rodriguez <andresx7@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH] drm/amdgpu: fix drm leases being broken on radv Date: Tue, 16 Apr 2019 17:40:46 -0400 Message-Id: <20190416214046.3551-1-andresx7@gmail.com> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=U5AHxbmpyWYS8z9WhanPsdGra+atXNzCnP48uk7Pd7c=; b=HiF7Q8LecD0W+VWVupJWOThd8bERWPxoxDfu7iCm9MS7tD3+RiwhKU3aNeT7HM1OKx KrIQbXXtryrTTH36NDrGAiGYPtd0j5mImPq8rdnLigh3eNq/X2zBgvloEsB8mL7rvJTk Ha2WsQXZJyzEGziiNw0nZuw2gWJPTVG1CUt5sW3DAZxcd77b61E+UdEFUiZuhOgqOWhJ DFySlX8g3lwv9rSSgDmtEsapQ8AjNgJet66OgoKyux/PoSTmX4jMLNWLKTzAe/hZs4A+ kNS0eXVfBjxojDBpj5PD+62FHc9e6G7n+laI4s8ORjOHsut8X96oU3xwdeS4gQxotk8M naBw== Precedence: list Cc: Keith Packard <keithp@keithp.com>, David Airlie <airlied@linux.ie>, Andres Rodriguez <andresx7@gmail.com>, Alex Deucher <alexander.deucher@amd.com>, Emil Velikov <emil.velikov@collabora.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/amdgpu: fix drm leases being broken on radv \| expand drm/amdgpu: fix drm leases being broken on radv

Andres Rodriguez April 16, 2019, 9:40 p.m. UTC

After a recent commit, access to the DRM_AUTH ioctls become more
permissive. This resulted in a buggy check for drm_master capabilities
inside radv stop working.

This commit adds a backwards compatibility workaround so that the radv
drm_master check keeps working as previously expected.

This fixes SteamVR and other drm lease based apps being broken since
v5.1-rc1

From the usermode side, radv will also be fixed to properly test for
master capabilities:
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669

Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls")
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Emil Velikov <emil.velikov@collabora.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Keith Packard <keithp@keithp.com>
Reviewed-by: Keith Packard <keithp@keithp.com>
Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
 3 files changed, 27 insertions(+)

Christian König April 17, 2019, 7:09 a.m. UTC | #1

Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
> After a recent commit, access to the DRM_AUTH ioctls become more
> permissive. This resulted in a buggy check for drm_master capabilities
> inside radv stop working.
>
> This commit adds a backwards compatibility workaround so that the radv
> drm_master check keeps working as previously expected.
>
> This fixes SteamVR and other drm lease based apps being broken since
> v5.1-rc1
>
>  From the usermode side, radv will also be fixed to properly test for
> master capabilities:
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
>
> Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls")
> Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Emil Velikov <emil.velikov@collabora.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Keith Packard <keithp@keithp.com>
> Reviewed-by: Keith Packard <keithp@keithp.com>
> Reviewed-by: Daniel Vetter <daniel@ffwll.ch>

Well definitely a NAK. IIRC we have unit tests where the exactly first 
thing they do is querying AMDGPU_INFO_ACCEL_WORKING.

And I definitely not going to risk breaking those just to fix buggy 
behavior in radv.

Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
>   3 files changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 8d0d7f3dd5fb..e67bfee8d411 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
>   	struct mutex		bo_list_lock;
>   	struct idr		bo_list_handles;
>   	struct amdgpu_ctx_mgr	ctx_mgr;
> +
> +	/* Part of a backwards compatibility hack */
> +	bool			is_first_ioctl;
>   };
>   
>   int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 7419ea8a388b..cd438afa7092 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>   		      unsigned int cmd, unsigned long arg)
>   {
>   	struct drm_file *file_priv = filp->private_data;
> +	struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>   	struct drm_device *dev;
>   	long ret;
>   	dev = file_priv->minor->dev;
> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>   		return ret;
>   
>   	ret = drm_ioctl(filp, cmd, arg);
> +	fpriv->is_first_ioctl = false;
>   
>   	pm_runtime_mark_last_busy(dev->dev);
>   	pm_runtime_put_autosuspend(dev->dev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index e860412043bb..00c0a2fb2a69 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
>   static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   {
>   	struct amdgpu_device *adev = dev->dev_private;
> +	struct amdgpu_fpriv *fpriv = filp->driver_priv;
>   	struct drm_amdgpu_info *info = data;
>   	struct amdgpu_mode_info *minfo = &adev->mode_info;
>   	void __user *out = (void __user *)(uintptr_t)info->return_pointer;
> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>   
>   	switch (info->query) {
>   	case AMDGPU_INFO_ACCEL_WORKING:
> +		/* HACK: radv has a fragile open-coded check for drm master
> +		 * The pattern for the call is:
> +		 *     open(DRM_NODE_PRIMARY)
> +		 *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
> +		 *
> +		 * After commit 8059add04 this check stopped working due to
> +		 * operations on a primary node from non-master clients becoming
> +		 * more permissive.
> +		 *
> +		 * This backwards compatibility hack targets the calling pattern
> +		 * above to fail radv from thinking it has master access to the
> +		 * display system ( without reverting 8059add04 ).
> +		 *
> +		 * Users of libdrm will not be affected as some other ioctls are
> +		 * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
> +		 * query.
> +		 */
> +		if (fpriv->is_first_ioctl && !filp->is_master)
> +			return -EACCES;
> +
>   		ui32 = adev->accel_working;
>   		return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0;
>   	case AMDGPU_INFO_CRTC_FROM_ID:
> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   			goto error_vm;
>   	}
>   
> +	fpriv->is_first_ioctl = true;
>   	mutex_init(&fpriv->bo_list_lock);
>   	idr_init(&fpriv->bo_list_handles);
>

Daniel Vetter April 17, 2019, 8:10 a.m. UTC | #2

On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote:
> Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
> > After a recent commit, access to the DRM_AUTH ioctls become more
> > permissive. This resulted in a buggy check for drm_master capabilities
> > inside radv stop working.
> > 
> > This commit adds a backwards compatibility workaround so that the radv
> > drm_master check keeps working as previously expected.
> > 
> > This fixes SteamVR and other drm lease based apps being broken since
> > v5.1-rc1
> > 
> >  From the usermode side, radv will also be fixed to properly test for
> > master capabilities:
> > https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
> > 
> > Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls")
> > Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Emil Velikov <emil.velikov@collabora.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Keith Packard <keithp@keithp.com>
> > Reviewed-by: Keith Packard <keithp@keithp.com>
> > Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
> 
> Well definitely a NAK. IIRC we have unit tests where the exactly first thing
> they do is querying AMDGPU_INFO_ACCEL_WORKING.
> 
> And I definitely not going to risk breaking those just to fix buggy behavior
> in radv.

s/buggy/fragile :-)

Option B would be to disable 8059add0478e ("drm: allow render capable
master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more
tricky to implement I guess (since it'd leak all over the place).

Option C is to revert 8059add04, but that's a bit silly since it seems to
work everywhere.

Breaking radv isn't an option, because no regression. Aside: No one is
stopping amd folks from reviewing radv patches and making sure there's no
fragile stuff in there.

We discussed this quite a bit on irc with Ben and Keith and others, and
figured option A is the most promising to go forward with. Anything using
amdgpu_device_init (which I think are all the umds, but I didn't check)
will keep working, as will radv leases/vkdisplay, plus we can keep
8059add04 for everyone (not just everony except amdgpu). If that means
breaking a few unit tests ... *shrugh*. Needs patches to fix them ofc, but
shouldn't be that much work really.
-Daniel

> 
> Christian.
> 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
> >   3 files changed, 27 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 8d0d7f3dd5fb..e67bfee8d411 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
> >   	struct mutex		bo_list_lock;
> >   	struct idr		bo_list_handles;
> >   	struct amdgpu_ctx_mgr	ctx_mgr;
> > +
> > +	/* Part of a backwards compatibility hack */
> > +	bool			is_first_ioctl;
> >   };
> >   int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 7419ea8a388b..cd438afa7092 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
> >   		      unsigned int cmd, unsigned long arg)
> >   {
> >   	struct drm_file *file_priv = filp->private_data;
> > +	struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
> >   	struct drm_device *dev;
> >   	long ret;
> >   	dev = file_priv->minor->dev;
> > @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
> >   		return ret;
> >   	ret = drm_ioctl(filp, cmd, arg);
> > +	fpriv->is_first_ioctl = false;
> >   	pm_runtime_mark_last_busy(dev->dev);
> >   	pm_runtime_put_autosuspend(dev->dev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > index e860412043bb..00c0a2fb2a69 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> >   {
> >   	struct amdgpu_device *adev = dev->dev_private;
> > +	struct amdgpu_fpriv *fpriv = filp->driver_priv;
> >   	struct drm_amdgpu_info *info = data;
> >   	struct amdgpu_mode_info *minfo = &adev->mode_info;
> >   	void __user *out = (void __user *)(uintptr_t)info->return_pointer;
> > @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
> >   	switch (info->query) {
> >   	case AMDGPU_INFO_ACCEL_WORKING:
> > +		/* HACK: radv has a fragile open-coded check for drm master
> > +		 * The pattern for the call is:
> > +		 *     open(DRM_NODE_PRIMARY)
> > +		 *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
> > +		 *
> > +		 * After commit 8059add04 this check stopped working due to
> > +		 * operations on a primary node from non-master clients becoming
> > +		 * more permissive.
> > +		 *
> > +		 * This backwards compatibility hack targets the calling pattern
> > +		 * above to fail radv from thinking it has master access to the
> > +		 * display system ( without reverting 8059add04 ).
> > +		 *
> > +		 * Users of libdrm will not be affected as some other ioctls are
> > +		 * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
> > +		 * query.
> > +		 */
> > +		if (fpriv->is_first_ioctl && !filp->is_master)
> > +			return -EACCES;
> > +
> >   		ui32 = adev->accel_working;
> >   		return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0;
> >   	case AMDGPU_INFO_CRTC_FROM_ID:
> > @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
> >   			goto error_vm;
> >   	}
> > +	fpriv->is_first_ioctl = true;
> >   	mutex_init(&fpriv->bo_list_lock);
> >   	idr_init(&fpriv->bo_list_handles);
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König April 17, 2019, 9:18 a.m. UTC | #3

Am 17.04.19 um 10:10 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote:
>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
>>> After a recent commit, access to the DRM_AUTH ioctls become more
>>> permissive. This resulted in a buggy check for drm_master capabilities
>>> inside radv stop working.
>>>
>>> This commit adds a backwards compatibility workaround so that the radv
>>> drm_master check keeps working as previously expected.
>>>
>>> This fixes SteamVR and other drm lease based apps being broken since
>>> v5.1-rc1
>>>
>>>   From the usermode side, radv will also be fixed to properly test for
>>> master capabilities:
>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
>>>
>>> Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls")
>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> Cc: David Airlie <airlied@linux.ie>
>>> Cc: Emil Velikov <emil.velikov@collabora.com>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Keith Packard <keithp@keithp.com>
>>> Reviewed-by: Keith Packard <keithp@keithp.com>
>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
>> Well definitely a NAK. IIRC we have unit tests where the exactly first thing
>> they do is querying AMDGPU_INFO_ACCEL_WORKING.
>>
>> And I definitely not going to risk breaking those just to fix buggy behavior
>> in radv.
> s/buggy/fragile :-)
>
> Option B would be to disable 8059add0478e ("drm: allow render capable
> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more
> tricky to implement I guess (since it'd leak all over the place).
>
> Option C is to revert 8059add04, but that's a bit silly since it seems to
> work everywhere.
>
> Breaking radv isn't an option, because no regression. Aside: No one is
> stopping amd folks from reviewing radv patches and making sure there's no
> fragile stuff in there.
>
> We discussed this quite a bit on irc with Ben and Keith and others, and
> figured option A is the most promising to go forward with. Anything using
> amdgpu_device_init (which I think are all the umds, but I didn't check)
> will keep working, as will radv leases/vkdisplay, plus we can keep
> 8059add04 for everyone (not just everony except amdgpu). If that means
> breaking a few unit tests ... *shrugh*. Needs patches to fix them ofc, but
> shouldn't be that much work really.

I think you miss the point here: This patch will break the render node 
interface!

E.g. take another look at those lines of code:
> +		if (fpriv->is_first_ioctl && !filp->is_master)
> +			return -EACCES;
This will return -EACCES on the first accel working query which is 
doesn't comes from the DRM master. And it is irrelevant if its the 
primary or the render node.

So this patch most likely breaks tons of things and is *definitely* a 
complete NAK.

Additional to that IIRC we have (rather old) code which behaves 
differently depending if it is called by the render node or the primary 
node. In other words it assumes that the primary node is authenticated.

If I understood correctly what 8059add04 does than this is NOT a good to 
do in general.

Regards,
Christian.

> -Daniel
>
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
>>>    3 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 8d0d7f3dd5fb..e67bfee8d411 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
>>>    	struct mutex		bo_list_lock;
>>>    	struct idr		bo_list_handles;
>>>    	struct amdgpu_ctx_mgr	ctx_mgr;
>>> +
>>> +	/* Part of a backwards compatibility hack */
>>> +	bool			is_first_ioctl;
>>>    };
>>>    int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 7419ea8a388b..cd438afa7092 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>    		      unsigned int cmd, unsigned long arg)
>>>    {
>>>    	struct drm_file *file_priv = filp->private_data;
>>> +	struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>>>    	struct drm_device *dev;
>>>    	long ret;
>>>    	dev = file_priv->minor->dev;
>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>    		return ret;
>>>    	ret = drm_ioctl(filp, cmd, arg);
>>> +	fpriv->is_first_ioctl = false;
>>>    	pm_runtime_mark_last_busy(dev->dev);
>>>    	pm_runtime_put_autosuspend(dev->dev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index e860412043bb..00c0a2fb2a69 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
>>>    static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>    {
>>>    	struct amdgpu_device *adev = dev->dev_private;
>>> +	struct amdgpu_fpriv *fpriv = filp->driver_priv;
>>>    	struct drm_amdgpu_info *info = data;
>>>    	struct amdgpu_mode_info *minfo = &adev->mode_info;
>>>    	void __user *out = (void __user *)(uintptr_t)info->return_pointer;
>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>    	switch (info->query) {
>>>    	case AMDGPU_INFO_ACCEL_WORKING:
>>> +		/* HACK: radv has a fragile open-coded check for drm master
>>> +		 * The pattern for the call is:
>>> +		 *     open(DRM_NODE_PRIMARY)
>>> +		 *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
>>> +		 *
>>> +		 * After commit 8059add04 this check stopped working due to
>>> +		 * operations on a primary node from non-master clients becoming
>>> +		 * more permissive.
>>> +		 *
>>> +		 * This backwards compatibility hack targets the calling pattern
>>> +		 * above to fail radv from thinking it has master access to the
>>> +		 * display system ( without reverting 8059add04 ).
>>> +		 *
>>> +		 * Users of libdrm will not be affected as some other ioctls are
>>> +		 * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
>>> +		 * query.
>>> +		 */
>>> +		if (fpriv->is_first_ioctl && !filp->is_master)
>>> +			return -EACCES;
>>> +
>>>    		ui32 = adev->accel_working;
>>>    		return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0;
>>>    	case AMDGPU_INFO_CRTC_FROM_ID:
>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>    			goto error_vm;
>>>    	}
>>> +	fpriv->is_first_ioctl = true;
>>>    	mutex_init(&fpriv->bo_list_lock);
>>>    	idr_init(&fpriv->bo_list_handles);
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König April 17, 2019, 10:29 a.m. UTC | #4

Hi guys,

well going back to the beginning once more because something doesn't fit 
together here.

I mean what exactly is the purpose of 8059add0478e "drm: allow render 
capable master with DRM_AUTH ioctls"?

When I read the code correctly the effect is that we ignore the DRM_AUTH 
flag when also the DRM_RENDER_ALLOW flag is set and the driver is a 
render node driver.

Well when this is correct then why the heck aren't we removing the 
DRM_AUTH flag from the affected IOCTLs instead?

The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD 
and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability 
flags separately.

So the patch 8059add0478e doesn't make any sense as far as I can tell. 
What I'm missing?

Regards,
Christian.

Am 17.04.19 um 11:18 schrieb Christian König:
> Am 17.04.19 um 10:10 schrieb Daniel Vetter:
>> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote:
>>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
>>>> After a recent commit, access to the DRM_AUTH ioctls become more
>>>> permissive. This resulted in a buggy check for drm_master capabilities
>>>> inside radv stop working.
>>>>
>>>> This commit adds a backwards compatibility workaround so that the radv
>>>> drm_master check keeps working as previously expected.
>>>>
>>>> This fixes SteamVR and other drm lease based apps being broken since
>>>> v5.1-rc1
>>>>
>>>>   From the usermode side, radv will also be fixed to properly test for
>>>> master capabilities:
>>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
>>>>
>>>> Fixes: 8059add0478e ("drm: allow render capable master with 
>>>> DRM_AUTH ioctls")
>>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>> Cc: David Airlie <airlied@linux.ie>
>>>> Cc: Emil Velikov <emil.velikov@collabora.com>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: Keith Packard <keithp@keithp.com>
>>>> Reviewed-by: Keith Packard <keithp@keithp.com>
>>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
>>> Well definitely a NAK. IIRC we have unit tests where the exactly 
>>> first thing
>>> they do is querying AMDGPU_INFO_ACCEL_WORKING.
>>>
>>> And I definitely not going to risk breaking those just to fix buggy 
>>> behavior
>>> in radv.
>> s/buggy/fragile :-)
>>
>> Option B would be to disable 8059add0478e ("drm: allow render capable
>> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more
>> tricky to implement I guess (since it'd leak all over the place).
>>
>> Option C is to revert 8059add04, but that's a bit silly since it 
>> seems to
>> work everywhere.
>>
>> Breaking radv isn't an option, because no regression. Aside: No one is
>> stopping amd folks from reviewing radv patches and making sure 
>> there's no
>> fragile stuff in there.
>>
>> We discussed this quite a bit on irc with Ben and Keith and others, and
>> figured option A is the most promising to go forward with. Anything 
>> using
>> amdgpu_device_init (which I think are all the umds, but I didn't check)
>> will keep working, as will radv leases/vkdisplay, plus we can keep
>> 8059add04 for everyone (not just everony except amdgpu). If that means
>> breaking a few unit tests ... *shrugh*. Needs patches to fix them 
>> ofc, but
>> shouldn't be that much work really.
>
> I think you miss the point here: This patch will break the render node 
> interface!
>
> E.g. take another look at those lines of code:
>> +        if (fpriv->is_first_ioctl && !filp->is_master)
>> +            return -EACCES;
> This will return -EACCES on the first accel working query which is 
> doesn't comes from the DRM master. And it is irrelevant if its the 
> primary or the render node.
>
> So this patch most likely breaks tons of things and is *definitely* a 
> complete NAK.
>
> Additional to that IIRC we have (rather old) code which behaves 
> differently depending if it is called by the render node or the 
> primary node. In other words it assumes that the primary node is 
> authenticated.
>
> If I understood correctly what 8059add04 does than this is NOT a good 
> to do in general.
>
> Regards,
> Christian.
>
>> -Daniel
>>
>>> Christian.
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
>>>>    3 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index 8d0d7f3dd5fb..e67bfee8d411 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
>>>>        struct mutex        bo_list_lock;
>>>>        struct idr        bo_list_handles;
>>>>        struct amdgpu_ctx_mgr    ctx_mgr;
>>>> +
>>>> +    /* Part of a backwards compatibility hack */
>>>> +    bool            is_first_ioctl;
>>>>    };
>>>>    int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv 
>>>> **fpriv);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 7419ea8a388b..cd438afa7092 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>>                  unsigned int cmd, unsigned long arg)
>>>>    {
>>>>        struct drm_file *file_priv = filp->private_data;
>>>> +    struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>>>>        struct drm_device *dev;
>>>>        long ret;
>>>>        dev = file_priv->minor->dev;
>>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>>            return ret;
>>>>        ret = drm_ioctl(filp, cmd, arg);
>>>> +    fpriv->is_first_ioctl = false;
>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index e860412043bb..00c0a2fb2a69 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct 
>>>> amdgpu_device *adev,
>>>>    static int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
>>>> struct drm_file *filp)
>>>>    {
>>>>        struct amdgpu_device *adev = dev->dev_private;
>>>> +    struct amdgpu_fpriv *fpriv = filp->driver_priv;
>>>>        struct drm_amdgpu_info *info = data;
>>>>        struct amdgpu_mode_info *minfo = &adev->mode_info;
>>>>        void __user *out = (void __user 
>>>> *)(uintptr_t)info->return_pointer;
>>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device 
>>>> *dev, void *data, struct drm_file
>>>>        switch (info->query) {
>>>>        case AMDGPU_INFO_ACCEL_WORKING:
>>>> +        /* HACK: radv has a fragile open-coded check for drm master
>>>> +         * The pattern for the call is:
>>>> +         *     open(DRM_NODE_PRIMARY)
>>>> +         *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
>>>> +         *
>>>> +         * After commit 8059add04 this check stopped working due to
>>>> +         * operations on a primary node from non-master clients 
>>>> becoming
>>>> +         * more permissive.
>>>> +         *
>>>> +         * This backwards compatibility hack targets the calling 
>>>> pattern
>>>> +         * above to fail radv from thinking it has master access 
>>>> to the
>>>> +         * display system ( without reverting 8059add04 ).
>>>> +         *
>>>> +         * Users of libdrm will not be affected as some other 
>>>> ioctls are
>>>> +         * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
>>>> +         * query.
>>>> +         */
>>>> +        if (fpriv->is_first_ioctl && !filp->is_master)
>>>> +            return -EACCES;
>>>> +
>>>>            ui32 = adev->accel_working;
>>>>            return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT 
>>>> : 0;
>>>>        case AMDGPU_INFO_CRTC_FROM_ID:
>>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device 
>>>> *dev, struct drm_file *file_priv)
>>>>                goto error_vm;
>>>>        }
>>>> +    fpriv->is_first_ioctl = true;
>>>>        mutex_init(&fpriv->bo_list_lock);
>>>>        idr_init(&fpriv->bo_list_handles);
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Daniel Vetter April 17, 2019, 11:06 a.m. UTC | #5

On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Hi guys,
>
> well going back to the beginning once more because something doesn't fit
> together here.
>
> I mean what exactly is the purpose of 8059add0478e "drm: allow render
> capable master with DRM_AUTH ioctls"?
>
> When I read the code correctly the effect is that we ignore the DRM_AUTH
> flag when also the DRM_RENDER_ALLOW flag is set and the driver is a
> render node driver.
>
> Well when this is correct then why the heck aren't we removing the
> DRM_AUTH flag from the affected IOCTLs instead?
>
> The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD
> and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability
> flags separately.
>
> So the patch 8059add0478e doesn't make any sense as far as I can tell.
> What I'm missing?

The idea is that if your driver supports render node (i.e. isolates
unpriviledged clients), then we could allow the same on primary nodes.
There's a bunch of userspace which blindly only opens primary nodes
(instead of also opening render nodes), and this makes them work
without requiring root or similar. Afaiui the main offender is libva
being really dense (and semi-abandonded, partially also because intel
has changed for the worse in that area).

Note that the capability checks are for all ioctl, including driver
private ones, i.e. rendering will Just Work. I think it makes sense to
have that.

I guess there might be 100% overlap between DRM_AUTH and
DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean
auditing all the driver ioctl tables. I guess we could add that as a
todo.rst item. It would make sense that either the driver is isolating
clients sufficiently to not need authenticated clients only, or allow
them all.

I guess we could lament why rendernodes are still not yet fully rolled
out everywhere, despite all these years. But then I stopped being
surprised about ecosystem inertia, and in the end a strict split
between display (done on the primary) and rendering (preferrably done
on render nodes) is not required on intel/amd/nv, so not many care to
make this work better. The patch from Emil seemed like a neat little
trick to get there faster, until the radv hiccup.
-Daniel

> Regards,
> Christian.
>
> Am 17.04.19 um 11:18 schrieb Christian König:
> > Am 17.04.19 um 10:10 schrieb Daniel Vetter:
> >> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote:
> >>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
> >>>> After a recent commit, access to the DRM_AUTH ioctls become more
> >>>> permissive. This resulted in a buggy check for drm_master capabilities
> >>>> inside radv stop working.
> >>>>
> >>>> This commit adds a backwards compatibility workaround so that the radv
> >>>> drm_master check keeps working as previously expected.
> >>>>
> >>>> This fixes SteamVR and other drm lease based apps being broken since
> >>>> v5.1-rc1
> >>>>
> >>>>   From the usermode side, radv will also be fixed to properly test for
> >>>> master capabilities:
> >>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
> >>>>
> >>>> Fixes: 8059add0478e ("drm: allow render capable master with
> >>>> DRM_AUTH ioctls")
> >>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
> >>>> Cc: Daniel Vetter <daniel@ffwll.ch>
> >>>> Cc: David Airlie <airlied@linux.ie>
> >>>> Cc: Emil Velikov <emil.velikov@collabora.com>
> >>>> Cc: Alex Deucher <alexander.deucher@amd.com>
> >>>> Cc: Keith Packard <keithp@keithp.com>
> >>>> Reviewed-by: Keith Packard <keithp@keithp.com>
> >>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
> >>> Well definitely a NAK. IIRC we have unit tests where the exactly
> >>> first thing
> >>> they do is querying AMDGPU_INFO_ACCEL_WORKING.
> >>>
> >>> And I definitely not going to risk breaking those just to fix buggy
> >>> behavior
> >>> in radv.
> >> s/buggy/fragile :-)
> >>
> >> Option B would be to disable 8059add0478e ("drm: allow render capable
> >> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more
> >> tricky to implement I guess (since it'd leak all over the place).
> >>
> >> Option C is to revert 8059add04, but that's a bit silly since it
> >> seems to
> >> work everywhere.
> >>
> >> Breaking radv isn't an option, because no regression. Aside: No one is
> >> stopping amd folks from reviewing radv patches and making sure
> >> there's no
> >> fragile stuff in there.
> >>
> >> We discussed this quite a bit on irc with Ben and Keith and others, and
> >> figured option A is the most promising to go forward with. Anything
> >> using
> >> amdgpu_device_init (which I think are all the umds, but I didn't check)
> >> will keep working, as will radv leases/vkdisplay, plus we can keep
> >> 8059add04 for everyone (not just everony except amdgpu). If that means
> >> breaking a few unit tests ... *shrugh*. Needs patches to fix them
> >> ofc, but
> >> shouldn't be that much work really.
> >
> > I think you miss the point here: This patch will break the render node
> > interface!
> >
> > E.g. take another look at those lines of code:
> >> +        if (fpriv->is_first_ioctl && !filp->is_master)
> >> +            return -EACCES;
> > This will return -EACCES on the first accel working query which is
> > doesn't comes from the DRM master. And it is irrelevant if its the
> > primary or the render node.
> >
> > So this patch most likely breaks tons of things and is *definitely* a
> > complete NAK.
> >
> > Additional to that IIRC we have (rather old) code which behaves
> > differently depending if it is called by the render node or the
> > primary node. In other words it assumes that the primary node is
> > authenticated.
> >
> > If I understood correctly what 8059add04 does than this is NOT a good
> > to do in general.
> >
> > Regards,
> > Christian.
> >
> >> -Daniel
> >>
> >>> Christian.
> >>>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
> >>>>    3 files changed, 27 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> index 8d0d7f3dd5fb..e67bfee8d411 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
> >>>>        struct mutex        bo_list_lock;
> >>>>        struct idr        bo_list_handles;
> >>>>        struct amdgpu_ctx_mgr    ctx_mgr;
> >>>> +
> >>>> +    /* Part of a backwards compatibility hack */
> >>>> +    bool            is_first_ioctl;
> >>>>    };
> >>>>    int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv
> >>>> **fpriv);
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> index 7419ea8a388b..cd438afa7092 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
> >>>>                  unsigned int cmd, unsigned long arg)
> >>>>    {
> >>>>        struct drm_file *file_priv = filp->private_data;
> >>>> +    struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
> >>>>        struct drm_device *dev;
> >>>>        long ret;
> >>>>        dev = file_priv->minor->dev;
> >>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
> >>>>            return ret;
> >>>>        ret = drm_ioctl(filp, cmd, arg);
> >>>> +    fpriv->is_first_ioctl = false;
> >>>>        pm_runtime_mark_last_busy(dev->dev);
> >>>>        pm_runtime_put_autosuspend(dev->dev);
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>>> index e860412043bb..00c0a2fb2a69 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct
> >>>> amdgpu_device *adev,
> >>>>    static int amdgpu_info_ioctl(struct drm_device *dev, void *data,
> >>>> struct drm_file *filp)
> >>>>    {
> >>>>        struct amdgpu_device *adev = dev->dev_private;
> >>>> +    struct amdgpu_fpriv *fpriv = filp->driver_priv;
> >>>>        struct drm_amdgpu_info *info = data;
> >>>>        struct amdgpu_mode_info *minfo = &adev->mode_info;
> >>>>        void __user *out = (void __user
> >>>> *)(uintptr_t)info->return_pointer;
> >>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device
> >>>> *dev, void *data, struct drm_file
> >>>>        switch (info->query) {
> >>>>        case AMDGPU_INFO_ACCEL_WORKING:
> >>>> +        /* HACK: radv has a fragile open-coded check for drm master
> >>>> +         * The pattern for the call is:
> >>>> +         *     open(DRM_NODE_PRIMARY)
> >>>> +         *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
> >>>> +         *
> >>>> +         * After commit 8059add04 this check stopped working due to
> >>>> +         * operations on a primary node from non-master clients
> >>>> becoming
> >>>> +         * more permissive.
> >>>> +         *
> >>>> +         * This backwards compatibility hack targets the calling
> >>>> pattern
> >>>> +         * above to fail radv from thinking it has master access
> >>>> to the
> >>>> +         * display system ( without reverting 8059add04 ).
> >>>> +         *
> >>>> +         * Users of libdrm will not be affected as some other
> >>>> ioctls are
> >>>> +         * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
> >>>> +         * query.
> >>>> +         */
> >>>> +        if (fpriv->is_first_ioctl && !filp->is_master)
> >>>> +            return -EACCES;
> >>>> +
> >>>>            ui32 = adev->accel_working;
> >>>>            return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT
> >>>> : 0;
> >>>>        case AMDGPU_INFO_CRTC_FROM_ID:
> >>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device
> >>>> *dev, struct drm_file *file_priv)
> >>>>                goto error_vm;
> >>>>        }
> >>>> +    fpriv->is_first_ioctl = true;
> >>>>        mutex_init(&fpriv->bo_list_lock);
> >>>>        idr_init(&fpriv->bo_list_handles);
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>

Christian König April 17, 2019, 11:18 a.m. UTC | #6

Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Hi guys,
>>
>> well going back to the beginning once more because something doesn't fit
>> together here.
>>
>> I mean what exactly is the purpose of 8059add0478e "drm: allow render
>> capable master with DRM_AUTH ioctls"?
>>
>> When I read the code correctly the effect is that we ignore the DRM_AUTH
>> flag when also the DRM_RENDER_ALLOW flag is set and the driver is a
>> render node driver.
>>
>> Well when this is correct then why the heck aren't we removing the
>> DRM_AUTH flag from the affected IOCTLs instead?
>>
>> The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD
>> and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability
>> flags separately.
>>
>> So the patch 8059add0478e doesn't make any sense as far as I can tell.
>> What I'm missing?
> The idea is that if your driver supports render node (i.e. isolates
> unpriviledged clients), then we could allow the same on primary nodes.
> There's a bunch of userspace which blindly only opens primary nodes
> (instead of also opening render nodes), and this makes them work
> without requiring root or similar. Afaiui the main offender is libva
> being really dense (and semi-abandonded, partially also because intel
> has changed for the worse in that area).

And exactly that's the problem, we can't do this. It's not only radv 
which breaks, but also older libdrm_amdgpu versions as well as a bunch 
of other stuff.

Key problem is that older libdrm_amdgpu code did exactly the same thing 
and called an IOCTL to check if a handle is authenticated or not. If I'm 
not completely mistaken the radv code could actually be copy&pasted from 
there.

Additional to that we might run into problems with the libdrm unit 
tests, but those probably didn't break obviously because they are almost 
all the time run as root.

> Note that the capability checks are for all ioctl, including driver
> private ones, i.e. rendering will Just Work. I think it makes sense to
> have that.
>
> I guess there might be 100% overlap between DRM_AUTH and
> DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean
> auditing all the driver ioctl tables. I guess we could add that as a
> todo.rst item. It would make sense that either the driver is isolating
> clients sufficiently to not need authenticated clients only, or allow
> them all.
>
> I guess we could lament why rendernodes are still not yet fully rolled
> out everywhere, despite all these years. But then I stopped being
> surprised about ecosystem inertia, and in the end a strict split
> between display (done on the primary) and rendering (preferrably done
> on render nodes) is not required on intel/amd/nv, so not many care to
> make this work better. The patch from Emil seemed like a neat little
> trick to get there faster, until the radv hiccup.

Well that is a clear NAK for that approach.

If you have problems with libva (which are design problems in libva and 
not kernel related at all) then please work around that by removing 
DRM_AUTH from the affected IOCTLs and not by messing up general 
authentication code.

Well, what you guys did here is a serious no-go.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>> Am 17.04.19 um 11:18 schrieb Christian König:
>>> Am 17.04.19 um 10:10 schrieb Daniel Vetter:
>>>> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote:
>>>>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez:
>>>>>> After a recent commit, access to the DRM_AUTH ioctls become more
>>>>>> permissive. This resulted in a buggy check for drm_master capabilities
>>>>>> inside radv stop working.
>>>>>>
>>>>>> This commit adds a backwards compatibility workaround so that the radv
>>>>>> drm_master check keeps working as previously expected.
>>>>>>
>>>>>> This fixes SteamVR and other drm lease based apps being broken since
>>>>>> v5.1-rc1
>>>>>>
>>>>>>    From the usermode side, radv will also be fixed to properly test for
>>>>>> master capabilities:
>>>>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669
>>>>>>
>>>>>> Fixes: 8059add0478e ("drm: allow render capable master with
>>>>>> DRM_AUTH ioctls")
>>>>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
>>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>>> Cc: David Airlie <airlied@linux.ie>
>>>>>> Cc: Emil Velikov <emil.velikov@collabora.com>
>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>> Cc: Keith Packard <keithp@keithp.com>
>>>>>> Reviewed-by: Keith Packard <keithp@keithp.com>
>>>>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
>>>>> Well definitely a NAK. IIRC we have unit tests where the exactly
>>>>> first thing
>>>>> they do is querying AMDGPU_INFO_ACCEL_WORKING.
>>>>>
>>>>> And I definitely not going to risk breaking those just to fix buggy
>>>>> behavior
>>>>> in radv.
>>>> s/buggy/fragile :-)
>>>>
>>>> Option B would be to disable 8059add0478e ("drm: allow render capable
>>>> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more
>>>> tricky to implement I guess (since it'd leak all over the place).
>>>>
>>>> Option C is to revert 8059add04, but that's a bit silly since it
>>>> seems to
>>>> work everywhere.
>>>>
>>>> Breaking radv isn't an option, because no regression. Aside: No one is
>>>> stopping amd folks from reviewing radv patches and making sure
>>>> there's no
>>>> fragile stuff in there.
>>>>
>>>> We discussed this quite a bit on irc with Ben and Keith and others, and
>>>> figured option A is the most promising to go forward with. Anything
>>>> using
>>>> amdgpu_device_init (which I think are all the umds, but I didn't check)
>>>> will keep working, as will radv leases/vkdisplay, plus we can keep
>>>> 8059add04 for everyone (not just everony except amdgpu). If that means
>>>> breaking a few unit tests ... *shrugh*. Needs patches to fix them
>>>> ofc, but
>>>> shouldn't be that much work really.
>>> I think you miss the point here: This patch will break the render node
>>> interface!
>>>
>>> E.g. take another look at those lines of code:
>>>> +        if (fpriv->is_first_ioctl && !filp->is_master)
>>>> +            return -EACCES;
>>> This will return -EACCES on the first accel working query which is
>>> doesn't comes from the DRM master. And it is irrelevant if its the
>>> primary or the render node.
>>>
>>> So this patch most likely breaks tons of things and is *definitely* a
>>> complete NAK.
>>>
>>> Additional to that IIRC we have (rather old) code which behaves
>>> differently depending if it is called by the render node or the
>>> primary node. In other words it assumes that the primary node is
>>> authenticated.
>>>
>>> If I understood correctly what 8059add04 does than this is NOT a good
>>> to do in general.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> -Daniel
>>>>
>>>>> Christian.
>>>>>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  3 +++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  2 ++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++
>>>>>>     3 files changed, 27 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index 8d0d7f3dd5fb..e67bfee8d411 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv {
>>>>>>         struct mutex        bo_list_lock;
>>>>>>         struct idr        bo_list_handles;
>>>>>>         struct amdgpu_ctx_mgr    ctx_mgr;
>>>>>> +
>>>>>> +    /* Part of a backwards compatibility hack */
>>>>>> +    bool            is_first_ioctl;
>>>>>>     };
>>>>>>     int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv
>>>>>> **fpriv);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> index 7419ea8a388b..cd438afa7092 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>>>>                   unsigned int cmd, unsigned long arg)
>>>>>>     {
>>>>>>         struct drm_file *file_priv = filp->private_data;
>>>>>> +    struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>>>>>>         struct drm_device *dev;
>>>>>>         long ret;
>>>>>>         dev = file_priv->minor->dev;
>>>>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp,
>>>>>>             return ret;
>>>>>>         ret = drm_ioctl(filp, cmd, arg);
>>>>>> +    fpriv->is_first_ioctl = false;
>>>>>>         pm_runtime_mark_last_busy(dev->dev);
>>>>>>         pm_runtime_put_autosuspend(dev->dev);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>>> index e860412043bb..00c0a2fb2a69 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct
>>>>>> amdgpu_device *adev,
>>>>>>     static int amdgpu_info_ioctl(struct drm_device *dev, void *data,
>>>>>> struct drm_file *filp)
>>>>>>     {
>>>>>>         struct amdgpu_device *adev = dev->dev_private;
>>>>>> +    struct amdgpu_fpriv *fpriv = filp->driver_priv;
>>>>>>         struct drm_amdgpu_info *info = data;
>>>>>>         struct amdgpu_mode_info *minfo = &adev->mode_info;
>>>>>>         void __user *out = (void __user
>>>>>> *)(uintptr_t)info->return_pointer;
>>>>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device
>>>>>> *dev, void *data, struct drm_file
>>>>>>         switch (info->query) {
>>>>>>         case AMDGPU_INFO_ACCEL_WORKING:
>>>>>> +        /* HACK: radv has a fragile open-coded check for drm master
>>>>>> +         * The pattern for the call is:
>>>>>> +         *     open(DRM_NODE_PRIMARY)
>>>>>> +         *     drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING )
>>>>>> +         *
>>>>>> +         * After commit 8059add04 this check stopped working due to
>>>>>> +         * operations on a primary node from non-master clients
>>>>>> becoming
>>>>>> +         * more permissive.
>>>>>> +         *
>>>>>> +         * This backwards compatibility hack targets the calling
>>>>>> pattern
>>>>>> +         * above to fail radv from thinking it has master access
>>>>>> to the
>>>>>> +         * display system ( without reverting 8059add04 ).
>>>>>> +         *
>>>>>> +         * Users of libdrm will not be affected as some other
>>>>>> ioctls are
>>>>>> +         * performed between open() and the AMDGPU_INFO_ACCEL_WORKING
>>>>>> +         * query.
>>>>>> +         */
>>>>>> +        if (fpriv->is_first_ioctl && !filp->is_master)
>>>>>> +            return -EACCES;
>>>>>> +
>>>>>>             ui32 = adev->accel_working;
>>>>>>             return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT
>>>>>> : 0;
>>>>>>         case AMDGPU_INFO_CRTC_FROM_ID:
>>>>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device
>>>>>> *dev, struct drm_file *file_priv)
>>>>>>                 goto error_vm;
>>>>>>         }
>>>>>> +    fpriv->is_first_ioctl = true;
>>>>>>         mutex_init(&fpriv->bo_list_lock);
>>>>>>         idr_init(&fpriv->bo_list_handles);
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Daniel Vetter April 17, 2019, noon UTC | #7

On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Hi guys,
> >>
> >> well going back to the beginning once more because something doesn't fit
> >> together here.
> >>
> >> I mean what exactly is the purpose of 8059add0478e "drm: allow render
> >> capable master with DRM_AUTH ioctls"?
> >>
> >> When I read the code correctly the effect is that we ignore the DRM_AUTH
> >> flag when also the DRM_RENDER_ALLOW flag is set and the driver is a
> >> render node driver.
> >>
> >> Well when this is correct then why the heck aren't we removing the
> >> DRM_AUTH flag from the affected IOCTLs instead?
> >>
> >> The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD
> >> and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability
> >> flags separately.
> >>
> >> So the patch 8059add0478e doesn't make any sense as far as I can tell.
> >> What I'm missing?
> > The idea is that if your driver supports render node (i.e. isolates
> > unpriviledged clients), then we could allow the same on primary nodes.
> > There's a bunch of userspace which blindly only opens primary nodes
> > (instead of also opening render nodes), and this makes them work
> > without requiring root or similar. Afaiui the main offender is libva
> > being really dense (and semi-abandonded, partially also because intel
> > has changed for the worse in that area).
> 
> And exactly that's the problem, we can't do this. It's not only radv 
> which breaks, but also older libdrm_amdgpu versions as well as a bunch 
> of other stuff.
> 
> Key problem is that older libdrm_amdgpu code did exactly the same thing 
> and called an IOCTL to check if a handle is authenticated or not. If I'm 
> not completely mistaken the radv code could actually be copy&pasted from 
> there.
> 
> Additional to that we might run into problems with the libdrm unit 
> tests, but those probably didn't break obviously because they are almost 
> all the time run as root.

I looked at the very first version of amdgpu_device_initialize. That still
does some other ioctl calls before querying AMDGPU_INFO_ACCEL_WORKING. So
should work even on the very first amdgpu based userspace release.

I also quickly checked for these unit tests, and at least the ones in
libdrm don't have a call to AMDGPU_INFO_ACCEL_WORKING directly after
opening the file. If you want I can also crawl through rocm, amdvlk and
ddx sources.

> > Note that the capability checks are for all ioctl, including driver
> > private ones, i.e. rendering will Just Work. I think it makes sense to
> > have that.
> >
> > I guess there might be 100% overlap between DRM_AUTH and
> > DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean
> > auditing all the driver ioctl tables. I guess we could add that as a
> > todo.rst item. It would make sense that either the driver is isolating
> > clients sufficiently to not need authenticated clients only, or allow
> > them all.
> >
> > I guess we could lament why rendernodes are still not yet fully rolled
> > out everywhere, despite all these years. But then I stopped being
> > surprised about ecosystem inertia, and in the end a strict split
> > between display (done on the primary) and rendering (preferrably done
> > on render nodes) is not required on intel/amd/nv, so not many care to
> > make this work better. The patch from Emil seemed like a neat little
> > trick to get there faster, until the radv hiccup.
> 
> Well that is a clear NAK for that approach.
> 
> If you have problems with libva (which are design problems in libva and 
> not kernel related at all) then please work around that by removing 
> DRM_AUTH from the affected IOCTLs and not by messing up general 
> authentication code.

That's what we've done defacto, except we've done it for all drivers. As
mentioned above, option C would be to disable that for amdgpu. Which seems
silly, given that the workaround hack is just 3 lines.

I also think that the goal of the original patch is pretty sound and
overall well motivated, given the reality that rendernodes aren't super
popular unfortunately. Also, from an access permission pov it's the right
thing to do really.

> Well, what you guys did here is a serious no-go.

Not really understanding your reasons for an unconditional nak on all this
still.
-Daniel

Christian König April 17, 2019, 12:06 p.m. UTC | #8

Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>> [SNIP]
>> Well, what you guys did here is a serious no-go.
> Not really understanding your reasons for an unconditional nak on all this
> still.

Well, let me summarize: You worked around a problem with libva by 
breaking another driver and now proposing a rather ugly and only halve 
backed workaround for that driver.

This is a serious no-go. I've already send out a i915 specific change to 
remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert 
the offending patch.

Please review,
Christian.

> -Daniel

Daniel Vetter April 17, 2019, 12:35 p.m. UTC | #9

On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>> [SNIP]
> >> Well, what you guys did here is a serious no-go.
> > Not really understanding your reasons for an unconditional nak on all this
> > still.
> 
> Well, let me summarize: You worked around a problem with libva by 
> breaking another driver and now proposing a rather ugly and only halve 
> backed workaround for that driver.
> 
> This is a serious no-go. I've already send out a i915 specific change to 
> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert 
> the offending patch.

Oh I'm totally fine with the revert if that's what we go with. But that
still leaves the issue that it would make sense to drop DRM_AUTH from
DRIVER_RENDER capable drivers (not just for libva, but in general). And
there's fundamentally 2 options to do that:

- This one here (or anything implementing the same idea), to keep radv
  working.

- We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
  exactly that's doen doesn't matter, but it leaves amdgpu out in the
  cold. It also doesn't matter whether we get there with revert + lots of
  patches, or a special hack for amdgpu, in the end amdgpu would be
  different. Also note that your patch isn't enough, since it doesn't fix
  the core ioctls.

I think from an overall platform pov, the first is the better option.
After all we try really hard to avoid driver special cases for these kinds
of things.
-Daniel

Christian König April 17, 2019, 12:46 p.m. UTC | #10

Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>>>> <Christian.Koenig@amd.com> wrote:
>>>>> [SNIP]
>>>> Well, what you guys did here is a serious no-go.
>>> Not really understanding your reasons for an unconditional nak on all this
>>> still.
>> Well, let me summarize: You worked around a problem with libva by
>> breaking another driver and now proposing a rather ugly and only halve
>> backed workaround for that driver.
>>
>> This is a serious no-go. I've already send out a i915 specific change to
>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
>> the offending patch.
> Oh I'm totally fine with the revert if that's what we go with. But that
> still leaves the issue that it would make sense to drop DRM_AUTH from
> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> there's fundamentally 2 options to do that:
>
> - This one here (or anything implementing the same idea), to keep radv
>    working.
>
> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
>    exactly that's doen doesn't matter, but it leaves amdgpu out in the
>    cold. It also doesn't matter whether we get there with revert + lots of
>    patches, or a special hack for amdgpu, in the end amdgpu would be
>    different. Also note that your patch isn't enough, since it doesn't fix
>    the core ioctls.
>
> I think from an overall platform pov, the first is the better option.
> After all we try really hard to avoid driver special cases for these kinds
> of things.

Well, I initially thought that radv is doing something unusual or 
broken, but after looking deeper into this that is actually not the case.

What radv does is calling an IOCTL and expecting an error message when 
it is not authenticated.

It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT 
to figure out if it is authenticated or not, but I clearly remember that 
we haven't done this from the beginning.

Thinking more about this I don't think that this problem is actually 
amdgpu specific. So I wouldn't drop DRM_AUTH from all render node 
drivers at all, but only from those who have the specific issue with libva.

Christian.

> -Daniel

Daniel Vetter April 17, 2019, 1:34 p.m. UTC | #11

On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> > On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> > > Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> > > > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> > > > > Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> > > > > > <Christian.Koenig@amd.com> wrote:
> > > > > > [SNIP]
> > > > > Well, what you guys did here is a serious no-go.
> > > > Not really understanding your reasons for an unconditional nak on all this
> > > > still.
> > > Well, let me summarize: You worked around a problem with libva by
> > > breaking another driver and now proposing a rather ugly and only halve
> > > backed workaround for that driver.
> > > 
> > > This is a serious no-go. I've already send out a i915 specific change to
> > > remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> > > the offending patch.
> > Oh I'm totally fine with the revert if that's what we go with. But that
> > still leaves the issue that it would make sense to drop DRM_AUTH from
> > DRIVER_RENDER capable drivers (not just for libva, but in general). And
> > there's fundamentally 2 options to do that:
> > 
> > - This one here (or anything implementing the same idea), to keep radv
> >    working.
> > 
> > - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >    exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >    cold. It also doesn't matter whether we get there with revert + lots of
> >    patches, or a special hack for amdgpu, in the end amdgpu would be
> >    different. Also note that your patch isn't enough, since it doesn't fix
> >    the core ioctls.
> > 
> > I think from an overall platform pov, the first is the better option.
> > After all we try really hard to avoid driver special cases for these kinds
> > of things.
> 
> Well, I initially thought that radv is doing something unusual or broken,
> but after looking deeper into this that is actually not the case.
> 
> What radv does is calling an IOCTL and expecting an error message when it is
> not authenticated.
> 
> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> figure out if it is authenticated or not, but I clearly remember that we
> haven't done this from the beginning.

Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
First public commit has all the bits: getauth, GetVersion, then the accel
query.

> Thinking more about this I don't think that this problem is actually amdgpu
> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> but only from those who have the specific issue with libva.

libva was the initial motivation, the goal of Emil's patch wasn't to just
duct-tape over libva. That would be easier to fix in libva. The point was
that we should be able to allow this in general.

And we can, except for the radv issue uncovered here.

So please don't get 100% hung up on the libva thing, that wasn't really
the goal, just the initial motivation. And I still thinks it makes for all
drivers. So again we have:

- radv hack
- make amdgpu behave different from everyone else
- keep tilting windmills about "pls use rendernodes, thx"

I neither like tilting windmills nor making drivers behave different for
no reaason at all.
-Daniel

Emil Velikov April 17, 2019, 3:51 p.m. UTC | #12

Hi guys,

On 2019/04/17, Daniel Vetter wrote:
> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> > Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> > > On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> > > > Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> > > > > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> > > > > > Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> > > > > > > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> > > > > > > <Christian.Koenig@amd.com> wrote:
> > > > > > > [SNIP]
> > > > > > Well, what you guys did here is a serious no-go.
> > > > > Not really understanding your reasons for an unconditional nak on all this
> > > > > still.
> > > > Well, let me summarize: You worked around a problem with libva by
> > > > breaking another driver and now proposing a rather ugly and only halve
> > > > backed workaround for that driver.
> > > > 
> > > > This is a serious no-go. I've already send out a i915 specific change to
> > > > remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> > > > the offending patch.
> > > Oh I'm totally fine with the revert if that's what we go with. But that
> > > still leaves the issue that it would make sense to drop DRM_AUTH from
> > > DRIVER_RENDER capable drivers (not just for libva, but in general). And
> > > there's fundamentally 2 options to do that:
> > > 
> > > - This one here (or anything implementing the same idea), to keep radv
> > >    working.
> > > 
> > > - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> > >    exactly that's doen doesn't matter, but it leaves amdgpu out in the
> > >    cold. It also doesn't matter whether we get there with revert + lots of
> > >    patches, or a special hack for amdgpu, in the end amdgpu would be
> > >    different. Also note that your patch isn't enough, since it doesn't fix
> > >    the core ioctls.
> > > 
> > > I think from an overall platform pov, the first is the better option.
> > > After all we try really hard to avoid driver special cases for these kinds
> > > of things.
> > 
> > Well, I initially thought that radv is doing something unusual or broken,
> > but after looking deeper into this that is actually not the case.
> > 
> > What radv does is calling an IOCTL and expecting an error message when it is
> > not authenticated.
> > 
> > It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> > figure out if it is authenticated or not, but I clearly remember that we
> > haven't done this from the beginning.
> 
> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> First public commit has all the bits: getauth, GetVersion, then the accel
> query.
> 
> > Thinking more about this I don't think that this problem is actually amdgpu
> > specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> > but only from those who have the specific issue with libva.
> 
> libva was the initial motivation, the goal of Emil's patch wasn't to just
> duct-tape over libva. That would be easier to fix in libva. The point was
> that we should be able to allow this in general.
> 
> And we can, except for the radv issue uncovered here.
> 
> So please don't get 100% hung up on the libva thing, that wasn't really
> the goal, just the initial motivation. And I still thinks it makes for all
> drivers. So again we have:
> 
> - radv hack
> - make amdgpu behave different from everyone else
> - keep tilting windmills about "pls use rendernodes, thx"
> 
> I neither like tilting windmills nor making drivers behave different for
> no reaason at all.

Allow me to jump-in some emails down the line.

First and foremost, sincere apologies for upsetting you Christian. If
it's of any consolidation - let me assure you the goal is _not_ to break
amdgpu or any other driver.

Secondly, I would like to ask you for a list of projects so we can look and
investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
it.

On the topic of "working around libva" - sadly libva is _not_ the only
offender. We also had bugs in mesa and kmscube.

Considering those are taken as a prime example of "the right way", it's very
likely for the mistakes to be repeated in many other projects.

Where the common "fix" seems to be "run as root"...


As Daniel pointed out, we could be fighting the windmills or we could have a
small, admittedly ugly, workaround for amdgpu.

If you don't like that workaround in the driver we could move it to core DRM.

In either case, I would like to focus on a pragramic solution which works with
both old and new userspace.

Thanks
Emil

Christian König April 17, 2019, 5:30 p.m. UTC | #13

Am 17.04.19 um 17:51 schrieb Emil Velikov:
> Hi guys,
>
> On 2019/04/17, Daniel Vetter wrote:
>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>>> [SNIP]
>>>>>>> Well, what you guys did here is a serious no-go.
>>>>>> Not really understanding your reasons for an unconditional nak on all this
>>>>>> still.
>>>>> Well, let me summarize: You worked around a problem with libva by
>>>>> breaking another driver and now proposing a rather ugly and only halve
>>>>> backed workaround for that driver.
>>>>>
>>>>> This is a serious no-go. I've already send out a i915 specific change to
>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
>>>>> the offending patch.
>>>> Oh I'm totally fine with the revert if that's what we go with. But that
>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
>>>> there's fundamentally 2 options to do that:
>>>>
>>>> - This one here (or anything implementing the same idea), to keep radv
>>>>     working.
>>>>
>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
>>>>     exactly that's doen doesn't matter, but it leaves amdgpu out in the
>>>>     cold. It also doesn't matter whether we get there with revert + lots of
>>>>     patches, or a special hack for amdgpu, in the end amdgpu would be
>>>>     different. Also note that your patch isn't enough, since it doesn't fix
>>>>     the core ioctls.
>>>>
>>>> I think from an overall platform pov, the first is the better option.
>>>> After all we try really hard to avoid driver special cases for these kinds
>>>> of things.
>>> Well, I initially thought that radv is doing something unusual or broken,
>>> but after looking deeper into this that is actually not the case.
>>>
>>> What radv does is calling an IOCTL and expecting an error message when it is
>>> not authenticated.
>>>
>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
>>> figure out if it is authenticated or not, but I clearly remember that we
>>> haven't done this from the beginning.
>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
>> First public commit has all the bits: getauth, GetVersion, then the accel
>> query.
>>
>>> Thinking more about this I don't think that this problem is actually amdgpu
>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
>>> but only from those who have the specific issue with libva.
>> libva was the initial motivation, the goal of Emil's patch wasn't to just
>> duct-tape over libva. That would be easier to fix in libva. The point was
>> that we should be able to allow this in general.
>>
>> And we can, except for the radv issue uncovered here.
>>
>> So please don't get 100% hung up on the libva thing, that wasn't really
>> the goal, just the initial motivation. And I still thinks it makes for all
>> drivers. So again we have:
>>
>> - radv hack
>> - make amdgpu behave different from everyone else
>> - keep tilting windmills about "pls use rendernodes, thx"
>>
>> I neither like tilting windmills nor making drivers behave different for
>> no reaason at all.
> Allow me to jump-in some emails down the line.
>
> First and foremost, sincere apologies for upsetting you Christian. If
> it's of any consolidation - let me assure you the goal is _not_ to break
> amdgpu or any other driver.
>
> Secondly, I would like to ask you for a list of projects so we can look and
> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> it.

That is rather hard to come by, because you would also need to look at 
all previously existing driver stacks.

E.g. with the classic R300 driver for example.

> On the topic of "working around libva" - sadly libva is _not_ the only
> offender. We also had bugs in mesa and kmscube.
>
> Considering those are taken as a prime example of "the right way", it's very
> likely for the mistakes to be repeated in many other projects.
>
> Where the common "fix" seems to be "run as root"...
>
>
> As Daniel pointed out, we could be fighting the windmills or we could have a
> small, admittedly ugly, workaround for amdgpu.
>
> If you don't like that workaround in the driver we could move it to core DRM.
>
> In either case, I would like to focus on a pragramic solution which works with
> both old and new userspace.

Well, I seriously think the original committed solution could cause a 
lot of problems and the issue with radv is only the tip of the iceberg.

I mean we just have to ask our self why have we created render nodes in 
the first place? The obvious alternative was to just removed the 
authentication restrictions on the primary node which would have been 
way less code and maintenance burden.

I need to dig up the mailing list archive, but I strongly think that one 
of the main arguments for this approach was to NOT break existing userspace.

Even taking into account that we now don't need to deal with UMS and 
really really old userspace drivers any more it still feels like a to 
high risk going down that route.

Christian.

>
> Thanks
> Emil

Dave Airlie April 17, 2019, 8:49 p.m. UTC | #14

On Thu, 18 Apr 2019 at 03:30, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 17.04.19 um 17:51 schrieb Emil Velikov:
> > Hi guys,
> >
> > On 2019/04/17, Daniel Vetter wrote:
> >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>>>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>>> [SNIP]
> >>>>>>> Well, what you guys did here is a serious no-go.
> >>>>>> Not really understanding your reasons for an unconditional nak on all this
> >>>>>> still.
> >>>>> Well, let me summarize: You worked around a problem with libva by
> >>>>> breaking another driver and now proposing a rather ugly and only halve
> >>>>> backed workaround for that driver.
> >>>>>
> >>>>> This is a serious no-go. I've already send out a i915 specific change to
> >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> >>>>> the offending patch.
> >>>> Oh I'm totally fine with the revert if that's what we go with. But that
> >>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> >>>> there's fundamentally 2 options to do that:
> >>>>
> >>>> - This one here (or anything implementing the same idea), to keep radv
> >>>>     working.
> >>>>
> >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >>>>     exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >>>>     cold. It also doesn't matter whether we get there with revert + lots of
> >>>>     patches, or a special hack for amdgpu, in the end amdgpu would be
> >>>>     different. Also note that your patch isn't enough, since it doesn't fix
> >>>>     the core ioctls.
> >>>>
> >>>> I think from an overall platform pov, the first is the better option.
> >>>> After all we try really hard to avoid driver special cases for these kinds
> >>>> of things.
> >>> Well, I initially thought that radv is doing something unusual or broken,
> >>> but after looking deeper into this that is actually not the case.
> >>>
> >>> What radv does is calling an IOCTL and expecting an error message when it is
> >>> not authenticated.
> >>>
> >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> >>> figure out if it is authenticated or not, but I clearly remember that we
> >>> haven't done this from the beginning.
> >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> >> First public commit has all the bits: getauth, GetVersion, then the accel
> >> query.
> >>
> >>> Thinking more about this I don't think that this problem is actually amdgpu
> >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> >>> but only from those who have the specific issue with libva.
> >> libva was the initial motivation, the goal of Emil's patch wasn't to just
> >> duct-tape over libva. That would be easier to fix in libva. The point was
> >> that we should be able to allow this in general.
> >>
> >> And we can, except for the radv issue uncovered here.
> >>
> >> So please don't get 100% hung up on the libva thing, that wasn't really
> >> the goal, just the initial motivation. And I still thinks it makes for all
> >> drivers. So again we have:
> >>
> >> - radv hack
> >> - make amdgpu behave different from everyone else
> >> - keep tilting windmills about "pls use rendernodes, thx"
> >>
> >> I neither like tilting windmills nor making drivers behave different for
> >> no reaason at all.
> > Allow me to jump-in some emails down the line.
> >
> > First and foremost, sincere apologies for upsetting you Christian. If
> > it's of any consolidation - let me assure you the goal is _not_ to break
> > amdgpu or any other driver.
> >
> > Secondly, I would like to ask you for a list of projects so we can look and
> > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> > it.
>
> That is rather hard to come by, because you would also need to look at
> all previously existing driver stacks.
>
> E.g. with the classic R300 driver for example.
>
> > On the topic of "working around libva" - sadly libva is _not_ the only
> > offender. We also had bugs in mesa and kmscube.
> >
> > Considering those are taken as a prime example of "the right way", it's very
> > likely for the mistakes to be repeated in many other projects.
> >
> > Where the common "fix" seems to be "run as root"...
> >
> >
> > As Daniel pointed out, we could be fighting the windmills or we could have a
> > small, admittedly ugly, workaround for amdgpu.
> >
> > If you don't like that workaround in the driver we could move it to core DRM.
> >
> > In either case, I would like to focus on a pragramic solution which works with
> > both old and new userspace.
>
> Well, I seriously think the original committed solution could cause a
> lot of problems and the issue with radv is only the tip of the iceberg.
>
> I mean we just have to ask our self why have we created render nodes in
> the first place? The obvious alternative was to just removed the
> authentication restrictions on the primary node which would have been
> way less code and maintenance burden.
>
> I need to dig up the mailing list archive, but I strongly think that one
> of the main arguments for this approach was to NOT break existing userspace.
>
> Even taking into account that we now don't need to deal with UMS and
> really really old userspace drivers any more it still feels like a to
> high risk going down that route.

I'm going to revert the original patch in drm-next now. We've been
getting a bit lax lately on the regressions need to get reverted no
matter what rule, and we are getting close to rc6 and it's Easter, so
I don't want to have to care about this, forget about it, remember it
again, end up at 5.2.

Personally I'm rather in favour of cleaning up the driver ioctls since
I don't like older drivers suddenly having a new ABI without
consideration of side effects.

Dave.

Daniel Vetter April 18, 2019, 6:46 a.m. UTC | #15

On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 17.04.19 um 17:51 schrieb Emil Velikov:
> > Hi guys,
> >
> > On 2019/04/17, Daniel Vetter wrote:
> >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>>>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>>> [SNIP]
> >>>>>>> Well, what you guys did here is a serious no-go.
> >>>>>> Not really understanding your reasons for an unconditional nak on all this
> >>>>>> still.
> >>>>> Well, let me summarize: You worked around a problem with libva by
> >>>>> breaking another driver and now proposing a rather ugly and only halve
> >>>>> backed workaround for that driver.
> >>>>>
> >>>>> This is a serious no-go. I've already send out a i915 specific change to
> >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> >>>>> the offending patch.
> >>>> Oh I'm totally fine with the revert if that's what we go with. But that
> >>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> >>>> there's fundamentally 2 options to do that:
> >>>>
> >>>> - This one here (or anything implementing the same idea), to keep radv
> >>>>     working.
> >>>>
> >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >>>>     exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >>>>     cold. It also doesn't matter whether we get there with revert + lots of
> >>>>     patches, or a special hack for amdgpu, in the end amdgpu would be
> >>>>     different. Also note that your patch isn't enough, since it doesn't fix
> >>>>     the core ioctls.
> >>>>
> >>>> I think from an overall platform pov, the first is the better option.
> >>>> After all we try really hard to avoid driver special cases for these kinds
> >>>> of things.
> >>> Well, I initially thought that radv is doing something unusual or broken,
> >>> but after looking deeper into this that is actually not the case.
> >>>
> >>> What radv does is calling an IOCTL and expecting an error message when it is
> >>> not authenticated.
> >>>
> >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> >>> figure out if it is authenticated or not, but I clearly remember that we
> >>> haven't done this from the beginning.
> >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> >> First public commit has all the bits: getauth, GetVersion, then the accel
> >> query.
> >>
> >>> Thinking more about this I don't think that this problem is actually amdgpu
> >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> >>> but only from those who have the specific issue with libva.
> >> libva was the initial motivation, the goal of Emil's patch wasn't to just
> >> duct-tape over libva. That would be easier to fix in libva. The point was
> >> that we should be able to allow this in general.
> >>
> >> And we can, except for the radv issue uncovered here.
> >>
> >> So please don't get 100% hung up on the libva thing, that wasn't really
> >> the goal, just the initial motivation. And I still thinks it makes for all
> >> drivers. So again we have:
> >>
> >> - radv hack
> >> - make amdgpu behave different from everyone else
> >> - keep tilting windmills about "pls use rendernodes, thx"
> >>
> >> I neither like tilting windmills nor making drivers behave different for
> >> no reaason at all.
> > Allow me to jump-in some emails down the line.
> >
> > First and foremost, sincere apologies for upsetting you Christian. If
> > it's of any consolidation - let me assure you the goal is _not_ to break
> > amdgpu or any other driver.
> >
> > Secondly, I would like to ask you for a list of projects so we can look and
> > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> > it.
>
> That is rather hard to come by, because you would also need to look at
> all previously existing driver stacks.
>
> E.g. with the classic R300 driver for example.
>
> > On the topic of "working around libva" - sadly libva is _not_ the only
> > offender. We also had bugs in mesa and kmscube.
> >
> > Considering those are taken as a prime example of "the right way", it's very
> > likely for the mistakes to be repeated in many other projects.
> >
> > Where the common "fix" seems to be "run as root"...
> >
> >
> > As Daniel pointed out, we could be fighting the windmills or we could have a
> > small, admittedly ugly, workaround for amdgpu.
> >
> > If you don't like that workaround in the driver we could move it to core DRM.
> >
> > In either case, I would like to focus on a pragramic solution which works with
> > both old and new userspace.
>
> Well, I seriously think the original committed solution could cause a
> lot of problems and the issue with radv is only the tip of the iceberg.
>
> I mean we just have to ask our self why have we created render nodes in
> the first place? The obvious alternative was to just removed the
> authentication restrictions on the primary node which would have been
> way less code and maintenance burden.

The render nodes exist so you can have different unix permissions for
rendering as for displaying stuff. E.g. run the compositor as a
distinct user from all its clients. The other reason was to have a
clean interface without any of the historical baggage (to make stuff
like container/vm pass-through easier, since we'd know that
render-node userspace won't ever touch any of the old crap ioctls). I
don't think there's ever been a concern about breaking backwards
compatibility.

The issue here is that radv checks for "can I render", when it wants
to check for "can I display". So anything that only renders we can
ignore. That leaves the various ddx around, which historically have
run as root, or if they run as non-root, logind or similar is giving
them the master kms node already in master mode (afaiui at least).
That leaves the non-main compositor userspace, of which there's only
vkdisplay. I don't see much additional possibilities for regressions
to creep in.

Emil probably has a better grasp on this, since he's got a bunch of
patches to roll out drmIsMaster all over.

> I need to dig up the mailing list archive, but I strongly think that one
> of the main arguments for this approach was to NOT break existing userspace.
>
> Even taking into account that we now don't need to deal with UMS and
> really really old userspace drivers any more it still feels like a to
> high risk going down that route.

We change the undefined corners of the uapi all the time to improve
the overall ecosystem, and occasionally we end up with
DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
think the choice here is how exactly amdgpu.ko copes with radv:
- the hack proposed here
- a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
let you folks deal with a (likely, not guaranteed) influx "but this
works on $any_other_driver"

I guess up to you which patch Emil should include when he resubmits
the patch for 5.3. We've done both in the past, but generally we're
trying to limit the impact of these hacks as much as possible. Hence
why I'm still in favour of the first one.

"keep tilting windmills" is imo not an option, and I'm happy to handle
any other fallout Emil's patch uncovers, if there is anything else - I
did sketch this hack here quickly on irc right when we got the report.
-Daniel

Christian König April 18, 2019, 8:06 a.m. UTC | #16

Am 18.04.19 um 08:46 schrieb Daniel Vetter:
> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 17.04.19 um 17:51 schrieb Emil Velikov:
>>> Hi guys,
>>>
>>> On 2019/04/17, Daniel Vetter wrote:
>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>>>>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>>>>> [SNIP]
>>>>>>>>> Well, what you guys did here is a serious no-go.
>>>>>>>> Not really understanding your reasons for an unconditional nak on all this
>>>>>>>> still.
>>>>>>> Well, let me summarize: You worked around a problem with libva by
>>>>>>> breaking another driver and now proposing a rather ugly and only halve
>>>>>>> backed workaround for that driver.
>>>>>>>
>>>>>>> This is a serious no-go. I've already send out a i915 specific change to
>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
>>>>>>> the offending patch.
>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that
>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
>>>>>> there's fundamentally 2 options to do that:
>>>>>>
>>>>>> - This one here (or anything implementing the same idea), to keep radv
>>>>>>      working.
>>>>>>
>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
>>>>>>      exactly that's doen doesn't matter, but it leaves amdgpu out in the
>>>>>>      cold. It also doesn't matter whether we get there with revert + lots of
>>>>>>      patches, or a special hack for amdgpu, in the end amdgpu would be
>>>>>>      different. Also note that your patch isn't enough, since it doesn't fix
>>>>>>      the core ioctls.
>>>>>>
>>>>>> I think from an overall platform pov, the first is the better option.
>>>>>> After all we try really hard to avoid driver special cases for these kinds
>>>>>> of things.
>>>>> Well, I initially thought that radv is doing something unusual or broken,
>>>>> but after looking deeper into this that is actually not the case.
>>>>>
>>>>> What radv does is calling an IOCTL and expecting an error message when it is
>>>>> not authenticated.
>>>>>
>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
>>>>> figure out if it is authenticated or not, but I clearly remember that we
>>>>> haven't done this from the beginning.
>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
>>>> First public commit has all the bits: getauth, GetVersion, then the accel
>>>> query.
>>>>
>>>>> Thinking more about this I don't think that this problem is actually amdgpu
>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
>>>>> but only from those who have the specific issue with libva.
>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just
>>>> duct-tape over libva. That would be easier to fix in libva. The point was
>>>> that we should be able to allow this in general.
>>>>
>>>> And we can, except for the radv issue uncovered here.
>>>>
>>>> So please don't get 100% hung up on the libva thing, that wasn't really
>>>> the goal, just the initial motivation. And I still thinks it makes for all
>>>> drivers. So again we have:
>>>>
>>>> - radv hack
>>>> - make amdgpu behave different from everyone else
>>>> - keep tilting windmills about "pls use rendernodes, thx"
>>>>
>>>> I neither like tilting windmills nor making drivers behave different for
>>>> no reaason at all.
>>> Allow me to jump-in some emails down the line.
>>>
>>> First and foremost, sincere apologies for upsetting you Christian. If
>>> it's of any consolidation - let me assure you the goal is _not_ to break
>>> amdgpu or any other driver.
>>>
>>> Secondly, I would like to ask you for a list of projects so we can look and
>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
>>> it.
>> That is rather hard to come by, because you would also need to look at
>> all previously existing driver stacks.
>>
>> E.g. with the classic R300 driver for example.
>>
>>> On the topic of "working around libva" - sadly libva is _not_ the only
>>> offender. We also had bugs in mesa and kmscube.
>>>
>>> Considering those are taken as a prime example of "the right way", it's very
>>> likely for the mistakes to be repeated in many other projects.
>>>
>>> Where the common "fix" seems to be "run as root"...
>>>
>>>
>>> As Daniel pointed out, we could be fighting the windmills or we could have a
>>> small, admittedly ugly, workaround for amdgpu.
>>>
>>> If you don't like that workaround in the driver we could move it to core DRM.
>>>
>>> In either case, I would like to focus on a pragramic solution which works with
>>> both old and new userspace.
>> Well, I seriously think the original committed solution could cause a
>> lot of problems and the issue with radv is only the tip of the iceberg.
>>
>> I mean we just have to ask our self why have we created render nodes in
>> the first place? The obvious alternative was to just removed the
>> authentication restrictions on the primary node which would have been
>> way less code and maintenance burden.
> The render nodes exist so you can have different unix permissions for
> rendering as for displaying stuff. E.g. run the compositor as a
> distinct user from all its clients. The other reason was to have a
> clean interface without any of the historical baggage (to make stuff
> like container/vm pass-through easier, since we'd know that
> render-node userspace won't ever touch any of the old crap ioctls). I
> don't think there's ever been a concern about breaking backwards
> compatibility.
>
> The issue here is that radv checks for "can I render", when it wants
> to check for "can I display". So anything that only renders we can
> ignore. That leaves the various ddx around, which historically have
> run as root, or if they run as non-root, logind or similar is giving
> them the master kms node already in master mode (afaiui at least).
> That leaves the non-main compositor userspace, of which there's only
> vkdisplay. I don't see much additional possibilities for regressions
> to creep in.
>
> Emil probably has a better grasp on this, since he's got a bunch of
> patches to roll out drmIsMaster all over.
>
>> I need to dig up the mailing list archive, but I strongly think that one
>> of the main arguments for this approach was to NOT break existing userspace.
>>
>> Even taking into account that we now don't need to deal with UMS and
>> really really old userspace drivers any more it still feels like a to
>> high risk going down that route.
> We change the undefined corners of the uapi all the time to improve
> the overall ecosystem, and occasionally we end up with
> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
> think the choice here is how exactly amdgpu.ko copes with radv:
> - the hack proposed here
> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
> let you folks deal with a (likely, not guaranteed) influx "but this
> works on $any_other_driver"
>
> I guess up to you which patch Emil should include when he resubmits
> the patch for 5.3. We've done both in the past, but generally we're
> trying to limit the impact of these hacks as much as possible. Hence
> why I'm still in favour of the first one.

Well and I will still NAK both approaches.

You can of course go ahead and remove the DRM_AUTH flags from the i915 
IOCTLs to make libva happy, but please not a general approach which 
touches all DRM drivers.

And from Dave's response I think he is following my argumentation here, 
so I don't see a general approach to be added for 5.3 either.

> "keep tilting windmills" is imo not an option, and I'm happy to handle
> any other fallout Emil's patch uncovers, if there is anything else - I
> did sketch this hack here quickly on irc right when we got the report.

Well this is not about tilting windmills, but rather good maintenance of 
backward compatibility and rejecting changing with potential unforeseen 
consequences.

Regards,
Christian.

> -Daniel

Daniel Vetter April 18, 2019, 8:26 a.m. UTC | #17

On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote:
> Am 18.04.19 um 08:46 schrieb Daniel Vetter:
> > On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 17.04.19 um 17:51 schrieb Emil Velikov:
> >>> Hi guys,
> >>>
> >>> On 2019/04/17, Daniel Vetter wrote:
> >>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> >>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> >>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> >>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> >>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>>>>>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>>>>> [SNIP]
> >>>>>>>>> Well, what you guys did here is a serious no-go.
> >>>>>>>> Not really understanding your reasons for an unconditional nak on all this
> >>>>>>>> still.
> >>>>>>> Well, let me summarize: You worked around a problem with libva by
> >>>>>>> breaking another driver and now proposing a rather ugly and only halve
> >>>>>>> backed workaround for that driver.
> >>>>>>>
> >>>>>>> This is a serious no-go. I've already send out a i915 specific change to
> >>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> >>>>>>> the offending patch.
> >>>>>> Oh I'm totally fine with the revert if that's what we go with. But that
> >>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> >>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> >>>>>> there's fundamentally 2 options to do that:
> >>>>>>
> >>>>>> - This one here (or anything implementing the same idea), to keep radv
> >>>>>>      working.
> >>>>>>
> >>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >>>>>>      exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >>>>>>      cold. It also doesn't matter whether we get there with revert + lots of
> >>>>>>      patches, or a special hack for amdgpu, in the end amdgpu would be
> >>>>>>      different. Also note that your patch isn't enough, since it doesn't fix
> >>>>>>      the core ioctls.
> >>>>>>
> >>>>>> I think from an overall platform pov, the first is the better option.
> >>>>>> After all we try really hard to avoid driver special cases for these kinds
> >>>>>> of things.
> >>>>> Well, I initially thought that radv is doing something unusual or broken,
> >>>>> but after looking deeper into this that is actually not the case.
> >>>>>
> >>>>> What radv does is calling an IOCTL and expecting an error message when it is
> >>>>> not authenticated.
> >>>>>
> >>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> >>>>> figure out if it is authenticated or not, but I clearly remember that we
> >>>>> haven't done this from the beginning.
> >>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> >>>> First public commit has all the bits: getauth, GetVersion, then the accel
> >>>> query.
> >>>>
> >>>>> Thinking more about this I don't think that this problem is actually amdgpu
> >>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> >>>>> but only from those who have the specific issue with libva.
> >>>> libva was the initial motivation, the goal of Emil's patch wasn't to just
> >>>> duct-tape over libva. That would be easier to fix in libva. The point was
> >>>> that we should be able to allow this in general.
> >>>>
> >>>> And we can, except for the radv issue uncovered here.
> >>>>
> >>>> So please don't get 100% hung up on the libva thing, that wasn't really
> >>>> the goal, just the initial motivation. And I still thinks it makes for all
> >>>> drivers. So again we have:
> >>>>
> >>>> - radv hack
> >>>> - make amdgpu behave different from everyone else
> >>>> - keep tilting windmills about "pls use rendernodes, thx"
> >>>>
> >>>> I neither like tilting windmills nor making drivers behave different for
> >>>> no reaason at all.
> >>> Allow me to jump-in some emails down the line.
> >>>
> >>> First and foremost, sincere apologies for upsetting you Christian. If
> >>> it's of any consolidation - let me assure you the goal is _not_ to break
> >>> amdgpu or any other driver.
> >>>
> >>> Secondly, I would like to ask you for a list of projects so we can look and
> >>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> >>> it.
> >> That is rather hard to come by, because you would also need to look at
> >> all previously existing driver stacks.
> >>
> >> E.g. with the classic R300 driver for example.
> >>
> >>> On the topic of "working around libva" - sadly libva is _not_ the only
> >>> offender. We also had bugs in mesa and kmscube.
> >>>
> >>> Considering those are taken as a prime example of "the right way", it's very
> >>> likely for the mistakes to be repeated in many other projects.
> >>>
> >>> Where the common "fix" seems to be "run as root"...
> >>>
> >>>
> >>> As Daniel pointed out, we could be fighting the windmills or we could have a
> >>> small, admittedly ugly, workaround for amdgpu.
> >>>
> >>> If you don't like that workaround in the driver we could move it to core DRM.
> >>>
> >>> In either case, I would like to focus on a pragramic solution which works with
> >>> both old and new userspace.
> >> Well, I seriously think the original committed solution could cause a
> >> lot of problems and the issue with radv is only the tip of the iceberg.
> >>
> >> I mean we just have to ask our self why have we created render nodes in
> >> the first place? The obvious alternative was to just removed the
> >> authentication restrictions on the primary node which would have been
> >> way less code and maintenance burden.
> > The render nodes exist so you can have different unix permissions for
> > rendering as for displaying stuff. E.g. run the compositor as a
> > distinct user from all its clients. The other reason was to have a
> > clean interface without any of the historical baggage (to make stuff
> > like container/vm pass-through easier, since we'd know that
> > render-node userspace won't ever touch any of the old crap ioctls). I
> > don't think there's ever been a concern about breaking backwards
> > compatibility.
> >
> > The issue here is that radv checks for "can I render", when it wants
> > to check for "can I display". So anything that only renders we can
> > ignore. That leaves the various ddx around, which historically have
> > run as root, or if they run as non-root, logind or similar is giving
> > them the master kms node already in master mode (afaiui at least).
> > That leaves the non-main compositor userspace, of which there's only
> > vkdisplay. I don't see much additional possibilities for regressions
> > to creep in.
> >
> > Emil probably has a better grasp on this, since he's got a bunch of
> > patches to roll out drmIsMaster all over.
> >
> >> I need to dig up the mailing list archive, but I strongly think that one
> >> of the main arguments for this approach was to NOT break existing userspace.
> >>
> >> Even taking into account that we now don't need to deal with UMS and
> >> really really old userspace drivers any more it still feels like a to
> >> high risk going down that route.
> > We change the undefined corners of the uapi all the time to improve
> > the overall ecosystem, and occasionally we end up with
> > DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
> > think the choice here is how exactly amdgpu.ko copes with radv:
> > - the hack proposed here
> > - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
> > let you folks deal with a (likely, not guaranteed) influx "but this
> > works on $any_other_driver"
> >
> > I guess up to you which patch Emil should include when he resubmits
> > the patch for 5.3. We've done both in the past, but generally we're
> > trying to limit the impact of these hacks as much as possible. Hence
> > why I'm still in favour of the first one.
> 
> Well and I will still NAK both approaches.
> 
> You can of course go ahead and remove the DRM_AUTH flags from the i915 
> IOCTLs to make libva happy, but please not a general approach which 
> touches all DRM drivers.
> 
> And from Dave's response I think he is following my argumentation here, 
> so I don't see a general approach to be added for 5.3 either.
> 
> > "keep tilting windmills" is imo not an option, and I'm happy to handle
> > any other fallout Emil's patch uncovers, if there is anything else - I
> > did sketch this hack here quickly on irc right when we got the report.
> 
> Well this is not about tilting windmills, but rather good maintenance of 
> backward compatibility and rejecting changing with potential unforeseen 
> consequences.

We (well I) have been doing stuff like this since forever. It's just that
thus far it hasn't been amd that stuck out with the need for a hack, but
other drivers. It's a tradeoff in the end, and a tradeoff we're can do due
to our open source userspace requirement - in case something does go
sideways, we have the full history of anything affected. So you're
proposed we revert all these other changes and cleanups too, because they
could be risky (many turned out to actually be risky, like this one here
too)?

Ok correction: amd has stuck out in the past too, there was some vblank vs
pageflip stuff where we needed to do some pretty clever tricks to both
have the new stricter semantics for everyone, while keeping the amd ddx
happy still. This aint new at all, we fixed up the regressions and moved
on.

tldr; You're categorical rejection doesn't make sense to me.
Slightly over the top, but this feels like gut reaction to radv being a
thorn for amd.
-Daniel

Michel Dänzer April 18, 2019, 8:52 a.m. UTC | #18

On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
> 
> Ok correction: amd has stuck out in the past too, there was some vblank vs
> pageflip stuff where we needed to do some pretty clever tricks to both
> have the new stricter semantics for everyone, while keeping the amd ddx
> happy still. This aint new at all, we fixed up the regressions and moved
> on.

That sounds like something I would have been involved in, but I'm not
sure what you mean...

I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
userspace to take advantage of our hardware's capabilities to avoid
delaying a page flip by one refresh cycle in some cases. That's all.

> tldr; You're categorical rejection doesn't make sense to me.
> Slightly over the top, but this feels like gut reaction to radv being a
> thorn for amd.

That's a weird insinuation. If RADV was a "thorn for AMD", surely we'd
be having a party about this regression instead of insisting that it be
fixed thoroughly.

Christian König April 18, 2019, 8:56 a.m. UTC | #19

Am 18.04.19 um 10:26 schrieb Daniel Vetter:
> On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote:
>> Am 18.04.19 um 08:46 schrieb Daniel Vetter:
>>> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 17.04.19 um 17:51 schrieb Emil Velikov:
>>>>> Hi guys,
>>>>>
>>>>> On 2019/04/17, Daniel Vetter wrote:
>>>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
>>>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
>>>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
>>>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
>>>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>>>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>>>>>>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>>>>>>> [SNIP]
>>>>>>>>>>> Well, what you guys did here is a serious no-go.
>>>>>>>>>> Not really understanding your reasons for an unconditional nak on all this
>>>>>>>>>> still.
>>>>>>>>> Well, let me summarize: You worked around a problem with libva by
>>>>>>>>> breaking another driver and now proposing a rather ugly and only halve
>>>>>>>>> backed workaround for that driver.
>>>>>>>>>
>>>>>>>>> This is a serious no-go. I've already send out a i915 specific change to
>>>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
>>>>>>>>> the offending patch.
>>>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that
>>>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
>>>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
>>>>>>>> there's fundamentally 2 options to do that:
>>>>>>>>
>>>>>>>> - This one here (or anything implementing the same idea), to keep radv
>>>>>>>>       working.
>>>>>>>>
>>>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
>>>>>>>>       exactly that's doen doesn't matter, but it leaves amdgpu out in the
>>>>>>>>       cold. It also doesn't matter whether we get there with revert + lots of
>>>>>>>>       patches, or a special hack for amdgpu, in the end amdgpu would be
>>>>>>>>       different. Also note that your patch isn't enough, since it doesn't fix
>>>>>>>>       the core ioctls.
>>>>>>>>
>>>>>>>> I think from an overall platform pov, the first is the better option.
>>>>>>>> After all we try really hard to avoid driver special cases for these kinds
>>>>>>>> of things.
>>>>>>> Well, I initially thought that radv is doing something unusual or broken,
>>>>>>> but after looking deeper into this that is actually not the case.
>>>>>>>
>>>>>>> What radv does is calling an IOCTL and expecting an error message when it is
>>>>>>> not authenticated.
>>>>>>>
>>>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
>>>>>>> figure out if it is authenticated or not, but I clearly remember that we
>>>>>>> haven't done this from the beginning.
>>>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
>>>>>> First public commit has all the bits: getauth, GetVersion, then the accel
>>>>>> query.
>>>>>>
>>>>>>> Thinking more about this I don't think that this problem is actually amdgpu
>>>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
>>>>>>> but only from those who have the specific issue with libva.
>>>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just
>>>>>> duct-tape over libva. That would be easier to fix in libva. The point was
>>>>>> that we should be able to allow this in general.
>>>>>>
>>>>>> And we can, except for the radv issue uncovered here.
>>>>>>
>>>>>> So please don't get 100% hung up on the libva thing, that wasn't really
>>>>>> the goal, just the initial motivation. And I still thinks it makes for all
>>>>>> drivers. So again we have:
>>>>>>
>>>>>> - radv hack
>>>>>> - make amdgpu behave different from everyone else
>>>>>> - keep tilting windmills about "pls use rendernodes, thx"
>>>>>>
>>>>>> I neither like tilting windmills nor making drivers behave different for
>>>>>> no reaason at all.
>>>>> Allow me to jump-in some emails down the line.
>>>>>
>>>>> First and foremost, sincere apologies for upsetting you Christian. If
>>>>> it's of any consolidation - let me assure you the goal is _not_ to break
>>>>> amdgpu or any other driver.
>>>>>
>>>>> Secondly, I would like to ask you for a list of projects so we can look and
>>>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
>>>>> it.
>>>> That is rather hard to come by, because you would also need to look at
>>>> all previously existing driver stacks.
>>>>
>>>> E.g. with the classic R300 driver for example.
>>>>
>>>>> On the topic of "working around libva" - sadly libva is _not_ the only
>>>>> offender. We also had bugs in mesa and kmscube.
>>>>>
>>>>> Considering those are taken as a prime example of "the right way", it's very
>>>>> likely for the mistakes to be repeated in many other projects.
>>>>>
>>>>> Where the common "fix" seems to be "run as root"...
>>>>>
>>>>>
>>>>> As Daniel pointed out, we could be fighting the windmills or we could have a
>>>>> small, admittedly ugly, workaround for amdgpu.
>>>>>
>>>>> If you don't like that workaround in the driver we could move it to core DRM.
>>>>>
>>>>> In either case, I would like to focus on a pragramic solution which works with
>>>>> both old and new userspace.
>>>> Well, I seriously think the original committed solution could cause a
>>>> lot of problems and the issue with radv is only the tip of the iceberg.
>>>>
>>>> I mean we just have to ask our self why have we created render nodes in
>>>> the first place? The obvious alternative was to just removed the
>>>> authentication restrictions on the primary node which would have been
>>>> way less code and maintenance burden.
>>> The render nodes exist so you can have different unix permissions for
>>> rendering as for displaying stuff. E.g. run the compositor as a
>>> distinct user from all its clients. The other reason was to have a
>>> clean interface without any of the historical baggage (to make stuff
>>> like container/vm pass-through easier, since we'd know that
>>> render-node userspace won't ever touch any of the old crap ioctls). I
>>> don't think there's ever been a concern about breaking backwards
>>> compatibility.
>>>
>>> The issue here is that radv checks for "can I render", when it wants
>>> to check for "can I display". So anything that only renders we can
>>> ignore. That leaves the various ddx around, which historically have
>>> run as root, or if they run as non-root, logind or similar is giving
>>> them the master kms node already in master mode (afaiui at least).
>>> That leaves the non-main compositor userspace, of which there's only
>>> vkdisplay. I don't see much additional possibilities for regressions
>>> to creep in.
>>>
>>> Emil probably has a better grasp on this, since he's got a bunch of
>>> patches to roll out drmIsMaster all over.
>>>
>>>> I need to dig up the mailing list archive, but I strongly think that one
>>>> of the main arguments for this approach was to NOT break existing userspace.
>>>>
>>>> Even taking into account that we now don't need to deal with UMS and
>>>> really really old userspace drivers any more it still feels like a to
>>>> high risk going down that route.
>>> We change the undefined corners of the uapi all the time to improve
>>> the overall ecosystem, and occasionally we end up with
>>> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
>>> think the choice here is how exactly amdgpu.ko copes with radv:
>>> - the hack proposed here
>>> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
>>> let you folks deal with a (likely, not guaranteed) influx "but this
>>> works on $any_other_driver"
>>>
>>> I guess up to you which patch Emil should include when he resubmits
>>> the patch for 5.3. We've done both in the past, but generally we're
>>> trying to limit the impact of these hacks as much as possible. Hence
>>> why I'm still in favour of the first one.
>> Well and I will still NAK both approaches.
>>
>> You can of course go ahead and remove the DRM_AUTH flags from the i915
>> IOCTLs to make libva happy, but please not a general approach which
>> touches all DRM drivers.
>>
>> And from Dave's response I think he is following my argumentation here,
>> so I don't see a general approach to be added for 5.3 either.
>>
>>> "keep tilting windmills" is imo not an option, and I'm happy to handle
>>> any other fallout Emil's patch uncovers, if there is anything else - I
>>> did sketch this hack here quickly on irc right when we got the report.
>> Well this is not about tilting windmills, but rather good maintenance of
>> backward compatibility and rejecting changing with potential unforeseen
>> consequences.
> We (well I) have been doing stuff like this since forever. It's just that
> thus far it hasn't been amd that stuck out with the need for a hack, but
> other drivers. It's a tradeoff in the end, and a tradeoff we're can do due
> to our open source userspace requirement - in case something does go
> sideways, we have the full history of anything affected. So you're
> proposed we revert all these other changes and cleanups too, because they
> could be risky (many turned out to actually be risky, like this one here
> too)?

Well in general yes, a lot of those changes looked unnecessary risky to me.

We of course can't revert those changes now because they are now UAPI as 
well and it would be risky to modify them. But I think we should have a 
much stricter approach to modifying exiting UAPI.

Additional to that I actually think that the requirement of open source 
userspace stacks is a mistake.

This requirement came just from exactly that approach to try to modify 
exiting UAPI instead of accepting the fact that UAPI backward 
compatibility is something you should not mess with.

Nobody else in the kernel is having that open source user space 
requirement and it actually seems that a lot of people see that as just 
a justification to keep Mesa alive.

What you can do is to require unit tests so that everybody is able to 
test for regressions independent of installed userspace software.

> Ok correction: amd has stuck out in the past too, there was some vblank vs
> pageflip stuff where we needed to do some pretty clever tricks to both
> have the new stricter semantics for everyone, while keeping the amd ddx
> happy still. This aint new at all, we fixed up the regressions and moved
> on.

I'm wasn't much involved into this, so I can't say anything about it.

> tldr; You're categorical rejection doesn't make sense to me.
> Slightly over the top, but this feels like gut reaction to radv being a
> thorn for amd.

Well radv is pointing out exactly what's going wrong with AMDs internal 
development process, so I'm really in favor of that.

And I'm certainly in favor or open source projects in general, but as 
kernel developer I must not favor a closed or open userspace stack 
running on top of my driver.

What I care about is good UAPI design, maintainable code and not 
re-inventing the wheel to often, but what stack runs on top is a 
political decision and not a technical one.

Regards,
Christian.

> -Daniel

Daniel Vetter April 18, 2019, 9:11 a.m. UTC | #20

On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
> >
> > Ok correction: amd has stuck out in the past too, there was some vblank vs
> > pageflip stuff where we needed to do some pretty clever tricks to both
> > have the new stricter semantics for everyone, while keeping the amd ddx
> > happy still. This aint new at all, we fixed up the regressions and moved
> > on.
>
> That sounds like something I would have been involved in, but I'm not
> sure what you mean...
>
> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
> userspace to take advantage of our hardware's capabilities to avoid
> delaying a page flip by one refresh cycle in some cases. That's all.

That's the one I meant. At least I remember quite a pile of
discussions around why i915 gets to define how this stuff works, and
amdgpu/radeon having to follow.

> > tldr; You're categorical rejection doesn't make sense to me.
> > Slightly over the top, but this feels like gut reaction to radv being a
> > thorn for amd.
>
> That's a weird insinuation. If RADV was a "thorn for AMD", surely we'd
> be having a party about this regression instead of insisting that it be
> fixed thoroughly.

It is fixed. The discussion is who we can evolve the uapi for primary
nodes in an imo useful way (Emil explained the motivations much better
than me), without breaking radv. Christian's take that we flat out
never do that kind of (risky, but it's a trade-off) uapi evolution
just isn't true. And the radv hack under discussion here also isn't
the worst hack we've ever shipped I think in the past to handle the
occasional fallout.

Some other examples that score pretty high on the awful scale:
- legacy context hack for nouvea. Leaves a bunch of root-holes open
for nouveau, in the name of backwards compat. Dave just posted a patch
to at least close those on modern distros
- adfb hack for nouveau 10 bpc rbg formats
- the native endian trickery from Gerd Hoffman
- the vblank vs flip clarifiation that required new uapi and upgraded
amd ddx (really not pretty if we have to somewhat break a performance
feature for some users)
- ...

That's just the ones I remember from the top of my mind, because they
all still leave a gross aftertastes behind. This stuff is pretty
common. Many of these took a few rounds to get there and catch all the
fallout and handle it somehow, and occasionally we just drop it and
live with uapi that's a bit ill-defined or inconsistent across drivers
or not as useful as we'd like it to be. This is the first time I get
an unconditional Nack on such an effort.
-Daniel

Michel Dänzer April 18, 2019, 9:22 a.m. UTC | #21

On 2019-04-18 11:11 a.m., Daniel Vetter wrote:
> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote:
>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
>>>
>>> Ok correction: amd has stuck out in the past too, there was some vblank vs
>>> pageflip stuff where we needed to do some pretty clever tricks to both
>>> have the new stricter semantics for everyone, while keeping the amd ddx
>>> happy still. This aint new at all, we fixed up the regressions and moved
>>> on.
>>
>> That sounds like something I would have been involved in, but I'm not
>> sure what you mean...
>>
>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
>> userspace to take advantage of our hardware's capabilities to avoid
>> delaying a page flip by one refresh cycle in some cases. That's all.
> 
> That's the one I meant. At least I remember quite a pile of
> discussions around why i915 gets to define how this stuff works, and
> amdgpu/radeon having to follow.
> 
> [...]
> 
> - the vblank vs flip clarifiation that required new uapi and upgraded
> amd ddx (really not pretty if we have to somewhat break a performance
> feature for some users)

You're totally mis-representing that IMO.

The amdgpu/radeon kernel drivers were simply broken WRT established UAPI
rules, in a way which affected our DDX drivers as well. The kernel
driver fixes didn't require any corresponding userspace changes.

New UAPI was added to allow taking advantage of hardware capabilities in
some cases without breaking other cases.

It's a pretty text-book example of how to manage UAPI.

Daniel Vetter April 18, 2019, 9:25 a.m. UTC | #22

On Thu, Apr 18, 2019 at 10:56 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 18.04.19 um 10:26 schrieb Daniel Vetter:
> > On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote:
> >> Am 18.04.19 um 08:46 schrieb Daniel Vetter:
> >>> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Am 17.04.19 um 17:51 schrieb Emil Velikov:
> >>>>> Hi guys,
> >>>>>
> >>>>> On 2019/04/17, Daniel Vetter wrote:
> >>>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> >>>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> >>>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> >>>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> >>>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >>>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>>>>>>>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>>>>>>> [SNIP]
> >>>>>>>>>>> Well, what you guys did here is a serious no-go.
> >>>>>>>>>> Not really understanding your reasons for an unconditional nak on all this
> >>>>>>>>>> still.
> >>>>>>>>> Well, let me summarize: You worked around a problem with libva by
> >>>>>>>>> breaking another driver and now proposing a rather ugly and only halve
> >>>>>>>>> backed workaround for that driver.
> >>>>>>>>>
> >>>>>>>>> This is a serious no-go. I've already send out a i915 specific change to
> >>>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> >>>>>>>>> the offending patch.
> >>>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that
> >>>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> >>>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> >>>>>>>> there's fundamentally 2 options to do that:
> >>>>>>>>
> >>>>>>>> - This one here (or anything implementing the same idea), to keep radv
> >>>>>>>>       working.
> >>>>>>>>
> >>>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >>>>>>>>       exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >>>>>>>>       cold. It also doesn't matter whether we get there with revert + lots of
> >>>>>>>>       patches, or a special hack for amdgpu, in the end amdgpu would be
> >>>>>>>>       different. Also note that your patch isn't enough, since it doesn't fix
> >>>>>>>>       the core ioctls.
> >>>>>>>>
> >>>>>>>> I think from an overall platform pov, the first is the better option.
> >>>>>>>> After all we try really hard to avoid driver special cases for these kinds
> >>>>>>>> of things.
> >>>>>>> Well, I initially thought that radv is doing something unusual or broken,
> >>>>>>> but after looking deeper into this that is actually not the case.
> >>>>>>>
> >>>>>>> What radv does is calling an IOCTL and expecting an error message when it is
> >>>>>>> not authenticated.
> >>>>>>>
> >>>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> >>>>>>> figure out if it is authenticated or not, but I clearly remember that we
> >>>>>>> haven't done this from the beginning.
> >>>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> >>>>>> First public commit has all the bits: getauth, GetVersion, then the accel
> >>>>>> query.
> >>>>>>
> >>>>>>> Thinking more about this I don't think that this problem is actually amdgpu
> >>>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> >>>>>>> but only from those who have the specific issue with libva.
> >>>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just
> >>>>>> duct-tape over libva. That would be easier to fix in libva. The point was
> >>>>>> that we should be able to allow this in general.
> >>>>>>
> >>>>>> And we can, except for the radv issue uncovered here.
> >>>>>>
> >>>>>> So please don't get 100% hung up on the libva thing, that wasn't really
> >>>>>> the goal, just the initial motivation. And I still thinks it makes for all
> >>>>>> drivers. So again we have:
> >>>>>>
> >>>>>> - radv hack
> >>>>>> - make amdgpu behave different from everyone else
> >>>>>> - keep tilting windmills about "pls use rendernodes, thx"
> >>>>>>
> >>>>>> I neither like tilting windmills nor making drivers behave different for
> >>>>>> no reaason at all.
> >>>>> Allow me to jump-in some emails down the line.
> >>>>>
> >>>>> First and foremost, sincere apologies for upsetting you Christian. If
> >>>>> it's of any consolidation - let me assure you the goal is _not_ to break
> >>>>> amdgpu or any other driver.
> >>>>>
> >>>>> Secondly, I would like to ask you for a list of projects so we can look and
> >>>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> >>>>> it.
> >>>> That is rather hard to come by, because you would also need to look at
> >>>> all previously existing driver stacks.
> >>>>
> >>>> E.g. with the classic R300 driver for example.
> >>>>
> >>>>> On the topic of "working around libva" - sadly libva is _not_ the only
> >>>>> offender. We also had bugs in mesa and kmscube.
> >>>>>
> >>>>> Considering those are taken as a prime example of "the right way", it's very
> >>>>> likely for the mistakes to be repeated in many other projects.
> >>>>>
> >>>>> Where the common "fix" seems to be "run as root"...
> >>>>>
> >>>>>
> >>>>> As Daniel pointed out, we could be fighting the windmills or we could have a
> >>>>> small, admittedly ugly, workaround for amdgpu.
> >>>>>
> >>>>> If you don't like that workaround in the driver we could move it to core DRM.
> >>>>>
> >>>>> In either case, I would like to focus on a pragramic solution which works with
> >>>>> both old and new userspace.
> >>>> Well, I seriously think the original committed solution could cause a
> >>>> lot of problems and the issue with radv is only the tip of the iceberg.
> >>>>
> >>>> I mean we just have to ask our self why have we created render nodes in
> >>>> the first place? The obvious alternative was to just removed the
> >>>> authentication restrictions on the primary node which would have been
> >>>> way less code and maintenance burden.
> >>> The render nodes exist so you can have different unix permissions for
> >>> rendering as for displaying stuff. E.g. run the compositor as a
> >>> distinct user from all its clients. The other reason was to have a
> >>> clean interface without any of the historical baggage (to make stuff
> >>> like container/vm pass-through easier, since we'd know that
> >>> render-node userspace won't ever touch any of the old crap ioctls). I
> >>> don't think there's ever been a concern about breaking backwards
> >>> compatibility.
> >>>
> >>> The issue here is that radv checks for "can I render", when it wants
> >>> to check for "can I display". So anything that only renders we can
> >>> ignore. That leaves the various ddx around, which historically have
> >>> run as root, or if they run as non-root, logind or similar is giving
> >>> them the master kms node already in master mode (afaiui at least).
> >>> That leaves the non-main compositor userspace, of which there's only
> >>> vkdisplay. I don't see much additional possibilities for regressions
> >>> to creep in.
> >>>
> >>> Emil probably has a better grasp on this, since he's got a bunch of
> >>> patches to roll out drmIsMaster all over.
> >>>
> >>>> I need to dig up the mailing list archive, but I strongly think that one
> >>>> of the main arguments for this approach was to NOT break existing userspace.
> >>>>
> >>>> Even taking into account that we now don't need to deal with UMS and
> >>>> really really old userspace drivers any more it still feels like a to
> >>>> high risk going down that route.
> >>> We change the undefined corners of the uapi all the time to improve
> >>> the overall ecosystem, and occasionally we end up with
> >>> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
> >>> think the choice here is how exactly amdgpu.ko copes with radv:
> >>> - the hack proposed here
> >>> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
> >>> let you folks deal with a (likely, not guaranteed) influx "but this
> >>> works on $any_other_driver"
> >>>
> >>> I guess up to you which patch Emil should include when he resubmits
> >>> the patch for 5.3. We've done both in the past, but generally we're
> >>> trying to limit the impact of these hacks as much as possible. Hence
> >>> why I'm still in favour of the first one.
> >> Well and I will still NAK both approaches.
> >>
> >> You can of course go ahead and remove the DRM_AUTH flags from the i915
> >> IOCTLs to make libva happy, but please not a general approach which
> >> touches all DRM drivers.
> >>
> >> And from Dave's response I think he is following my argumentation here,
> >> so I don't see a general approach to be added for 5.3 either.
> >>
> >>> "keep tilting windmills" is imo not an option, and I'm happy to handle
> >>> any other fallout Emil's patch uncovers, if there is anything else - I
> >>> did sketch this hack here quickly on irc right when we got the report.
> >> Well this is not about tilting windmills, but rather good maintenance of
> >> backward compatibility and rejecting changing with potential unforeseen
> >> consequences.
> > We (well I) have been doing stuff like this since forever. It's just that
> > thus far it hasn't been amd that stuck out with the need for a hack, but
> > other drivers. It's a tradeoff in the end, and a tradeoff we're can do due
> > to our open source userspace requirement - in case something does go
> > sideways, we have the full history of anything affected. So you're
> > proposed we revert all these other changes and cleanups too, because they
> > could be risky (many turned out to actually be risky, like this one here
> > too)?
>
> Well in general yes, a lot of those changes looked unnecessary risky to me.

What else should we do then?

- everyone does their own driver private extensions, and we don't
bother to have standard interfaces/behaviour across drivers. Entirely
defeats the point of having an upstream project imo, might as well
ship vendor drivers only then.

- we freeze and become irrelevant.

- we copypaste drm every time someone comes up with a new way they
want to use our stack. We don't have the people for that, plus Linus
gets pissed when we abandon the old versions (like you can do with
your vendor driver that's not upstream).

I agree it's not a great option, but it's the least shitty one.

And yes from the point of view of a single driver, almost all these
changes look like unecessary risk, since you're not the one who cares
about that new use-case.

> We of course can't revert those changes now because they are now UAPI as
> well and it would be risky to modify them. But I think we should have a
> much stricter approach to modifying exiting UAPI.

We just recently agreed on requiring some standard unit tests. We have
probably the strictest review requirements for new uapi in the entire
kernel, bar none (and see below, lots of people think we're misguided
with some of those). Not sure what else we can do, while still being
able to evolve the uapi. What are you suggesting? From your comments
below it sounds like you're suggesting we drop the userspace
requirement (dropping the review requirement by a lot) and just
copypaste drivers every time we find a mistake in our uapi.
-Daniel

> Additional to that I actually think that the requirement of open source
> userspace stacks is a mistake.
>
> This requirement came just from exactly that approach to try to modify
> exiting UAPI instead of accepting the fact that UAPI backward
> compatibility is something you should not mess with.
>
> Nobody else in the kernel is having that open source user space
> requirement and it actually seems that a lot of people see that as just
> a justification to keep Mesa alive.
>
> What you can do is to require unit tests so that everybody is able to
> test for regressions independent of installed userspace software.
>
> > Ok correction: amd has stuck out in the past too, there was some vblank vs
> > pageflip stuff where we needed to do some pretty clever tricks to both
> > have the new stricter semantics for everyone, while keeping the amd ddx
> > happy still. This aint new at all, we fixed up the regressions and moved
> > on.
>
> I'm wasn't much involved into this, so I can't say anything about it.
>
> > tldr; You're categorical rejection doesn't make sense to me.
> > Slightly over the top, but this feels like gut reaction to radv being a
> > thorn for amd.
>
> Well radv is pointing out exactly what's going wrong with AMDs internal
> development process, so I'm really in favor of that.
>
> And I'm certainly in favor or open source projects in general, but as
> kernel developer I must not favor a closed or open userspace stack
> running on top of my driver.
>
> What I care about is good UAPI design, maintainable code and not
> re-inventing the wheel to often, but what stack runs on top is a
> political decision and not a technical one.
>
> Regards,
> Christian.
>
> > -Daniel
>

Daniel Vetter April 18, 2019, 9:46 a.m. UTC | #23

On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-04-18 11:11 a.m., Daniel Vetter wrote:
> > On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
> >>>
> >>> Ok correction: amd has stuck out in the past too, there was some vblank vs
> >>> pageflip stuff where we needed to do some pretty clever tricks to both
> >>> have the new stricter semantics for everyone, while keeping the amd ddx
> >>> happy still. This aint new at all, we fixed up the regressions and moved
> >>> on.
> >>
> >> That sounds like something I would have been involved in, but I'm not
> >> sure what you mean...
> >>
> >> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
> >> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
> >> userspace to take advantage of our hardware's capabilities to avoid
> >> delaying a page flip by one refresh cycle in some cases. That's all.
> >
> > That's the one I meant. At least I remember quite a pile of
> > discussions around why i915 gets to define how this stuff works, and
> > amdgpu/radeon having to follow.
> >
> > [...]
> >
> > - the vblank vs flip clarifiation that required new uapi and upgraded
> > amd ddx (really not pretty if we have to somewhat break a performance
> > feature for some users)
>
> You're totally mis-representing that IMO.
>
> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI
> rules, in a way which affected our DDX drivers as well. The kernel
> driver fixes didn't require any corresponding userspace changes.

It wasn't ever defined when we merged the page-flip ioctl, and the
vblank ioctl was retrofitted/abused to also work with kms. i915
happened to guarantee that a flip immediately after a vblank wait will
hit the next vblank, radeon (this stuff is really old) had more wiggle
room. And for unthrottled vblank, this mattered since it would allow
you to update frames still, even when the rendering was really late.

The a lot of people started writing kms compositors, and due to random
reasons they mostly developed those on i915. Relying on the
i915-specific guarantee between vblank waits and page-flips.

Then the entire atomic saga happened, and since I spent way too much
time reading other kms drivers already I wanted to standardize a bunch
of the uapi grey areas. This was starting with 2014/2015. Stuff I
still remember:
- dpms behaviour, standardized on what i915 did in it's private
modeset code (the crtc helper behaviour had a pile of duct-tape
patches, but still ill-defined corner cases).
- vblank semantics around crtc on/off (atomic helpers fully relied on
drivers calling drm_vblank_on/off, which thus far only i915 did)
- and later on vblank vs timestamp, when the nonblocking helpers landed

As a result, even more people started relying on this stuff.
Eventually someone noticed that amdgpu works differently. There was a
pretty heated irc debate between you and me about why exactly amdgpu
needs to follow the i915 standard, and why intel gets to set all these
standards and a few others things. This was around 2016 or so, per git
blame of relevant at least.

We ended up adjusting amd drivers to follow i915 semantics, which
fixed scheduled flips on lots of compositors, and broke unthrottled
flipping (very slightly, but it did) on amd. page_flip_target remedied
that.

One of the lessons I took from all this that retrofitting uapi is best
done across the board, and not just for i915. It's occasionally not
possible.

> New UAPI was added to allow taking advantage of hardware capabilities in
> some cases without breaking other cases.
>
> It's a pretty text-book example of how to manage UAPI.

Well I guess we remember different aspects of that story, but to me it
was a textbook case of how retrofitting uapi has challenges. One of
the options you proposed was that we'd have a flag to select the i915
behaviour, and leave the default undefined. I think that would have
handled the vblank vs pageflip issue in a way Christian is advertising
for here. It's actually what we've done, by adding distinct
rendernodes. Unfortunate reality is that adoption of those is very
slow, and bad examples using primary nodes get copypasted all around
still. Iirc I similarly argued in the vblank vs pageflip case that
even with an explicit flag, the implicit uapi as defined by what most
people run out there wouldn't change, and amgpu/radeon alone wouldn't
be enough by far to change that alone.
-Daniel

Michel Dänzer April 18, 2019, 10:04 a.m. UTC | #24

On 2019-04-18 11:46 a.m., Daniel Vetter wrote:
> On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote:
>> On 2019-04-18 11:11 a.m., Daniel Vetter wrote:
>>> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote:
>>>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
>>>>>
>>>>> Ok correction: amd has stuck out in the past too, there was some vblank vs
>>>>> pageflip stuff where we needed to do some pretty clever tricks to both
>>>>> have the new stricter semantics for everyone, while keeping the amd ddx
>>>>> happy still. This aint new at all, we fixed up the regressions and moved
>>>>> on.
>>>>
>>>> That sounds like something I would have been involved in, but I'm not
>>>> sure what you mean...
>>>>
>>>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
>>>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
>>>> userspace to take advantage of our hardware's capabilities to avoid
>>>> delaying a page flip by one refresh cycle in some cases. That's all.
>>>
>>> That's the one I meant. At least I remember quite a pile of
>>> discussions around why i915 gets to define how this stuff works, and
>>> amdgpu/radeon having to follow.
>>>
>>> [...]
>>>
>>> - the vblank vs flip clarifiation that required new uapi and upgraded
>>> amd ddx (really not pretty if we have to somewhat break a performance
>>> feature for some users)
>>
>> You're totally mis-representing that IMO.
>>
>> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI
>> rules, in a way which affected our DDX drivers as well. The kernel
>> driver fixes didn't require any corresponding userspace changes.
> 
> It wasn't ever defined when we merged the page-flip ioctl, and the
> vblank ioctl was retrofitted/abused to also work with kms. i915
> happened to guarantee that a flip immediately after a vblank wait will
> hit the next vblank, radeon (this stuff is really old) had more wiggle
> room. And for unthrottled vblank, this mattered since it would allow
> you to update frames still, even when the rendering was really late.
> 
> The a lot of people started writing kms compositors, and due to random
> reasons they mostly developed those on i915. Relying on the
> i915-specific guarantee between vblank waits and page-flips.
> 
> Then the entire atomic saga happened, and since I spent way too much
> time reading other kms drivers already I wanted to standardize a bunch
> of the uapi grey areas. This was starting with 2014/2015. Stuff I
> still remember:
> [...]
> - and later on vblank vs timestamp, when the nonblocking helpers landed
> 
> As a result, even more people started relying on this stuff.
> Eventually someone noticed that amdgpu works differently. There was a
> pretty heated irc debate between you and me about why exactly amdgpu
> needs to follow the i915 standard, and why intel gets to set all these
> standards and a few others things. This was around 2016 or so, per git
> blame of relevant at least.

Sounds like I did some knee-jerk ranting on IRC at the time, sorry.


> We ended up adjusting amd drivers to follow i915 semantics, which
> fixed scheduled flips on lots of compositors, 

It also fixed them with our DDX drivers, as otherwise there was no way
for userspace to prevent page flips from completing one refresh cycle
too early.

> and broke unthrottled flipping (very slightly, but it did) on amd.
> page_flip_target remedied that.

That's just a minor optimization.


>> New UAPI was added to allow taking advantage of hardware capabilities in
>> some cases without breaking other cases.
>>
>> It's a pretty text-book example of how to manage UAPI.
> 
> Well I guess we remember different aspects of that story, but to me it
> was a textbook case of how retrofitting uapi has challenges.

I don't agree "retrofitting" is accurate in this case. This was always
the intention for the UAPI, the amdgpu/radeon kernel drivers just
accidentally didn't implement it correctly.

> One of the options you proposed was that we'd have a flag to select the i915
> behaviour, and leave the default undefined.

So I made a bad proposal at the time. :) I honestly didn't even remember
that. I really don't see that episode having much to do with the current
discussion.

Daniel Vetter April 18, 2019, 10:35 a.m. UTC | #25

On Thu, Apr 18, 2019 at 12:04 PM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-04-18 11:46 a.m., Daniel Vetter wrote:
> > On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2019-04-18 11:11 a.m., Daniel Vetter wrote:
> >>> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote:
> >>>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
> >>>>>
> >>>>> Ok correction: amd has stuck out in the past too, there was some vblank vs
> >>>>> pageflip stuff where we needed to do some pretty clever tricks to both
> >>>>> have the new stricter semantics for everyone, while keeping the amd ddx
> >>>>> happy still. This aint new at all, we fixed up the regressions and moved
> >>>>> on.
> >>>>
> >>>> That sounds like something I would have been involved in, but I'm not
> >>>> sure what you mean...
> >>>>
> >>>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
> >>>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
> >>>> userspace to take advantage of our hardware's capabilities to avoid
> >>>> delaying a page flip by one refresh cycle in some cases. That's all.
> >>>
> >>> That's the one I meant. At least I remember quite a pile of
> >>> discussions around why i915 gets to define how this stuff works, and
> >>> amdgpu/radeon having to follow.
> >>>
> >>> [...]
> >>>
> >>> - the vblank vs flip clarifiation that required new uapi and upgraded
> >>> amd ddx (really not pretty if we have to somewhat break a performance
> >>> feature for some users)
> >>
> >> You're totally mis-representing that IMO.
> >>
> >> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI
> >> rules, in a way which affected our DDX drivers as well. The kernel
> >> driver fixes didn't require any corresponding userspace changes.
> >
> > It wasn't ever defined when we merged the page-flip ioctl, and the
> > vblank ioctl was retrofitted/abused to also work with kms. i915
> > happened to guarantee that a flip immediately after a vblank wait will
> > hit the next vblank, radeon (this stuff is really old) had more wiggle
> > room. And for unthrottled vblank, this mattered since it would allow
> > you to update frames still, even when the rendering was really late.
> >
> > The a lot of people started writing kms compositors, and due to random
> > reasons they mostly developed those on i915. Relying on the
> > i915-specific guarantee between vblank waits and page-flips.
> >
> > Then the entire atomic saga happened, and since I spent way too much
> > time reading other kms drivers already I wanted to standardize a bunch
> > of the uapi grey areas. This was starting with 2014/2015. Stuff I
> > still remember:
> > [...]
> > - and later on vblank vs timestamp, when the nonblocking helpers landed
> >
> > As a result, even more people started relying on this stuff.
> > Eventually someone noticed that amdgpu works differently. There was a
> > pretty heated irc debate between you and me about why exactly amdgpu
> > needs to follow the i915 standard, and why intel gets to set all these
> > standards and a few others things. This was around 2016 or so, per git
> > blame of relevant at least.
>
> Sounds like I did some knee-jerk ranting on IRC at the time, sorry.
>
>
> > We ended up adjusting amd drivers to follow i915 semantics, which
> > fixed scheduled flips on lots of compositors,
>
> It also fixed them with our DDX drivers, as otherwise there was no way
> for userspace to prevent page flips from completing one refresh cycle
> too early.
>
> > and broke unthrottled flipping (very slightly, but it did) on amd.
> > page_flip_target remedied that.
>
> That's just a minor optimization.
>
>
> >> New UAPI was added to allow taking advantage of hardware capabilities in
> >> some cases without breaking other cases.
> >>
> >> It's a pretty text-book example of how to manage UAPI.
> >
> > Well I guess we remember different aspects of that story, but to me it
> > was a textbook case of how retrofitting uapi has challenges.
>
> I don't agree "retrofitting" is accurate in this case. This was always
> the intention for the UAPI, the amdgpu/radeon kernel drivers just
> accidentally didn't implement it correctly.
>
> > One of the options you proposed was that we'd have a flag to select the i915
> > behaviour, and leave the default undefined.
>
> So I made a bad proposal at the time. :) I honestly didn't even remember
> that. I really don't see that episode having much to do with the current
> discussion.

Yeah it's definitely a lot less serious uapi retrofitting than what
we're discussing here now. But I do see a lot of parallels:

- We've implemented rendernodes as a clean/new uapi years ago. git
says we've made them the default in 2014.

- Lots of work happened to make render nodes the main thing. Despite
all that effort (well over 5 years, and a few more, initial render
nodes merge was 2013), adoption is fairly lack-luster and defacto uapi
is "use primary nodes, cause it works if you're root".

We can keep telling everyone they're wrong and that they should use
render nodes, or we can give in and make the uapi fit more with what
people out there seem to actually use. What I don't want to do is some
gradual thing where we slowly change things one driver at a time (like
Christian's patch proposes, starting with i915), cause that leads to
lots of confusion and eventually a lone driver standing in the rain
:-/

Now maybe the right approach is to keep telling everyone they should
use render nodes, but I'm not sure another 5 years is really going to
move this needle. But it is a tradeoff between doing the right thing
from how uapi should evolve, and the pragmatic thing of shrugging and
moving on and implementing what seems to be actually in use. And we've
done quite a bit of the pragmatic kind of uapi
clarifictation/adjusting to actual reality and mostly gotten away with
it. This here doesn't seem like a much different case.

Anyway, to unstuck this discussion a bit I chatted with Emil on irc,
and came up with a new idea for an rfc patch series:
- include both options discussed here for handling amdgpu/radv
- include the original patch again
- include patches for all DRIVER_RENDER drivers to drop DRM_AUTH from
them. That should give us a lot more eyes and reviewers to figure out
whether there's other place this might break userspace.

If that results in lots of negative feedback, we can confidently drop
this idea on the floor as a bad one, and will have to keep telling
everyone to use rendernodes more. If not, then we should know a lot
better whether it will blow up somewhere else than for radv only. And
Emil seems happy to type the patches.

Cheers, Daniel

Michel Dänzer April 18, 2019, 10:53 a.m. UTC | #26

On 2019-04-18 12:35 p.m., Daniel Vetter wrote:
> 
> - We've implemented rendernodes as a clean/new uapi years ago. git
> says we've made them the default in 2014.
> 
> - Lots of work happened to make render nodes the main thing. Despite
> all that effort (well over 5 years, and a few more, initial render
> nodes merge was 2013), adoption is fairly lack-luster and defacto uapi
> is "use primary nodes, cause it works if you're root".

FWIW, xf86-video-amdgpu/ati have been using render nodes for the DRM FDs
sent to DRI3 clients since the 1.3/7.9 releases from March 2017.
Xwayland will do the same in the next xserver release. A notable holdout
is the Xorg modesetting driver.

Emil Velikov April 18, 2019, 1:59 p.m. UTC | #27

On Wed, 17 Apr 2019 at 21:49, Dave Airlie <airlied@gmail.com> wrote:
>
> On Thu, 18 Apr 2019 at 03:30, Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 17.04.19 um 17:51 schrieb Emil Velikov:
> > > Hi guys,
> > >
> > > On 2019/04/17, Daniel Vetter wrote:
> > >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> > >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> > >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> > >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> > >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> > >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> > >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> > >>>>>>>> <Christian.Koenig@amd.com> wrote:
> > >>>>>>>> [SNIP]
> > >>>>>>> Well, what you guys did here is a serious no-go.
> > >>>>>> Not really understanding your reasons for an unconditional nak on all this
> > >>>>>> still.
> > >>>>> Well, let me summarize: You worked around a problem with libva by
> > >>>>> breaking another driver and now proposing a rather ugly and only halve
> > >>>>> backed workaround for that driver.
> > >>>>>
> > >>>>> This is a serious no-go. I've already send out a i915 specific change to
> > >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> > >>>>> the offending patch.
> > >>>> Oh I'm totally fine with the revert if that's what we go with. But that
> > >>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> > >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> > >>>> there's fundamentally 2 options to do that:
> > >>>>
> > >>>> - This one here (or anything implementing the same idea), to keep radv
> > >>>>     working.
> > >>>>
> > >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> > >>>>     exactly that's doen doesn't matter, but it leaves amdgpu out in the
> > >>>>     cold. It also doesn't matter whether we get there with revert + lots of
> > >>>>     patches, or a special hack for amdgpu, in the end amdgpu would be
> > >>>>     different. Also note that your patch isn't enough, since it doesn't fix
> > >>>>     the core ioctls.
> > >>>>
> > >>>> I think from an overall platform pov, the first is the better option.
> > >>>> After all we try really hard to avoid driver special cases for these kinds
> > >>>> of things.
> > >>> Well, I initially thought that radv is doing something unusual or broken,
> > >>> but after looking deeper into this that is actually not the case.
> > >>>
> > >>> What radv does is calling an IOCTL and expecting an error message when it is
> > >>> not authenticated.
> > >>>
> > >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> > >>> figure out if it is authenticated or not, but I clearly remember that we
> > >>> haven't done this from the beginning.
> > >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> > >> First public commit has all the bits: getauth, GetVersion, then the accel
> > >> query.
> > >>
> > >>> Thinking more about this I don't think that this problem is actually amdgpu
> > >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> > >>> but only from those who have the specific issue with libva.
> > >> libva was the initial motivation, the goal of Emil's patch wasn't to just
> > >> duct-tape over libva. That would be easier to fix in libva. The point was
> > >> that we should be able to allow this in general.
> > >>
> > >> And we can, except for the radv issue uncovered here.
> > >>
> > >> So please don't get 100% hung up on the libva thing, that wasn't really
> > >> the goal, just the initial motivation. And I still thinks it makes for all
> > >> drivers. So again we have:
> > >>
> > >> - radv hack
> > >> - make amdgpu behave different from everyone else
> > >> - keep tilting windmills about "pls use rendernodes, thx"
> > >>
> > >> I neither like tilting windmills nor making drivers behave different for
> > >> no reaason at all.
> > > Allow me to jump-in some emails down the line.
> > >
> > > First and foremost, sincere apologies for upsetting you Christian. If
> > > it's of any consolidation - let me assure you the goal is _not_ to break
> > > amdgpu or any other driver.
> > >
> > > Secondly, I would like to ask you for a list of projects so we can look and
> > > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> > > it.
> >
> > That is rather hard to come by, because you would also need to look at
> > all previously existing driver stacks.
> >
> > E.g. with the classic R300 driver for example.
> >
> > > On the topic of "working around libva" - sadly libva is _not_ the only
> > > offender. We also had bugs in mesa and kmscube.
> > >
> > > Considering those are taken as a prime example of "the right way", it's very
> > > likely for the mistakes to be repeated in many other projects.
> > >
> > > Where the common "fix" seems to be "run as root"...
> > >
> > >
> > > As Daniel pointed out, we could be fighting the windmills or we could have a
> > > small, admittedly ugly, workaround for amdgpu.
> > >
> > > If you don't like that workaround in the driver we could move it to core DRM.
> > >
> > > In either case, I would like to focus on a pragramic solution which works with
> > > both old and new userspace.
> >
> > Well, I seriously think the original committed solution could cause a
> > lot of problems and the issue with radv is only the tip of the iceberg.
> >
> > I mean we just have to ask our self why have we created render nodes in
> > the first place? The obvious alternative was to just removed the
> > authentication restrictions on the primary node which would have been
> > way less code and maintenance burden.
> >
> > I need to dig up the mailing list archive, but I strongly think that one
> > of the main arguments for this approach was to NOT break existing userspace.
> >
> > Even taking into account that we now don't need to deal with UMS and
> > really really old userspace drivers any more it still feels like a to
> > high risk going down that route.
>
> I'm going to revert the original patch in drm-next now. We've been
> getting a bit lax lately on the regressions need to get reverted no
> matter what rule, and we are getting close to rc6 and it's Easter, so
> I don't want to have to care about this, forget about it, remember it
> again, end up at 5.2.
>
Makes sense, thanks Dave.

Feel free to add the following to the revert:
Acked-by: Emil Velikov <emil.velikov@collabora.com>

> Personally I'm rather in favour of cleaning up the driver ioctls since
> I don't like older drivers suddenly having a new ABI without
> consideration of side effects.
>
Fwiw I think we're overloading the meaning of ABI - this is a functional change.
Which admittedly I'll have a loser and harder look that it doesn't
break things ;-)

Thanks
Emil

drm/amdgpu: fix drm leases being broken on radv

Commit Message

Comments

Patch