Message ID | 20190416214046.3551-1-andresx7@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: fix drm leases being broken on radv | expand |
Am 16.04.19 um 23:40 schrieb Andres Rodriguez: > After a recent commit, access to the DRM_AUTH ioctls become more > permissive. This resulted in a buggy check for drm_master capabilities > inside radv stop working. > > This commit adds a backwards compatibility workaround so that the radv > drm_master check keeps working as previously expected. > > This fixes SteamVR and other drm lease based apps being broken since > v5.1-rc1 > > From the usermode side, radv will also be fixed to properly test for > master capabilities: > https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 > > Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls") > Signed-off-by: Andres Rodriguez <andresx7@gmail.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: David Airlie <airlied@linux.ie> > Cc: Emil Velikov <emil.velikov@collabora.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Keith Packard <keithp@keithp.com> > Reviewed-by: Keith Packard <keithp@keithp.com> > Reviewed-by: Daniel Vetter <daniel@ffwll.ch> Well definitely a NAK. IIRC we have unit tests where the exactly first thing they do is querying AMDGPU_INFO_ACCEL_WORKING. And I definitely not going to risk breaking those just to fix buggy behavior in radv. Christian. > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ > 3 files changed, 27 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 8d0d7f3dd5fb..e67bfee8d411 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -409,6 +409,9 @@ struct amdgpu_fpriv { > struct mutex bo_list_lock; > struct idr bo_list_handles; > struct amdgpu_ctx_mgr ctx_mgr; > + > + /* Part of a backwards compatibility hack */ > + bool is_first_ioctl; > }; > > int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 7419ea8a388b..cd438afa7092 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, > unsigned int cmd, unsigned long arg) > { > struct drm_file *file_priv = filp->private_data; > + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; > struct drm_device *dev; > long ret; > dev = file_priv->minor->dev; > @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, > return ret; > > ret = drm_ioctl(filp, cmd, arg); > + fpriv->is_first_ioctl = false; > > pm_runtime_mark_last_busy(dev->dev); > pm_runtime_put_autosuspend(dev->dev); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index e860412043bb..00c0a2fb2a69 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) > { > struct amdgpu_device *adev = dev->dev_private; > + struct amdgpu_fpriv *fpriv = filp->driver_priv; > struct drm_amdgpu_info *info = data; > struct amdgpu_mode_info *minfo = &adev->mode_info; > void __user *out = (void __user *)(uintptr_t)info->return_pointer; > @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file > > switch (info->query) { > case AMDGPU_INFO_ACCEL_WORKING: > + /* HACK: radv has a fragile open-coded check for drm master > + * The pattern for the call is: > + * open(DRM_NODE_PRIMARY) > + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) > + * > + * After commit 8059add04 this check stopped working due to > + * operations on a primary node from non-master clients becoming > + * more permissive. > + * > + * This backwards compatibility hack targets the calling pattern > + * above to fail radv from thinking it has master access to the > + * display system ( without reverting 8059add04 ). > + * > + * Users of libdrm will not be affected as some other ioctls are > + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING > + * query. > + */ > + if (fpriv->is_first_ioctl && !filp->is_master) > + return -EACCES; > + > ui32 = adev->accel_working; > return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0; > case AMDGPU_INFO_CRTC_FROM_ID: > @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) > goto error_vm; > } > > + fpriv->is_first_ioctl = true; > mutex_init(&fpriv->bo_list_lock); > idr_init(&fpriv->bo_list_handles); >
On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote: > Am 16.04.19 um 23:40 schrieb Andres Rodriguez: > > After a recent commit, access to the DRM_AUTH ioctls become more > > permissive. This resulted in a buggy check for drm_master capabilities > > inside radv stop working. > > > > This commit adds a backwards compatibility workaround so that the radv > > drm_master check keeps working as previously expected. > > > > This fixes SteamVR and other drm lease based apps being broken since > > v5.1-rc1 > > > > From the usermode side, radv will also be fixed to properly test for > > master capabilities: > > https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 > > > > Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls") > > Signed-off-by: Andres Rodriguez <andresx7@gmail.com> > > Cc: Daniel Vetter <daniel@ffwll.ch> > > Cc: David Airlie <airlied@linux.ie> > > Cc: Emil Velikov <emil.velikov@collabora.com> > > Cc: Alex Deucher <alexander.deucher@amd.com> > > Cc: Keith Packard <keithp@keithp.com> > > Reviewed-by: Keith Packard <keithp@keithp.com> > > Reviewed-by: Daniel Vetter <daniel@ffwll.ch> > > Well definitely a NAK. IIRC we have unit tests where the exactly first thing > they do is querying AMDGPU_INFO_ACCEL_WORKING. > > And I definitely not going to risk breaking those just to fix buggy behavior > in radv. s/buggy/fragile :-) Option B would be to disable 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more tricky to implement I guess (since it'd leak all over the place). Option C is to revert 8059add04, but that's a bit silly since it seems to work everywhere. Breaking radv isn't an option, because no regression. Aside: No one is stopping amd folks from reviewing radv patches and making sure there's no fragile stuff in there. We discussed this quite a bit on irc with Ben and Keith and others, and figured option A is the most promising to go forward with. Anything using amdgpu_device_init (which I think are all the umds, but I didn't check) will keep working, as will radv leases/vkdisplay, plus we can keep 8059add04 for everyone (not just everony except amdgpu). If that means breaking a few unit tests ... *shrugh*. Needs patches to fix them ofc, but shouldn't be that much work really. -Daniel > > Christian. > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ > > 3 files changed, 27 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index 8d0d7f3dd5fb..e67bfee8d411 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -409,6 +409,9 @@ struct amdgpu_fpriv { > > struct mutex bo_list_lock; > > struct idr bo_list_handles; > > struct amdgpu_ctx_mgr ctx_mgr; > > + > > + /* Part of a backwards compatibility hack */ > > + bool is_first_ioctl; > > }; > > int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index 7419ea8a388b..cd438afa7092 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, > > unsigned int cmd, unsigned long arg) > > { > > struct drm_file *file_priv = filp->private_data; > > + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; > > struct drm_device *dev; > > long ret; > > dev = file_priv->minor->dev; > > @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, > > return ret; > > ret = drm_ioctl(filp, cmd, arg); > > + fpriv->is_first_ioctl = false; > > pm_runtime_mark_last_busy(dev->dev); > > pm_runtime_put_autosuspend(dev->dev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > index e860412043bb..00c0a2fb2a69 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) > > { > > struct amdgpu_device *adev = dev->dev_private; > > + struct amdgpu_fpriv *fpriv = filp->driver_priv; > > struct drm_amdgpu_info *info = data; > > struct amdgpu_mode_info *minfo = &adev->mode_info; > > void __user *out = (void __user *)(uintptr_t)info->return_pointer; > > @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file > > switch (info->query) { > > case AMDGPU_INFO_ACCEL_WORKING: > > + /* HACK: radv has a fragile open-coded check for drm master > > + * The pattern for the call is: > > + * open(DRM_NODE_PRIMARY) > > + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) > > + * > > + * After commit 8059add04 this check stopped working due to > > + * operations on a primary node from non-master clients becoming > > + * more permissive. > > + * > > + * This backwards compatibility hack targets the calling pattern > > + * above to fail radv from thinking it has master access to the > > + * display system ( without reverting 8059add04 ). > > + * > > + * Users of libdrm will not be affected as some other ioctls are > > + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING > > + * query. > > + */ > > + if (fpriv->is_first_ioctl && !filp->is_master) > > + return -EACCES; > > + > > ui32 = adev->accel_working; > > return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0; > > case AMDGPU_INFO_CRTC_FROM_ID: > > @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) > > goto error_vm; > > } > > + fpriv->is_first_ioctl = true; > > mutex_init(&fpriv->bo_list_lock); > > idr_init(&fpriv->bo_list_handles); > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 17.04.19 um 10:10 schrieb Daniel Vetter: > On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote: >> Am 16.04.19 um 23:40 schrieb Andres Rodriguez: >>> After a recent commit, access to the DRM_AUTH ioctls become more >>> permissive. This resulted in a buggy check for drm_master capabilities >>> inside radv stop working. >>> >>> This commit adds a backwards compatibility workaround so that the radv >>> drm_master check keeps working as previously expected. >>> >>> This fixes SteamVR and other drm lease based apps being broken since >>> v5.1-rc1 >>> >>> From the usermode side, radv will also be fixed to properly test for >>> master capabilities: >>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 >>> >>> Fixes: 8059add0478e ("drm: allow render capable master with DRM_AUTH ioctls") >>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com> >>> Cc: Daniel Vetter <daniel@ffwll.ch> >>> Cc: David Airlie <airlied@linux.ie> >>> Cc: Emil Velikov <emil.velikov@collabora.com> >>> Cc: Alex Deucher <alexander.deucher@amd.com> >>> Cc: Keith Packard <keithp@keithp.com> >>> Reviewed-by: Keith Packard <keithp@keithp.com> >>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> >> Well definitely a NAK. IIRC we have unit tests where the exactly first thing >> they do is querying AMDGPU_INFO_ACCEL_WORKING. >> >> And I definitely not going to risk breaking those just to fix buggy behavior >> in radv. > s/buggy/fragile :-) > > Option B would be to disable 8059add0478e ("drm: allow render capable > master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more > tricky to implement I guess (since it'd leak all over the place). > > Option C is to revert 8059add04, but that's a bit silly since it seems to > work everywhere. > > Breaking radv isn't an option, because no regression. Aside: No one is > stopping amd folks from reviewing radv patches and making sure there's no > fragile stuff in there. > > We discussed this quite a bit on irc with Ben and Keith and others, and > figured option A is the most promising to go forward with. Anything using > amdgpu_device_init (which I think are all the umds, but I didn't check) > will keep working, as will radv leases/vkdisplay, plus we can keep > 8059add04 for everyone (not just everony except amdgpu). If that means > breaking a few unit tests ... *shrugh*. Needs patches to fix them ofc, but > shouldn't be that much work really. I think you miss the point here: This patch will break the render node interface! E.g. take another look at those lines of code: > + if (fpriv->is_first_ioctl && !filp->is_master) > + return -EACCES; This will return -EACCES on the first accel working query which is doesn't comes from the DRM master. And it is irrelevant if its the primary or the render node. So this patch most likely breaks tons of things and is *definitely* a complete NAK. Additional to that IIRC we have (rather old) code which behaves differently depending if it is called by the render node or the primary node. In other words it assumes that the primary node is authenticated. If I understood correctly what 8059add04 does than this is NOT a good to do in general. Regards, Christian. > -Daniel > >> Christian. >> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ >>> 3 files changed, 27 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> index 8d0d7f3dd5fb..e67bfee8d411 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv { >>> struct mutex bo_list_lock; >>> struct idr bo_list_handles; >>> struct amdgpu_ctx_mgr ctx_mgr; >>> + >>> + /* Part of a backwards compatibility hack */ >>> + bool is_first_ioctl; >>> }; >>> int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> index 7419ea8a388b..cd438afa7092 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>> unsigned int cmd, unsigned long arg) >>> { >>> struct drm_file *file_priv = filp->private_data; >>> + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; >>> struct drm_device *dev; >>> long ret; >>> dev = file_priv->minor->dev; >>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>> return ret; >>> ret = drm_ioctl(filp, cmd, arg); >>> + fpriv->is_first_ioctl = false; >>> pm_runtime_mark_last_busy(dev->dev); >>> pm_runtime_put_autosuspend(dev->dev); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> index e860412043bb..00c0a2fb2a69 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, >>> static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) >>> { >>> struct amdgpu_device *adev = dev->dev_private; >>> + struct amdgpu_fpriv *fpriv = filp->driver_priv; >>> struct drm_amdgpu_info *info = data; >>> struct amdgpu_mode_info *minfo = &adev->mode_info; >>> void __user *out = (void __user *)(uintptr_t)info->return_pointer; >>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file >>> switch (info->query) { >>> case AMDGPU_INFO_ACCEL_WORKING: >>> + /* HACK: radv has a fragile open-coded check for drm master >>> + * The pattern for the call is: >>> + * open(DRM_NODE_PRIMARY) >>> + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) >>> + * >>> + * After commit 8059add04 this check stopped working due to >>> + * operations on a primary node from non-master clients becoming >>> + * more permissive. >>> + * >>> + * This backwards compatibility hack targets the calling pattern >>> + * above to fail radv from thinking it has master access to the >>> + * display system ( without reverting 8059add04 ). >>> + * >>> + * Users of libdrm will not be affected as some other ioctls are >>> + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING >>> + * query. >>> + */ >>> + if (fpriv->is_first_ioctl && !filp->is_master) >>> + return -EACCES; >>> + >>> ui32 = adev->accel_working; >>> return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0; >>> case AMDGPU_INFO_CRTC_FROM_ID: >>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) >>> goto error_vm; >>> } >>> + fpriv->is_first_ioctl = true; >>> mutex_init(&fpriv->bo_list_lock); >>> idr_init(&fpriv->bo_list_handles); >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi guys, well going back to the beginning once more because something doesn't fit together here. I mean what exactly is the purpose of 8059add0478e "drm: allow render capable master with DRM_AUTH ioctls"? When I read the code correctly the effect is that we ignore the DRM_AUTH flag when also the DRM_RENDER_ALLOW flag is set and the driver is a render node driver. Well when this is correct then why the heck aren't we removing the DRM_AUTH flag from the affected IOCTLs instead? The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability flags separately. So the patch 8059add0478e doesn't make any sense as far as I can tell. What I'm missing? Regards, Christian. Am 17.04.19 um 11:18 schrieb Christian König: > Am 17.04.19 um 10:10 schrieb Daniel Vetter: >> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote: >>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez: >>>> After a recent commit, access to the DRM_AUTH ioctls become more >>>> permissive. This resulted in a buggy check for drm_master capabilities >>>> inside radv stop working. >>>> >>>> This commit adds a backwards compatibility workaround so that the radv >>>> drm_master check keeps working as previously expected. >>>> >>>> This fixes SteamVR and other drm lease based apps being broken since >>>> v5.1-rc1 >>>> >>>> From the usermode side, radv will also be fixed to properly test for >>>> master capabilities: >>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 >>>> >>>> Fixes: 8059add0478e ("drm: allow render capable master with >>>> DRM_AUTH ioctls") >>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com> >>>> Cc: Daniel Vetter <daniel@ffwll.ch> >>>> Cc: David Airlie <airlied@linux.ie> >>>> Cc: Emil Velikov <emil.velikov@collabora.com> >>>> Cc: Alex Deucher <alexander.deucher@amd.com> >>>> Cc: Keith Packard <keithp@keithp.com> >>>> Reviewed-by: Keith Packard <keithp@keithp.com> >>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> >>> Well definitely a NAK. IIRC we have unit tests where the exactly >>> first thing >>> they do is querying AMDGPU_INFO_ACCEL_WORKING. >>> >>> And I definitely not going to risk breaking those just to fix buggy >>> behavior >>> in radv. >> s/buggy/fragile :-) >> >> Option B would be to disable 8059add0478e ("drm: allow render capable >> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more >> tricky to implement I guess (since it'd leak all over the place). >> >> Option C is to revert 8059add04, but that's a bit silly since it >> seems to >> work everywhere. >> >> Breaking radv isn't an option, because no regression. Aside: No one is >> stopping amd folks from reviewing radv patches and making sure >> there's no >> fragile stuff in there. >> >> We discussed this quite a bit on irc with Ben and Keith and others, and >> figured option A is the most promising to go forward with. Anything >> using >> amdgpu_device_init (which I think are all the umds, but I didn't check) >> will keep working, as will radv leases/vkdisplay, plus we can keep >> 8059add04 for everyone (not just everony except amdgpu). If that means >> breaking a few unit tests ... *shrugh*. Needs patches to fix them >> ofc, but >> shouldn't be that much work really. > > I think you miss the point here: This patch will break the render node > interface! > > E.g. take another look at those lines of code: >> + if (fpriv->is_first_ioctl && !filp->is_master) >> + return -EACCES; > This will return -EACCES on the first accel working query which is > doesn't comes from the DRM master. And it is irrelevant if its the > primary or the render node. > > So this patch most likely breaks tons of things and is *definitely* a > complete NAK. > > Additional to that IIRC we have (rather old) code which behaves > differently depending if it is called by the render node or the > primary node. In other words it assumes that the primary node is > authenticated. > > If I understood correctly what 8059add04 does than this is NOT a good > to do in general. > > Regards, > Christian. > >> -Daniel >> >>> Christian. >>> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ >>>> 3 files changed, 27 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index 8d0d7f3dd5fb..e67bfee8d411 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv { >>>> struct mutex bo_list_lock; >>>> struct idr bo_list_handles; >>>> struct amdgpu_ctx_mgr ctx_mgr; >>>> + >>>> + /* Part of a backwards compatibility hack */ >>>> + bool is_first_ioctl; >>>> }; >>>> int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv >>>> **fpriv); >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 7419ea8a388b..cd438afa7092 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>>> unsigned int cmd, unsigned long arg) >>>> { >>>> struct drm_file *file_priv = filp->private_data; >>>> + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; >>>> struct drm_device *dev; >>>> long ret; >>>> dev = file_priv->minor->dev; >>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>>> return ret; >>>> ret = drm_ioctl(filp, cmd, arg); >>>> + fpriv->is_first_ioctl = false; >>>> pm_runtime_mark_last_busy(dev->dev); >>>> pm_runtime_put_autosuspend(dev->dev); >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> index e860412043bb..00c0a2fb2a69 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct >>>> amdgpu_device *adev, >>>> static int amdgpu_info_ioctl(struct drm_device *dev, void *data, >>>> struct drm_file *filp) >>>> { >>>> struct amdgpu_device *adev = dev->dev_private; >>>> + struct amdgpu_fpriv *fpriv = filp->driver_priv; >>>> struct drm_amdgpu_info *info = data; >>>> struct amdgpu_mode_info *minfo = &adev->mode_info; >>>> void __user *out = (void __user >>>> *)(uintptr_t)info->return_pointer; >>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device >>>> *dev, void *data, struct drm_file >>>> switch (info->query) { >>>> case AMDGPU_INFO_ACCEL_WORKING: >>>> + /* HACK: radv has a fragile open-coded check for drm master >>>> + * The pattern for the call is: >>>> + * open(DRM_NODE_PRIMARY) >>>> + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) >>>> + * >>>> + * After commit 8059add04 this check stopped working due to >>>> + * operations on a primary node from non-master clients >>>> becoming >>>> + * more permissive. >>>> + * >>>> + * This backwards compatibility hack targets the calling >>>> pattern >>>> + * above to fail radv from thinking it has master access >>>> to the >>>> + * display system ( without reverting 8059add04 ). >>>> + * >>>> + * Users of libdrm will not be affected as some other >>>> ioctls are >>>> + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING >>>> + * query. >>>> + */ >>>> + if (fpriv->is_first_ioctl && !filp->is_master) >>>> + return -EACCES; >>>> + >>>> ui32 = adev->accel_working; >>>> return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT >>>> : 0; >>>> case AMDGPU_INFO_CRTC_FROM_ID: >>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device >>>> *dev, struct drm_file *file_priv) >>>> goto error_vm; >>>> } >>>> + fpriv->is_first_ioctl = true; >>>> mutex_init(&fpriv->bo_list_lock); >>>> idr_init(&fpriv->bo_list_handles); >>> _______________________________________________ >>> dri-devel mailing list >>> dri-devel@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel >
On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Hi guys, > > well going back to the beginning once more because something doesn't fit > together here. > > I mean what exactly is the purpose of 8059add0478e "drm: allow render > capable master with DRM_AUTH ioctls"? > > When I read the code correctly the effect is that we ignore the DRM_AUTH > flag when also the DRM_RENDER_ALLOW flag is set and the driver is a > render node driver. > > Well when this is correct then why the heck aren't we removing the > DRM_AUTH flag from the affected IOCTLs instead? > > The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD > and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability > flags separately. > > So the patch 8059add0478e doesn't make any sense as far as I can tell. > What I'm missing? The idea is that if your driver supports render node (i.e. isolates unpriviledged clients), then we could allow the same on primary nodes. There's a bunch of userspace which blindly only opens primary nodes (instead of also opening render nodes), and this makes them work without requiring root or similar. Afaiui the main offender is libva being really dense (and semi-abandonded, partially also because intel has changed for the worse in that area). Note that the capability checks are for all ioctl, including driver private ones, i.e. rendering will Just Work. I think it makes sense to have that. I guess there might be 100% overlap between DRM_AUTH and DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean auditing all the driver ioctl tables. I guess we could add that as a todo.rst item. It would make sense that either the driver is isolating clients sufficiently to not need authenticated clients only, or allow them all. I guess we could lament why rendernodes are still not yet fully rolled out everywhere, despite all these years. But then I stopped being surprised about ecosystem inertia, and in the end a strict split between display (done on the primary) and rendering (preferrably done on render nodes) is not required on intel/amd/nv, so not many care to make this work better. The patch from Emil seemed like a neat little trick to get there faster, until the radv hiccup. -Daniel > Regards, > Christian. > > Am 17.04.19 um 11:18 schrieb Christian König: > > Am 17.04.19 um 10:10 schrieb Daniel Vetter: > >> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote: > >>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez: > >>>> After a recent commit, access to the DRM_AUTH ioctls become more > >>>> permissive. This resulted in a buggy check for drm_master capabilities > >>>> inside radv stop working. > >>>> > >>>> This commit adds a backwards compatibility workaround so that the radv > >>>> drm_master check keeps working as previously expected. > >>>> > >>>> This fixes SteamVR and other drm lease based apps being broken since > >>>> v5.1-rc1 > >>>> > >>>> From the usermode side, radv will also be fixed to properly test for > >>>> master capabilities: > >>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 > >>>> > >>>> Fixes: 8059add0478e ("drm: allow render capable master with > >>>> DRM_AUTH ioctls") > >>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com> > >>>> Cc: Daniel Vetter <daniel@ffwll.ch> > >>>> Cc: David Airlie <airlied@linux.ie> > >>>> Cc: Emil Velikov <emil.velikov@collabora.com> > >>>> Cc: Alex Deucher <alexander.deucher@amd.com> > >>>> Cc: Keith Packard <keithp@keithp.com> > >>>> Reviewed-by: Keith Packard <keithp@keithp.com> > >>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> > >>> Well definitely a NAK. IIRC we have unit tests where the exactly > >>> first thing > >>> they do is querying AMDGPU_INFO_ACCEL_WORKING. > >>> > >>> And I definitely not going to risk breaking those just to fix buggy > >>> behavior > >>> in radv. > >> s/buggy/fragile :-) > >> > >> Option B would be to disable 8059add0478e ("drm: allow render capable > >> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more > >> tricky to implement I guess (since it'd leak all over the place). > >> > >> Option C is to revert 8059add04, but that's a bit silly since it > >> seems to > >> work everywhere. > >> > >> Breaking radv isn't an option, because no regression. Aside: No one is > >> stopping amd folks from reviewing radv patches and making sure > >> there's no > >> fragile stuff in there. > >> > >> We discussed this quite a bit on irc with Ben and Keith and others, and > >> figured option A is the most promising to go forward with. Anything > >> using > >> amdgpu_device_init (which I think are all the umds, but I didn't check) > >> will keep working, as will radv leases/vkdisplay, plus we can keep > >> 8059add04 for everyone (not just everony except amdgpu). If that means > >> breaking a few unit tests ... *shrugh*. Needs patches to fix them > >> ofc, but > >> shouldn't be that much work really. > > > > I think you miss the point here: This patch will break the render node > > interface! > > > > E.g. take another look at those lines of code: > >> + if (fpriv->is_first_ioctl && !filp->is_master) > >> + return -EACCES; > > This will return -EACCES on the first accel working query which is > > doesn't comes from the DRM master. And it is irrelevant if its the > > primary or the render node. > > > > So this patch most likely breaks tons of things and is *definitely* a > > complete NAK. > > > > Additional to that IIRC we have (rather old) code which behaves > > differently depending if it is called by the render node or the > > primary node. In other words it assumes that the primary node is > > authenticated. > > > > If I understood correctly what 8059add04 does than this is NOT a good > > to do in general. > > > > Regards, > > Christian. > > > >> -Daniel > >> > >>> Christian. > >>> > >>>> --- > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ > >>>> 3 files changed, 27 insertions(+) > >>>> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>> index 8d0d7f3dd5fb..e67bfee8d411 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv { > >>>> struct mutex bo_list_lock; > >>>> struct idr bo_list_handles; > >>>> struct amdgpu_ctx_mgr ctx_mgr; > >>>> + > >>>> + /* Part of a backwards compatibility hack */ > >>>> + bool is_first_ioctl; > >>>> }; > >>>> int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv > >>>> **fpriv); > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>> index 7419ea8a388b..cd438afa7092 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, > >>>> unsigned int cmd, unsigned long arg) > >>>> { > >>>> struct drm_file *file_priv = filp->private_data; > >>>> + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; > >>>> struct drm_device *dev; > >>>> long ret; > >>>> dev = file_priv->minor->dev; > >>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, > >>>> return ret; > >>>> ret = drm_ioctl(filp, cmd, arg); > >>>> + fpriv->is_first_ioctl = false; > >>>> pm_runtime_mark_last_busy(dev->dev); > >>>> pm_runtime_put_autosuspend(dev->dev); > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > >>>> index e860412043bb..00c0a2fb2a69 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > >>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct > >>>> amdgpu_device *adev, > >>>> static int amdgpu_info_ioctl(struct drm_device *dev, void *data, > >>>> struct drm_file *filp) > >>>> { > >>>> struct amdgpu_device *adev = dev->dev_private; > >>>> + struct amdgpu_fpriv *fpriv = filp->driver_priv; > >>>> struct drm_amdgpu_info *info = data; > >>>> struct amdgpu_mode_info *minfo = &adev->mode_info; > >>>> void __user *out = (void __user > >>>> *)(uintptr_t)info->return_pointer; > >>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device > >>>> *dev, void *data, struct drm_file > >>>> switch (info->query) { > >>>> case AMDGPU_INFO_ACCEL_WORKING: > >>>> + /* HACK: radv has a fragile open-coded check for drm master > >>>> + * The pattern for the call is: > >>>> + * open(DRM_NODE_PRIMARY) > >>>> + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) > >>>> + * > >>>> + * After commit 8059add04 this check stopped working due to > >>>> + * operations on a primary node from non-master clients > >>>> becoming > >>>> + * more permissive. > >>>> + * > >>>> + * This backwards compatibility hack targets the calling > >>>> pattern > >>>> + * above to fail radv from thinking it has master access > >>>> to the > >>>> + * display system ( without reverting 8059add04 ). > >>>> + * > >>>> + * Users of libdrm will not be affected as some other > >>>> ioctls are > >>>> + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING > >>>> + * query. > >>>> + */ > >>>> + if (fpriv->is_first_ioctl && !filp->is_master) > >>>> + return -EACCES; > >>>> + > >>>> ui32 = adev->accel_working; > >>>> return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT > >>>> : 0; > >>>> case AMDGPU_INFO_CRTC_FROM_ID: > >>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device > >>>> *dev, struct drm_file *file_priv) > >>>> goto error_vm; > >>>> } > >>>> + fpriv->is_first_ioctl = true; > >>>> mutex_init(&fpriv->bo_list_lock); > >>>> idr_init(&fpriv->bo_list_handles); > >>> _______________________________________________ > >>> dri-devel mailing list > >>> dri-devel@lists.freedesktop.org > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > >
Am 17.04.19 um 13:06 schrieb Daniel Vetter: > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > <Christian.Koenig@amd.com> wrote: >> Hi guys, >> >> well going back to the beginning once more because something doesn't fit >> together here. >> >> I mean what exactly is the purpose of 8059add0478e "drm: allow render >> capable master with DRM_AUTH ioctls"? >> >> When I read the code correctly the effect is that we ignore the DRM_AUTH >> flag when also the DRM_RENDER_ALLOW flag is set and the driver is a >> render node driver. >> >> Well when this is correct then why the heck aren't we removing the >> DRM_AUTH flag from the affected IOCTLs instead? >> >> The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD >> and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability >> flags separately. >> >> So the patch 8059add0478e doesn't make any sense as far as I can tell. >> What I'm missing? > The idea is that if your driver supports render node (i.e. isolates > unpriviledged clients), then we could allow the same on primary nodes. > There's a bunch of userspace which blindly only opens primary nodes > (instead of also opening render nodes), and this makes them work > without requiring root or similar. Afaiui the main offender is libva > being really dense (and semi-abandonded, partially also because intel > has changed for the worse in that area). And exactly that's the problem, we can't do this. It's not only radv which breaks, but also older libdrm_amdgpu versions as well as a bunch of other stuff. Key problem is that older libdrm_amdgpu code did exactly the same thing and called an IOCTL to check if a handle is authenticated or not. If I'm not completely mistaken the radv code could actually be copy&pasted from there. Additional to that we might run into problems with the libdrm unit tests, but those probably didn't break obviously because they are almost all the time run as root. > Note that the capability checks are for all ioctl, including driver > private ones, i.e. rendering will Just Work. I think it makes sense to > have that. > > I guess there might be 100% overlap between DRM_AUTH and > DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean > auditing all the driver ioctl tables. I guess we could add that as a > todo.rst item. It would make sense that either the driver is isolating > clients sufficiently to not need authenticated clients only, or allow > them all. > > I guess we could lament why rendernodes are still not yet fully rolled > out everywhere, despite all these years. But then I stopped being > surprised about ecosystem inertia, and in the end a strict split > between display (done on the primary) and rendering (preferrably done > on render nodes) is not required on intel/amd/nv, so not many care to > make this work better. The patch from Emil seemed like a neat little > trick to get there faster, until the radv hiccup. Well that is a clear NAK for that approach. If you have problems with libva (which are design problems in libva and not kernel related at all) then please work around that by removing DRM_AUTH from the affected IOCTLs and not by messing up general authentication code. Well, what you guys did here is a serious no-go. Christian. > -Daniel > >> Regards, >> Christian. >> >> Am 17.04.19 um 11:18 schrieb Christian König: >>> Am 17.04.19 um 10:10 schrieb Daniel Vetter: >>>> On Wed, Apr 17, 2019 at 09:09:27AM +0200, Christian König wrote: >>>>> Am 16.04.19 um 23:40 schrieb Andres Rodriguez: >>>>>> After a recent commit, access to the DRM_AUTH ioctls become more >>>>>> permissive. This resulted in a buggy check for drm_master capabilities >>>>>> inside radv stop working. >>>>>> >>>>>> This commit adds a backwards compatibility workaround so that the radv >>>>>> drm_master check keeps working as previously expected. >>>>>> >>>>>> This fixes SteamVR and other drm lease based apps being broken since >>>>>> v5.1-rc1 >>>>>> >>>>>> From the usermode side, radv will also be fixed to properly test for >>>>>> master capabilities: >>>>>> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/669 >>>>>> >>>>>> Fixes: 8059add0478e ("drm: allow render capable master with >>>>>> DRM_AUTH ioctls") >>>>>> Signed-off-by: Andres Rodriguez <andresx7@gmail.com> >>>>>> Cc: Daniel Vetter <daniel@ffwll.ch> >>>>>> Cc: David Airlie <airlied@linux.ie> >>>>>> Cc: Emil Velikov <emil.velikov@collabora.com> >>>>>> Cc: Alex Deucher <alexander.deucher@amd.com> >>>>>> Cc: Keith Packard <keithp@keithp.com> >>>>>> Reviewed-by: Keith Packard <keithp@keithp.com> >>>>>> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> >>>>> Well definitely a NAK. IIRC we have unit tests where the exactly >>>>> first thing >>>>> they do is querying AMDGPU_INFO_ACCEL_WORKING. >>>>> >>>>> And I definitely not going to risk breaking those just to fix buggy >>>>> behavior >>>>> in radv. >>>> s/buggy/fragile :-) >>>> >>>> Option B would be to disable 8059add0478e ("drm: allow render capable >>>> master with DRM_AUTH ioctls") just for amdgpu.ko, but that's a bit more >>>> tricky to implement I guess (since it'd leak all over the place). >>>> >>>> Option C is to revert 8059add04, but that's a bit silly since it >>>> seems to >>>> work everywhere. >>>> >>>> Breaking radv isn't an option, because no regression. Aside: No one is >>>> stopping amd folks from reviewing radv patches and making sure >>>> there's no >>>> fragile stuff in there. >>>> >>>> We discussed this quite a bit on irc with Ben and Keith and others, and >>>> figured option A is the most promising to go forward with. Anything >>>> using >>>> amdgpu_device_init (which I think are all the umds, but I didn't check) >>>> will keep working, as will radv leases/vkdisplay, plus we can keep >>>> 8059add04 for everyone (not just everony except amdgpu). If that means >>>> breaking a few unit tests ... *shrugh*. Needs patches to fix them >>>> ofc, but >>>> shouldn't be that much work really. >>> I think you miss the point here: This patch will break the render node >>> interface! >>> >>> E.g. take another look at those lines of code: >>>> + if (fpriv->is_first_ioctl && !filp->is_master) >>>> + return -EACCES; >>> This will return -EACCES on the first accel working query which is >>> doesn't comes from the DRM master. And it is irrelevant if its the >>> primary or the render node. >>> >>> So this patch most likely breaks tons of things and is *definitely* a >>> complete NAK. >>> >>> Additional to that IIRC we have (rather old) code which behaves >>> differently depending if it is called by the render node or the >>> primary node. In other words it assumes that the primary node is >>> authenticated. >>> >>> If I understood correctly what 8059add04 does than this is NOT a good >>> to do in general. >>> >>> Regards, >>> Christian. >>> >>>> -Daniel >>>> >>>>> Christian. >>>>> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 22 ++++++++++++++++++++++ >>>>>> 3 files changed, 27 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>> index 8d0d7f3dd5fb..e67bfee8d411 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>> @@ -409,6 +409,9 @@ struct amdgpu_fpriv { >>>>>> struct mutex bo_list_lock; >>>>>> struct idr bo_list_handles; >>>>>> struct amdgpu_ctx_mgr ctx_mgr; >>>>>> + >>>>>> + /* Part of a backwards compatibility hack */ >>>>>> + bool is_first_ioctl; >>>>>> }; >>>>>> int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv >>>>>> **fpriv); >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>> index 7419ea8a388b..cd438afa7092 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>> @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>>>>> unsigned int cmd, unsigned long arg) >>>>>> { >>>>>> struct drm_file *file_priv = filp->private_data; >>>>>> + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; >>>>>> struct drm_device *dev; >>>>>> long ret; >>>>>> dev = file_priv->minor->dev; >>>>>> @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, >>>>>> return ret; >>>>>> ret = drm_ioctl(filp, cmd, arg); >>>>>> + fpriv->is_first_ioctl = false; >>>>>> pm_runtime_mark_last_busy(dev->dev); >>>>>> pm_runtime_put_autosuspend(dev->dev); >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>>>> index e860412043bb..00c0a2fb2a69 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>>>> @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct >>>>>> amdgpu_device *adev, >>>>>> static int amdgpu_info_ioctl(struct drm_device *dev, void *data, >>>>>> struct drm_file *filp) >>>>>> { >>>>>> struct amdgpu_device *adev = dev->dev_private; >>>>>> + struct amdgpu_fpriv *fpriv = filp->driver_priv; >>>>>> struct drm_amdgpu_info *info = data; >>>>>> struct amdgpu_mode_info *minfo = &adev->mode_info; >>>>>> void __user *out = (void __user >>>>>> *)(uintptr_t)info->return_pointer; >>>>>> @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device >>>>>> *dev, void *data, struct drm_file >>>>>> switch (info->query) { >>>>>> case AMDGPU_INFO_ACCEL_WORKING: >>>>>> + /* HACK: radv has a fragile open-coded check for drm master >>>>>> + * The pattern for the call is: >>>>>> + * open(DRM_NODE_PRIMARY) >>>>>> + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) >>>>>> + * >>>>>> + * After commit 8059add04 this check stopped working due to >>>>>> + * operations on a primary node from non-master clients >>>>>> becoming >>>>>> + * more permissive. >>>>>> + * >>>>>> + * This backwards compatibility hack targets the calling >>>>>> pattern >>>>>> + * above to fail radv from thinking it has master access >>>>>> to the >>>>>> + * display system ( without reverting 8059add04 ). >>>>>> + * >>>>>> + * Users of libdrm will not be affected as some other >>>>>> ioctls are >>>>>> + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING >>>>>> + * query. >>>>>> + */ >>>>>> + if (fpriv->is_first_ioctl && !filp->is_master) >>>>>> + return -EACCES; >>>>>> + >>>>>> ui32 = adev->accel_working; >>>>>> return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT >>>>>> : 0; >>>>>> case AMDGPU_INFO_CRTC_FROM_ID: >>>>>> @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device >>>>>> *dev, struct drm_file *file_priv) >>>>>> goto error_vm; >>>>>> } >>>>>> + fpriv->is_first_ioctl = true; >>>>>> mutex_init(&fpriv->bo_list_lock); >>>>>> idr_init(&fpriv->bo_list_handles); >>>>> _______________________________________________ >>>>> dri-devel mailing list >>>>> dri-devel@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel >
On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > Am 17.04.19 um 13:06 schrieb Daniel Vetter: > > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > > <Christian.Koenig@amd.com> wrote: > >> Hi guys, > >> > >> well going back to the beginning once more because something doesn't fit > >> together here. > >> > >> I mean what exactly is the purpose of 8059add0478e "drm: allow render > >> capable master with DRM_AUTH ioctls"? > >> > >> When I read the code correctly the effect is that we ignore the DRM_AUTH > >> flag when also the DRM_RENDER_ALLOW flag is set and the driver is a > >> render node driver. > >> > >> Well when this is correct then why the heck aren't we removing the > >> DRM_AUTH flag from the affected IOCTLs instead? > >> > >> The only common IOCTLs affecting this are DRM_IOCTL_PRIME_HANDLE_TO_FD > >> and DRM_IOCTL_PRIME_FD_TO_HANDLE and those are protected by capability > >> flags separately. > >> > >> So the patch 8059add0478e doesn't make any sense as far as I can tell. > >> What I'm missing? > > The idea is that if your driver supports render node (i.e. isolates > > unpriviledged clients), then we could allow the same on primary nodes. > > There's a bunch of userspace which blindly only opens primary nodes > > (instead of also opening render nodes), and this makes them work > > without requiring root or similar. Afaiui the main offender is libva > > being really dense (and semi-abandonded, partially also because intel > > has changed for the worse in that area). > > And exactly that's the problem, we can't do this. It's not only radv > which breaks, but also older libdrm_amdgpu versions as well as a bunch > of other stuff. > > Key problem is that older libdrm_amdgpu code did exactly the same thing > and called an IOCTL to check if a handle is authenticated or not. If I'm > not completely mistaken the radv code could actually be copy&pasted from > there. > > Additional to that we might run into problems with the libdrm unit > tests, but those probably didn't break obviously because they are almost > all the time run as root. I looked at the very first version of amdgpu_device_initialize. That still does some other ioctl calls before querying AMDGPU_INFO_ACCEL_WORKING. So should work even on the very first amdgpu based userspace release. I also quickly checked for these unit tests, and at least the ones in libdrm don't have a call to AMDGPU_INFO_ACCEL_WORKING directly after opening the file. If you want I can also crawl through rocm, amdvlk and ddx sources. > > Note that the capability checks are for all ioctl, including driver > > private ones, i.e. rendering will Just Work. I think it makes sense to > > have that. > > > > I guess there might be 100% overlap between DRM_AUTH and > > DRM_RENDER_ALLOW in ioctl flags now, but fixing that would mean > > auditing all the driver ioctl tables. I guess we could add that as a > > todo.rst item. It would make sense that either the driver is isolating > > clients sufficiently to not need authenticated clients only, or allow > > them all. > > > > I guess we could lament why rendernodes are still not yet fully rolled > > out everywhere, despite all these years. But then I stopped being > > surprised about ecosystem inertia, and in the end a strict split > > between display (done on the primary) and rendering (preferrably done > > on render nodes) is not required on intel/amd/nv, so not many care to > > make this work better. The patch from Emil seemed like a neat little > > trick to get there faster, until the radv hiccup. > > Well that is a clear NAK for that approach. > > If you have problems with libva (which are design problems in libva and > not kernel related at all) then please work around that by removing > DRM_AUTH from the affected IOCTLs and not by messing up general > authentication code. That's what we've done defacto, except we've done it for all drivers. As mentioned above, option C would be to disable that for amdgpu. Which seems silly, given that the workaround hack is just 3 lines. I also think that the goal of the original patch is pretty sound and overall well motivated, given the reality that rendernodes aren't super popular unfortunately. Also, from an access permission pov it's the right thing to do really. > Well, what you guys did here is a serious no-go. Not really understanding your reasons for an unconditional nak on all this still. -Daniel
Am 17.04.19 um 14:00 schrieb Daniel Vetter: > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: >> Am 17.04.19 um 13:06 schrieb Daniel Vetter: >>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian >>> <Christian.Koenig@amd.com> wrote: >>> [SNIP] >> Well, what you guys did here is a serious no-go. > Not really understanding your reasons for an unconditional nak on all this > still. Well, let me summarize: You worked around a problem with libva by breaking another driver and now proposing a rather ugly and only halve backed workaround for that driver. This is a serious no-go. I've already send out a i915 specific change to remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert the offending patch. Please review, Christian. > -Daniel
On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > Am 17.04.19 um 14:00 schrieb Daniel Vetter: > > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > >> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > >>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > >>> <Christian.Koenig@amd.com> wrote: > >>> [SNIP] > >> Well, what you guys did here is a serious no-go. > > Not really understanding your reasons for an unconditional nak on all this > > still. > > Well, let me summarize: You worked around a problem with libva by > breaking another driver and now proposing a rather ugly and only halve > backed workaround for that driver. > > This is a serious no-go. I've already send out a i915 specific change to > remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > the offending patch. Oh I'm totally fine with the revert if that's what we go with. But that still leaves the issue that it would make sense to drop DRM_AUTH from DRIVER_RENDER capable drivers (not just for libva, but in general). And there's fundamentally 2 options to do that: - This one here (or anything implementing the same idea), to keep radv working. - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How exactly that's doen doesn't matter, but it leaves amdgpu out in the cold. It also doesn't matter whether we get there with revert + lots of patches, or a special hack for amdgpu, in the end amdgpu would be different. Also note that your patch isn't enough, since it doesn't fix the core ioctls. I think from an overall platform pov, the first is the better option. After all we try really hard to avoid driver special cases for these kinds of things. -Daniel
Am 17.04.19 um 14:35 schrieb Daniel Vetter: > On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: >> Am 17.04.19 um 14:00 schrieb Daniel Vetter: >>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: >>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: >>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian >>>>> <Christian.Koenig@amd.com> wrote: >>>>> [SNIP] >>>> Well, what you guys did here is a serious no-go. >>> Not really understanding your reasons for an unconditional nak on all this >>> still. >> Well, let me summarize: You worked around a problem with libva by >> breaking another driver and now proposing a rather ugly and only halve >> backed workaround for that driver. >> >> This is a serious no-go. I've already send out a i915 specific change to >> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert >> the offending patch. > Oh I'm totally fine with the revert if that's what we go with. But that > still leaves the issue that it would make sense to drop DRM_AUTH from > DRIVER_RENDER capable drivers (not just for libva, but in general). And > there's fundamentally 2 options to do that: > > - This one here (or anything implementing the same idea), to keep radv > working. > > - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > exactly that's doen doesn't matter, but it leaves amdgpu out in the > cold. It also doesn't matter whether we get there with revert + lots of > patches, or a special hack for amdgpu, in the end amdgpu would be > different. Also note that your patch isn't enough, since it doesn't fix > the core ioctls. > > I think from an overall platform pov, the first is the better option. > After all we try really hard to avoid driver special cases for these kinds > of things. Well, I initially thought that radv is doing something unusual or broken, but after looking deeper into this that is actually not the case. What radv does is calling an IOCTL and expecting an error message when it is not authenticated. It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to figure out if it is authenticated or not, but I clearly remember that we haven't done this from the beginning. Thinking more about this I don't think that this problem is actually amdgpu specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, but only from those who have the specific issue with libva. Christian. > -Daniel
On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > Am 17.04.19 um 14:35 schrieb Daniel Vetter: > > On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > > > Am 17.04.19 um 14:00 schrieb Daniel Vetter: > > > > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > > > > > Am 17.04.19 um 13:06 schrieb Daniel Vetter: > > > > > > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > > > > > > <Christian.Koenig@amd.com> wrote: > > > > > > [SNIP] > > > > > Well, what you guys did here is a serious no-go. > > > > Not really understanding your reasons for an unconditional nak on all this > > > > still. > > > Well, let me summarize: You worked around a problem with libva by > > > breaking another driver and now proposing a rather ugly and only halve > > > backed workaround for that driver. > > > > > > This is a serious no-go. I've already send out a i915 specific change to > > > remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > > > the offending patch. > > Oh I'm totally fine with the revert if that's what we go with. But that > > still leaves the issue that it would make sense to drop DRM_AUTH from > > DRIVER_RENDER capable drivers (not just for libva, but in general). And > > there's fundamentally 2 options to do that: > > > > - This one here (or anything implementing the same idea), to keep radv > > working. > > > > - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > > exactly that's doen doesn't matter, but it leaves amdgpu out in the > > cold. It also doesn't matter whether we get there with revert + lots of > > patches, or a special hack for amdgpu, in the end amdgpu would be > > different. Also note that your patch isn't enough, since it doesn't fix > > the core ioctls. > > > > I think from an overall platform pov, the first is the better option. > > After all we try really hard to avoid driver special cases for these kinds > > of things. > > Well, I initially thought that radv is doing something unusual or broken, > but after looking deeper into this that is actually not the case. > > What radv does is calling an IOCTL and expecting an error message when it is > not authenticated. > > It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > figure out if it is authenticated or not, but I clearly remember that we > haven't done this from the beginning. Maybe in the radoen code, or some earlier internal amdgpu libdrm code? First public commit has all the bits: getauth, GetVersion, then the accel query. > Thinking more about this I don't think that this problem is actually amdgpu > specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > but only from those who have the specific issue with libva. libva was the initial motivation, the goal of Emil's patch wasn't to just duct-tape over libva. That would be easier to fix in libva. The point was that we should be able to allow this in general. And we can, except for the radv issue uncovered here. So please don't get 100% hung up on the libva thing, that wasn't really the goal, just the initial motivation. And I still thinks it makes for all drivers. So again we have: - radv hack - make amdgpu behave different from everyone else - keep tilting windmills about "pls use rendernodes, thx" I neither like tilting windmills nor making drivers behave different for no reaason at all. -Daniel
Hi guys, On 2019/04/17, Daniel Vetter wrote: > On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > > Am 17.04.19 um 14:35 schrieb Daniel Vetter: > > > On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > > > > Am 17.04.19 um 14:00 schrieb Daniel Vetter: > > > > > On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > > > > > > Am 17.04.19 um 13:06 schrieb Daniel Vetter: > > > > > > > On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > > > > > > > <Christian.Koenig@amd.com> wrote: > > > > > > > [SNIP] > > > > > > Well, what you guys did here is a serious no-go. > > > > > Not really understanding your reasons for an unconditional nak on all this > > > > > still. > > > > Well, let me summarize: You worked around a problem with libva by > > > > breaking another driver and now proposing a rather ugly and only halve > > > > backed workaround for that driver. > > > > > > > > This is a serious no-go. I've already send out a i915 specific change to > > > > remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > > > > the offending patch. > > > Oh I'm totally fine with the revert if that's what we go with. But that > > > still leaves the issue that it would make sense to drop DRM_AUTH from > > > DRIVER_RENDER capable drivers (not just for libva, but in general). And > > > there's fundamentally 2 options to do that: > > > > > > - This one here (or anything implementing the same idea), to keep radv > > > working. > > > > > > - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > > > exactly that's doen doesn't matter, but it leaves amdgpu out in the > > > cold. It also doesn't matter whether we get there with revert + lots of > > > patches, or a special hack for amdgpu, in the end amdgpu would be > > > different. Also note that your patch isn't enough, since it doesn't fix > > > the core ioctls. > > > > > > I think from an overall platform pov, the first is the better option. > > > After all we try really hard to avoid driver special cases for these kinds > > > of things. > > > > Well, I initially thought that radv is doing something unusual or broken, > > but after looking deeper into this that is actually not the case. > > > > What radv does is calling an IOCTL and expecting an error message when it is > > not authenticated. > > > > It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > > figure out if it is authenticated or not, but I clearly remember that we > > haven't done this from the beginning. > > Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > First public commit has all the bits: getauth, GetVersion, then the accel > query. > > > Thinking more about this I don't think that this problem is actually amdgpu > > specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > > but only from those who have the specific issue with libva. > > libva was the initial motivation, the goal of Emil's patch wasn't to just > duct-tape over libva. That would be easier to fix in libva. The point was > that we should be able to allow this in general. > > And we can, except for the radv issue uncovered here. > > So please don't get 100% hung up on the libva thing, that wasn't really > the goal, just the initial motivation. And I still thinks it makes for all > drivers. So again we have: > > - radv hack > - make amdgpu behave different from everyone else > - keep tilting windmills about "pls use rendernodes, thx" > > I neither like tilting windmills nor making drivers behave different for > no reaason at all. Allow me to jump-in some emails down the line. First and foremost, sincere apologies for upsetting you Christian. If it's of any consolidation - let me assure you the goal is _not_ to break amdgpu or any other driver. Secondly, I would like to ask you for a list of projects so we can look and investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into it. On the topic of "working around libva" - sadly libva is _not_ the only offender. We also had bugs in mesa and kmscube. Considering those are taken as a prime example of "the right way", it's very likely for the mistakes to be repeated in many other projects. Where the common "fix" seems to be "run as root"... As Daniel pointed out, we could be fighting the windmills or we could have a small, admittedly ugly, workaround for amdgpu. If you don't like that workaround in the driver we could move it to core DRM. In either case, I would like to focus on a pragramic solution which works with both old and new userspace. Thanks Emil
Am 17.04.19 um 17:51 schrieb Emil Velikov: > Hi guys, > > On 2019/04/17, Daniel Vetter wrote: >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian >>>>>>>> <Christian.Koenig@amd.com> wrote: >>>>>>>> [SNIP] >>>>>>> Well, what you guys did here is a serious no-go. >>>>>> Not really understanding your reasons for an unconditional nak on all this >>>>>> still. >>>>> Well, let me summarize: You worked around a problem with libva by >>>>> breaking another driver and now proposing a rather ugly and only halve >>>>> backed workaround for that driver. >>>>> >>>>> This is a serious no-go. I've already send out a i915 specific change to >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert >>>>> the offending patch. >>>> Oh I'm totally fine with the revert if that's what we go with. But that >>>> still leaves the issue that it would make sense to drop DRM_AUTH from >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And >>>> there's fundamentally 2 options to do that: >>>> >>>> - This one here (or anything implementing the same idea), to keep radv >>>> working. >>>> >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How >>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the >>>> cold. It also doesn't matter whether we get there with revert + lots of >>>> patches, or a special hack for amdgpu, in the end amdgpu would be >>>> different. Also note that your patch isn't enough, since it doesn't fix >>>> the core ioctls. >>>> >>>> I think from an overall platform pov, the first is the better option. >>>> After all we try really hard to avoid driver special cases for these kinds >>>> of things. >>> Well, I initially thought that radv is doing something unusual or broken, >>> but after looking deeper into this that is actually not the case. >>> >>> What radv does is calling an IOCTL and expecting an error message when it is >>> not authenticated. >>> >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to >>> figure out if it is authenticated or not, but I clearly remember that we >>> haven't done this from the beginning. >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? >> First public commit has all the bits: getauth, GetVersion, then the accel >> query. >> >>> Thinking more about this I don't think that this problem is actually amdgpu >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, >>> but only from those who have the specific issue with libva. >> libva was the initial motivation, the goal of Emil's patch wasn't to just >> duct-tape over libva. That would be easier to fix in libva. The point was >> that we should be able to allow this in general. >> >> And we can, except for the radv issue uncovered here. >> >> So please don't get 100% hung up on the libva thing, that wasn't really >> the goal, just the initial motivation. And I still thinks it makes for all >> drivers. So again we have: >> >> - radv hack >> - make amdgpu behave different from everyone else >> - keep tilting windmills about "pls use rendernodes, thx" >> >> I neither like tilting windmills nor making drivers behave different for >> no reaason at all. > Allow me to jump-in some emails down the line. > > First and foremost, sincere apologies for upsetting you Christian. If > it's of any consolidation - let me assure you the goal is _not_ to break > amdgpu or any other driver. > > Secondly, I would like to ask you for a list of projects so we can look and > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > it. That is rather hard to come by, because you would also need to look at all previously existing driver stacks. E.g. with the classic R300 driver for example. > On the topic of "working around libva" - sadly libva is _not_ the only > offender. We also had bugs in mesa and kmscube. > > Considering those are taken as a prime example of "the right way", it's very > likely for the mistakes to be repeated in many other projects. > > Where the common "fix" seems to be "run as root"... > > > As Daniel pointed out, we could be fighting the windmills or we could have a > small, admittedly ugly, workaround for amdgpu. > > If you don't like that workaround in the driver we could move it to core DRM. > > In either case, I would like to focus on a pragramic solution which works with > both old and new userspace. Well, I seriously think the original committed solution could cause a lot of problems and the issue with radv is only the tip of the iceberg. I mean we just have to ask our self why have we created render nodes in the first place? The obvious alternative was to just removed the authentication restrictions on the primary node which would have been way less code and maintenance burden. I need to dig up the mailing list archive, but I strongly think that one of the main arguments for this approach was to NOT break existing userspace. Even taking into account that we now don't need to deal with UMS and really really old userspace drivers any more it still feels like a to high risk going down that route. Christian. > > Thanks > Emil
On Thu, 18 Apr 2019 at 03:30, Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 17.04.19 um 17:51 schrieb Emil Velikov: > > Hi guys, > > > > On 2019/04/17, Daniel Vetter wrote: > >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: > >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: > >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > >>>>>>>> <Christian.Koenig@amd.com> wrote: > >>>>>>>> [SNIP] > >>>>>>> Well, what you guys did here is a serious no-go. > >>>>>> Not really understanding your reasons for an unconditional nak on all this > >>>>>> still. > >>>>> Well, let me summarize: You worked around a problem with libva by > >>>>> breaking another driver and now proposing a rather ugly and only halve > >>>>> backed workaround for that driver. > >>>>> > >>>>> This is a serious no-go. I've already send out a i915 specific change to > >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > >>>>> the offending patch. > >>>> Oh I'm totally fine with the revert if that's what we go with. But that > >>>> still leaves the issue that it would make sense to drop DRM_AUTH from > >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And > >>>> there's fundamentally 2 options to do that: > >>>> > >>>> - This one here (or anything implementing the same idea), to keep radv > >>>> working. > >>>> > >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > >>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the > >>>> cold. It also doesn't matter whether we get there with revert + lots of > >>>> patches, or a special hack for amdgpu, in the end amdgpu would be > >>>> different. Also note that your patch isn't enough, since it doesn't fix > >>>> the core ioctls. > >>>> > >>>> I think from an overall platform pov, the first is the better option. > >>>> After all we try really hard to avoid driver special cases for these kinds > >>>> of things. > >>> Well, I initially thought that radv is doing something unusual or broken, > >>> but after looking deeper into this that is actually not the case. > >>> > >>> What radv does is calling an IOCTL and expecting an error message when it is > >>> not authenticated. > >>> > >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > >>> figure out if it is authenticated or not, but I clearly remember that we > >>> haven't done this from the beginning. > >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > >> First public commit has all the bits: getauth, GetVersion, then the accel > >> query. > >> > >>> Thinking more about this I don't think that this problem is actually amdgpu > >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > >>> but only from those who have the specific issue with libva. > >> libva was the initial motivation, the goal of Emil's patch wasn't to just > >> duct-tape over libva. That would be easier to fix in libva. The point was > >> that we should be able to allow this in general. > >> > >> And we can, except for the radv issue uncovered here. > >> > >> So please don't get 100% hung up on the libva thing, that wasn't really > >> the goal, just the initial motivation. And I still thinks it makes for all > >> drivers. So again we have: > >> > >> - radv hack > >> - make amdgpu behave different from everyone else > >> - keep tilting windmills about "pls use rendernodes, thx" > >> > >> I neither like tilting windmills nor making drivers behave different for > >> no reaason at all. > > Allow me to jump-in some emails down the line. > > > > First and foremost, sincere apologies for upsetting you Christian. If > > it's of any consolidation - let me assure you the goal is _not_ to break > > amdgpu or any other driver. > > > > Secondly, I would like to ask you for a list of projects so we can look and > > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > > it. > > That is rather hard to come by, because you would also need to look at > all previously existing driver stacks. > > E.g. with the classic R300 driver for example. > > > On the topic of "working around libva" - sadly libva is _not_ the only > > offender. We also had bugs in mesa and kmscube. > > > > Considering those are taken as a prime example of "the right way", it's very > > likely for the mistakes to be repeated in many other projects. > > > > Where the common "fix" seems to be "run as root"... > > > > > > As Daniel pointed out, we could be fighting the windmills or we could have a > > small, admittedly ugly, workaround for amdgpu. > > > > If you don't like that workaround in the driver we could move it to core DRM. > > > > In either case, I would like to focus on a pragramic solution which works with > > both old and new userspace. > > Well, I seriously think the original committed solution could cause a > lot of problems and the issue with radv is only the tip of the iceberg. > > I mean we just have to ask our self why have we created render nodes in > the first place? The obvious alternative was to just removed the > authentication restrictions on the primary node which would have been > way less code and maintenance burden. > > I need to dig up the mailing list archive, but I strongly think that one > of the main arguments for this approach was to NOT break existing userspace. > > Even taking into account that we now don't need to deal with UMS and > really really old userspace drivers any more it still feels like a to > high risk going down that route. I'm going to revert the original patch in drm-next now. We've been getting a bit lax lately on the regressions need to get reverted no matter what rule, and we are getting close to rc6 and it's Easter, so I don't want to have to care about this, forget about it, remember it again, end up at 5.2. Personally I'm rather in favour of cleaning up the driver ioctls since I don't like older drivers suddenly having a new ABI without consideration of side effects. Dave.
On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 17.04.19 um 17:51 schrieb Emil Velikov: > > Hi guys, > > > > On 2019/04/17, Daniel Vetter wrote: > >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: > >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: > >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > >>>>>>>> <Christian.Koenig@amd.com> wrote: > >>>>>>>> [SNIP] > >>>>>>> Well, what you guys did here is a serious no-go. > >>>>>> Not really understanding your reasons for an unconditional nak on all this > >>>>>> still. > >>>>> Well, let me summarize: You worked around a problem with libva by > >>>>> breaking another driver and now proposing a rather ugly and only halve > >>>>> backed workaround for that driver. > >>>>> > >>>>> This is a serious no-go. I've already send out a i915 specific change to > >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > >>>>> the offending patch. > >>>> Oh I'm totally fine with the revert if that's what we go with. But that > >>>> still leaves the issue that it would make sense to drop DRM_AUTH from > >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And > >>>> there's fundamentally 2 options to do that: > >>>> > >>>> - This one here (or anything implementing the same idea), to keep radv > >>>> working. > >>>> > >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > >>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the > >>>> cold. It also doesn't matter whether we get there with revert + lots of > >>>> patches, or a special hack for amdgpu, in the end amdgpu would be > >>>> different. Also note that your patch isn't enough, since it doesn't fix > >>>> the core ioctls. > >>>> > >>>> I think from an overall platform pov, the first is the better option. > >>>> After all we try really hard to avoid driver special cases for these kinds > >>>> of things. > >>> Well, I initially thought that radv is doing something unusual or broken, > >>> but after looking deeper into this that is actually not the case. > >>> > >>> What radv does is calling an IOCTL and expecting an error message when it is > >>> not authenticated. > >>> > >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > >>> figure out if it is authenticated or not, but I clearly remember that we > >>> haven't done this from the beginning. > >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > >> First public commit has all the bits: getauth, GetVersion, then the accel > >> query. > >> > >>> Thinking more about this I don't think that this problem is actually amdgpu > >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > >>> but only from those who have the specific issue with libva. > >> libva was the initial motivation, the goal of Emil's patch wasn't to just > >> duct-tape over libva. That would be easier to fix in libva. The point was > >> that we should be able to allow this in general. > >> > >> And we can, except for the radv issue uncovered here. > >> > >> So please don't get 100% hung up on the libva thing, that wasn't really > >> the goal, just the initial motivation. And I still thinks it makes for all > >> drivers. So again we have: > >> > >> - radv hack > >> - make amdgpu behave different from everyone else > >> - keep tilting windmills about "pls use rendernodes, thx" > >> > >> I neither like tilting windmills nor making drivers behave different for > >> no reaason at all. > > Allow me to jump-in some emails down the line. > > > > First and foremost, sincere apologies for upsetting you Christian. If > > it's of any consolidation - let me assure you the goal is _not_ to break > > amdgpu or any other driver. > > > > Secondly, I would like to ask you for a list of projects so we can look and > > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > > it. > > That is rather hard to come by, because you would also need to look at > all previously existing driver stacks. > > E.g. with the classic R300 driver for example. > > > On the topic of "working around libva" - sadly libva is _not_ the only > > offender. We also had bugs in mesa and kmscube. > > > > Considering those are taken as a prime example of "the right way", it's very > > likely for the mistakes to be repeated in many other projects. > > > > Where the common "fix" seems to be "run as root"... > > > > > > As Daniel pointed out, we could be fighting the windmills or we could have a > > small, admittedly ugly, workaround for amdgpu. > > > > If you don't like that workaround in the driver we could move it to core DRM. > > > > In either case, I would like to focus on a pragramic solution which works with > > both old and new userspace. > > Well, I seriously think the original committed solution could cause a > lot of problems and the issue with radv is only the tip of the iceberg. > > I mean we just have to ask our self why have we created render nodes in > the first place? The obvious alternative was to just removed the > authentication restrictions on the primary node which would have been > way less code and maintenance burden. The render nodes exist so you can have different unix permissions for rendering as for displaying stuff. E.g. run the compositor as a distinct user from all its clients. The other reason was to have a clean interface without any of the historical baggage (to make stuff like container/vm pass-through easier, since we'd know that render-node userspace won't ever touch any of the old crap ioctls). I don't think there's ever been a concern about breaking backwards compatibility. The issue here is that radv checks for "can I render", when it wants to check for "can I display". So anything that only renders we can ignore. That leaves the various ddx around, which historically have run as root, or if they run as non-root, logind or similar is giving them the master kms node already in master mode (afaiui at least). That leaves the non-main compositor userspace, of which there's only vkdisplay. I don't see much additional possibilities for regressions to creep in. Emil probably has a better grasp on this, since he's got a bunch of patches to roll out drmIsMaster all over. > I need to dig up the mailing list archive, but I strongly think that one > of the main arguments for this approach was to NOT break existing userspace. > > Even taking into account that we now don't need to deal with UMS and > really really old userspace drivers any more it still feels like a to > high risk going down that route. We change the undefined corners of the uapi all the time to improve the overall ecosystem, and occasionally we end up with DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still think the choice here is how exactly amdgpu.ko copes with radv: - the hack proposed here - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and let you folks deal with a (likely, not guaranteed) influx "but this works on $any_other_driver" I guess up to you which patch Emil should include when he resubmits the patch for 5.3. We've done both in the past, but generally we're trying to limit the impact of these hacks as much as possible. Hence why I'm still in favour of the first one. "keep tilting windmills" is imo not an option, and I'm happy to handle any other fallout Emil's patch uncovers, if there is anything else - I did sketch this hack here quickly on irc right when we got the report. -Daniel
Am 18.04.19 um 08:46 schrieb Daniel Vetter: > On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian > <Christian.Koenig@amd.com> wrote: >> Am 17.04.19 um 17:51 schrieb Emil Velikov: >>> Hi guys, >>> >>> On 2019/04/17, Daniel Vetter wrote: >>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: >>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: >>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: >>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: >>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: >>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: >>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian >>>>>>>>>> <Christian.Koenig@amd.com> wrote: >>>>>>>>>> [SNIP] >>>>>>>>> Well, what you guys did here is a serious no-go. >>>>>>>> Not really understanding your reasons for an unconditional nak on all this >>>>>>>> still. >>>>>>> Well, let me summarize: You worked around a problem with libva by >>>>>>> breaking another driver and now proposing a rather ugly and only halve >>>>>>> backed workaround for that driver. >>>>>>> >>>>>>> This is a serious no-go. I've already send out a i915 specific change to >>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert >>>>>>> the offending patch. >>>>>> Oh I'm totally fine with the revert if that's what we go with. But that >>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from >>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And >>>>>> there's fundamentally 2 options to do that: >>>>>> >>>>>> - This one here (or anything implementing the same idea), to keep radv >>>>>> working. >>>>>> >>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How >>>>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the >>>>>> cold. It also doesn't matter whether we get there with revert + lots of >>>>>> patches, or a special hack for amdgpu, in the end amdgpu would be >>>>>> different. Also note that your patch isn't enough, since it doesn't fix >>>>>> the core ioctls. >>>>>> >>>>>> I think from an overall platform pov, the first is the better option. >>>>>> After all we try really hard to avoid driver special cases for these kinds >>>>>> of things. >>>>> Well, I initially thought that radv is doing something unusual or broken, >>>>> but after looking deeper into this that is actually not the case. >>>>> >>>>> What radv does is calling an IOCTL and expecting an error message when it is >>>>> not authenticated. >>>>> >>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to >>>>> figure out if it is authenticated or not, but I clearly remember that we >>>>> haven't done this from the beginning. >>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? >>>> First public commit has all the bits: getauth, GetVersion, then the accel >>>> query. >>>> >>>>> Thinking more about this I don't think that this problem is actually amdgpu >>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, >>>>> but only from those who have the specific issue with libva. >>>> libva was the initial motivation, the goal of Emil's patch wasn't to just >>>> duct-tape over libva. That would be easier to fix in libva. The point was >>>> that we should be able to allow this in general. >>>> >>>> And we can, except for the radv issue uncovered here. >>>> >>>> So please don't get 100% hung up on the libva thing, that wasn't really >>>> the goal, just the initial motivation. And I still thinks it makes for all >>>> drivers. So again we have: >>>> >>>> - radv hack >>>> - make amdgpu behave different from everyone else >>>> - keep tilting windmills about "pls use rendernodes, thx" >>>> >>>> I neither like tilting windmills nor making drivers behave different for >>>> no reaason at all. >>> Allow me to jump-in some emails down the line. >>> >>> First and foremost, sincere apologies for upsetting you Christian. If >>> it's of any consolidation - let me assure you the goal is _not_ to break >>> amdgpu or any other driver. >>> >>> Secondly, I would like to ask you for a list of projects so we can look and >>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into >>> it. >> That is rather hard to come by, because you would also need to look at >> all previously existing driver stacks. >> >> E.g. with the classic R300 driver for example. >> >>> On the topic of "working around libva" - sadly libva is _not_ the only >>> offender. We also had bugs in mesa and kmscube. >>> >>> Considering those are taken as a prime example of "the right way", it's very >>> likely for the mistakes to be repeated in many other projects. >>> >>> Where the common "fix" seems to be "run as root"... >>> >>> >>> As Daniel pointed out, we could be fighting the windmills or we could have a >>> small, admittedly ugly, workaround for amdgpu. >>> >>> If you don't like that workaround in the driver we could move it to core DRM. >>> >>> In either case, I would like to focus on a pragramic solution which works with >>> both old and new userspace. >> Well, I seriously think the original committed solution could cause a >> lot of problems and the issue with radv is only the tip of the iceberg. >> >> I mean we just have to ask our self why have we created render nodes in >> the first place? The obvious alternative was to just removed the >> authentication restrictions on the primary node which would have been >> way less code and maintenance burden. > The render nodes exist so you can have different unix permissions for > rendering as for displaying stuff. E.g. run the compositor as a > distinct user from all its clients. The other reason was to have a > clean interface without any of the historical baggage (to make stuff > like container/vm pass-through easier, since we'd know that > render-node userspace won't ever touch any of the old crap ioctls). I > don't think there's ever been a concern about breaking backwards > compatibility. > > The issue here is that radv checks for "can I render", when it wants > to check for "can I display". So anything that only renders we can > ignore. That leaves the various ddx around, which historically have > run as root, or if they run as non-root, logind or similar is giving > them the master kms node already in master mode (afaiui at least). > That leaves the non-main compositor userspace, of which there's only > vkdisplay. I don't see much additional possibilities for regressions > to creep in. > > Emil probably has a better grasp on this, since he's got a bunch of > patches to roll out drmIsMaster all over. > >> I need to dig up the mailing list archive, but I strongly think that one >> of the main arguments for this approach was to NOT break existing userspace. >> >> Even taking into account that we now don't need to deal with UMS and >> really really old userspace drivers any more it still feels like a to >> high risk going down that route. > We change the undefined corners of the uapi all the time to improve > the overall ecosystem, and occasionally we end up with > DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still > think the choice here is how exactly amdgpu.ko copes with radv: > - the hack proposed here > - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and > let you folks deal with a (likely, not guaranteed) influx "but this > works on $any_other_driver" > > I guess up to you which patch Emil should include when he resubmits > the patch for 5.3. We've done both in the past, but generally we're > trying to limit the impact of these hacks as much as possible. Hence > why I'm still in favour of the first one. Well and I will still NAK both approaches. You can of course go ahead and remove the DRM_AUTH flags from the i915 IOCTLs to make libva happy, but please not a general approach which touches all DRM drivers. And from Dave's response I think he is following my argumentation here, so I don't see a general approach to be added for 5.3 either. > "keep tilting windmills" is imo not an option, and I'm happy to handle > any other fallout Emil's patch uncovers, if there is anything else - I > did sketch this hack here quickly on irc right when we got the report. Well this is not about tilting windmills, but rather good maintenance of backward compatibility and rejecting changing with potential unforeseen consequences. Regards, Christian. > -Daniel
On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote: > Am 18.04.19 um 08:46 schrieb Daniel Vetter: > > On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian > > <Christian.Koenig@amd.com> wrote: > >> Am 17.04.19 um 17:51 schrieb Emil Velikov: > >>> Hi guys, > >>> > >>> On 2019/04/17, Daniel Vetter wrote: > >>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > >>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: > >>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > >>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: > >>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > >>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > >>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > >>>>>>>>>> <Christian.Koenig@amd.com> wrote: > >>>>>>>>>> [SNIP] > >>>>>>>>> Well, what you guys did here is a serious no-go. > >>>>>>>> Not really understanding your reasons for an unconditional nak on all this > >>>>>>>> still. > >>>>>>> Well, let me summarize: You worked around a problem with libva by > >>>>>>> breaking another driver and now proposing a rather ugly and only halve > >>>>>>> backed workaround for that driver. > >>>>>>> > >>>>>>> This is a serious no-go. I've already send out a i915 specific change to > >>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > >>>>>>> the offending patch. > >>>>>> Oh I'm totally fine with the revert if that's what we go with. But that > >>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from > >>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And > >>>>>> there's fundamentally 2 options to do that: > >>>>>> > >>>>>> - This one here (or anything implementing the same idea), to keep radv > >>>>>> working. > >>>>>> > >>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > >>>>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the > >>>>>> cold. It also doesn't matter whether we get there with revert + lots of > >>>>>> patches, or a special hack for amdgpu, in the end amdgpu would be > >>>>>> different. Also note that your patch isn't enough, since it doesn't fix > >>>>>> the core ioctls. > >>>>>> > >>>>>> I think from an overall platform pov, the first is the better option. > >>>>>> After all we try really hard to avoid driver special cases for these kinds > >>>>>> of things. > >>>>> Well, I initially thought that radv is doing something unusual or broken, > >>>>> but after looking deeper into this that is actually not the case. > >>>>> > >>>>> What radv does is calling an IOCTL and expecting an error message when it is > >>>>> not authenticated. > >>>>> > >>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > >>>>> figure out if it is authenticated or not, but I clearly remember that we > >>>>> haven't done this from the beginning. > >>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > >>>> First public commit has all the bits: getauth, GetVersion, then the accel > >>>> query. > >>>> > >>>>> Thinking more about this I don't think that this problem is actually amdgpu > >>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > >>>>> but only from those who have the specific issue with libva. > >>>> libva was the initial motivation, the goal of Emil's patch wasn't to just > >>>> duct-tape over libva. That would be easier to fix in libva. The point was > >>>> that we should be able to allow this in general. > >>>> > >>>> And we can, except for the radv issue uncovered here. > >>>> > >>>> So please don't get 100% hung up on the libva thing, that wasn't really > >>>> the goal, just the initial motivation. And I still thinks it makes for all > >>>> drivers. So again we have: > >>>> > >>>> - radv hack > >>>> - make amdgpu behave different from everyone else > >>>> - keep tilting windmills about "pls use rendernodes, thx" > >>>> > >>>> I neither like tilting windmills nor making drivers behave different for > >>>> no reaason at all. > >>> Allow me to jump-in some emails down the line. > >>> > >>> First and foremost, sincere apologies for upsetting you Christian. If > >>> it's of any consolidation - let me assure you the goal is _not_ to break > >>> amdgpu or any other driver. > >>> > >>> Secondly, I would like to ask you for a list of projects so we can look and > >>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > >>> it. > >> That is rather hard to come by, because you would also need to look at > >> all previously existing driver stacks. > >> > >> E.g. with the classic R300 driver for example. > >> > >>> On the topic of "working around libva" - sadly libva is _not_ the only > >>> offender. We also had bugs in mesa and kmscube. > >>> > >>> Considering those are taken as a prime example of "the right way", it's very > >>> likely for the mistakes to be repeated in many other projects. > >>> > >>> Where the common "fix" seems to be "run as root"... > >>> > >>> > >>> As Daniel pointed out, we could be fighting the windmills or we could have a > >>> small, admittedly ugly, workaround for amdgpu. > >>> > >>> If you don't like that workaround in the driver we could move it to core DRM. > >>> > >>> In either case, I would like to focus on a pragramic solution which works with > >>> both old and new userspace. > >> Well, I seriously think the original committed solution could cause a > >> lot of problems and the issue with radv is only the tip of the iceberg. > >> > >> I mean we just have to ask our self why have we created render nodes in > >> the first place? The obvious alternative was to just removed the > >> authentication restrictions on the primary node which would have been > >> way less code and maintenance burden. > > The render nodes exist so you can have different unix permissions for > > rendering as for displaying stuff. E.g. run the compositor as a > > distinct user from all its clients. The other reason was to have a > > clean interface without any of the historical baggage (to make stuff > > like container/vm pass-through easier, since we'd know that > > render-node userspace won't ever touch any of the old crap ioctls). I > > don't think there's ever been a concern about breaking backwards > > compatibility. > > > > The issue here is that radv checks for "can I render", when it wants > > to check for "can I display". So anything that only renders we can > > ignore. That leaves the various ddx around, which historically have > > run as root, or if they run as non-root, logind or similar is giving > > them the master kms node already in master mode (afaiui at least). > > That leaves the non-main compositor userspace, of which there's only > > vkdisplay. I don't see much additional possibilities for regressions > > to creep in. > > > > Emil probably has a better grasp on this, since he's got a bunch of > > patches to roll out drmIsMaster all over. > > > >> I need to dig up the mailing list archive, but I strongly think that one > >> of the main arguments for this approach was to NOT break existing userspace. > >> > >> Even taking into account that we now don't need to deal with UMS and > >> really really old userspace drivers any more it still feels like a to > >> high risk going down that route. > > We change the undefined corners of the uapi all the time to improve > > the overall ecosystem, and occasionally we end up with > > DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still > > think the choice here is how exactly amdgpu.ko copes with radv: > > - the hack proposed here > > - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and > > let you folks deal with a (likely, not guaranteed) influx "but this > > works on $any_other_driver" > > > > I guess up to you which patch Emil should include when he resubmits > > the patch for 5.3. We've done both in the past, but generally we're > > trying to limit the impact of these hacks as much as possible. Hence > > why I'm still in favour of the first one. > > Well and I will still NAK both approaches. > > You can of course go ahead and remove the DRM_AUTH flags from the i915 > IOCTLs to make libva happy, but please not a general approach which > touches all DRM drivers. > > And from Dave's response I think he is following my argumentation here, > so I don't see a general approach to be added for 5.3 either. > > > "keep tilting windmills" is imo not an option, and I'm happy to handle > > any other fallout Emil's patch uncovers, if there is anything else - I > > did sketch this hack here quickly on irc right when we got the report. > > Well this is not about tilting windmills, but rather good maintenance of > backward compatibility and rejecting changing with potential unforeseen > consequences. We (well I) have been doing stuff like this since forever. It's just that thus far it hasn't been amd that stuck out with the need for a hack, but other drivers. It's a tradeoff in the end, and a tradeoff we're can do due to our open source userspace requirement - in case something does go sideways, we have the full history of anything affected. So you're proposed we revert all these other changes and cleanups too, because they could be risky (many turned out to actually be risky, like this one here too)? Ok correction: amd has stuck out in the past too, there was some vblank vs pageflip stuff where we needed to do some pretty clever tricks to both have the new stricter semantics for everyone, while keeping the amd ddx happy still. This aint new at all, we fixed up the regressions and moved on. tldr; You're categorical rejection doesn't make sense to me. Slightly over the top, but this feels like gut reaction to radv being a thorn for amd. -Daniel
On 2019-04-18 10:26 a.m., Daniel Vetter wrote: > > Ok correction: amd has stuck out in the past too, there was some vblank vs > pageflip stuff where we needed to do some pretty clever tricks to both > have the new stricter semantics for everyone, while keeping the amd ddx > happy still. This aint new at all, we fixed up the regressions and moved > on. That sounds like something I would have been involved in, but I'm not sure what you mean... I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow userspace to take advantage of our hardware's capabilities to avoid delaying a page flip by one refresh cycle in some cases. That's all. > tldr; You're categorical rejection doesn't make sense to me. > Slightly over the top, but this feels like gut reaction to radv being a > thorn for amd. That's a weird insinuation. If RADV was a "thorn for AMD", surely we'd be having a party about this regression instead of insisting that it be fixed thoroughly.
Am 18.04.19 um 10:26 schrieb Daniel Vetter: > On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote: >> Am 18.04.19 um 08:46 schrieb Daniel Vetter: >>> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian >>> <Christian.Koenig@amd.com> wrote: >>>> Am 17.04.19 um 17:51 schrieb Emil Velikov: >>>>> Hi guys, >>>>> >>>>> On 2019/04/17, Daniel Vetter wrote: >>>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: >>>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: >>>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: >>>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: >>>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: >>>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: >>>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian >>>>>>>>>>>> <Christian.Koenig@amd.com> wrote: >>>>>>>>>>>> [SNIP] >>>>>>>>>>> Well, what you guys did here is a serious no-go. >>>>>>>>>> Not really understanding your reasons for an unconditional nak on all this >>>>>>>>>> still. >>>>>>>>> Well, let me summarize: You worked around a problem with libva by >>>>>>>>> breaking another driver and now proposing a rather ugly and only halve >>>>>>>>> backed workaround for that driver. >>>>>>>>> >>>>>>>>> This is a serious no-go. I've already send out a i915 specific change to >>>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert >>>>>>>>> the offending patch. >>>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that >>>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from >>>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And >>>>>>>> there's fundamentally 2 options to do that: >>>>>>>> >>>>>>>> - This one here (or anything implementing the same idea), to keep radv >>>>>>>> working. >>>>>>>> >>>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How >>>>>>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the >>>>>>>> cold. It also doesn't matter whether we get there with revert + lots of >>>>>>>> patches, or a special hack for amdgpu, in the end amdgpu would be >>>>>>>> different. Also note that your patch isn't enough, since it doesn't fix >>>>>>>> the core ioctls. >>>>>>>> >>>>>>>> I think from an overall platform pov, the first is the better option. >>>>>>>> After all we try really hard to avoid driver special cases for these kinds >>>>>>>> of things. >>>>>>> Well, I initially thought that radv is doing something unusual or broken, >>>>>>> but after looking deeper into this that is actually not the case. >>>>>>> >>>>>>> What radv does is calling an IOCTL and expecting an error message when it is >>>>>>> not authenticated. >>>>>>> >>>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to >>>>>>> figure out if it is authenticated or not, but I clearly remember that we >>>>>>> haven't done this from the beginning. >>>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? >>>>>> First public commit has all the bits: getauth, GetVersion, then the accel >>>>>> query. >>>>>> >>>>>>> Thinking more about this I don't think that this problem is actually amdgpu >>>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, >>>>>>> but only from those who have the specific issue with libva. >>>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just >>>>>> duct-tape over libva. That would be easier to fix in libva. The point was >>>>>> that we should be able to allow this in general. >>>>>> >>>>>> And we can, except for the radv issue uncovered here. >>>>>> >>>>>> So please don't get 100% hung up on the libva thing, that wasn't really >>>>>> the goal, just the initial motivation. And I still thinks it makes for all >>>>>> drivers. So again we have: >>>>>> >>>>>> - radv hack >>>>>> - make amdgpu behave different from everyone else >>>>>> - keep tilting windmills about "pls use rendernodes, thx" >>>>>> >>>>>> I neither like tilting windmills nor making drivers behave different for >>>>>> no reaason at all. >>>>> Allow me to jump-in some emails down the line. >>>>> >>>>> First and foremost, sincere apologies for upsetting you Christian. If >>>>> it's of any consolidation - let me assure you the goal is _not_ to break >>>>> amdgpu or any other driver. >>>>> >>>>> Secondly, I would like to ask you for a list of projects so we can look and >>>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into >>>>> it. >>>> That is rather hard to come by, because you would also need to look at >>>> all previously existing driver stacks. >>>> >>>> E.g. with the classic R300 driver for example. >>>> >>>>> On the topic of "working around libva" - sadly libva is _not_ the only >>>>> offender. We also had bugs in mesa and kmscube. >>>>> >>>>> Considering those are taken as a prime example of "the right way", it's very >>>>> likely for the mistakes to be repeated in many other projects. >>>>> >>>>> Where the common "fix" seems to be "run as root"... >>>>> >>>>> >>>>> As Daniel pointed out, we could be fighting the windmills or we could have a >>>>> small, admittedly ugly, workaround for amdgpu. >>>>> >>>>> If you don't like that workaround in the driver we could move it to core DRM. >>>>> >>>>> In either case, I would like to focus on a pragramic solution which works with >>>>> both old and new userspace. >>>> Well, I seriously think the original committed solution could cause a >>>> lot of problems and the issue with radv is only the tip of the iceberg. >>>> >>>> I mean we just have to ask our self why have we created render nodes in >>>> the first place? The obvious alternative was to just removed the >>>> authentication restrictions on the primary node which would have been >>>> way less code and maintenance burden. >>> The render nodes exist so you can have different unix permissions for >>> rendering as for displaying stuff. E.g. run the compositor as a >>> distinct user from all its clients. The other reason was to have a >>> clean interface without any of the historical baggage (to make stuff >>> like container/vm pass-through easier, since we'd know that >>> render-node userspace won't ever touch any of the old crap ioctls). I >>> don't think there's ever been a concern about breaking backwards >>> compatibility. >>> >>> The issue here is that radv checks for "can I render", when it wants >>> to check for "can I display". So anything that only renders we can >>> ignore. That leaves the various ddx around, which historically have >>> run as root, or if they run as non-root, logind or similar is giving >>> them the master kms node already in master mode (afaiui at least). >>> That leaves the non-main compositor userspace, of which there's only >>> vkdisplay. I don't see much additional possibilities for regressions >>> to creep in. >>> >>> Emil probably has a better grasp on this, since he's got a bunch of >>> patches to roll out drmIsMaster all over. >>> >>>> I need to dig up the mailing list archive, but I strongly think that one >>>> of the main arguments for this approach was to NOT break existing userspace. >>>> >>>> Even taking into account that we now don't need to deal with UMS and >>>> really really old userspace drivers any more it still feels like a to >>>> high risk going down that route. >>> We change the undefined corners of the uapi all the time to improve >>> the overall ecosystem, and occasionally we end up with >>> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still >>> think the choice here is how exactly amdgpu.ko copes with radv: >>> - the hack proposed here >>> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and >>> let you folks deal with a (likely, not guaranteed) influx "but this >>> works on $any_other_driver" >>> >>> I guess up to you which patch Emil should include when he resubmits >>> the patch for 5.3. We've done both in the past, but generally we're >>> trying to limit the impact of these hacks as much as possible. Hence >>> why I'm still in favour of the first one. >> Well and I will still NAK both approaches. >> >> You can of course go ahead and remove the DRM_AUTH flags from the i915 >> IOCTLs to make libva happy, but please not a general approach which >> touches all DRM drivers. >> >> And from Dave's response I think he is following my argumentation here, >> so I don't see a general approach to be added for 5.3 either. >> >>> "keep tilting windmills" is imo not an option, and I'm happy to handle >>> any other fallout Emil's patch uncovers, if there is anything else - I >>> did sketch this hack here quickly on irc right when we got the report. >> Well this is not about tilting windmills, but rather good maintenance of >> backward compatibility and rejecting changing with potential unforeseen >> consequences. > We (well I) have been doing stuff like this since forever. It's just that > thus far it hasn't been amd that stuck out with the need for a hack, but > other drivers. It's a tradeoff in the end, and a tradeoff we're can do due > to our open source userspace requirement - in case something does go > sideways, we have the full history of anything affected. So you're > proposed we revert all these other changes and cleanups too, because they > could be risky (many turned out to actually be risky, like this one here > too)? Well in general yes, a lot of those changes looked unnecessary risky to me. We of course can't revert those changes now because they are now UAPI as well and it would be risky to modify them. But I think we should have a much stricter approach to modifying exiting UAPI. Additional to that I actually think that the requirement of open source userspace stacks is a mistake. This requirement came just from exactly that approach to try to modify exiting UAPI instead of accepting the fact that UAPI backward compatibility is something you should not mess with. Nobody else in the kernel is having that open source user space requirement and it actually seems that a lot of people see that as just a justification to keep Mesa alive. What you can do is to require unit tests so that everybody is able to test for regressions independent of installed userspace software. > Ok correction: amd has stuck out in the past too, there was some vblank vs > pageflip stuff where we needed to do some pretty clever tricks to both > have the new stricter semantics for everyone, while keeping the amd ddx > happy still. This aint new at all, we fixed up the regressions and moved > on. I'm wasn't much involved into this, so I can't say anything about it. > tldr; You're categorical rejection doesn't make sense to me. > Slightly over the top, but this feels like gut reaction to radv being a > thorn for amd. Well radv is pointing out exactly what's going wrong with AMDs internal development process, so I'm really in favor of that. And I'm certainly in favor or open source projects in general, but as kernel developer I must not favor a closed or open userspace stack running on top of my driver. What I care about is good UAPI design, maintainable code and not re-inventing the wheel to often, but what stack runs on top is a political decision and not a technical one. Regards, Christian. > -Daniel
On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote: > > On 2019-04-18 10:26 a.m., Daniel Vetter wrote: > > > > Ok correction: amd has stuck out in the past too, there was some vblank vs > > pageflip stuff where we needed to do some pretty clever tricks to both > > have the new stricter semantics for everyone, while keeping the amd ddx > > happy still. This aint new at all, we fixed up the regressions and moved > > on. > > That sounds like something I would have been involved in, but I'm not > sure what you mean... > > I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules > for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow > userspace to take advantage of our hardware's capabilities to avoid > delaying a page flip by one refresh cycle in some cases. That's all. That's the one I meant. At least I remember quite a pile of discussions around why i915 gets to define how this stuff works, and amdgpu/radeon having to follow. > > tldr; You're categorical rejection doesn't make sense to me. > > Slightly over the top, but this feels like gut reaction to radv being a > > thorn for amd. > > That's a weird insinuation. If RADV was a "thorn for AMD", surely we'd > be having a party about this regression instead of insisting that it be > fixed thoroughly. It is fixed. The discussion is who we can evolve the uapi for primary nodes in an imo useful way (Emil explained the motivations much better than me), without breaking radv. Christian's take that we flat out never do that kind of (risky, but it's a trade-off) uapi evolution just isn't true. And the radv hack under discussion here also isn't the worst hack we've ever shipped I think in the past to handle the occasional fallout. Some other examples that score pretty high on the awful scale: - legacy context hack for nouvea. Leaves a bunch of root-holes open for nouveau, in the name of backwards compat. Dave just posted a patch to at least close those on modern distros - adfb hack for nouveau 10 bpc rbg formats - the native endian trickery from Gerd Hoffman - the vblank vs flip clarifiation that required new uapi and upgraded amd ddx (really not pretty if we have to somewhat break a performance feature for some users) - ... That's just the ones I remember from the top of my mind, because they all still leave a gross aftertastes behind. This stuff is pretty common. Many of these took a few rounds to get there and catch all the fallout and handle it somehow, and occasionally we just drop it and live with uapi that's a bit ill-defined or inconsistent across drivers or not as useful as we'd like it to be. This is the first time I get an unconditional Nack on such an effort. -Daniel
On 2019-04-18 11:11 a.m., Daniel Vetter wrote: > On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote: >> On 2019-04-18 10:26 a.m., Daniel Vetter wrote: >>> >>> Ok correction: amd has stuck out in the past too, there was some vblank vs >>> pageflip stuff where we needed to do some pretty clever tricks to both >>> have the new stricter semantics for everyone, while keeping the amd ddx >>> happy still. This aint new at all, we fixed up the regressions and moved >>> on. >> >> That sounds like something I would have been involved in, but I'm not >> sure what you mean... >> >> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules >> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow >> userspace to take advantage of our hardware's capabilities to avoid >> delaying a page flip by one refresh cycle in some cases. That's all. > > That's the one I meant. At least I remember quite a pile of > discussions around why i915 gets to define how this stuff works, and > amdgpu/radeon having to follow. > > [...] > > - the vblank vs flip clarifiation that required new uapi and upgraded > amd ddx (really not pretty if we have to somewhat break a performance > feature for some users) You're totally mis-representing that IMO. The amdgpu/radeon kernel drivers were simply broken WRT established UAPI rules, in a way which affected our DDX drivers as well. The kernel driver fixes didn't require any corresponding userspace changes. New UAPI was added to allow taking advantage of hardware capabilities in some cases without breaking other cases. It's a pretty text-book example of how to manage UAPI.
On Thu, Apr 18, 2019 at 10:56 AM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 18.04.19 um 10:26 schrieb Daniel Vetter: > > On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote: > >> Am 18.04.19 um 08:46 schrieb Daniel Vetter: > >>> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian > >>> <Christian.Koenig@amd.com> wrote: > >>>> Am 17.04.19 um 17:51 schrieb Emil Velikov: > >>>>> Hi guys, > >>>>> > >>>>> On 2019/04/17, Daniel Vetter wrote: > >>>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > >>>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: > >>>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > >>>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: > >>>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > >>>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > >>>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > >>>>>>>>>>>> <Christian.Koenig@amd.com> wrote: > >>>>>>>>>>>> [SNIP] > >>>>>>>>>>> Well, what you guys did here is a serious no-go. > >>>>>>>>>> Not really understanding your reasons for an unconditional nak on all this > >>>>>>>>>> still. > >>>>>>>>> Well, let me summarize: You worked around a problem with libva by > >>>>>>>>> breaking another driver and now proposing a rather ugly and only halve > >>>>>>>>> backed workaround for that driver. > >>>>>>>>> > >>>>>>>>> This is a serious no-go. I've already send out a i915 specific change to > >>>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > >>>>>>>>> the offending patch. > >>>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that > >>>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from > >>>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And > >>>>>>>> there's fundamentally 2 options to do that: > >>>>>>>> > >>>>>>>> - This one here (or anything implementing the same idea), to keep radv > >>>>>>>> working. > >>>>>>>> > >>>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > >>>>>>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the > >>>>>>>> cold. It also doesn't matter whether we get there with revert + lots of > >>>>>>>> patches, or a special hack for amdgpu, in the end amdgpu would be > >>>>>>>> different. Also note that your patch isn't enough, since it doesn't fix > >>>>>>>> the core ioctls. > >>>>>>>> > >>>>>>>> I think from an overall platform pov, the first is the better option. > >>>>>>>> After all we try really hard to avoid driver special cases for these kinds > >>>>>>>> of things. > >>>>>>> Well, I initially thought that radv is doing something unusual or broken, > >>>>>>> but after looking deeper into this that is actually not the case. > >>>>>>> > >>>>>>> What radv does is calling an IOCTL and expecting an error message when it is > >>>>>>> not authenticated. > >>>>>>> > >>>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > >>>>>>> figure out if it is authenticated or not, but I clearly remember that we > >>>>>>> haven't done this from the beginning. > >>>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > >>>>>> First public commit has all the bits: getauth, GetVersion, then the accel > >>>>>> query. > >>>>>> > >>>>>>> Thinking more about this I don't think that this problem is actually amdgpu > >>>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > >>>>>>> but only from those who have the specific issue with libva. > >>>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just > >>>>>> duct-tape over libva. That would be easier to fix in libva. The point was > >>>>>> that we should be able to allow this in general. > >>>>>> > >>>>>> And we can, except for the radv issue uncovered here. > >>>>>> > >>>>>> So please don't get 100% hung up on the libva thing, that wasn't really > >>>>>> the goal, just the initial motivation. And I still thinks it makes for all > >>>>>> drivers. So again we have: > >>>>>> > >>>>>> - radv hack > >>>>>> - make amdgpu behave different from everyone else > >>>>>> - keep tilting windmills about "pls use rendernodes, thx" > >>>>>> > >>>>>> I neither like tilting windmills nor making drivers behave different for > >>>>>> no reaason at all. > >>>>> Allow me to jump-in some emails down the line. > >>>>> > >>>>> First and foremost, sincere apologies for upsetting you Christian. If > >>>>> it's of any consolidation - let me assure you the goal is _not_ to break > >>>>> amdgpu or any other driver. > >>>>> > >>>>> Secondly, I would like to ask you for a list of projects so we can look and > >>>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > >>>>> it. > >>>> That is rather hard to come by, because you would also need to look at > >>>> all previously existing driver stacks. > >>>> > >>>> E.g. with the classic R300 driver for example. > >>>> > >>>>> On the topic of "working around libva" - sadly libva is _not_ the only > >>>>> offender. We also had bugs in mesa and kmscube. > >>>>> > >>>>> Considering those are taken as a prime example of "the right way", it's very > >>>>> likely for the mistakes to be repeated in many other projects. > >>>>> > >>>>> Where the common "fix" seems to be "run as root"... > >>>>> > >>>>> > >>>>> As Daniel pointed out, we could be fighting the windmills or we could have a > >>>>> small, admittedly ugly, workaround for amdgpu. > >>>>> > >>>>> If you don't like that workaround in the driver we could move it to core DRM. > >>>>> > >>>>> In either case, I would like to focus on a pragramic solution which works with > >>>>> both old and new userspace. > >>>> Well, I seriously think the original committed solution could cause a > >>>> lot of problems and the issue with radv is only the tip of the iceberg. > >>>> > >>>> I mean we just have to ask our self why have we created render nodes in > >>>> the first place? The obvious alternative was to just removed the > >>>> authentication restrictions on the primary node which would have been > >>>> way less code and maintenance burden. > >>> The render nodes exist so you can have different unix permissions for > >>> rendering as for displaying stuff. E.g. run the compositor as a > >>> distinct user from all its clients. The other reason was to have a > >>> clean interface without any of the historical baggage (to make stuff > >>> like container/vm pass-through easier, since we'd know that > >>> render-node userspace won't ever touch any of the old crap ioctls). I > >>> don't think there's ever been a concern about breaking backwards > >>> compatibility. > >>> > >>> The issue here is that radv checks for "can I render", when it wants > >>> to check for "can I display". So anything that only renders we can > >>> ignore. That leaves the various ddx around, which historically have > >>> run as root, or if they run as non-root, logind or similar is giving > >>> them the master kms node already in master mode (afaiui at least). > >>> That leaves the non-main compositor userspace, of which there's only > >>> vkdisplay. I don't see much additional possibilities for regressions > >>> to creep in. > >>> > >>> Emil probably has a better grasp on this, since he's got a bunch of > >>> patches to roll out drmIsMaster all over. > >>> > >>>> I need to dig up the mailing list archive, but I strongly think that one > >>>> of the main arguments for this approach was to NOT break existing userspace. > >>>> > >>>> Even taking into account that we now don't need to deal with UMS and > >>>> really really old userspace drivers any more it still feels like a to > >>>> high risk going down that route. > >>> We change the undefined corners of the uapi all the time to improve > >>> the overall ecosystem, and occasionally we end up with > >>> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still > >>> think the choice here is how exactly amdgpu.ko copes with radv: > >>> - the hack proposed here > >>> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and > >>> let you folks deal with a (likely, not guaranteed) influx "but this > >>> works on $any_other_driver" > >>> > >>> I guess up to you which patch Emil should include when he resubmits > >>> the patch for 5.3. We've done both in the past, but generally we're > >>> trying to limit the impact of these hacks as much as possible. Hence > >>> why I'm still in favour of the first one. > >> Well and I will still NAK both approaches. > >> > >> You can of course go ahead and remove the DRM_AUTH flags from the i915 > >> IOCTLs to make libva happy, but please not a general approach which > >> touches all DRM drivers. > >> > >> And from Dave's response I think he is following my argumentation here, > >> so I don't see a general approach to be added for 5.3 either. > >> > >>> "keep tilting windmills" is imo not an option, and I'm happy to handle > >>> any other fallout Emil's patch uncovers, if there is anything else - I > >>> did sketch this hack here quickly on irc right when we got the report. > >> Well this is not about tilting windmills, but rather good maintenance of > >> backward compatibility and rejecting changing with potential unforeseen > >> consequences. > > We (well I) have been doing stuff like this since forever. It's just that > > thus far it hasn't been amd that stuck out with the need for a hack, but > > other drivers. It's a tradeoff in the end, and a tradeoff we're can do due > > to our open source userspace requirement - in case something does go > > sideways, we have the full history of anything affected. So you're > > proposed we revert all these other changes and cleanups too, because they > > could be risky (many turned out to actually be risky, like this one here > > too)? > > Well in general yes, a lot of those changes looked unnecessary risky to me. What else should we do then? - everyone does their own driver private extensions, and we don't bother to have standard interfaces/behaviour across drivers. Entirely defeats the point of having an upstream project imo, might as well ship vendor drivers only then. - we freeze and become irrelevant. - we copypaste drm every time someone comes up with a new way they want to use our stack. We don't have the people for that, plus Linus gets pissed when we abandon the old versions (like you can do with your vendor driver that's not upstream). I agree it's not a great option, but it's the least shitty one. And yes from the point of view of a single driver, almost all these changes look like unecessary risk, since you're not the one who cares about that new use-case. > We of course can't revert those changes now because they are now UAPI as > well and it would be risky to modify them. But I think we should have a > much stricter approach to modifying exiting UAPI. We just recently agreed on requiring some standard unit tests. We have probably the strictest review requirements for new uapi in the entire kernel, bar none (and see below, lots of people think we're misguided with some of those). Not sure what else we can do, while still being able to evolve the uapi. What are you suggesting? From your comments below it sounds like you're suggesting we drop the userspace requirement (dropping the review requirement by a lot) and just copypaste drivers every time we find a mistake in our uapi. -Daniel > Additional to that I actually think that the requirement of open source > userspace stacks is a mistake. > > This requirement came just from exactly that approach to try to modify > exiting UAPI instead of accepting the fact that UAPI backward > compatibility is something you should not mess with. > > Nobody else in the kernel is having that open source user space > requirement and it actually seems that a lot of people see that as just > a justification to keep Mesa alive. > > What you can do is to require unit tests so that everybody is able to > test for regressions independent of installed userspace software. > > > Ok correction: amd has stuck out in the past too, there was some vblank vs > > pageflip stuff where we needed to do some pretty clever tricks to both > > have the new stricter semantics for everyone, while keeping the amd ddx > > happy still. This aint new at all, we fixed up the regressions and moved > > on. > > I'm wasn't much involved into this, so I can't say anything about it. > > > tldr; You're categorical rejection doesn't make sense to me. > > Slightly over the top, but this feels like gut reaction to radv being a > > thorn for amd. > > Well radv is pointing out exactly what's going wrong with AMDs internal > development process, so I'm really in favor of that. > > And I'm certainly in favor or open source projects in general, but as > kernel developer I must not favor a closed or open userspace stack > running on top of my driver. > > What I care about is good UAPI design, maintainable code and not > re-inventing the wheel to often, but what stack runs on top is a > political decision and not a technical one. > > Regards, > Christian. > > > -Daniel >
On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote: > > On 2019-04-18 11:11 a.m., Daniel Vetter wrote: > > On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote: > >> On 2019-04-18 10:26 a.m., Daniel Vetter wrote: > >>> > >>> Ok correction: amd has stuck out in the past too, there was some vblank vs > >>> pageflip stuff where we needed to do some pretty clever tricks to both > >>> have the new stricter semantics for everyone, while keeping the amd ddx > >>> happy still. This aint new at all, we fixed up the regressions and moved > >>> on. > >> > >> That sounds like something I would have been involved in, but I'm not > >> sure what you mean... > >> > >> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules > >> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow > >> userspace to take advantage of our hardware's capabilities to avoid > >> delaying a page flip by one refresh cycle in some cases. That's all. > > > > That's the one I meant. At least I remember quite a pile of > > discussions around why i915 gets to define how this stuff works, and > > amdgpu/radeon having to follow. > > > > [...] > > > > - the vblank vs flip clarifiation that required new uapi and upgraded > > amd ddx (really not pretty if we have to somewhat break a performance > > feature for some users) > > You're totally mis-representing that IMO. > > The amdgpu/radeon kernel drivers were simply broken WRT established UAPI > rules, in a way which affected our DDX drivers as well. The kernel > driver fixes didn't require any corresponding userspace changes. It wasn't ever defined when we merged the page-flip ioctl, and the vblank ioctl was retrofitted/abused to also work with kms. i915 happened to guarantee that a flip immediately after a vblank wait will hit the next vblank, radeon (this stuff is really old) had more wiggle room. And for unthrottled vblank, this mattered since it would allow you to update frames still, even when the rendering was really late. The a lot of people started writing kms compositors, and due to random reasons they mostly developed those on i915. Relying on the i915-specific guarantee between vblank waits and page-flips. Then the entire atomic saga happened, and since I spent way too much time reading other kms drivers already I wanted to standardize a bunch of the uapi grey areas. This was starting with 2014/2015. Stuff I still remember: - dpms behaviour, standardized on what i915 did in it's private modeset code (the crtc helper behaviour had a pile of duct-tape patches, but still ill-defined corner cases). - vblank semantics around crtc on/off (atomic helpers fully relied on drivers calling drm_vblank_on/off, which thus far only i915 did) - and later on vblank vs timestamp, when the nonblocking helpers landed As a result, even more people started relying on this stuff. Eventually someone noticed that amdgpu works differently. There was a pretty heated irc debate between you and me about why exactly amdgpu needs to follow the i915 standard, and why intel gets to set all these standards and a few others things. This was around 2016 or so, per git blame of relevant at least. We ended up adjusting amd drivers to follow i915 semantics, which fixed scheduled flips on lots of compositors, and broke unthrottled flipping (very slightly, but it did) on amd. page_flip_target remedied that. One of the lessons I took from all this that retrofitting uapi is best done across the board, and not just for i915. It's occasionally not possible. > New UAPI was added to allow taking advantage of hardware capabilities in > some cases without breaking other cases. > > It's a pretty text-book example of how to manage UAPI. Well I guess we remember different aspects of that story, but to me it was a textbook case of how retrofitting uapi has challenges. One of the options you proposed was that we'd have a flag to select the i915 behaviour, and leave the default undefined. I think that would have handled the vblank vs pageflip issue in a way Christian is advertising for here. It's actually what we've done, by adding distinct rendernodes. Unfortunate reality is that adoption of those is very slow, and bad examples using primary nodes get copypasted all around still. Iirc I similarly argued in the vblank vs pageflip case that even with an explicit flag, the implicit uapi as defined by what most people run out there wouldn't change, and amgpu/radeon alone wouldn't be enough by far to change that alone. -Daniel
On 2019-04-18 11:46 a.m., Daniel Vetter wrote: > On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote: >> On 2019-04-18 11:11 a.m., Daniel Vetter wrote: >>> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote: >>>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote: >>>>> >>>>> Ok correction: amd has stuck out in the past too, there was some vblank vs >>>>> pageflip stuff where we needed to do some pretty clever tricks to both >>>>> have the new stricter semantics for everyone, while keeping the amd ddx >>>>> happy still. This aint new at all, we fixed up the regressions and moved >>>>> on. >>>> >>>> That sounds like something I would have been involved in, but I'm not >>>> sure what you mean... >>>> >>>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules >>>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow >>>> userspace to take advantage of our hardware's capabilities to avoid >>>> delaying a page flip by one refresh cycle in some cases. That's all. >>> >>> That's the one I meant. At least I remember quite a pile of >>> discussions around why i915 gets to define how this stuff works, and >>> amdgpu/radeon having to follow. >>> >>> [...] >>> >>> - the vblank vs flip clarifiation that required new uapi and upgraded >>> amd ddx (really not pretty if we have to somewhat break a performance >>> feature for some users) >> >> You're totally mis-representing that IMO. >> >> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI >> rules, in a way which affected our DDX drivers as well. The kernel >> driver fixes didn't require any corresponding userspace changes. > > It wasn't ever defined when we merged the page-flip ioctl, and the > vblank ioctl was retrofitted/abused to also work with kms. i915 > happened to guarantee that a flip immediately after a vblank wait will > hit the next vblank, radeon (this stuff is really old) had more wiggle > room. And for unthrottled vblank, this mattered since it would allow > you to update frames still, even when the rendering was really late. > > The a lot of people started writing kms compositors, and due to random > reasons they mostly developed those on i915. Relying on the > i915-specific guarantee between vblank waits and page-flips. > > Then the entire atomic saga happened, and since I spent way too much > time reading other kms drivers already I wanted to standardize a bunch > of the uapi grey areas. This was starting with 2014/2015. Stuff I > still remember: > [...] > - and later on vblank vs timestamp, when the nonblocking helpers landed > > As a result, even more people started relying on this stuff. > Eventually someone noticed that amdgpu works differently. There was a > pretty heated irc debate between you and me about why exactly amdgpu > needs to follow the i915 standard, and why intel gets to set all these > standards and a few others things. This was around 2016 or so, per git > blame of relevant at least. Sounds like I did some knee-jerk ranting on IRC at the time, sorry. > We ended up adjusting amd drivers to follow i915 semantics, which > fixed scheduled flips on lots of compositors, It also fixed them with our DDX drivers, as otherwise there was no way for userspace to prevent page flips from completing one refresh cycle too early. > and broke unthrottled flipping (very slightly, but it did) on amd. > page_flip_target remedied that. That's just a minor optimization. >> New UAPI was added to allow taking advantage of hardware capabilities in >> some cases without breaking other cases. >> >> It's a pretty text-book example of how to manage UAPI. > > Well I guess we remember different aspects of that story, but to me it > was a textbook case of how retrofitting uapi has challenges. I don't agree "retrofitting" is accurate in this case. This was always the intention for the UAPI, the amdgpu/radeon kernel drivers just accidentally didn't implement it correctly. > One of the options you proposed was that we'd have a flag to select the i915 > behaviour, and leave the default undefined. So I made a bad proposal at the time. :) I honestly didn't even remember that. I really don't see that episode having much to do with the current discussion.
On Thu, Apr 18, 2019 at 12:04 PM Michel Dänzer <michel@daenzer.net> wrote: > > On 2019-04-18 11:46 a.m., Daniel Vetter wrote: > > On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel@daenzer.net> wrote: > >> On 2019-04-18 11:11 a.m., Daniel Vetter wrote: > >>> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel@daenzer.net> wrote: > >>>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote: > >>>>> > >>>>> Ok correction: amd has stuck out in the past too, there was some vblank vs > >>>>> pageflip stuff where we needed to do some pretty clever tricks to both > >>>>> have the new stricter semantics for everyone, while keeping the amd ddx > >>>>> happy still. This aint new at all, we fixed up the regressions and moved > >>>>> on. > >>>> > >>>> That sounds like something I would have been involved in, but I'm not > >>>> sure what you mean... > >>>> > >>>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules > >>>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow > >>>> userspace to take advantage of our hardware's capabilities to avoid > >>>> delaying a page flip by one refresh cycle in some cases. That's all. > >>> > >>> That's the one I meant. At least I remember quite a pile of > >>> discussions around why i915 gets to define how this stuff works, and > >>> amdgpu/radeon having to follow. > >>> > >>> [...] > >>> > >>> - the vblank vs flip clarifiation that required new uapi and upgraded > >>> amd ddx (really not pretty if we have to somewhat break a performance > >>> feature for some users) > >> > >> You're totally mis-representing that IMO. > >> > >> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI > >> rules, in a way which affected our DDX drivers as well. The kernel > >> driver fixes didn't require any corresponding userspace changes. > > > > It wasn't ever defined when we merged the page-flip ioctl, and the > > vblank ioctl was retrofitted/abused to also work with kms. i915 > > happened to guarantee that a flip immediately after a vblank wait will > > hit the next vblank, radeon (this stuff is really old) had more wiggle > > room. And for unthrottled vblank, this mattered since it would allow > > you to update frames still, even when the rendering was really late. > > > > The a lot of people started writing kms compositors, and due to random > > reasons they mostly developed those on i915. Relying on the > > i915-specific guarantee between vblank waits and page-flips. > > > > Then the entire atomic saga happened, and since I spent way too much > > time reading other kms drivers already I wanted to standardize a bunch > > of the uapi grey areas. This was starting with 2014/2015. Stuff I > > still remember: > > [...] > > - and later on vblank vs timestamp, when the nonblocking helpers landed > > > > As a result, even more people started relying on this stuff. > > Eventually someone noticed that amdgpu works differently. There was a > > pretty heated irc debate between you and me about why exactly amdgpu > > needs to follow the i915 standard, and why intel gets to set all these > > standards and a few others things. This was around 2016 or so, per git > > blame of relevant at least. > > Sounds like I did some knee-jerk ranting on IRC at the time, sorry. > > > > We ended up adjusting amd drivers to follow i915 semantics, which > > fixed scheduled flips on lots of compositors, > > It also fixed them with our DDX drivers, as otherwise there was no way > for userspace to prevent page flips from completing one refresh cycle > too early. > > > and broke unthrottled flipping (very slightly, but it did) on amd. > > page_flip_target remedied that. > > That's just a minor optimization. > > > >> New UAPI was added to allow taking advantage of hardware capabilities in > >> some cases without breaking other cases. > >> > >> It's a pretty text-book example of how to manage UAPI. > > > > Well I guess we remember different aspects of that story, but to me it > > was a textbook case of how retrofitting uapi has challenges. > > I don't agree "retrofitting" is accurate in this case. This was always > the intention for the UAPI, the amdgpu/radeon kernel drivers just > accidentally didn't implement it correctly. > > > One of the options you proposed was that we'd have a flag to select the i915 > > behaviour, and leave the default undefined. > > So I made a bad proposal at the time. :) I honestly didn't even remember > that. I really don't see that episode having much to do with the current > discussion. Yeah it's definitely a lot less serious uapi retrofitting than what we're discussing here now. But I do see a lot of parallels: - We've implemented rendernodes as a clean/new uapi years ago. git says we've made them the default in 2014. - Lots of work happened to make render nodes the main thing. Despite all that effort (well over 5 years, and a few more, initial render nodes merge was 2013), adoption is fairly lack-luster and defacto uapi is "use primary nodes, cause it works if you're root". We can keep telling everyone they're wrong and that they should use render nodes, or we can give in and make the uapi fit more with what people out there seem to actually use. What I don't want to do is some gradual thing where we slowly change things one driver at a time (like Christian's patch proposes, starting with i915), cause that leads to lots of confusion and eventually a lone driver standing in the rain :-/ Now maybe the right approach is to keep telling everyone they should use render nodes, but I'm not sure another 5 years is really going to move this needle. But it is a tradeoff between doing the right thing from how uapi should evolve, and the pragmatic thing of shrugging and moving on and implementing what seems to be actually in use. And we've done quite a bit of the pragmatic kind of uapi clarifictation/adjusting to actual reality and mostly gotten away with it. This here doesn't seem like a much different case. Anyway, to unstuck this discussion a bit I chatted with Emil on irc, and came up with a new idea for an rfc patch series: - include both options discussed here for handling amdgpu/radv - include the original patch again - include patches for all DRIVER_RENDER drivers to drop DRM_AUTH from them. That should give us a lot more eyes and reviewers to figure out whether there's other place this might break userspace. If that results in lots of negative feedback, we can confidently drop this idea on the floor as a bad one, and will have to keep telling everyone to use rendernodes more. If not, then we should know a lot better whether it will blow up somewhere else than for radv only. And Emil seems happy to type the patches. Cheers, Daniel
On 2019-04-18 12:35 p.m., Daniel Vetter wrote: > > - We've implemented rendernodes as a clean/new uapi years ago. git > says we've made them the default in 2014. > > - Lots of work happened to make render nodes the main thing. Despite > all that effort (well over 5 years, and a few more, initial render > nodes merge was 2013), adoption is fairly lack-luster and defacto uapi > is "use primary nodes, cause it works if you're root". FWIW, xf86-video-amdgpu/ati have been using render nodes for the DRM FDs sent to DRI3 clients since the 1.3/7.9 releases from March 2017. Xwayland will do the same in the next xserver release. A notable holdout is the Xorg modesetting driver.
On Wed, 17 Apr 2019 at 21:49, Dave Airlie <airlied@gmail.com> wrote: > > On Thu, 18 Apr 2019 at 03:30, Koenig, Christian > <Christian.Koenig@amd.com> wrote: > > > > Am 17.04.19 um 17:51 schrieb Emil Velikov: > > > Hi guys, > > > > > > On 2019/04/17, Daniel Vetter wrote: > > >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote: > > >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter: > > >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote: > > >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter: > > >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote: > > >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter: > > >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian > > >>>>>>>> <Christian.Koenig@amd.com> wrote: > > >>>>>>>> [SNIP] > > >>>>>>> Well, what you guys did here is a serious no-go. > > >>>>>> Not really understanding your reasons for an unconditional nak on all this > > >>>>>> still. > > >>>>> Well, let me summarize: You worked around a problem with libva by > > >>>>> breaking another driver and now proposing a rather ugly and only halve > > >>>>> backed workaround for that driver. > > >>>>> > > >>>>> This is a serious no-go. I've already send out a i915 specific change to > > >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert > > >>>>> the offending patch. > > >>>> Oh I'm totally fine with the revert if that's what we go with. But that > > >>>> still leaves the issue that it would make sense to drop DRM_AUTH from > > >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And > > >>>> there's fundamentally 2 options to do that: > > >>>> > > >>>> - This one here (or anything implementing the same idea), to keep radv > > >>>> working. > > >>>> > > >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How > > >>>> exactly that's doen doesn't matter, but it leaves amdgpu out in the > > >>>> cold. It also doesn't matter whether we get there with revert + lots of > > >>>> patches, or a special hack for amdgpu, in the end amdgpu would be > > >>>> different. Also note that your patch isn't enough, since it doesn't fix > > >>>> the core ioctls. > > >>>> > > >>>> I think from an overall platform pov, the first is the better option. > > >>>> After all we try really hard to avoid driver special cases for these kinds > > >>>> of things. > > >>> Well, I initially thought that radv is doing something unusual or broken, > > >>> but after looking deeper into this that is actually not the case. > > >>> > > >>> What radv does is calling an IOCTL and expecting an error message when it is > > >>> not authenticated. > > >>> > > >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to > > >>> figure out if it is authenticated or not, but I clearly remember that we > > >>> haven't done this from the beginning. > > >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code? > > >> First public commit has all the bits: getauth, GetVersion, then the accel > > >> query. > > >> > > >>> Thinking more about this I don't think that this problem is actually amdgpu > > >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all, > > >>> but only from those who have the specific issue with libva. > > >> libva was the initial motivation, the goal of Emil's patch wasn't to just > > >> duct-tape over libva. That would be easier to fix in libva. The point was > > >> that we should be able to allow this in general. > > >> > > >> And we can, except for the radv issue uncovered here. > > >> > > >> So please don't get 100% hung up on the libva thing, that wasn't really > > >> the goal, just the initial motivation. And I still thinks it makes for all > > >> drivers. So again we have: > > >> > > >> - radv hack > > >> - make amdgpu behave different from everyone else > > >> - keep tilting windmills about "pls use rendernodes, thx" > > >> > > >> I neither like tilting windmills nor making drivers behave different for > > >> no reaason at all. > > > Allow me to jump-in some emails down the line. > > > > > > First and foremost, sincere apologies for upsetting you Christian. If > > > it's of any consolidation - let me assure you the goal is _not_ to break > > > amdgpu or any other driver. > > > > > > Secondly, I would like to ask you for a list of projects so we can look and > > > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into > > > it. > > > > That is rather hard to come by, because you would also need to look at > > all previously existing driver stacks. > > > > E.g. with the classic R300 driver for example. > > > > > On the topic of "working around libva" - sadly libva is _not_ the only > > > offender. We also had bugs in mesa and kmscube. > > > > > > Considering those are taken as a prime example of "the right way", it's very > > > likely for the mistakes to be repeated in many other projects. > > > > > > Where the common "fix" seems to be "run as root"... > > > > > > > > > As Daniel pointed out, we could be fighting the windmills or we could have a > > > small, admittedly ugly, workaround for amdgpu. > > > > > > If you don't like that workaround in the driver we could move it to core DRM. > > > > > > In either case, I would like to focus on a pragramic solution which works with > > > both old and new userspace. > > > > Well, I seriously think the original committed solution could cause a > > lot of problems and the issue with radv is only the tip of the iceberg. > > > > I mean we just have to ask our self why have we created render nodes in > > the first place? The obvious alternative was to just removed the > > authentication restrictions on the primary node which would have been > > way less code and maintenance burden. > > > > I need to dig up the mailing list archive, but I strongly think that one > > of the main arguments for this approach was to NOT break existing userspace. > > > > Even taking into account that we now don't need to deal with UMS and > > really really old userspace drivers any more it still feels like a to > > high risk going down that route. > > I'm going to revert the original patch in drm-next now. We've been > getting a bit lax lately on the regressions need to get reverted no > matter what rule, and we are getting close to rc6 and it's Easter, so > I don't want to have to care about this, forget about it, remember it > again, end up at 5.2. > Makes sense, thanks Dave. Feel free to add the following to the revert: Acked-by: Emil Velikov <emil.velikov@collabora.com> > Personally I'm rather in favour of cleaning up the driver ioctls since > I don't like older drivers suddenly having a new ABI without > consideration of side effects. > Fwiw I think we're overloading the meaning of ABI - this is a functional change. Which admittedly I'll have a loser and harder look that it doesn't break things ;-) Thanks Emil
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 8d0d7f3dd5fb..e67bfee8d411 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -409,6 +409,9 @@ struct amdgpu_fpriv { struct mutex bo_list_lock; struct idr bo_list_handles; struct amdgpu_ctx_mgr ctx_mgr; + + /* Part of a backwards compatibility hack */ + bool is_first_ioctl; }; int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 7419ea8a388b..cd438afa7092 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1128,6 +1128,7 @@ long amdgpu_drm_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct drm_file *file_priv = filp->private_data; + struct amdgpu_fpriv *fpriv = file_priv->driver_priv; struct drm_device *dev; long ret; dev = file_priv->minor->dev; @@ -1136,6 +1137,7 @@ long amdgpu_drm_ioctl(struct file *filp, return ret; ret = drm_ioctl(filp, cmd, arg); + fpriv->is_first_ioctl = false; pm_runtime_mark_last_busy(dev->dev); pm_runtime_put_autosuspend(dev->dev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index e860412043bb..00c0a2fb2a69 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -455,6 +455,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) { struct amdgpu_device *adev = dev->dev_private; + struct amdgpu_fpriv *fpriv = filp->driver_priv; struct drm_amdgpu_info *info = data; struct amdgpu_mode_info *minfo = &adev->mode_info; void __user *out = (void __user *)(uintptr_t)info->return_pointer; @@ -470,6 +471,26 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file switch (info->query) { case AMDGPU_INFO_ACCEL_WORKING: + /* HACK: radv has a fragile open-coded check for drm master + * The pattern for the call is: + * open(DRM_NODE_PRIMARY) + * drmModeCommandWrite( query=AMDGPU_INFO_ACCEL_WORKING ) + * + * After commit 8059add04 this check stopped working due to + * operations on a primary node from non-master clients becoming + * more permissive. + * + * This backwards compatibility hack targets the calling pattern + * above to fail radv from thinking it has master access to the + * display system ( without reverting 8059add04 ). + * + * Users of libdrm will not be affected as some other ioctls are + * performed between open() and the AMDGPU_INFO_ACCEL_WORKING + * query. + */ + if (fpriv->is_first_ioctl && !filp->is_master) + return -EACCES; + ui32 = adev->accel_working; return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0; case AMDGPU_INFO_CRTC_FROM_ID: @@ -987,6 +1008,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) goto error_vm; } + fpriv->is_first_ioctl = true; mutex_init(&fpriv->bo_list_lock); idr_init(&fpriv->bo_list_handles);