Message ID | 20170714191439.31169-1-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Daniel Vetter (2017-07-14 20:14:37) > The legacy plane->fb pointer is refcounted by calling > drm_atomic_clean_old_fb(). > > In practice this isn't a real problem because: > - The caller in the i915 gpu reset code restores the original state > again, which means the plane->fb pointer won't change, hence can't > leak. > - Drivers using drm_atomic_helper_shutdown call the fbdev cleanup > first, and that usually cleans up the fb through > drm_remove_framebuffer, which does this correctly. > - Without fbdev the only framebuffers are from userspace, and those > get cleaned up (again using drm_remove_framebuffer) befor the driver > can even be unloaded. > > But in i915 I've switched the cleanup sequence around so that the > _shutdown() calls happens after the drm_remove_framebuffer(), which is > how I discovered this issue. If only we have refcounted fb now and didn't need every caller manually tracking the old pointers. </s> Given the requirement for the caller to cleanup old_fb around drm_atomic_commit(), this looks correct to me. > Cc: martin.peres@free.fr > Cc: chris@chris-wilson.co.uk > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> -Chris
Quoting Patchwork (2017-07-14 21:18:04) > == Series Details == > > Series: series starting with [1/3] drm/atomic-helper: Fix leak in disable_all > URL : https://patchwork.freedesktop.org/series/27327/ > State : failure > > == Summary == > > Series 27327v1 Series without cover letter > https://patchwork.freedesktop.org/api/1.0/series/27327/revisions/1/mbox/ > > Test gem_exec_fence: > Subgroup await-hang-default: > pass -> INCOMPLETE (fi-blb-e6850) > pass -> INCOMPLETE (fi-pnv-d510) > Test gem_exec_flush: > Subgroup basic-batch-kernel-default-uc: > fail -> PASS (fi-snb-2600) fdo#100007 > Test kms_busy: > Subgroup basic-flip-default-a: > pass -> INCOMPLETE (fi-ilk-650) > pass -> INCOMPLETE (fi-snb-2520m) > pass -> INCOMPLETE (fi-snb-2600) > pass -> INCOMPLETE (fi-ivb-3520m) > pass -> INCOMPLETE (fi-ivb-3770) > pass -> INCOMPLETE (fi-byt-j1900) > pass -> INCOMPLETE (fi-byt-n2820) > pass -> INCOMPLETE (fi-hsw-4770) > pass -> INCOMPLETE (fi-hsw-4770r) > pass -> INCOMPLETE (fi-bdw-5557u) > pass -> INCOMPLETE (fi-bdw-gvtdvm) > pass -> INCOMPLETE (fi-skl-6260u) > pass -> INCOMPLETE (fi-skl-6700hq) fdo#101144 > pass -> INCOMPLETE (fi-skl-6700k) > pass -> INCOMPLETE (fi-skl-6770hq) > pass -> INCOMPLETE (fi-skl-gvtdvm) > none -> INCOMPLETE (fi-skl-x1585l) > pass -> INCOMPLETE (fi-bxt-j4205) > pass -> INCOMPLETE (fi-kbl-7500u) > pass -> INCOMPLETE (fi-kbl-7560u) > pass -> INCOMPLETE (fi-kbl-r) > pass -> INCOMPLETE (fi-glk-2a) > Subgroup basic-flip-default-c: > pass -> INCOMPLETE (fi-bsw-n3050) Looks like we decremented a refcount once too often (after it was freed, poison 0x6b--). That clean up plane->old_fb? -Chris
On Fri, Jul 14, 2017 at 09:32:13PM +0100, Chris Wilson wrote: > Quoting Patchwork (2017-07-14 21:18:04) > > == Series Details == > > > > Series: series starting with [1/3] drm/atomic-helper: Fix leak in disable_all > > URL : https://patchwork.freedesktop.org/series/27327/ > > State : failure > > > > == Summary == > > > > Series 27327v1 Series without cover letter > > https://patchwork.freedesktop.org/api/1.0/series/27327/revisions/1/mbox/ > > > > Test gem_exec_fence: > > Subgroup await-hang-default: > > pass -> INCOMPLETE (fi-blb-e6850) > > pass -> INCOMPLETE (fi-pnv-d510) > > Test gem_exec_flush: > > Subgroup basic-batch-kernel-default-uc: > > fail -> PASS (fi-snb-2600) fdo#100007 > > Test kms_busy: > > Subgroup basic-flip-default-a: > > pass -> INCOMPLETE (fi-ilk-650) > > pass -> INCOMPLETE (fi-snb-2520m) > > pass -> INCOMPLETE (fi-snb-2600) > > pass -> INCOMPLETE (fi-ivb-3520m) > > pass -> INCOMPLETE (fi-ivb-3770) > > pass -> INCOMPLETE (fi-byt-j1900) > > pass -> INCOMPLETE (fi-byt-n2820) > > pass -> INCOMPLETE (fi-hsw-4770) > > pass -> INCOMPLETE (fi-hsw-4770r) > > pass -> INCOMPLETE (fi-bdw-5557u) > > pass -> INCOMPLETE (fi-bdw-gvtdvm) > > pass -> INCOMPLETE (fi-skl-6260u) > > pass -> INCOMPLETE (fi-skl-6700hq) fdo#101144 > > pass -> INCOMPLETE (fi-skl-6700k) > > pass -> INCOMPLETE (fi-skl-6770hq) > > pass -> INCOMPLETE (fi-skl-gvtdvm) > > none -> INCOMPLETE (fi-skl-x1585l) > > pass -> INCOMPLETE (fi-bxt-j4205) > > pass -> INCOMPLETE (fi-kbl-7500u) > > pass -> INCOMPLETE (fi-kbl-7560u) > > pass -> INCOMPLETE (fi-kbl-r) > > pass -> INCOMPLETE (fi-glk-2a) > > Subgroup basic-flip-default-c: > > pass -> INCOMPLETE (fi-bsw-n3050) > > Looks like we decremented a refcount once too often (after it was freed, > poison 0x6b--). That clean up plane->old_fb? I can't even run that testcase without my gpu reset fixes, epic win :( I'll assemble an even bigger pile and see what happens. -Daniel
Quoting Daniel Vetter (2017-07-14 22:03:05) > On Fri, Jul 14, 2017 at 09:32:13PM +0100, Chris Wilson wrote: > > Quoting Patchwork (2017-07-14 21:18:04) > > > == Series Details == > > > > > > Series: series starting with [1/3] drm/atomic-helper: Fix leak in disable_all > > > URL : https://patchwork.freedesktop.org/series/27327/ > > > State : failure > > > > > > == Summary == > > > > > > Series 27327v1 Series without cover letter > > > https://patchwork.freedesktop.org/api/1.0/series/27327/revisions/1/mbox/ > > > > > > Test gem_exec_fence: > > > Subgroup await-hang-default: > > > pass -> INCOMPLETE (fi-blb-e6850) > > > pass -> INCOMPLETE (fi-pnv-d510) > > > Test gem_exec_flush: > > > Subgroup basic-batch-kernel-default-uc: > > > fail -> PASS (fi-snb-2600) fdo#100007 > > > Test kms_busy: > > > Subgroup basic-flip-default-a: > > > pass -> INCOMPLETE (fi-ilk-650) > > > pass -> INCOMPLETE (fi-snb-2520m) > > > pass -> INCOMPLETE (fi-snb-2600) > > > pass -> INCOMPLETE (fi-ivb-3520m) > > > pass -> INCOMPLETE (fi-ivb-3770) > > > pass -> INCOMPLETE (fi-byt-j1900) > > > pass -> INCOMPLETE (fi-byt-n2820) > > > pass -> INCOMPLETE (fi-hsw-4770) > > > pass -> INCOMPLETE (fi-hsw-4770r) > > > pass -> INCOMPLETE (fi-bdw-5557u) > > > pass -> INCOMPLETE (fi-bdw-gvtdvm) > > > pass -> INCOMPLETE (fi-skl-6260u) > > > pass -> INCOMPLETE (fi-skl-6700hq) fdo#101144 > > > pass -> INCOMPLETE (fi-skl-6700k) > > > pass -> INCOMPLETE (fi-skl-6770hq) > > > pass -> INCOMPLETE (fi-skl-gvtdvm) > > > none -> INCOMPLETE (fi-skl-x1585l) > > > pass -> INCOMPLETE (fi-bxt-j4205) > > > pass -> INCOMPLETE (fi-kbl-7500u) > > > pass -> INCOMPLETE (fi-kbl-7560u) > > > pass -> INCOMPLETE (fi-kbl-r) > > > pass -> INCOMPLETE (fi-glk-2a) > > > Subgroup basic-flip-default-c: > > > pass -> INCOMPLETE (fi-bsw-n3050) > > > > Looks like we decremented a refcount once too often (after it was freed, > > poison 0x6b--). That clean up plane->old_fb? > > I can't even run that testcase without my gpu reset fixes, epic win :( It's hard to tell, but considering that the error it is detecting is slab corruption, it is more likely an earlier test that is doing the dirty deed. This is kasan territory. Tomi! -Chris
On Fri, Jul 14, 2017 at 10:17:52PM +0100, Chris Wilson wrote: > Quoting Daniel Vetter (2017-07-14 22:03:05) > > On Fri, Jul 14, 2017 at 09:32:13PM +0100, Chris Wilson wrote: > > > Quoting Patchwork (2017-07-14 21:18:04) > > > > == Series Details == > > > > > > > > Series: series starting with [1/3] drm/atomic-helper: Fix leak in disable_all > > > > URL : https://patchwork.freedesktop.org/series/27327/ > > > > State : failure > > > > > > > > == Summary == > > > > > > > > Series 27327v1 Series without cover letter > > > > https://patchwork.freedesktop.org/api/1.0/series/27327/revisions/1/mbox/ > > > > > > > > Test gem_exec_fence: > > > > Subgroup await-hang-default: > > > > pass -> INCOMPLETE (fi-blb-e6850) > > > > pass -> INCOMPLETE (fi-pnv-d510) > > > > Test gem_exec_flush: > > > > Subgroup basic-batch-kernel-default-uc: > > > > fail -> PASS (fi-snb-2600) fdo#100007 > > > > Test kms_busy: > > > > Subgroup basic-flip-default-a: > > > > pass -> INCOMPLETE (fi-ilk-650) > > > > pass -> INCOMPLETE (fi-snb-2520m) > > > > pass -> INCOMPLETE (fi-snb-2600) > > > > pass -> INCOMPLETE (fi-ivb-3520m) > > > > pass -> INCOMPLETE (fi-ivb-3770) > > > > pass -> INCOMPLETE (fi-byt-j1900) > > > > pass -> INCOMPLETE (fi-byt-n2820) > > > > pass -> INCOMPLETE (fi-hsw-4770) > > > > pass -> INCOMPLETE (fi-hsw-4770r) > > > > pass -> INCOMPLETE (fi-bdw-5557u) > > > > pass -> INCOMPLETE (fi-bdw-gvtdvm) > > > > pass -> INCOMPLETE (fi-skl-6260u) > > > > pass -> INCOMPLETE (fi-skl-6700hq) fdo#101144 > > > > pass -> INCOMPLETE (fi-skl-6700k) > > > > pass -> INCOMPLETE (fi-skl-6770hq) > > > > pass -> INCOMPLETE (fi-skl-gvtdvm) > > > > none -> INCOMPLETE (fi-skl-x1585l) > > > > pass -> INCOMPLETE (fi-bxt-j4205) > > > > pass -> INCOMPLETE (fi-kbl-7500u) > > > > pass -> INCOMPLETE (fi-kbl-7560u) > > > > pass -> INCOMPLETE (fi-kbl-r) > > > > pass -> INCOMPLETE (fi-glk-2a) > > > > Subgroup basic-flip-default-c: > > > > pass -> INCOMPLETE (fi-bsw-n3050) > > > > > > Looks like we decremented a refcount once too often (after it was freed, > > > poison 0x6b--). That clean up plane->old_fb? > > > > I can't even run that testcase without my gpu reset fixes, epic win :( > > It's hard to tell, but considering that the error it is detecting is slab > corruption, it is more likely an earlier test that is doing the dirty > deed. This is kasan territory. Tomi! I did fire up my full pile of stuff, and it indeed goes somehow horribly wrong in the modeset path for the reset code. Looks like my innocently-looking change to disable_all (we call that somewhere in there) to properly refcount ->fb wasn't all that innocent :( It's getting late, I'll try to figure out what's going tomorrow or on Mon. Promises to be another intriguing wild goose chase at least. kasan didn't say anything special, but somehow the atomic commit machinery died completely. -Daniel
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index b07fc30372d3..71b5f71e61e0 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -2726,6 +2726,7 @@ int drm_atomic_helper_disable_all(struct drm_device *dev, struct drm_plane *plane; struct drm_crtc_state *crtc_state; struct drm_crtc *crtc; + unsigned plane_mask = 0; int ret, i; state = drm_atomic_state_alloc(dev); @@ -2768,10 +2769,14 @@ int drm_atomic_helper_disable_all(struct drm_device *dev, goto free; drm_atomic_set_fb_for_plane(plane_state, NULL); + plane_mask |= BIT(drm_plane_index(plane)); + plane->old_fb = plane->fb; } ret = drm_atomic_commit(state); free: + if (plane_mask) + drm_atomic_clean_old_fb(dev, plane_mask, ret); drm_atomic_state_put(state); return ret; }
The legacy plane->fb pointer is refcounted by calling drm_atomic_clean_old_fb(). In practice this isn't a real problem because: - The caller in the i915 gpu reset code restores the original state again, which means the plane->fb pointer won't change, hence can't leak. - Drivers using drm_atomic_helper_shutdown call the fbdev cleanup first, and that usually cleans up the fb through drm_remove_framebuffer, which does this correctly. - Without fbdev the only framebuffers are from userspace, and those get cleaned up (again using drm_remove_framebuffer) befor the driver can even be unloaded. But in i915 I've switched the cleanup sequence around so that the _shutdown() calls happens after the drm_remove_framebuffer(), which is how I discovered this issue. Cc: martin.peres@free.fr Cc: chris@chris-wilson.co.uk Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/drm_atomic_helper.c | 5 +++++ 1 file changed, 5 insertions(+)