diff mbox

[git,pull] drm fixes

Message ID CA+55aFz94M0TVdf+jbKs-AHNJq5mZW8oo4pUOAqHS8qb_Cz9sA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds March 2, 2015, 1:59 a.m. UTC
On Sun, Mar 1, 2015 at 1:00 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Back to the drawing board.

Ok, many hours later, but I found it.

The bisection was a disaster, having to work around other bugs in this
area, but it ended up getting "close enough" that I figured out what
went wrong.

The "intel_plane_duplicate_state()" is horribly horribly buggy. It
looks at the state->fb pointer, but it may have been free'd already.

This workaround "works for me", but it's really still very
questionable, because while the "kref_get_unless_zero()" works
correctly when the last reference has been dropped, I'm not sure that
there is any guarantee that the whole allocation even exists any more,
so I think the *correct* thing to do would be to clear state->fb when
dropping the kref. But this was the smallest working patch I could
come up with. Somebody who actually knows the code should start
looking at the places that do drm_framebuffer_unreference(), and
actually clear that pointer instead.

Added Matt Roper and Ander Conselvan de Oliveira to the discussion,
since they are the ones git says are involved with the original broken
intel_plane_duplicate_state().

Anyway, attached is

 (a) the patch with a big comment

 (b) the warnings I get on that machine that show where this problem
triggers (and another warning earlier).

Comments? I'm sure this probably only triggers with *old* X servers
that don't do all the modern dri stuff.

                               Linus

Comments

Daniel Vetter March 2, 2015, 9:04 a.m. UTC | #1
On Sun, Mar 01, 2015 at 05:59:53PM -0800, Linus Torvalds wrote:
> On Sun, Mar 1, 2015 at 1:00 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Back to the drawing board.
> 
> Ok, many hours later, but I found it.
> 
> The bisection was a disaster, having to work around other bugs in this
> area, but it ended up getting "close enough" that I figured out what
> went wrong.

Sorry for all the bisect work. Ville dug as far as the load detect being
unhappy due to atomic last week. But since I general don't real mail on
w/e I've only seen this thread now.

> The "intel_plane_duplicate_state()" is horribly horribly buggy. It
> looks at the state->fb pointer, but it may have been free'd already.
> 
> This workaround "works for me", but it's really still very
> questionable, because while the "kref_get_unless_zero()" works
> correctly when the last reference has been dropped, I'm not sure that
> there is any guarantee that the whole allocation even exists any more,
> so I think the *correct* thing to do would be to clear state->fb when
> dropping the kref. But this was the smallest working patch I could
> come up with. Somebody who actually knows the code should start
> looking at the places that do drm_framebuffer_unreference(), and
> actually clear that pointer instead.
> 
> Added Matt Roper and Ander Conselvan de Oliveira to the discussion,
> since they are the ones git says are involved with the original broken
> intel_plane_duplicate_state().
> 
> Anyway, attached is
> 
>  (a) the patch with a big comment
> 
>  (b) the warnings I get on that machine that show where this problem
> triggers (and another warning earlier).
> 
> Comments? I'm sure this probably only triggers with *old* X servers
> that don't do all the modern dri stuff.

It's not old X servers but old machines which don't have hotplug
interrupts (or still have tv-out, which doesn't have hpd support
anywhere).

I'll dig into this now just fyi the rules about how plane->state should be
handled:
- you need plane->lock
- it's invariant once assigned, updates should only be done with a
  duplicate_state(), update state you want to change and the going through
  the atomic commit machinery
- drm_plane_state->fb should always be a full reference

But switching to atomic amounts to a full rewrite of the drm core and
drivers (semantics for modeset updates are totally different). So there's
lots of glue and ducttape going on to keep up the illusion for both old
code and partially transitioned driver. It's all a bit wobbly atm and
don't loook too closely at some of the hacks we have.

And can you please attach a bactrace of the WARN in your patch, just to
double-check you blow up at the same spot?

Thanks, Daniel

>                                Linus

> From c182b15c3abee75cdc9d9564b6ab826403690f4e Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@localhost.localdomain>
> Date: Sat, 28 Feb 2015 21:44:48 -0800
> Subject: [PATCH] Workaround for drm bug
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> ---
>  drivers/gpu/drm/i915/intel_atomic_plane.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_atomic_plane.c b/drivers/gpu/drm/i915/intel_atomic_plane.c
> index 9e6f727..72714d3 100644
> --- a/drivers/gpu/drm/i915/intel_atomic_plane.c
> +++ b/drivers/gpu/drm/i915/intel_atomic_plane.c
> @@ -85,8 +85,23 @@ intel_plane_duplicate_state(struct drm_plane *plane)
>  		return NULL;
>  
>  	state = &intel_state->base;
> -	if (state->fb)
> -		drm_framebuffer_reference(state->fb);
> +
> +	/*
> +	 * We cannot do drm_framebuffer_reference(), because the reference
> +	 * may already have been dropped.
> +	 *
> +	 * So we do what drm_framebuffer_lookup() does, namely do a
> +	 * kref_get_unless_zero(). Even that is somewhat questionable,
> +	 * in that maybe the 'fb' already got free'd. So warn loudly
> +	 * about this.
> +	 *
> +	 * Maybe the base.fb should be cleared by whatever drops the
> +	 * reference?
> +	 */
> +	if (state->fb && !kref_get_unless_zero(&state->fb->refcount)) {
> +		state->fb = NULL;
> +		WARN_ONCE(1, "intel_plane_duplicate_state got plane with dead frame buffer");
> +	}
>  
>  	return state;
>  }
> -- 
> 2.3.1.167.g7f4ba4b
> 


> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linus Torvalds March 2, 2015, 4:53 p.m. UTC | #2
On Mon, Mar 2, 2015 at 1:04 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> And can you please attach a bactrace of the WARN in your patch, just to
> double-check you blow up at the same spot?

So the dmesg I attached had a backtrace for the new WARN_ONCE() (in
addition to an unrelated(?) one from i915_gem_free_object()).

Or did you mean a backtrace of the oops when things go wrong, when my
patch is *not* applied? My first email had that with the kref.h
warning from drm_framebuffer_reference, which is otherwise the same
thing.

And after things go wrong, and the plane handling thing uses a
framebuffer that has already been freed, then the resulting oopses end
up being kind of random. Which was partly why it ended up beign so
hard to finally bisect, and I actually eventually gave up on it -
because sometimes things would just happen to work, sometimes things
would blow up and oops when X started (resulting in just a dead
machine), sometimes things would oops at bootup.

The most common oops was something going wrong when the framebuffer
was free'd for the second time, and it would cause a NULL pointer
dereference in drm_framebuffer_free() or one of the routines it called
(so often a NULL pointer dereference in
"mutex_lock(&dev->mode_config.fb_lock)" because "dev" was NULL, or the
call to "fb->funcs->destroy(fb)" would oops because "fb->funcs" was
NULL.

                         Linus
Daniel Vetter March 2, 2015, 5:23 p.m. UTC | #3
On Mon, Mar 2, 2015 at 5:53 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 2, 2015 at 1:04 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> And can you please attach a bactrace of the WARN in your patch, just to
>> double-check you blow up at the same spot?
>
> So the dmesg I attached had a backtrace for the new WARN_ONCE() (in
> addition to an unrelated(?) one from i915_gem_free_object()).
>
> Or did you mean a backtrace of the oops when things go wrong, when my
> patch is *not* applied? My first email had that with the kref.h
> warning from drm_framebuffer_reference, which is otherwise the same
> thing.

I've mixed things up with the other reporter which was full of the
subsequent oopses. But after I've sorted out why drm-intel-next
doesn't blow up the same way I see the bug now. Still baffled that we
underrun the refcount apparently since the same pile of legacy code +
atomic glue is used for the old modeset ioctl. But obviously something
is different, so still digging.

The gem_free_object backtrace is a completely unrelated issue. Fix for
that is in drm-intel-fixes and on the way to you:

commit 62e537f8d568347bbe4e00d7803a838750cdc618
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Tue Feb 24 13:37:54 2015 -0800

    drm/i915: Fix frontbuffer false positve.

If that one doesn't help please scream ;-)
-Daniel
diff mbox

Patch

From c182b15c3abee75cdc9d9564b6ab826403690f4e Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@localhost.localdomain>
Date: Sat, 28 Feb 2015 21:44:48 -0800
Subject: [PATCH] Workaround for drm bug

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

---
 drivers/gpu/drm/i915/intel_atomic_plane.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_atomic_plane.c b/drivers/gpu/drm/i915/intel_atomic_plane.c
index 9e6f727..72714d3 100644
--- a/drivers/gpu/drm/i915/intel_atomic_plane.c
+++ b/drivers/gpu/drm/i915/intel_atomic_plane.c
@@ -85,8 +85,23 @@  intel_plane_duplicate_state(struct drm_plane *plane)
 		return NULL;
 
 	state = &intel_state->base;
-	if (state->fb)
-		drm_framebuffer_reference(state->fb);
+
+	/*
+	 * We cannot do drm_framebuffer_reference(), because the reference
+	 * may already have been dropped.
+	 *
+	 * So we do what drm_framebuffer_lookup() does, namely do a
+	 * kref_get_unless_zero(). Even that is somewhat questionable,
+	 * in that maybe the 'fb' already got free'd. So warn loudly
+	 * about this.
+	 *
+	 * Maybe the base.fb should be cleared by whatever drops the
+	 * reference?
+	 */
+	if (state->fb && !kref_get_unless_zero(&state->fb->refcount)) {
+		state->fb = NULL;
+		WARN_ONCE(1, "intel_plane_duplicate_state got plane with dead frame buffer");
+	}
 
 	return state;
 }
-- 
2.3.1.167.g7f4ba4b