diff mbox

[01/12] drm/i915: Assert mutex_is_locked on context lookup

Message ID 1366784140-2670-2-git-send-email-ben@bwidawsk.net (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Widawsky April 24, 2013, 6:15 a.m. UTC
Because our context refcounting doesn't grab a ref at lookup time, it is
unsafe to do so without the lock.

NOTE: We don't have an easy way to put the assertion in the lookup
function which is where this really belongs. Context switching is good
enough because it actually asserts even more correctness by protecting
the default_context.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Jesse Barnes May 2, 2013, 8:27 p.m. UTC | #1
On Tue, 23 Apr 2013 23:15:29 -0700
Ben Widawsky <ben@bwidawsk.net> wrote:

> Because our context refcounting doesn't grab a ref at lookup time, it is
> unsafe to do so without the lock.
> 
> NOTE: We don't have an easy way to put the assertion in the lookup
> function which is where this really belongs. Context switching is good
> enough because it actually asserts even more correctness by protecting
> the default_context.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index a1e8ecb..411ace0 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring,
>  	if (dev_priv->hw_contexts_disabled)
>  		return 0;
>  
> +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> +
>  	if (ring != &dev_priv->ring[RCS])
>  		return 0;
>  

Simple enough.

Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>

We usually do WARN_ONs for this stuff though, in case a user actually
does hit it, it may not be fatal so why crash the machine?

But that's a minor distinction since we shouldn't hit this except in
development anyway.
Daniel Vetter May 6, 2013, 9:40 a.m. UTC | #2
On Thu, May 02, 2013 at 01:27:32PM -0700, Jesse Barnes wrote:
> On Tue, 23 Apr 2013 23:15:29 -0700
> Ben Widawsky <ben@bwidawsk.net> wrote:
> 
> > Because our context refcounting doesn't grab a ref at lookup time, it is
> > unsafe to do so without the lock.
> > 
> > NOTE: We don't have an easy way to put the assertion in the lookup
> > function which is where this really belongs. Context switching is good
> > enough because it actually asserts even more correctness by protecting
> > the default_context.
> > 
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index a1e8ecb..411ace0 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring,
> >  	if (dev_priv->hw_contexts_disabled)
> >  		return 0;
> >  
> > +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> > +
> >  	if (ring != &dev_priv->ring[RCS])
> >  		return 0;
> >  
> 
> Simple enough.
> 
> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> 
> We usually do WARN_ONs for this stuff though, in case a user actually
> does hit it, it may not be fatal so why crash the machine?
> 
> But that's a minor distinction since we shouldn't hit this except in
> development anyway.

Well, since this is a patch for upstream the focus should very much be on
supporting bug reporters and not developers. And for bug reporters a BUG
is much more annoying than a WARN and greatly reduces the chances that
we'll get a bug report.

There are imo only very few cases where a BUG instead of a WARN is
justified:
- The kernel is _guaranteed_ to oops in the next few lines anyway, so a
  BUG_ON will help in readability of the backtrace. Note that checking 3
  different things for non-NULL in the same BUG actually reduces OOPS
  readability (with an oops you can at least reconstruct the faulting
  address and so probably the pointer). Also, this means the BUG should
  have a neat description of what exactly blew up.

  The "a few lines" part is just a guideline with some big exceptions. A
  prime example is refcount over/underflows since those will blow up, but
  only sometimes later (and usually no one will have a clue why).

- BUGs are justified if there's a potential security hole awaiting, e.g.
  when something in the userspace input validation has gone wrong.

- I'm wary of special error handling for WARNs, but if an early return
  (with an error code if possible) transforms a BUG into a WARN I'm in.
  But trying to fix up e.g. modeset state is usually futile, since we'll
  end up with a black/fuzzy/corrupted screen most likely anyway.

But the most important rule is: In case of doubt, just WARN, don't BUG.

[I know, I violate it sometimes, too.]

Patch applied with the s/BUG/WARN bikeshed.
-Daniel
Daniel Vetter May 6, 2013, 9:44 a.m. UTC | #3
On Mon, May 06, 2013 at 11:40:06AM +0200, Daniel Vetter wrote:
> On Thu, May 02, 2013 at 01:27:32PM -0700, Jesse Barnes wrote:
> > On Tue, 23 Apr 2013 23:15:29 -0700
> > Ben Widawsky <ben@bwidawsk.net> wrote:
> > 
> > > Because our context refcounting doesn't grab a ref at lookup time, it is
> > > unsafe to do so without the lock.
> > > 
> > > NOTE: We don't have an easy way to put the assertion in the lookup
> > > function which is where this really belongs. Context switching is good
> > > enough because it actually asserts even more correctness by protecting
> > > the default_context.
> > > 
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > index a1e8ecb..411ace0 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring,
> > >  	if (dev_priv->hw_contexts_disabled)
> > >  		return 0;
> > >  
> > > +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> > > +
> > >  	if (ring != &dev_priv->ring[RCS])
> > >  		return 0;
> > >  
> > 
> > Simple enough.
> > 
> > Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> > 
> > We usually do WARN_ONs for this stuff though, in case a user actually
> > does hit it, it may not be fatal so why crash the machine?
> > 
> > But that's a minor distinction since we shouldn't hit this except in
> > development anyway.
> 
> Well, since this is a patch for upstream the focus should very much be on
> supporting bug reporters and not developers. And for bug reporters a BUG
> is much more annoying than a WARN and greatly reduces the chances that
> we'll get a bug report.

Some more details why a WARN massively beats a BUG for us: BUG kills the
current process and ensures all locks are stuck. Usually that means X is
dead and you can't vt-switch away to the console to take a quick look at
dmesg.

Now even when all rendering is down the toilet due to the follow-up damage
after the WARN and the gpu a zombie, there's a non-zero chance that
vt-switch (or sw rendering in X) will work long enough to grab log files
and debugfs data.

Hence the first rule to only use a BUG on if we have a guaranteed OOPS
otherwise (which again will kill the process and make all locks stuck).
-Daniel

> 
> There are imo only very few cases where a BUG instead of a WARN is
> justified:
> - The kernel is _guaranteed_ to oops in the next few lines anyway, so a
>   BUG_ON will help in readability of the backtrace. Note that checking 3
>   different things for non-NULL in the same BUG actually reduces OOPS
>   readability (with an oops you can at least reconstruct the faulting
>   address and so probably the pointer). Also, this means the BUG should
>   have a neat description of what exactly blew up.
> 
>   The "a few lines" part is just a guideline with some big exceptions. A
>   prime example is refcount over/underflows since those will blow up, but
>   only sometimes later (and usually no one will have a clue why).
> 
> - BUGs are justified if there's a potential security hole awaiting, e.g.
>   when something in the userspace input validation has gone wrong.
> 
> - I'm wary of special error handling for WARNs, but if an early return
>   (with an error code if possible) transforms a BUG into a WARN I'm in.
>   But trying to fix up e.g. modeset state is usually futile, since we'll
>   end up with a black/fuzzy/corrupted screen most likely anyway.
> 
> But the most important rule is: In case of doubt, just WARN, don't BUG.
> 
> [I know, I violate it sometimes, too.]
> 
> Patch applied with the s/BUG/WARN bikeshed.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Ben Widawsky May 6, 2013, 5:59 p.m. UTC | #4
On Mon, May 06, 2013 at 11:44:22AM +0200, Daniel Vetter wrote:
> On Mon, May 06, 2013 at 11:40:06AM +0200, Daniel Vetter wrote:
> > On Thu, May 02, 2013 at 01:27:32PM -0700, Jesse Barnes wrote:
> > > On Tue, 23 Apr 2013 23:15:29 -0700
> > > Ben Widawsky <ben@bwidawsk.net> wrote:
> > > 
> > > > Because our context refcounting doesn't grab a ref at lookup time, it is
> > > > unsafe to do so without the lock.
> > > > 
> > > > NOTE: We don't have an easy way to put the assertion in the lookup
> > > > function which is where this really belongs. Context switching is good
> > > > enough because it actually asserts even more correctness by protecting
> > > > the default_context.
> > > > 
> > > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > index a1e8ecb..411ace0 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring,
> > > >  	if (dev_priv->hw_contexts_disabled)
> > > >  		return 0;
> > > >  
> > > > +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> > > > +
> > > >  	if (ring != &dev_priv->ring[RCS])
> > > >  		return 0;
> > > >  
> > > 
> > > Simple enough.
> > > 
> > > Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > 
> > > We usually do WARN_ONs for this stuff though, in case a user actually
> > > does hit it, it may not be fatal so why crash the machine?
> > > 
> > > But that's a minor distinction since we shouldn't hit this except in
> > > development anyway.
> > 
> > Well, since this is a patch for upstream the focus should very much be on
> > supporting bug reporters and not developers. And for bug reporters a BUG
> > is much more annoying than a WARN and greatly reduces the chances that
> > we'll get a bug report.
> 
> Some more details why a WARN massively beats a BUG for us: BUG kills the
> current process and ensures all locks are stuck. Usually that means X is
> dead and you can't vt-switch away to the console to take a quick look at
> dmesg.
> 
> Now even when all rendering is down the toilet due to the follow-up damage
> after the WARN and the gpu a zombie, there's a non-zero chance that
> vt-switch (or sw rendering in X) will work long enough to grab log files
> and debugfs data.
> 
> Hence the first rule to only use a BUG on if we have a guaranteed OOPS
> otherwise (which again will kill the process and make all locks stuck).
> -Daniel
> 
> > 
> > There are imo only very few cases where a BUG instead of a WARN is
> > justified:
> > - The kernel is _guaranteed_ to oops in the next few lines anyway, so a
> >   BUG_ON will help in readability of the backtrace. Note that checking 3
> >   different things for non-NULL in the same BUG actually reduces OOPS
> >   readability (with an oops you can at least reconstruct the faulting
> >   address and so probably the pointer). Also, this means the BUG should
> >   have a neat description of what exactly blew up.
> > 
> >   The "a few lines" part is just a guideline with some big exceptions. A
> >   prime example is refcount over/underflows since those will blow up, but
> >   only sometimes later (and usually no one will have a clue why).
> > 
> > - BUGs are justified if there's a potential security hole awaiting, e.g.
> >   when something in the userspace input validation has gone wrong.
> > 
> > - I'm wary of special error handling for WARNs, but if an early return
> >   (with an error code if possible) transforms a BUG into a WARN I'm in.
> >   But trying to fix up e.g. modeset state is usually futile, since we'll
> >   end up with a black/fuzzy/corrupted screen most likely anyway.
> > 
> > But the most important rule is: In case of doubt, just WARN, don't BUG.
> > 
> > [I know, I violate it sometimes, too.]
> > 
> > Patch applied with the s/BUG/WARN bikeshed.
> > -Daniel

Thanks for applying the patch, it's certainly better than what we have
currently.

Why I wanted a BUG: When you get a ref to an object without holding a
lock you get a potentially unsafe pointer (to which we will be writing).
If the context object memory is freed, and we write to it, we have a
potential to late scribble over <insert your file system of choice>
memory. There is probably a similar security implication there as well.
Many of us are used to, and capable of recovering from GPU hangs, but
less of us like to deal with FS recovery.

I actually believe all "get" code like this (backed with refcounts)
should BUG and not WARN.
Daniel Vetter May 6, 2013, 6:35 p.m. UTC | #5
On Mon, May 6, 2013 at 7:59 PM, Ben Widawsky <ben@bwidawsk.net> wrote:
>
> Why I wanted a BUG: When you get a ref to an object without holding a
> lock you get a potentially unsafe pointer (to which we will be writing).
> If the context object memory is freed, and we write to it, we have a
> potential to late scribble over <insert your file system of choice>
> memory. There is probably a similar security implication there as well.
> Many of us are used to, and capable of recovering from GPU hangs, but
> less of us like to deal with FS recovery.
>
> I actually believe all "get" code like this (backed with refcounts)
> should BUG and not WARN.

Historically we've screwed up locking in the driver load/teardown,
suspend/resume and panic paths. Blowing up with a BUG_ON in there
tends to be pretty bad for debugging. And there's pretty much no
chance of a hostile party being able to exploit a race.

So _especially_ for locking checks imo only WARN is acceptable.

Otoh if such a race is indeed expoitable from userspace and it escaped
into a released kernel that means we have a pretty gapping hole in our
test coverage. Fixing that sounds more fruitful to me than making bug
reporter's life harder.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index a1e8ecb..411ace0 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -444,6 +444,8 @@  int i915_switch_context(struct intel_ring_buffer *ring,
 	if (dev_priv->hw_contexts_disabled)
 		return 0;
 
+	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
+
 	if (ring != &dev_priv->ring[RCS])
 		return 0;