diff mbox

drm/i915: Fix use-after-free in do_switch

Message ID 1386254554-26888-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Dec. 5, 2013, 2:42 p.m. UTC
So apparently under ridiculous amounts of memory pressure we can get
into trouble in do_switch when we try to move the old hw context
backing storage object onto the active lists.

With list debugging enabled that usually results in us chasing a
poisoned pointer - which means we've hit upon a vma that has been
removed from all lrus with list_del (and then deallocated, so it's a
real use-after free).

Ian Lister has done some great callchain chasing and noticed that we
can reenter do_switch:

i915_gem_do_execbuffer()

i915_switch_context()

do_switch()
   from = ring->last_context;
   i915_gem_object_pin()

      i915_gem_object_bind_to_gtt()
         ret = drm_mm_insert_node_in_range_generic();
         // If the above call fails then it will try i915_gem_evict_something()
         // If that fails it will call i915_gem_evict_everything() ...
	 i915_gem_evict_everything()
	    i915_gpu_idle()
	       i915_switch_context(DEFAULT_CONTEXT)

Like with everything else where the shrinker or eviction code can
invalidate pointers we need to reload relevant state.

Note that there's no need to recheck whether a context switch is still
required because:

- Doing a switch to the same context is harmless (besides wasting a
  bit of energy).

- This can only happen with the default context. But since that one's
  pinned we'll never call down into evict_everything under normal
  circumstances. Note that there's a little driver bringup fun
  involved namely that we could recourse into do_switch for the
  initial switch. Atm we're fine since we assign the context pointer
  only after the call to do_switch at driver load or resume time. And
  in the gpu reset case we skip the entire setup sequence (which might
  be a bug on its own, but definitely not this one here).

Cc'ing stable since apparently ChromeOS guys are seeing this in the
wild (and not just on artificial stress tests), see the reference.

Note that in upstream code doesn't calle evict_everything directly
from evict_something, that's an extension in this product branch. But
we can still hit upon this bug (and apparently we do, see the linked
backtraces). I've noticed this while trying to construct a testcase
for this bug and utterly failed to provoke it. It looks like we need
to driver the system squarly into the lowmem wall and provoke the
shrinker to evict the context object by doing the last-ditch
evict_everything call.

Aside: There's currently no means to get a badly-fragmenting hw
context object away from a bad spot in the upstream code. We should
fix this by at least adding some code to evict_something to handle hw
contexts.

References: https://code.google.com/p/chromium/issues/detail?id=248191
Reported-by: Ian Lister <ian.lister@intel.com>
Cc: Ian Lister <ian.lister@intel.com>
Cc: stable@vger.kernel.org
Cc: Ben Widawsky <benjamin.widawsky@intel.com>
Cc: Stéphane Marchesin <marcheu@chromium.org>
Cc: Bloomfield, Jon <jon.bloomfield@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

rafael.barbalho@intel.com Dec. 5, 2013, 4:28 p.m. UTC | #1
> -----Original Message-----

> From: intel-gfx-bounces@lists.freedesktop.org [mailto:intel-gfx-

> bounces@lists.freedesktop.org] On Behalf Of Daniel Vetter

> Sent: Thursday, December 05, 2013 2:43 PM

> To: Intel Graphics Development

> Cc: Lister, Ian; Daniel Vetter; stable@vger.kernel.org; Widawsky, Benjamin

> Subject: [Intel-gfx] [PATCH] drm/i915: Fix use-after-free in do_switch

> 

> So apparently under ridiculous amounts of memory pressure we can get into

> trouble in do_switch when we try to move the old hw context backing

> storage object onto the active lists.

> 


[SNIP]

> 

> Aside: There's currently no means to get a badly-fragmenting hw context

> object away from a bad spot in the upstream code. We should fix this by at

> least adding some code to evict_something to handle hw contexts.

> 

> References: https://code.google.com/p/chromium/issues/detail?id=248191

> Reported-by: Ian Lister <ian.lister@intel.com>

> Cc: Ian Lister <ian.lister@intel.com>

> Cc: stable@vger.kernel.org

> Cc: Ben Widawsky <benjamin.widawsky@intel.com>

> Cc: Stéphane Marchesin <marcheu@chromium.org>

> Cc: Bloomfield, Jon <jon.bloomfield@intel.com>

> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>


Tested-by: Rafael Barbalho <rafael.barbalho@intel.com>


> ---

>  drivers/gpu/drm/i915/i915_gem_context.c | 14 ++++++++++++--

>  1 file changed, 12 insertions(+), 2 deletions(-)

> 

> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c

> b/drivers/gpu/drm/i915/i915_gem_context.c

> index 41877045a1a0..2d2877493f61 100644

> --- a/drivers/gpu/drm/i915/i915_gem_context.c

> +++ b/drivers/gpu/drm/i915/i915_gem_context.c

> @@ -421,11 +421,21 @@ static int do_switch(struct i915_hw_context *to)

>  	if (ret)

>  		return ret;

> 

> -	/* Clear this page out of any CPU caches for coherent swap-in/out.

> Note

> +	/*

> +	 * Pin can switch back to the default context if we end up calling into

> +	 * evict_everything - as a last ditch gtt defrag effort that also

> +	 * switches to the default context. Hence we need to reload from

> here.

> +	 */

> +	from = ring->last_context;

> +

> +	/*

> +	 * Clear this page out of any CPU caches for coherent swap-in/out.

> +Note

>  	 * that thanks to write = false in this call and us not setting any gpu

>  	 * write domains when putting a context object onto the active list

>  	 * (when switching away from it), this won't block.

> -	 * XXX: We need a real interface to do this instead of trickery. */

> +	 *

> +	 * XXX: We need a real interface to do this instead of trickery.

> +	 */

>  	ret = i915_gem_object_set_to_gtt_domain(to->obj, false);

>  	if (ret) {

>  		i915_gem_object_unpin(to->obj);

> --

> 1.8.4.3

> 

> _______________________________________________

> Intel-gfx mailing list

> Intel-gfx@lists.freedesktop.org

> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Lister, Ian Dec. 6, 2013, 10:05 a.m. UTC | #2
> -----Original Message-----

> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]

> Sent: Thursday, December 05, 2013 2:43 PM

> To: Intel Graphics Development

> Cc: Daniel Vetter; Lister, Ian; stable@vger.kernel.org; Widawsky, Benjamin;

> Stéphane Marchesin; Bloomfield, Jon

> Subject: [PATCH] drm/i915: Fix use-after-free in do_switch

> 

> So apparently under ridiculous amounts of memory pressure we can get into

> trouble in do_switch when we try to move the old hw context backing

> storage object onto the active lists.

> 

> With list debugging enabled that usually results in us chasing a poisoned

> pointer - which means we've hit upon a vma that has been removed from all

> lrus with list_del (and then deallocated, so it's a real use-after free).

> 

> Ian Lister has done some great callchain chasing and noticed that we can

> reenter do_switch:

> 

> i915_gem_do_execbuffer()

> 

> i915_switch_context()

> 

> do_switch()

>    from = ring->last_context;

>    i915_gem_object_pin()

> 

>       i915_gem_object_bind_to_gtt()

>          ret = drm_mm_insert_node_in_range_generic();

>          // If the above call fails then it will try i915_gem_evict_something()

>          // If that fails it will call i915_gem_evict_everything() ...

> 	 i915_gem_evict_everything()

> 	    i915_gpu_idle()

> 	       i915_switch_context(DEFAULT_CONTEXT)

> 

> Like with everything else where the shrinker or eviction code can invalidate

> pointers we need to reload relevant state.

> 

> Note that there's no need to recheck whether a context switch is still

> required because:

> 

> - Doing a switch to the same context is harmless (besides wasting a

>   bit of energy).

> 

> - This can only happen with the default context. But since that one's

>   pinned we'll never call down into evict_everything under normal

>   circumstances. Note that there's a little driver bringup fun

>   involved namely that we could recourse into do_switch for the

>   initial switch. Atm we're fine since we assign the context pointer

>   only after the call to do_switch at driver load or resume time. And

>   in the gpu reset case we skip the entire setup sequence (which might

>   be a bug on its own, but definitely not this one here).

> 

> Cc'ing stable since apparently ChromeOS guys are seeing this in the wild (and

> not just on artificial stress tests), see the reference.

> 

> Note that in upstream code doesn't calle evict_everything directly from

> evict_something, that's an extension in this product branch. But we can still

> hit upon this bug (and apparently we do, see the linked backtraces). I've

> noticed this while trying to construct a testcase for this bug and utterly failed

> to provoke it. It looks like we need to driver the system squarly into the

> lowmem wall and provoke the shrinker to evict the context object by doing

> the last-ditch evict_everything call.

> 

> Aside: There's currently no means to get a badly-fragmenting hw context

> object away from a bad spot in the upstream code. We should fix this by at

> least adding some code to evict_something to handle hw contexts.

> 

> References: https://code.google.com/p/chromium/issues/detail?id=248191

> Reported-by: Ian Lister <ian.lister@intel.com>

> Cc: Ian Lister <ian.lister@intel.com>

> Cc: stable@vger.kernel.org

> Cc: Ben Widawsky <benjamin.widawsky@intel.com>

> Cc: Stéphane Marchesin <marcheu@chromium.org>

> Cc: Bloomfield, Jon <jon.bloomfield@intel.com>

> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>


Reviewed-by: Ian Lister <ian.lister@intel.com>


> ---

>  drivers/gpu/drm/i915/i915_gem_context.c | 14 ++++++++++++--

>  1 file changed, 12 insertions(+), 2 deletions(-)

> 

> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c

> b/drivers/gpu/drm/i915/i915_gem_context.c

> index 41877045a1a0..2d2877493f61 100644

> --- a/drivers/gpu/drm/i915/i915_gem_context.c

> +++ b/drivers/gpu/drm/i915/i915_gem_context.c

> @@ -421,11 +421,21 @@ static int do_switch(struct i915_hw_context *to)

>  	if (ret)

>  		return ret;

> 

> -	/* Clear this page out of any CPU caches for coherent swap-in/out.

> Note

> +	/*

> +	 * Pin can switch back to the default context if we end up calling into

> +	 * evict_everything - as a last ditch gtt defrag effort that also

> +	 * switches to the default context. Hence we need to reload from

> here.

> +	 */

> +	from = ring->last_context;

> +

> +	/*

> +	 * Clear this page out of any CPU caches for coherent swap-in/out.

> +Note

>  	 * that thanks to write = false in this call and us not setting any gpu

>  	 * write domains when putting a context object onto the active list

>  	 * (when switching away from it), this won't block.

> -	 * XXX: We need a real interface to do this instead of trickery. */

> +	 *

> +	 * XXX: We need a real interface to do this instead of trickery.

> +	 */

>  	ret = i915_gem_object_set_to_gtt_domain(to->obj, false);

>  	if (ret) {

>  		i915_gem_object_unpin(to->obj);

> --

> 1.8.4.3
Daniel Vetter Dec. 6, 2013, 12:09 p.m. UTC | #3
On Fri, Dec 06, 2013 at 10:05:51AM +0000, Lister, Ian wrote:
> 
> 
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
> > Sent: Thursday, December 05, 2013 2:43 PM
> > To: Intel Graphics Development
> > Cc: Daniel Vetter; Lister, Ian; stable@vger.kernel.org; Widawsky, Benjamin;
> > Stéphane Marchesin; Bloomfield, Jon
> > Subject: [PATCH] drm/i915: Fix use-after-free in do_switch
> > 
> > So apparently under ridiculous amounts of memory pressure we can get into
> > trouble in do_switch when we try to move the old hw context backing
> > storage object onto the active lists.
> > 
> > With list debugging enabled that usually results in us chasing a poisoned
> > pointer - which means we've hit upon a vma that has been removed from all
> > lrus with list_del (and then deallocated, so it's a real use-after free).
> > 
> > Ian Lister has done some great callchain chasing and noticed that we can
> > reenter do_switch:
> > 
> > i915_gem_do_execbuffer()
> > 
> > i915_switch_context()
> > 
> > do_switch()
> >    from = ring->last_context;
> >    i915_gem_object_pin()
> > 
> >       i915_gem_object_bind_to_gtt()
> >          ret = drm_mm_insert_node_in_range_generic();
> >          // If the above call fails then it will try i915_gem_evict_something()
> >          // If that fails it will call i915_gem_evict_everything() ...
> > 	 i915_gem_evict_everything()
> > 	    i915_gpu_idle()
> > 	       i915_switch_context(DEFAULT_CONTEXT)
> > 
> > Like with everything else where the shrinker or eviction code can invalidate
> > pointers we need to reload relevant state.
> > 
> > Note that there's no need to recheck whether a context switch is still
> > required because:
> > 
> > - Doing a switch to the same context is harmless (besides wasting a
> >   bit of energy).
> > 
> > - This can only happen with the default context. But since that one's
> >   pinned we'll never call down into evict_everything under normal
> >   circumstances. Note that there's a little driver bringup fun
> >   involved namely that we could recourse into do_switch for the
> >   initial switch. Atm we're fine since we assign the context pointer
> >   only after the call to do_switch at driver load or resume time. And
> >   in the gpu reset case we skip the entire setup sequence (which might
> >   be a bug on its own, but definitely not this one here).
> > 
> > Cc'ing stable since apparently ChromeOS guys are seeing this in the wild (and
> > not just on artificial stress tests), see the reference.
> > 
> > Note that in upstream code doesn't calle evict_everything directly from
> > evict_something, that's an extension in this product branch. But we can still
> > hit upon this bug (and apparently we do, see the linked backtraces). I've
> > noticed this while trying to construct a testcase for this bug and utterly failed
> > to provoke it. It looks like we need to driver the system squarly into the
> > lowmem wall and provoke the shrinker to evict the context object by doing
> > the last-ditch evict_everything call.
> > 
> > Aside: There's currently no means to get a badly-fragmenting hw context
> > object away from a bad spot in the upstream code. We should fix this by at
> > least adding some code to evict_something to handle hw contexts.
> > 
> > References: https://code.google.com/p/chromium/issues/detail?id=248191
> > Reported-by: Ian Lister <ian.lister@intel.com>
> > Cc: Ian Lister <ian.lister@intel.com>
> > Cc: stable@vger.kernel.org
> > Cc: Ben Widawsky <benjamin.widawsky@intel.com>
> > Cc: Stéphane Marchesin <marcheu@chromium.org>
> > Cc: Bloomfield, Jon <jon.bloomfield@intel.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Reviewed-by: Ian Lister <ian.lister@intel.com>

Picked up for -fixes, thanks for the review.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 41877045a1a0..2d2877493f61 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -421,11 +421,21 @@  static int do_switch(struct i915_hw_context *to)
 	if (ret)
 		return ret;
 
-	/* Clear this page out of any CPU caches for coherent swap-in/out. Note
+	/*
+	 * Pin can switch back to the default context if we end up calling into
+	 * evict_everything - as a last ditch gtt defrag effort that also
+	 * switches to the default context. Hence we need to reload from here.
+	 */
+	from = ring->last_context;
+
+	/*
+	 * Clear this page out of any CPU caches for coherent swap-in/out. Note
 	 * that thanks to write = false in this call and us not setting any gpu
 	 * write domains when putting a context object onto the active list
 	 * (when switching away from it), this won't block.
-	 * XXX: We need a real interface to do this instead of trickery. */
+	 *
+	 * XXX: We need a real interface to do this instead of trickery.
+	 */
 	ret = i915_gem_object_set_to_gtt_domain(to->obj, false);
 	if (ret) {
 		i915_gem_object_unpin(to->obj);