diff mbox

drm/i915: Workaround incoherence between fences and LLC across multiple CPUs

Message ID 1365107463-1867-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson April 4, 2013, 8:31 p.m. UTC
In order to fully serialize access to the fenced region and the update
to the fence register we need to take extreme measures on SNB+, and
manually flush writes to memory prior to writing the fence register in
conjunction with the memory barriers placed around the register write.

Fixes i-g-t/gem_fence_thrash

v2: Bring a bigger gun
v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
v4: Remove changes for working generations.
v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
v6: Rewrite comments to ellide forgotten history.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

Comments

Jesse Barnes April 5, 2013, 4:03 p.m. UTC | #1
On Thu,  4 Apr 2013 21:31:03 +0100
Chris Wilson <chris@chris-wilson.co.uk> wrote:

> In order to fully serialize access to the fenced region and the update
> to the fence register we need to take extreme measures on SNB+, and
> manually flush writes to memory prior to writing the fence register in
> conjunction with the memory barriers placed around the register write.
> 
> Fixes i-g-t/gem_fence_thrash
> 
> v2: Bring a bigger gun
> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
> v4: Remove changes for working generations.
> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
> v6: Rewrite comments to ellide forgotten history.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
> Cc: stable@vger.kernel.org
> ---
>  drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index fa4ea1a..8f7739e 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct drm_i915_private *dev_priv,
>  	return fence - dev_priv->fence_regs;
>  }
>  
> +static void i915_gem_write_fence__ipi(void *data)
> +{
> +	wbinvd();
> +}
> +
>  static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
>  					 struct drm_i915_fence_reg *fence,
>  					 bool enable)
>  {
> -	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
> -	int reg = fence_number(dev_priv, fence);
> -
> -	i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
> +	struct drm_device *dev = obj->base.dev;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	int fence_reg = fence_number(dev_priv, fence);
> +
> +	/* In order to fully serialize access to the fenced region and
> +	 * the update to the fence register we need to take extreme
> +	 * measures on SNB+. In theory, the write to the fence register
> +	 * flushes all memory transactions before, and coupled with the
> +	 * mb() placed around the register write we serialise all memory
> +	 * operations with respect to the changes in the tiler. Yet, on
> +	 * SNB+ we need to take a step further and emit an explicit wbinvd()
> +	 * on each processor in order to manually flush all memory
> +	 * transactions before updating the fence register.
> +	 */
> +	if (HAS_LLC(obj->base.dev))
> +		on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
> +	i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>  
>  	if (enable) {
> -		obj->fence_reg = reg;
> +		obj->fence_reg = fence_reg;
>  		fence->obj = obj;
>  		list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list);
>  	} else {

I almost wish x86 just gave up and went fully weakly ordered.  At least
then we'd know we need barriers everywhere, rather than just in
mystical places.

Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Daniel Vetter April 8, 2013, 9:54 a.m. UTC | #2
On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
> On Thu,  4 Apr 2013 21:31:03 +0100
> Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
> > In order to fully serialize access to the fenced region and the update
> > to the fence register we need to take extreme measures on SNB+, and
> > manually flush writes to memory prior to writing the fence register in
> > conjunction with the memory barriers placed around the register write.
> > 
> > Fixes i-g-t/gem_fence_thrash
> > 
> > v2: Bring a bigger gun
> > v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
> > v4: Remove changes for working generations.
> > v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
> > v6: Rewrite comments to ellide forgotten history.
> > 
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> > Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
> > Cc: stable@vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
> >  1 file changed, 23 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index fa4ea1a..8f7739e 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -2689,17 +2689,35 @@ static inline int fence_number(struct drm_i915_private *dev_priv,
> >  	return fence - dev_priv->fence_regs;
> >  }
> >  
> > +static void i915_gem_write_fence__ipi(void *data)
> > +{
> > +	wbinvd();
> > +}
> > +
> >  static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
> >  					 struct drm_i915_fence_reg *fence,
> >  					 bool enable)
> >  {
> > -	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
> > -	int reg = fence_number(dev_priv, fence);
> > -
> > -	i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
> > +	struct drm_device *dev = obj->base.dev;
> > +	struct drm_i915_private *dev_priv = dev->dev_private;
> > +	int fence_reg = fence_number(dev_priv, fence);
> > +
> > +	/* In order to fully serialize access to the fenced region and
> > +	 * the update to the fence register we need to take extreme
> > +	 * measures on SNB+. In theory, the write to the fence register
> > +	 * flushes all memory transactions before, and coupled with the
> > +	 * mb() placed around the register write we serialise all memory
> > +	 * operations with respect to the changes in the tiler. Yet, on
> > +	 * SNB+ we need to take a step further and emit an explicit wbinvd()
> > +	 * on each processor in order to manually flush all memory
> > +	 * transactions before updating the fence register.
> > +	 */
> > +	if (HAS_LLC(obj->base.dev))
> > +		on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
> > +	i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
> >  
> >  	if (enable) {
> > -		obj->fence_reg = reg;
> > +		obj->fence_reg = fence_reg;
> >  		fence->obj = obj;
> >  		list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list);
> >  	} else {
> 
> I almost wish x86 just gave up and went fully weakly ordered.  At least
> then we'd know we need barriers everywhere, rather than just in
> mystical places.
> 
> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>

Queued for -next, thanks for the patch. I am a bit unhappy that the
testcase doesn't work too well and can't reliably reproduce this on all
affected machines. But I've tried a bit to improve things and it's very
fickle. I guess we just have to accept this :(
-Daniel
Carsten Emde June 8, 2013, 8:41 p.m. UTC | #3
On 04/08/2013 11:54 AM, Daniel Vetter wrote:
> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
>> On Thu,  4 Apr 2013 21:31:03 +0100
>> Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>
>>> In order to fully serialize access to the fenced region and the update
>>> to the fence register we need to take extreme measures on SNB+, and
>>> manually flush writes to memory prior to writing the fence register in
>>> conjunction with the memory barriers placed around the register write.
>>>
>>> Fixes i-g-t/gem_fence_thrash
>>>
>>> v2: Bring a bigger gun
>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
>>> v4: Remove changes for working generations.
>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
>>> v6: Rewrite comments to ellide forgotten history.
>>>
>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>>> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
>>> Cc: stable@vger.kernel.org
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
>>>   1 file changed, 23 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index fa4ea1a..8f7739e 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct drm_i915_private *dev_priv,
>>>   	return fence - dev_priv->fence_regs;
>>>   }
>>>
>>> +static void i915_gem_write_fence__ipi(void *data)
>>> +{
>>> +	wbinvd();
>>> +}
>>> +
>>>   static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
>>>   					 struct drm_i915_fence_reg *fence,
>>>   					 bool enable)
>>>   {
>>> -	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
>>> -	int reg = fence_number(dev_priv, fence);
>>> -
>>> -	i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
>>> +	struct drm_device *dev = obj->base.dev;
>>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>>> +	int fence_reg = fence_number(dev_priv, fence);
>>> +
>>> +	/* In order to fully serialize access to the fenced region and
>>> +	 * the update to the fence register we need to take extreme
>>> +	 * measures on SNB+. In theory, the write to the fence register
>>> +	 * flushes all memory transactions before, and coupled with the
>>> +	 * mb() placed around the register write we serialise all memory
>>> +	 * operations with respect to the changes in the tiler. Yet, on
>>> +	 * SNB+ we need to take a step further and emit an explicit wbinvd()
>>> +	 * on each processor in order to manually flush all memory
>>> +	 * transactions before updating the fence register.
>>> +	 */
>>> +	if (HAS_LLC(obj->base.dev))
>>> +		on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
>>> +	i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>>>
>>>   	if (enable) {
>>> -		obj->fence_reg = reg;
>>> +		obj->fence_reg = fence_reg;
>>>   		fence->obj = obj;
>>>   		list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list);
>>>   	} else {
>>
>> I almost wish x86 just gave up and went fully weakly ordered.  At least
>> then we'd know we need barriers everywhere, rather than just in
>> mystical places.
>>
>> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>
> Queued for -next, thanks for the patch. I am a bit unhappy that the
> testcase doesn't work too well and can't reliably reproduce this on all
> affected machines. But I've tried a bit to improve things and it's very
> fickle. I guess we just have to accept this :(

Under real-time conditions when we expect deterministic response to
external and internal events at any time within a couple of
microseconds, invalidating and flushing the entire cache in a running
system is unacceptable. This is introducing latencies of several
milliseconds which was clearly shown in RT regression tests on a kernel
with this patch applied (https://www.osadl.org/?id=1543#c7602). We
therefore have to revert it in the RT patch queue - kind of a
workaround of a workaround.

Would simply be wonderful, if we could get rid of the hateful wbinvd().

	-Carsten.
Daniel Vetter June 8, 2013, 9:22 p.m. UTC | #4
On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde@osadl.org> wrote:
> On 04/08/2013 11:54 AM, Daniel Vetter wrote:
>>
>> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
>>>
>>> On Thu,  4 Apr 2013 21:31:03 +0100
>>> Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>>
>>>> In order to fully serialize access to the fenced region and the update
>>>> to the fence register we need to take extreme measures on SNB+, and
>>>> manually flush writes to memory prior to writing the fence register in
>>>> conjunction with the memory barriers placed around the register write.
>>>>
>>>> Fixes i-g-t/gem_fence_thrash
>>>>
>>>> v2: Bring a bigger gun
>>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
>>>> v4: Remove changes for working generations.
>>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
>>>> v6: Rewrite comments to ellide forgotten history.
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>>>> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
>>>> Cc: stable@vger.kernel.org
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
>>>>   1 file changed, 23 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>>> b/drivers/gpu/drm/i915/i915_gem.c
>>>> index fa4ea1a..8f7739e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
>>>> drm_i915_private *dev_priv,
>>>>         return fence - dev_priv->fence_regs;
>>>>   }
>>>>
>>>> +static void i915_gem_write_fence__ipi(void *data)
>>>> +{
>>>> +       wbinvd();
>>>> +}
>>>> +
>>>>   static void i915_gem_object_update_fence(struct drm_i915_gem_object
>>>> *obj,
>>>>                                          struct drm_i915_fence_reg
>>>> *fence,
>>>>                                          bool enable)
>>>>   {
>>>> -       struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
>>>> -       int reg = fence_number(dev_priv, fence);
>>>> -
>>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
>>>> +       struct drm_device *dev = obj->base.dev;
>>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
>>>> +       int fence_reg = fence_number(dev_priv, fence);
>>>> +
>>>> +       /* In order to fully serialize access to the fenced region and
>>>> +        * the update to the fence register we need to take extreme
>>>> +        * measures on SNB+. In theory, the write to the fence register
>>>> +        * flushes all memory transactions before, and coupled with the
>>>> +        * mb() placed around the register write we serialise all memory
>>>> +        * operations with respect to the changes in the tiler. Yet, on
>>>> +        * SNB+ we need to take a step further and emit an explicit
>>>> wbinvd()
>>>> +        * on each processor in order to manually flush all memory
>>>> +        * transactions before updating the fence register.
>>>> +        */
>>>> +       if (HAS_LLC(obj->base.dev))
>>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
>>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>>>>
>>>>         if (enable) {
>>>> -               obj->fence_reg = reg;
>>>> +               obj->fence_reg = fence_reg;
>>>>                 fence->obj = obj;
>>>>                 list_move_tail(&fence->lru_list,
>>>> &dev_priv->mm.fence_list);
>>>>         } else {
>>>
>>>
>>> I almost wish x86 just gave up and went fully weakly ordered.  At least
>>> then we'd know we need barriers everywhere, rather than just in
>>> mystical places.
>>>
>>> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>
>>
>> Queued for -next, thanks for the patch. I am a bit unhappy that the
>> testcase doesn't work too well and can't reliably reproduce this on all
>> affected machines. But I've tried a bit to improve things and it's very
>> fickle. I guess we just have to accept this :(
>
>
> Under real-time conditions when we expect deterministic response to
> external and internal events at any time within a couple of
> microseconds, invalidating and flushing the entire cache in a running
> system is unacceptable. This is introducing latencies of several
> milliseconds which was clearly shown in RT regression tests on a kernel
> with this patch applied (https://www.osadl.org/?id=1543#c7602). We
> therefore have to revert it in the RT patch queue - kind of a
> workaround of a workaround.
>
> Would simply be wonderful, if we could get rid of the hateful wbinvd().

I'm somewhat surprised people have not started to scream earlier about
this one ;-)

We're trying to figure out whether there's a less costly w/a (we have
some benchmark where it kills performance, too) but until that's done
we pretty much need to stick with it. If you want to avoid any bad
side-effects due to that w/a you can alternatively pin all gpu
rendering tasks to the same cpu core/thread.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Bloomfield, Jon June 10, 2013, 1:17 p.m. UTC | #5
> -----Original Message-----
> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> Daniel Vetter
> Sent: Saturday, June 08, 2013 10:22 PM
> To: Carsten Emde
> Cc: Chris Wilson; Jesse Barnes; Intel Graphics Development; Bloomfield, Jon;
> Steven Rostedt; Christoph Mathys; stable
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Workaround incoherence
> between fences and LLC across multiple CPUs
> 
> On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde@osadl.org> wrote:
> > On 04/08/2013 11:54 AM, Daniel Vetter wrote:
> >>
> >> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
> >>>
> >>> On Thu,  4 Apr 2013 21:31:03 +0100
> >>> Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >>>
> >>>> In order to fully serialize access to the fenced region and the
> >>>> update to the fence register we need to take extreme measures on
> >>>> SNB+, and manually flush writes to memory prior to writing the
> >>>> fence register in conjunction with the memory barriers placed around
> the register write.
> >>>>
> >>>> Fixes i-g-t/gem_fence_thrash
> >>>>
> >>>> v2: Bring a bigger gun
> >>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
> >>>> v4: Remove changes for working generations.
> >>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
> >>>> v6: Rewrite comments to ellide forgotten history.
> >>>>
> >>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
> >>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> >>>> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
> >>>> Cc: stable@vger.kernel.org
> >>>> ---
> >>>>   drivers/gpu/drm/i915/i915_gem.c |   28
> +++++++++++++++++++++++-----
> >>>>   1 file changed, 23 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> >>>> b/drivers/gpu/drm/i915/i915_gem.c index fa4ea1a..8f7739e 100644
> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
> >>>> drm_i915_private *dev_priv,
> >>>>         return fence - dev_priv->fence_regs;
> >>>>   }
> >>>>
> >>>> +static void i915_gem_write_fence__ipi(void *data) {
> >>>> +       wbinvd();
> >>>> +}
> >>>> +
> >>>>   static void i915_gem_object_update_fence(struct
> >>>> drm_i915_gem_object *obj,
> >>>>                                          struct drm_i915_fence_reg
> >>>> *fence,
> >>>>                                          bool enable)
> >>>>   {
> >>>> -       struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
> >>>> -       int reg = fence_number(dev_priv, fence);
> >>>> -
> >>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
> >>>> +       struct drm_device *dev = obj->base.dev;
> >>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
> >>>> +       int fence_reg = fence_number(dev_priv, fence);
> >>>> +
> >>>> +       /* In order to fully serialize access to the fenced region and
> >>>> +        * the update to the fence register we need to take extreme
> >>>> +        * measures on SNB+. In theory, the write to the fence register
> >>>> +        * flushes all memory transactions before, and coupled with the
> >>>> +        * mb() placed around the register write we serialise all memory
> >>>> +        * operations with respect to the changes in the tiler. Yet, on
> >>>> +        * SNB+ we need to take a step further and emit an explicit
> >>>> wbinvd()
> >>>> +        * on each processor in order to manually flush all memory
> >>>> +        * transactions before updating the fence register.
> >>>> +        */
> >>>> +       if (HAS_LLC(obj->base.dev))
> >>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
> >>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
> >>>>
> >>>>         if (enable) {
> >>>> -               obj->fence_reg = reg;
> >>>> +               obj->fence_reg = fence_reg;
> >>>>                 fence->obj = obj;
> >>>>                 list_move_tail(&fence->lru_list,
> >>>> &dev_priv->mm.fence_list);
> >>>>         } else {
> >>>
> >>>
> >>> I almost wish x86 just gave up and went fully weakly ordered.  At
> >>> least then we'd know we need barriers everywhere, rather than just
> >>> in mystical places.
> >>>
> >>> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> >>
> >>
> >> Queued for -next, thanks for the patch. I am a bit unhappy that the
> >> testcase doesn't work too well and can't reliably reproduce this on
> >> all affected machines. But I've tried a bit to improve things and
> >> it's very fickle. I guess we just have to accept this :(
> >
> >
> > Under real-time conditions when we expect deterministic response to
> > external and internal events at any time within a couple of
> > microseconds, invalidating and flushing the entire cache in a running
> > system is unacceptable. This is introducing latencies of several
> > milliseconds which was clearly shown in RT regression tests on a
> > kernel with this patch applied (https://www.osadl.org/?id=1543#c7602).
> > We therefore have to revert it in the RT patch queue - kind of a
> > workaround of a workaround.
> >
> > Would simply be wonderful, if we could get rid of the hateful wbinvd().
> 
> I'm somewhat surprised people have not started to scream earlier about this
> one ;-)
> 
> We're trying to figure out whether there's a less costly w/a (we have some
> benchmark where it kills performance, too) but until that's done we pretty
> much need to stick with it. If you want to avoid any bad side-effects due to
> that w/a you can alternatively pin all gpu rendering tasks to the same cpu
> core/thread.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

This is such a serious bug, I'm surprised it hasn't been seen elsewhere already. We shouldn't just not fix it because the only known fix could have a performance impact. There's no point having fast tiled accesses that can't be relied upon. We need to release an interim fix while we look for better solutions.

Daniel, how does pinning to a single CPU avoid the wbinvd() ? The page-fault code doesn't know not to apply the workaround in that case surely ?

JonB

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Daniel Vetter June 10, 2013, 2:37 p.m. UTC | #6
On Mon, Jun 10, 2013 at 3:17 PM, Bloomfield, Jon
<jon.bloomfield@intel.com> wrote:
>> -----Original Message-----
>> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
>> Daniel Vetter
>> Sent: Saturday, June 08, 2013 10:22 PM
>> To: Carsten Emde
>> Cc: Chris Wilson; Jesse Barnes; Intel Graphics Development; Bloomfield, Jon;
>> Steven Rostedt; Christoph Mathys; stable
>> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Workaround incoherence
>> between fences and LLC across multiple CPUs
>>
>> On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde@osadl.org> wrote:
>> > On 04/08/2013 11:54 AM, Daniel Vetter wrote:
>> >>
>> >> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
>> >>>
>> >>> On Thu,  4 Apr 2013 21:31:03 +0100
>> >>> Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> >>>
>> >>>> In order to fully serialize access to the fenced region and the
>> >>>> update to the fence register we need to take extreme measures on
>> >>>> SNB+, and manually flush writes to memory prior to writing the
>> >>>> fence register in conjunction with the memory barriers placed around
>> the register write.
>> >>>>
>> >>>> Fixes i-g-t/gem_fence_thrash
>> >>>>
>> >>>> v2: Bring a bigger gun
>> >>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
>> >>>> v4: Remove changes for working generations.
>> >>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
>> >>>> v6: Rewrite comments to ellide forgotten history.
>> >>>>
>> >>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
>> >>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> >>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>> >>>> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
>> >>>> Cc: stable@vger.kernel.org
>> >>>> ---
>> >>>>   drivers/gpu/drm/i915/i915_gem.c |   28
>> +++++++++++++++++++++++-----
>> >>>>   1 file changed, 23 insertions(+), 5 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>> >>>> b/drivers/gpu/drm/i915/i915_gem.c index fa4ea1a..8f7739e 100644
>> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> >>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
>> >>>> drm_i915_private *dev_priv,
>> >>>>         return fence - dev_priv->fence_regs;
>> >>>>   }
>> >>>>
>> >>>> +static void i915_gem_write_fence__ipi(void *data) {
>> >>>> +       wbinvd();
>> >>>> +}
>> >>>> +
>> >>>>   static void i915_gem_object_update_fence(struct
>> >>>> drm_i915_gem_object *obj,
>> >>>>                                          struct drm_i915_fence_reg
>> >>>> *fence,
>> >>>>                                          bool enable)
>> >>>>   {
>> >>>> -       struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
>> >>>> -       int reg = fence_number(dev_priv, fence);
>> >>>> -
>> >>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
>> >>>> +       struct drm_device *dev = obj->base.dev;
>> >>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
>> >>>> +       int fence_reg = fence_number(dev_priv, fence);
>> >>>> +
>> >>>> +       /* In order to fully serialize access to the fenced region and
>> >>>> +        * the update to the fence register we need to take extreme
>> >>>> +        * measures on SNB+. In theory, the write to the fence register
>> >>>> +        * flushes all memory transactions before, and coupled with the
>> >>>> +        * mb() placed around the register write we serialise all memory
>> >>>> +        * operations with respect to the changes in the tiler. Yet, on
>> >>>> +        * SNB+ we need to take a step further and emit an explicit
>> >>>> wbinvd()
>> >>>> +        * on each processor in order to manually flush all memory
>> >>>> +        * transactions before updating the fence register.
>> >>>> +        */
>> >>>> +       if (HAS_LLC(obj->base.dev))
>> >>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
>> >>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>> >>>>
>> >>>>         if (enable) {
>> >>>> -               obj->fence_reg = reg;
>> >>>> +               obj->fence_reg = fence_reg;
>> >>>>                 fence->obj = obj;
>> >>>>                 list_move_tail(&fence->lru_list,
>> >>>> &dev_priv->mm.fence_list);
>> >>>>         } else {
>> >>>
>> >>>
>> >>> I almost wish x86 just gave up and went fully weakly ordered.  At
>> >>> least then we'd know we need barriers everywhere, rather than just
>> >>> in mystical places.
>> >>>
>> >>> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>> >>
>> >>
>> >> Queued for -next, thanks for the patch. I am a bit unhappy that the
>> >> testcase doesn't work too well and can't reliably reproduce this on
>> >> all affected machines. But I've tried a bit to improve things and
>> >> it's very fickle. I guess we just have to accept this :(
>> >
>> >
>> > Under real-time conditions when we expect deterministic response to
>> > external and internal events at any time within a couple of
>> > microseconds, invalidating and flushing the entire cache in a running
>> > system is unacceptable. This is introducing latencies of several
>> > milliseconds which was clearly shown in RT regression tests on a
>> > kernel with this patch applied (https://www.osadl.org/?id=1543#c7602).
>> > We therefore have to revert it in the RT patch queue - kind of a
>> > workaround of a workaround.
>> >
>> > Would simply be wonderful, if we could get rid of the hateful wbinvd().
>>
>> I'm somewhat surprised people have not started to scream earlier about this
>> one ;-)
>>
>> We're trying to figure out whether there's a less costly w/a (we have some
>> benchmark where it kills performance, too) but until that's done we pretty
>> much need to stick with it. If you want to avoid any bad side-effects due to
>> that w/a you can alternatively pin all gpu rendering tasks to the same cpu
>> core/thread.
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
> This is such a serious bug, I'm surprised it hasn't been seen elsewhere already. We shouldn't just not fix it because the only known fix could have a performance impact. There's no point having fast tiled accesses that can't be relied upon. We need to release an interim fix while we look for better solutions.
>
> Daniel, how does pinning to a single CPU avoid the wbinvd() ? The page-fault code doesn't know not to apply the workaround in that case surely ?

It's only that the incoherency only shows up if you both migrate
threads between different cpus and also need to steal fence registers.
If you pin all threads to the same cpu thread/core (with e.g. with
numactl or sched_setaffinity tools) then we haven't yet seen a
corruption.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Bloomfield, Jon June 10, 2013, 2:47 p.m. UTC | #7
> -----Original Message-----
> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> Daniel Vetter
> Sent: Monday, June 10, 2013 3:38 PM
> To: Bloomfield, Jon
> Cc: Carsten Emde; Chris Wilson; Jesse Barnes; Intel Graphics Development;
> Steven Rostedt; Christoph Mathys; stable
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Workaround incoherence
> between fences and LLC across multiple CPUs
> 
> On Mon, Jun 10, 2013 at 3:17 PM, Bloomfield, Jon
> <jon.bloomfield@intel.com> wrote:
> >> -----Original Message-----
> >> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On
> >> Behalf Of Daniel Vetter
> >> Sent: Saturday, June 08, 2013 10:22 PM
> >> To: Carsten Emde
> >> Cc: Chris Wilson; Jesse Barnes; Intel Graphics Development;
> >> Bloomfield, Jon; Steven Rostedt; Christoph Mathys; stable
> >> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Workaround incoherence
> >> between fences and LLC across multiple CPUs
> >>
> >> On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde@osadl.org>
> wrote:
> >> > On 04/08/2013 11:54 AM, Daniel Vetter wrote:
> >> >>
> >> >> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
> >> >>>
> >> >>> On Thu,  4 Apr 2013 21:31:03 +0100 Chris Wilson
> >> >>> <chris@chris-wilson.co.uk> wrote:
> >> >>>
> >> >>>> In order to fully serialize access to the fenced region and the
> >> >>>> update to the fence register we need to take extreme measures on
> >> >>>> SNB+, and manually flush writes to memory prior to writing the
> >> >>>> fence register in conjunction with the memory barriers placed
> >> >>>> around
> >> the register write.
> >> >>>>
> >> >>>> Fixes i-g-t/gem_fence_thrash
> >> >>>>
> >> >>>> v2: Bring a bigger gun
> >> >>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
> >> >>>> v4: Remove changes for working generations.
> >> >>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
> >> >>>> v6: Rewrite comments to ellide forgotten history.
> >> >>>>
> >> >>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
> >> >>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> >>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> >> >>>> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2)
> >> >>>> Cc: stable@vger.kernel.org
> >> >>>> ---
> >> >>>>   drivers/gpu/drm/i915/i915_gem.c |   28
> >> +++++++++++++++++++++++-----
> >> >>>>   1 file changed, 23 insertions(+), 5 deletions(-)
> >> >>>>
> >> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> >> >>>> b/drivers/gpu/drm/i915/i915_gem.c index fa4ea1a..8f7739e 100644
> >> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> >>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
> >> >>>> drm_i915_private *dev_priv,
> >> >>>>         return fence - dev_priv->fence_regs;
> >> >>>>   }
> >> >>>>
> >> >>>> +static void i915_gem_write_fence__ipi(void *data) {
> >> >>>> +       wbinvd();
> >> >>>> +}
> >> >>>> +
> >> >>>>   static void i915_gem_object_update_fence(struct
> >> >>>> drm_i915_gem_object *obj,
> >> >>>>                                          struct
> >> >>>> drm_i915_fence_reg *fence,
> >> >>>>                                          bool enable)
> >> >>>>   {
> >> >>>> -       struct drm_i915_private *dev_priv = obj->base.dev-
> >dev_private;
> >> >>>> -       int reg = fence_number(dev_priv, fence);
> >> >>>> -
> >> >>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj :
> NULL);
> >> >>>> +       struct drm_device *dev = obj->base.dev;
> >> >>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
> >> >>>> +       int fence_reg = fence_number(dev_priv, fence);
> >> >>>> +
> >> >>>> +       /* In order to fully serialize access to the fenced region and
> >> >>>> +        * the update to the fence register we need to take extreme
> >> >>>> +        * measures on SNB+. In theory, the write to the fence register
> >> >>>> +        * flushes all memory transactions before, and coupled with the
> >> >>>> +        * mb() placed around the register write we serialise all
> memory
> >> >>>> +        * operations with respect to the changes in the tiler. Yet, on
> >> >>>> +        * SNB+ we need to take a step further and emit an
> >> >>>> + explicit
> >> >>>> wbinvd()
> >> >>>> +        * on each processor in order to manually flush all memory
> >> >>>> +        * transactions before updating the fence register.
> >> >>>> +        */
> >> >>>> +       if (HAS_LLC(obj->base.dev))
> >> >>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
> >> >>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj :
> >> >>>> + NULL);
> >> >>>>
> >> >>>>         if (enable) {
> >> >>>> -               obj->fence_reg = reg;
> >> >>>> +               obj->fence_reg = fence_reg;
> >> >>>>                 fence->obj = obj;
> >> >>>>                 list_move_tail(&fence->lru_list,
> >> >>>> &dev_priv->mm.fence_list);
> >> >>>>         } else {
> >> >>>
> >> >>>
> >> >>> I almost wish x86 just gave up and went fully weakly ordered.  At
> >> >>> least then we'd know we need barriers everywhere, rather than
> >> >>> just in mystical places.
> >> >>>
> >> >>> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> >> >>
> >> >>
> >> >> Queued for -next, thanks for the patch. I am a bit unhappy that
> >> >> the testcase doesn't work too well and can't reliably reproduce
> >> >> this on all affected machines. But I've tried a bit to improve
> >> >> things and it's very fickle. I guess we just have to accept this
> >> >> :(
> >> >
> >> >
> >> > Under real-time conditions when we expect deterministic response to
> >> > external and internal events at any time within a couple of
> >> > microseconds, invalidating and flushing the entire cache in a
> >> > running system is unacceptable. This is introducing latencies of
> >> > several milliseconds which was clearly shown in RT regression tests
> >> > on a kernel with this patch applied
> (https://www.osadl.org/?id=1543#c7602).
> >> > We therefore have to revert it in the RT patch queue - kind of a
> >> > workaround of a workaround.
> >> >
> >> > Would simply be wonderful, if we could get rid of the hateful wbinvd().
> >>
> >> I'm somewhat surprised people have not started to scream earlier
> >> about this one ;-)
> >>
> >> We're trying to figure out whether there's a less costly w/a (we have
> >> some benchmark where it kills performance, too) but until that's done
> >> we pretty much need to stick with it. If you want to avoid any bad
> >> side-effects due to that w/a you can alternatively pin all gpu
> >> rendering tasks to the same cpu core/thread.
> >> -Daniel
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >
> > This is such a serious bug, I'm surprised it hasn't been seen elsewhere
> already. We shouldn't just not fix it because the only known fix could have a
> performance impact. There's no point having fast tiled accesses that can't be
> relied upon. We need to release an interim fix while we look for better
> solutions.
> >
> > Daniel, how does pinning to a single CPU avoid the wbinvd() ? The page-
> fault code doesn't know not to apply the workaround in that case surely ?
> 
> It's only that the incoherency only shows up if you both migrate threads
> between different cpus and also need to steal fence registers.
> If you pin all threads to the same cpu thread/core (with e.g. with numactl or
> sched_setaffinity tools) then we haven't yet seen a corruption.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Ok, you're saying we could back the fix out, and force all apps mapping tiled-buffers to set the CPU affinity to an agreed single CPU ? All apps would have to use the same CPU, otherwise we're back where we started, in which case I think performance would still nose-dive.
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Daniel Vetter June 10, 2013, 2:51 p.m. UTC | #8
On Mon, Jun 10, 2013 at 4:47 PM, Bloomfield, Jon
<jon.bloomfield@intel.com> wrote:
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

Oh, the pinning isn't a suitable w/a for upstream nor for general
consumption. But I guess for the realtime linux folks it is suitable
(at least for now) since they tend to not care too much about gfx
throughput anyway.

[Aside: The patch is only for the -rt tree for now, so not something
that will land upstream.]
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fa4ea1a..8f7739e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2689,17 +2689,35 @@  static inline int fence_number(struct drm_i915_private *dev_priv,
 	return fence - dev_priv->fence_regs;
 }
 
+static void i915_gem_write_fence__ipi(void *data)
+{
+	wbinvd();
+}
+
 static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
 					 struct drm_i915_fence_reg *fence,
 					 bool enable)
 {
-	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
-	int reg = fence_number(dev_priv, fence);
-
-	i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
+	struct drm_device *dev = obj->base.dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	int fence_reg = fence_number(dev_priv, fence);
+
+	/* In order to fully serialize access to the fenced region and
+	 * the update to the fence register we need to take extreme
+	 * measures on SNB+. In theory, the write to the fence register
+	 * flushes all memory transactions before, and coupled with the
+	 * mb() placed around the register write we serialise all memory
+	 * operations with respect to the changes in the tiler. Yet, on
+	 * SNB+ we need to take a step further and emit an explicit wbinvd()
+	 * on each processor in order to manually flush all memory
+	 * transactions before updating the fence register.
+	 */
+	if (HAS_LLC(obj->base.dev))
+		on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
+	i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
 
 	if (enable) {
-		obj->fence_reg = reg;
+		obj->fence_reg = fence_reg;
 		fence->obj = obj;
 		list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list);
 	} else {