drm/i915/selftest: If reconfigure_sseu is busy, try again

Message ID	20191126164712.2802694-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=chMQ=ZS=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 747142071A From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Tue, 26 Nov 2019 16:47:12 +0000 Message-Id: <20191126164712.2802694-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915/selftest: If reconfigure_sseu is busy, try again Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915/selftest: If reconfigure_sseu is busy, try again \| expand drm/i915/selftest: If reconfigure_sseu is busy, try again

Chris Wilson Nov. 26, 2019, 4:47 p.m. UTC

Following 58b4c1a07ada ("drm/i915: Reduce nested prepare_remote_context()
to a trylock"), we punt to the caller if the local intel_context
happens to be busy as we try to rewrite the sseu (due to retiring in
another thread). As the interlude should be short, spin until the lock
is available.

The regret for using mutex_trylock() and not an atomic insertion of the
barrier is growing...

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Tvrtko Ursulin Nov. 26, 2019, 5 p.m. UTC | #1

On 26/11/2019 16:47, Chris Wilson wrote:
> Following 58b4c1a07ada ("drm/i915: Reduce nested prepare_remote_context()
> to a trylock"), we punt to the caller if the local intel_context
> happens to be busy as we try to rewrite the sseu (due to retiring in
> another thread). As the interlude should be short, spin until the lock
> is available.
> 
> The regret for using mutex_trylock() and not an atomic insertion of the
> barrier is growing...
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> index 2ea4790f3721..571cc996577c 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> @@ -1197,7 +1197,10 @@ __sseu_test(const char *name,
>   	if (ret)
>   		goto out_pm;
>   
> -	ret = intel_context_reconfigure_sseu(ce, sseu);
> +	do {
> +		ret = intel_context_reconfigure_sseu(ce, sseu);
> +		cond_resched();
> +	} while (ret == -EAGAIN);
>   	if (ret)
>   		goto out_spin;
>   
> 

Hm I looked at the selftests, saw error is correctly propagated, and 
concluded it will be fine. I missed the problem selftests will not 
actually retry. But wait, can we even count that userspace will if all 
of a sudden ctx.set_param starts returning -EAGAIN sporadically? Feels 
like we may need to revert.

Regards,

Tvrtko

Chris Wilson Nov. 26, 2019, 5:05 p.m. UTC | #2

Quoting Tvrtko Ursulin (2019-11-26 17:00:53)
> 
> On 26/11/2019 16:47, Chris Wilson wrote:
> > Following 58b4c1a07ada ("drm/i915: Reduce nested prepare_remote_context()
> > to a trylock"), we punt to the caller if the local intel_context
> > happens to be busy as we try to rewrite the sseu (due to retiring in
> > another thread). As the interlude should be short, spin until the lock
> > is available.
> > 
> > The regret for using mutex_trylock() and not an atomic insertion of the
> > barrier is growing...
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > index 2ea4790f3721..571cc996577c 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > @@ -1197,7 +1197,10 @@ __sseu_test(const char *name,
> >       if (ret)
> >               goto out_pm;
> >   
> > -     ret = intel_context_reconfigure_sseu(ce, sseu);
> > +     do {
> > +             ret = intel_context_reconfigure_sseu(ce, sseu);
> > +             cond_resched();
> > +     } while (ret == -EAGAIN);
> >       if (ret)
> >               goto out_spin;
> >   
> > 
> 
> Hm I looked at the selftests, saw error is correctly propagated, and 
> concluded it will be fine. I missed the problem selftests will not 
> actually retry. But wait, can we even count that userspace will if all 
> of a sudden ctx.set_param starts returning -EAGAIN sporadically? Feels 
> like we may need to revert.

We invoke the principle of drmIoctl() catches -EAGAIN.
-Chris

Tvrtko Ursulin Nov. 26, 2019, 5:08 p.m. UTC | #3

On 26/11/2019 17:05, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-26 17:00:53)
>>
>> On 26/11/2019 16:47, Chris Wilson wrote:
>>> Following 58b4c1a07ada ("drm/i915: Reduce nested prepare_remote_context()
>>> to a trylock"), we punt to the caller if the local intel_context
>>> happens to be busy as we try to rewrite the sseu (due to retiring in
>>> another thread). As the interlude should be short, spin until the lock
>>> is available.
>>>
>>> The regret for using mutex_trylock() and not an atomic insertion of the
>>> barrier is growing...
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>> index 2ea4790f3721..571cc996577c 100644
>>> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>> @@ -1197,7 +1197,10 @@ __sseu_test(const char *name,
>>>        if (ret)
>>>                goto out_pm;
>>>    
>>> -     ret = intel_context_reconfigure_sseu(ce, sseu);
>>> +     do {
>>> +             ret = intel_context_reconfigure_sseu(ce, sseu);
>>> +             cond_resched();
>>> +     } while (ret == -EAGAIN);
>>>        if (ret)
>>>                goto out_spin;
>>>    
>>>
>>
>> Hm I looked at the selftests, saw error is correctly propagated, and
>> concluded it will be fine. I missed the problem selftests will not
>> actually retry. But wait, can we even count that userspace will if all
>> of a sudden ctx.set_param starts returning -EAGAIN sporadically? Feels
>> like we may need to revert.
> 
> We invoke the principle of drmIoctl() catches -EAGAIN.

I'm not comfortable with that. :( Not least how we are saying not to use 
libdrm. man 2 ioctl does not mention -EAGAIN. :(

Regards,

Tvrtko

Tvrtko Ursulin Nov. 26, 2019, 5:11 p.m. UTC | #4

On 26/11/2019 17:08, Tvrtko Ursulin wrote:
> 
> On 26/11/2019 17:05, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2019-11-26 17:00:53)
>>>
>>> On 26/11/2019 16:47, Chris Wilson wrote:
>>>> Following 58b4c1a07ada ("drm/i915: Reduce nested 
>>>> prepare_remote_context()
>>>> to a trylock"), we punt to the caller if the local intel_context
>>>> happens to be busy as we try to rewrite the sseu (due to retiring in
>>>> another thread). As the interlude should be short, spin until the lock
>>>> is available.
>>>>
>>>> The regret for using mutex_trylock() and not an atomic insertion of the
>>>> barrier is growing...
>>>>
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
>>>> b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>>> index 2ea4790f3721..571cc996577c 100644
>>>> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
>>>> @@ -1197,7 +1197,10 @@ __sseu_test(const char *name,
>>>>        if (ret)
>>>>                goto out_pm;
>>>> -     ret = intel_context_reconfigure_sseu(ce, sseu);
>>>> +     do {
>>>> +             ret = intel_context_reconfigure_sseu(ce, sseu);
>>>> +             cond_resched();
>>>> +     } while (ret == -EAGAIN);
>>>>        if (ret)
>>>>                goto out_spin;
>>>>
>>>
>>> Hm I looked at the selftests, saw error is correctly propagated, and
>>> concluded it will be fine. I missed the problem selftests will not
>>> actually retry. But wait, can we even count that userspace will if all
>>> of a sudden ctx.set_param starts returning -EAGAIN sporadically? Feels
>>> like we may need to revert.
>>
>> We invoke the principle of drmIoctl() catches -EAGAIN.
> 
> I'm not comfortable with that. :( Not least how we are saying not to use 
> libdrm. man 2 ioctl does not mention -EAGAIN. :(

Or duct tape by looping in set_sseu as well?

Regards,

Tvrtko

Chris Wilson Nov. 26, 2019, 5:19 p.m. UTC | #5

Quoting Tvrtko Ursulin (2019-11-26 17:11:04)
> 
> On 26/11/2019 17:08, Tvrtko Ursulin wrote:
> > 
> > On 26/11/2019 17:05, Chris Wilson wrote:
> >> Quoting Tvrtko Ursulin (2019-11-26 17:00:53)
> >>>
> >>> On 26/11/2019 16:47, Chris Wilson wrote:
> >>>> Following 58b4c1a07ada ("drm/i915: Reduce nested 
> >>>> prepare_remote_context()
> >>>> to a trylock"), we punt to the caller if the local intel_context
> >>>> happens to be busy as we try to rewrite the sseu (due to retiring in
> >>>> another thread). As the interlude should be short, spin until the lock
> >>>> is available.
> >>>>
> >>>> The regret for using mutex_trylock() and not an atomic insertion of the
> >>>> barrier is growing...
> >>>>
> >>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>> ---
> >>>>    drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 5 ++++-
> >>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
> >>>> b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> >>>> index 2ea4790f3721..571cc996577c 100644
> >>>> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> >>>> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> >>>> @@ -1197,7 +1197,10 @@ __sseu_test(const char *name,
> >>>>        if (ret)
> >>>>                goto out_pm;
> >>>> -     ret = intel_context_reconfigure_sseu(ce, sseu);
> >>>> +     do {
> >>>> +             ret = intel_context_reconfigure_sseu(ce, sseu);
> >>>> +             cond_resched();
> >>>> +     } while (ret == -EAGAIN);
> >>>>        if (ret)
> >>>>                goto out_spin;
> >>>>
> >>>
> >>> Hm I looked at the selftests, saw error is correctly propagated, and
> >>> concluded it will be fine. I missed the problem selftests will not
> >>> actually retry. But wait, can we even count that userspace will if all
> >>> of a sudden ctx.set_param starts returning -EAGAIN sporadically? Feels
> >>> like we may need to revert.
> >>
> >> We invoke the principle of drmIoctl() catches -EAGAIN.
> > 
> > I'm not comfortable with that. :( Not least how we are saying not to use 
> > libdrm. man 2 ioctl does not mention -EAGAIN. :(
> 
> Or duct tape by looping in set_sseu as well?

Could do, the actual atomic insertion doesn't look to horrendous (albeit
yet another atomic op along that path), but will take a fair amount of
rationalising. Certainly another cup of tea worth.

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index dca15ace88f6..181c6a3aaafa 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -753,11 +753,13 @@ __i915_active_fence_set(struct i915_active_fence *active,
        struct dma_fence *prev;
        unsigned long flags;

-       /* NB: must be serialised by an outer timeline mutex (active->lock) */
+       if (fence == rcu_access_pointer(active->fence))
+               return fence;
+
        spin_lock_irqsave(fence->lock, flags);
        GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags));

-       prev = rcu_dereference_protected(active->fence, active_is_held(active));
+       prev = xchg((struct dma_fence ** __force)&active->fence, fence);
        if (prev) {
                GEM_BUG_ON(prev == fence);
                spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
@@ -775,12 +777,13 @@ __i915_active_fence_set(struct i915_active_fence *active,
                 * our list_del() [decoupling prev from the callback] or
                 * the callback...
                 */
-               prev = rcu_access_pointer(active->fence);
+               if (!rcu_access_pointer(active->fence)) {
+                       RCU_INIT_POINTER(active->fence, fence);
+                       prev = NULL;
+               }
        }

-       rcu_assign_pointer(active->fence, fence);
        list_add_tail(&active->cb.node, &fence->cb_list);
-
        spin_unlock_irqrestore(fence->lock, flags);

        return prev;

-Chris

drm/i915/selftest: If reconfigure_sseu is busy, try again

Commit Message

Comments

Patch