diff mbox

drm/i915: fix DP AUX errors due to false timeouts when using wait_event_timeout

Message ID 1367485203-4533-1-git-send-email-imre.deak@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Imre Deak May 2, 2013, 9 a.m. UTC
Due to possible scheduling latencies wait_event_timeout doesn't
guarantee a non-zero return value, even if the condition becomes true
before the specified timeout expires. Thus we can incorrectly signal a
timeout and abort a DP AUX transaction.

If wait_event_timeout returns 0, it's guaranteed that at least the
specified timeout (minus one jiffies, see below) had passed, so we can
fix this by checking the condition explicitly in this case.

Also the timeout that wait_event_timeout() is guaranteed to wait if the
condition doesn't become true is one less jiffies than what is passed to
it as a parameter. This is because the absolute expiration time in
schedule_timeout() may be calculated at a moment close to the next
scheduling tick, when jiffies is incremented. So make sure we pass always
a jiffies value of 2 or greater. Here this makes a difference only for
HZ=100.

This fixes DP AUX errors I saw during booting on an ILK.

This should ideally be fixed in wait_event_timeout(), but that can take
a while. Until that's done use this fix as a band-aid.

Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/intel_dp.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Chris Wilson May 2, 2013, 9:17 a.m. UTC | #1
On Thu, May 02, 2013 at 12:00:03PM +0300, Imre Deak wrote:
> Due to possible scheduling latencies wait_event_timeout doesn't
> guarantee a non-zero return value, even if the condition becomes true
> before the specified timeout expires. Thus we can incorrectly signal a
> timeout and abort a DP AUX transaction.
> 
> If wait_event_timeout returns 0, it's guaranteed that at least the
> specified timeout (minus one jiffies, see below) had passed, so we can
> fix this by checking the condition explicitly in this case.
> 
> Also the timeout that wait_event_timeout() is guaranteed to wait if the
> condition doesn't become true is one less jiffies than what is passed to
> it as a parameter. This is because the absolute expiration time in
> schedule_timeout() may be calculated at a moment close to the next
> scheduling tick, when jiffies is incremented. So make sure we pass always
> a jiffies value of 2 or greater. Here this makes a difference only for
> HZ=100.
> 
> This fixes DP AUX errors I saw during booting on an ILK.
> 
> This should ideally be fixed in wait_event_timeout(), but that can take
> a while. Until that's done use this fix as a band-aid.

As we have 3 such vulnerable callsite in our driver alone, perhaps we
should push for your general fix.
-Chris
Daniel Vetter May 2, 2013, 9:31 a.m. UTC | #2
On Thu, May 2, 2013 at 11:17 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, May 02, 2013 at 12:00:03PM +0300, Imre Deak wrote:
>> Due to possible scheduling latencies wait_event_timeout doesn't
>> guarantee a non-zero return value, even if the condition becomes true
>> before the specified timeout expires. Thus we can incorrectly signal a
>> timeout and abort a DP AUX transaction.
>>
>> If wait_event_timeout returns 0, it's guaranteed that at least the
>> specified timeout (minus one jiffies, see below) had passed, so we can
>> fix this by checking the condition explicitly in this case.
>>
>> Also the timeout that wait_event_timeout() is guaranteed to wait if the
>> condition doesn't become true is one less jiffies than what is passed to
>> it as a parameter. This is because the absolute expiration time in
>> schedule_timeout() may be calculated at a moment close to the next
>> scheduling tick, when jiffies is incremented. So make sure we pass always
>> a jiffies value of 2 or greater. Here this makes a difference only for
>> HZ=100.
>>
>> This fixes DP AUX errors I saw during booting on an ILK.
>>
>> This should ideally be fixed in wait_event_timeout(), but that can take
>> a while. Until that's done use this fix as a band-aid.
>
> As we have 3 such vulnerable callsite in our driver alone, perhaps we
> should push for your general fix.

Yeah, I think this should be fixed in general, doesn't really make
much sense given that the jiffies value is relative this way. Also,
the actual timeout we need to wait is just 400 us, so I think we're
pretty safe here. Can you please resumbit just the recheck part of
your patch?

Also, can you do the same fix for gmbus_wait_idle in intel_i2c.c please?

Thanks, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Daniel Vetter May 2, 2013, 9:33 a.m. UTC | #3
On Thu, May 2, 2013 at 11:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, May 2, 2013 at 11:17 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> On Thu, May 02, 2013 at 12:00:03PM +0300, Imre Deak wrote:
>>> Due to possible scheduling latencies wait_event_timeout doesn't
>>> guarantee a non-zero return value, even if the condition becomes true
>>> before the specified timeout expires. Thus we can incorrectly signal a
>>> timeout and abort a DP AUX transaction.
>>>
>>> If wait_event_timeout returns 0, it's guaranteed that at least the
>>> specified timeout (minus one jiffies, see below) had passed, so we can
>>> fix this by checking the condition explicitly in this case.
>>>
>>> Also the timeout that wait_event_timeout() is guaranteed to wait if the
>>> condition doesn't become true is one less jiffies than what is passed to
>>> it as a parameter. This is because the absolute expiration time in
>>> schedule_timeout() may be calculated at a moment close to the next
>>> scheduling tick, when jiffies is incremented. So make sure we pass always
>>> a jiffies value of 2 or greater. Here this makes a difference only for
>>> HZ=100.
>>>
>>> This fixes DP AUX errors I saw during booting on an ILK.
>>>
>>> This should ideally be fixed in wait_event_timeout(), but that can take
>>> a while. Until that's done use this fix as a band-aid.
>>
>> As we have 3 such vulnerable callsite in our driver alone, perhaps we
>> should push for your general fix.
>
> Yeah, I think this should be fixed in general, doesn't really make
> much sense given that the jiffies value is relative this way. Also,
> the actual timeout we need to wait is just 400 us, so I think we're
> pretty safe here. Can you please resumbit just the recheck part of
> your patch?
>
> Also, can you do the same fix for gmbus_wait_idle in intel_i2c.c please?

Oops, just now I've seen your generic patch. Yeah, I like that much better ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Imre Deak May 2, 2013, 9:33 a.m. UTC | #4
On Thu, 2013-05-02 at 10:17 +0100, Chris Wilson wrote:
> On Thu, May 02, 2013 at 12:00:03PM +0300, Imre Deak wrote:
> > Due to possible scheduling latencies wait_event_timeout doesn't
> > guarantee a non-zero return value, even if the condition becomes true
> > before the specified timeout expires. Thus we can incorrectly signal a
> > timeout and abort a DP AUX transaction.
> > 
> > If wait_event_timeout returns 0, it's guaranteed that at least the
> > specified timeout (minus one jiffies, see below) had passed, so we can
> > fix this by checking the condition explicitly in this case.
> > 
> > Also the timeout that wait_event_timeout() is guaranteed to wait if the
> > condition doesn't become true is one less jiffies than what is passed to
> > it as a parameter. This is because the absolute expiration time in
> > schedule_timeout() may be calculated at a moment close to the next
> > scheduling tick, when jiffies is incremented. So make sure we pass always
> > a jiffies value of 2 or greater. Here this makes a difference only for
> > HZ=100.
> > 
> > This fixes DP AUX errors I saw during booting on an ILK.
> > 
> > This should ideally be fixed in wait_event_timeout(), but that can take
> > a while. Until that's done use this fix as a band-aid.
> 
> As we have 3 such vulnerable callsite in our driver alone, perhaps we
> should push for your general fix.

Yes, I've sent a patch for wait_event_timeout() to start the discussion,
I'm not sure how long it will take..

--Imre
Imre Deak May 2, 2013, 9:37 a.m. UTC | #5
On Thu, 2013-05-02 at 11:33 +0200, Daniel Vetter wrote:
> On Thu, May 2, 2013 at 11:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Thu, May 2, 2013 at 11:17 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> On Thu, May 02, 2013 at 12:00:03PM +0300, Imre Deak wrote:
> >>> Due to possible scheduling latencies wait_event_timeout doesn't
> >>> guarantee a non-zero return value, even if the condition becomes true
> >>> before the specified timeout expires. Thus we can incorrectly signal a
> >>> timeout and abort a DP AUX transaction.
> >>>
> >>> If wait_event_timeout returns 0, it's guaranteed that at least the
> >>> specified timeout (minus one jiffies, see below) had passed, so we can
> >>> fix this by checking the condition explicitly in this case.
> >>>
> >>> Also the timeout that wait_event_timeout() is guaranteed to wait if the
> >>> condition doesn't become true is one less jiffies than what is passed to
> >>> it as a parameter. This is because the absolute expiration time in
> >>> schedule_timeout() may be calculated at a moment close to the next
> >>> scheduling tick, when jiffies is incremented. So make sure we pass always
> >>> a jiffies value of 2 or greater. Here this makes a difference only for
> >>> HZ=100.
> >>>
> >>> This fixes DP AUX errors I saw during booting on an ILK.
> >>>
> >>> This should ideally be fixed in wait_event_timeout(), but that can take
> >>> a while. Until that's done use this fix as a band-aid.
> >>
> >> As we have 3 such vulnerable callsite in our driver alone, perhaps we
> >> should push for your general fix.
> >
> > Yeah, I think this should be fixed in general, doesn't really make
> > much sense given that the jiffies value is relative this way. Also,
> > the actual timeout we need to wait is just 400 us, so I think we're
> > pretty safe here. Can you please resumbit just the recheck part of
> > your patch?

Well, without the +1 to msecs_to_jiffies(10) part, I still can trigger
the problem on HZ=100. Since w/o that we pass 1 jiffies and wait less
than 400us..

> >
> > Also, can you do the same fix for gmbus_wait_idle in intel_i2c.c please?

Yes, can do this.

--Imre

> 
> Oops, just now I've seen your generic patch. Yeah, I like that much better ;-)
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index 8759fb1..7ce96c5 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -303,10 +303,10 @@  intel_dp_aux_wait_done(struct intel_dp *intel_dp, bool has_aux_irq)
 #define C (((status = I915_READ_NOTRACE(ch_ctl)) & DP_AUX_CH_CTL_SEND_BUSY) == 0)
 	if (has_aux_irq)
 		done = wait_event_timeout(dev_priv->gmbus_wait_queue, C,
-					  msecs_to_jiffies(10));
+					  msecs_to_jiffies(10) + 1);
 	else
 		done = wait_for_atomic(C, 10) == 0;
-	if (!done)
+	if (!done && !(C))
 		DRM_ERROR("dp aux hw did not signal timeout (has irq: %i)!\n",
 			  has_aux_irq);
 #undef C