diff mbox

[2/2] drm/i915: Set guilty-flag on fence after detecting a hang

Message ID 20170103110533.31880-2-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Jan. 3, 2017, 11:05 a.m. UTC
The struct dma_fence carries a status field exposed to userspace by
sync_file. This is inspected after the fence is signaled and can convey
whether or not the request completed successfully, or in our case if we
detected a hang during the request (signaled via -EIO in
SYNC_IOC_FILE_INFO).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Tvrtko Ursulin Jan. 3, 2017, 11:34 a.m. UTC | #1
On 03/01/2017 11:05, Chris Wilson wrote:
> The struct dma_fence carries a status field exposed to userspace by
> sync_file. This is inspected after the fence is signaled and can convey
> whether or not the request completed successfully, or in our case if we
> detected a hang during the request (signaled via -EIO in
> SYNC_IOC_FILE_INFO).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 204c4a673bf3..bc99c0e292d8 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  		ring_hung = false;
>  	}
>
> -	if (ring_hung)
> +	if (ring_hung) {
>  		i915_gem_context_mark_guilty(request->ctx);
> -	else
> +		request->fence.status = -EIO;
> +	} else {
>  		i915_gem_context_mark_innocent(request->ctx);
> +	}
>
>  	if (!ring_hung)
>  		return;
>

Reading what happens later in this function, should we set the status of 
all the other requests we are about to clear?

However one thing I don't understand is how this scheme interacts with 
the current userspace. We will clear/no-nop some of the submitted 
requests since the state is corrupt. But how will userspace notice this 
before it submits more requets?

Regards,

Tvrtko
Chris Wilson Jan. 3, 2017, 11:46 a.m. UTC | #2
On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 11:05, Chris Wilson wrote:
> >The struct dma_fence carries a status field exposed to userspace by
> >sync_file. This is inspected after the fence is signaled and can convey
> >whether or not the request completed successfully, or in our case if we
> >detected a hang during the request (signaled via -EIO in
> >SYNC_IOC_FILE_INFO).
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >---
> > drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 204c4a673bf3..bc99c0e292d8 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> > 		ring_hung = false;
> > 	}
> >
> >-	if (ring_hung)
> >+	if (ring_hung) {
> > 		i915_gem_context_mark_guilty(request->ctx);
> >-	else
> >+		request->fence.status = -EIO;
> >+	} else {
> > 		i915_gem_context_mark_innocent(request->ctx);
> >+	}
> >
> > 	if (!ring_hung)
> > 		return;
> >
> 
> Reading what happens later in this function, should we set the
> status of all the other requests we are about to clear?
> 
> However one thing I don't understand is how this scheme interacts
> with the current userspace. We will clear/no-nop some of the
> submitted requests since the state is corrupt. But how will
> userspace notice this before it submits more requets?

There is no mechanism currently for user space to be able to detect a
hung request. (It can use the uevent for async notification of the
hang/reset, but that will not tell you who caused the hang.) Userspace
can track the number of hangs it caused, but the delay makes any
roundtripping impractical (i.e. you have to synchronise before all
rendering if you must detect the event immediately). Note also that we
do not want to give out interprocess information (i.e. to allow one
client to spy on another), which makes things harder to get right.
-Chris
Tvrtko Ursulin Jan. 3, 2017, 11:57 a.m. UTC | #3
On 03/01/2017 11:46, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 11:05, Chris Wilson wrote:
>>> The struct dma_fence carries a status field exposed to userspace by
>>> sync_file. This is inspected after the fence is signaled and can convey
>>> whether or not the request completed successfully, or in our case if we
>>> detected a hang during the request (signaled via -EIO in
>>> SYNC_IOC_FILE_INFO).
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 204c4a673bf3..bc99c0e292d8 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>> 		ring_hung = false;
>>> 	}
>>>
>>> -	if (ring_hung)
>>> +	if (ring_hung) {
>>> 		i915_gem_context_mark_guilty(request->ctx);
>>> -	else
>>> +		request->fence.status = -EIO;
>>> +	} else {
>>> 		i915_gem_context_mark_innocent(request->ctx);
>>> +	}
>>>
>>> 	if (!ring_hung)
>>> 		return;
>>>
>>
>> Reading what happens later in this function, should we set the
>> status of all the other requests we are about to clear?
>>
>> However one thing I don't understand is how this scheme interacts
>> with the current userspace. We will clear/no-nop some of the
>> submitted requests since the state is corrupt. But how will
>> userspace notice this before it submits more requets?
>
> There is no mechanism currently for user space to be able to detect a
> hung request. (It can use the uevent for async notification of the
> hang/reset, but that will not tell you who caused the hang.) Userspace
> can track the number of hangs it caused, but the delay makes any
> roundtripping impractical (i.e. you have to synchronise before all
> rendering if you must detect the event immediately). Note also that we
> do not want to give out interprocess information (i.e. to allow one
> client to spy on another), which makes things harder to get right.

So idea is to clear already submitted requests _if_ the userspace is 
synchronising before all rendering and looking at reset stats, to make 
it theoretically possible to detect the corrupt state?

Still with the fences do you agree error status needs to be set on those 
as well?

Regards,

Tvrtko
Chris Wilson Jan. 3, 2017, 12:13 p.m. UTC | #4
On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 11:46, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>The struct dma_fence carries a status field exposed to userspace by
> >>>sync_file. This is inspected after the fence is signaled and can convey
> >>>whether or not the request completed successfully, or in our case if we
> >>>detected a hang during the request (signaled via -EIO in
> >>>SYNC_IOC_FILE_INFO).
> >>>
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>---
> >>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>		ring_hung = false;
> >>>	}
> >>>
> >>>-	if (ring_hung)
> >>>+	if (ring_hung) {
> >>>		i915_gem_context_mark_guilty(request->ctx);
> >>>-	else
> >>>+		request->fence.status = -EIO;
> >>>+	} else {
> >>>		i915_gem_context_mark_innocent(request->ctx);
> >>>+	}
> >>>
> >>>	if (!ring_hung)
> >>>		return;
> >>>
> >>
> >>Reading what happens later in this function, should we set the
> >>status of all the other requests we are about to clear?
> >>
> >>However one thing I don't understand is how this scheme interacts
> >>with the current userspace. We will clear/no-nop some of the
> >>submitted requests since the state is corrupt. But how will
> >>userspace notice this before it submits more requets?
> >
> >There is no mechanism currently for user space to be able to detect a
> >hung request. (It can use the uevent for async notification of the
> >hang/reset, but that will not tell you who caused the hang.) Userspace
> >can track the number of hangs it caused, but the delay makes any
> >roundtripping impractical (i.e. you have to synchronise before all
> >rendering if you must detect the event immediately). Note also that we
> >do not want to give out interprocess information (i.e. to allow one
> >client to spy on another), which makes things harder to get right.
> 
> So idea is to clear already submitted requests _if_ the userspace is
> synchronising before all rendering and looking at reset stats, to
> make it theoretically possible to detect the corrupt state?

No, I'm just don't see a way that userspace can detect the hang without
testing after seeing the request signaled (either by waiting on the
batch or by waiting on the fence), i.e. by being completely synchronous
(or at least chosing its synchronous points very carefully, such as
around IPC). It can either poll reset-count or use sync_file (which
requires fence exporting).

The current robustness interfaces is a basic query on whether any reset
occurred within the context, not when.
 
> Still with the fences do you agree error status needs to be set on
> those as well?

Yes.
-Chris
Tvrtko Ursulin Jan. 3, 2017, 12:34 p.m. UTC | #5
On 03/01/2017 12:13, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 11:46, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>> whether or not the request completed successfully, or in our case if we
>>>>> detected a hang during the request (signaled via -EIO in
>>>>> SYNC_IOC_FILE_INFO).
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>> 		ring_hung = false;
>>>>> 	}
>>>>>
>>>>> -	if (ring_hung)
>>>>> +	if (ring_hung) {
>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>> -	else
>>>>> +		request->fence.status = -EIO;
>>>>> +	} else {
>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>> +	}
>>>>>
>>>>> 	if (!ring_hung)
>>>>> 		return;
>>>>>
>>>>
>>>> Reading what happens later in this function, should we set the
>>>> status of all the other requests we are about to clear?
>>>>
>>>> However one thing I don't understand is how this scheme interacts
>>>> with the current userspace. We will clear/no-nop some of the
>>>> submitted requests since the state is corrupt. But how will
>>>> userspace notice this before it submits more requets?
>>>
>>> There is no mechanism currently for user space to be able to detect a
>>> hung request. (It can use the uevent for async notification of the
>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>> can track the number of hangs it caused, but the delay makes any
>>> roundtripping impractical (i.e. you have to synchronise before all
>>> rendering if you must detect the event immediately). Note also that we
>>> do not want to give out interprocess information (i.e. to allow one
>>> client to spy on another), which makes things harder to get right.
>>
>> So idea is to clear already submitted requests _if_ the userspace is
>> synchronising before all rendering and looking at reset stats, to
>> make it theoretically possible to detect the corrupt state?
>
> No, I'm just don't see a way that userspace can detect the hang without
> testing after seeing the request signaled (either by waiting on the
> batch or by waiting on the fence), i.e. by being completely synchronous
> (or at least chosing its synchronous points very carefully, such as
> around IPC). It can either poll reset-count or use sync_file (which
> requires fence exporting).
>
> The current robustness interfaces is a basic query on whether any reset
> occurred within the context, not when.

Why do we bother with clearing the submitted requests then?

Regards,

Tvrtko
Chris Wilson Jan. 3, 2017, 12:38 p.m. UTC | #6
On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 12:13, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 11:46, Chris Wilson wrote:
> >>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>>>The struct dma_fence carries a status field exposed to userspace by
> >>>>>sync_file. This is inspected after the fence is signaled and can convey
> >>>>>whether or not the request completed successfully, or in our case if we
> >>>>>detected a hang during the request (signaled via -EIO in
> >>>>>SYNC_IOC_FILE_INFO).
> >>>>>
> >>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>>>---
> >>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>>>		ring_hung = false;
> >>>>>	}
> >>>>>
> >>>>>-	if (ring_hung)
> >>>>>+	if (ring_hung) {
> >>>>>		i915_gem_context_mark_guilty(request->ctx);
> >>>>>-	else
> >>>>>+		request->fence.status = -EIO;
> >>>>>+	} else {
> >>>>>		i915_gem_context_mark_innocent(request->ctx);
> >>>>>+	}
> >>>>>
> >>>>>	if (!ring_hung)
> >>>>>		return;
> >>>>>
> >>>>
> >>>>Reading what happens later in this function, should we set the
> >>>>status of all the other requests we are about to clear?
> >>>>
> >>>>However one thing I don't understand is how this scheme interacts
> >>>>with the current userspace. We will clear/no-nop some of the
> >>>>submitted requests since the state is corrupt. But how will
> >>>>userspace notice this before it submits more requets?
> >>>
> >>>There is no mechanism currently for user space to be able to detect a
> >>>hung request. (It can use the uevent for async notification of the
> >>>hang/reset, but that will not tell you who caused the hang.) Userspace
> >>>can track the number of hangs it caused, but the delay makes any
> >>>roundtripping impractical (i.e. you have to synchronise before all
> >>>rendering if you must detect the event immediately). Note also that we
> >>>do not want to give out interprocess information (i.e. to allow one
> >>>client to spy on another), which makes things harder to get right.
> >>
> >>So idea is to clear already submitted requests _if_ the userspace is
> >>synchronising before all rendering and looking at reset stats, to
> >>make it theoretically possible to detect the corrupt state?
> >
> >No, I'm just don't see a way that userspace can detect the hang without
> >testing after seeing the request signaled (either by waiting on the
> >batch or by waiting on the fence), i.e. by being completely synchronous
> >(or at least chosing its synchronous points very carefully, such as
> >around IPC). It can either poll reset-count or use sync_file (which
> >requires fence exporting).
> >
> >The current robustness interfaces is a basic query on whether any reset
> >occurred within the context, not when.
> 
> Why do we bother with clearing the submitted requests then?

The same reason we ban processes from submitting new requests if they
cause repeated hangs. If before we ban that client, it has already
submitted 1000 hanging requests, it has successfully locked the machine
up for a couple of hours.
-Chris
Tvrtko Ursulin Jan. 3, 2017, 1:17 p.m. UTC | #7
On 03/01/2017 12:38, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 12:13, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:46, Chris Wilson wrote:
>>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>>>> whether or not the request completed successfully, or in our case if we
>>>>>>> detected a hang during the request (signaled via -EIO in
>>>>>>> SYNC_IOC_FILE_INFO).
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>>>> 		ring_hung = false;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	if (ring_hung)
>>>>>>> +	if (ring_hung) {
>>>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>>>> -	else
>>>>>>> +		request->fence.status = -EIO;
>>>>>>> +	} else {
>>>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>>>> +	}
>>>>>>>
>>>>>>> 	if (!ring_hung)
>>>>>>> 		return;
>>>>>>>
>>>>>>
>>>>>> Reading what happens later in this function, should we set the
>>>>>> status of all the other requests we are about to clear?
>>>>>>
>>>>>> However one thing I don't understand is how this scheme interacts
>>>>>> with the current userspace. We will clear/no-nop some of the
>>>>>> submitted requests since the state is corrupt. But how will
>>>>>> userspace notice this before it submits more requets?
>>>>>
>>>>> There is no mechanism currently for user space to be able to detect a
>>>>> hung request. (It can use the uevent for async notification of the
>>>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>>>> can track the number of hangs it caused, but the delay makes any
>>>>> roundtripping impractical (i.e. you have to synchronise before all
>>>>> rendering if you must detect the event immediately). Note also that we
>>>>> do not want to give out interprocess information (i.e. to allow one
>>>>> client to spy on another), which makes things harder to get right.
>>>>
>>>> So idea is to clear already submitted requests _if_ the userspace is
>>>> synchronising before all rendering and looking at reset stats, to
>>>> make it theoretically possible to detect the corrupt state?
>>>
>>> No, I'm just don't see a way that userspace can detect the hang without
>>> testing after seeing the request signaled (either by waiting on the
>>> batch or by waiting on the fence), i.e. by being completely synchronous
>>> (or at least chosing its synchronous points very carefully, such as
>>> around IPC). It can either poll reset-count or use sync_file (which
>>> requires fence exporting).
>>>
>>> The current robustness interfaces is a basic query on whether any reset
>>> occurred within the context, not when.
>>
>> Why do we bother with clearing the submitted requests then?
>
> The same reason we ban processes from submitting new requests if they
> cause repeated hangs. If before we ban that client, it has already
> submitted 1000 hanging requests, it has successfully locked the machine
> up for a couple of hours.

So we would need to gate clearing on the transition to banned state I 
think. Because currently it does in unconditionally.

Regards,

Tvrtko
Chris Wilson Jan. 3, 2017, 1:25 p.m. UTC | #8
On Tue, Jan 03, 2017 at 01:17:19PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 12:38, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 12:13, Chris Wilson wrote:
> >>>On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 03/01/2017 11:46, Chris Wilson wrote:
> >>>>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>>>>>
> >>>>>>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>>>>>The struct dma_fence carries a status field exposed to userspace by
> >>>>>>>sync_file. This is inspected after the fence is signaled and can convey
> >>>>>>>whether or not the request completed successfully, or in our case if we
> >>>>>>>detected a hang during the request (signaled via -EIO in
> >>>>>>>SYNC_IOC_FILE_INFO).
> >>>>>>>
> >>>>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>>>>>---
> >>>>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>>>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>>>>>		ring_hung = false;
> >>>>>>>	}
> >>>>>>>
> >>>>>>>-	if (ring_hung)
> >>>>>>>+	if (ring_hung) {
> >>>>>>>		i915_gem_context_mark_guilty(request->ctx);
> >>>>>>>-	else
> >>>>>>>+		request->fence.status = -EIO;
> >>>>>>>+	} else {
> >>>>>>>		i915_gem_context_mark_innocent(request->ctx);
> >>>>>>>+	}
> >>>>>>>
> >>>>>>>	if (!ring_hung)
> >>>>>>>		return;
> >>>>>>>
> >>>>>>
> >>>>>>Reading what happens later in this function, should we set the
> >>>>>>status of all the other requests we are about to clear?
> >>>>>>
> >>>>>>However one thing I don't understand is how this scheme interacts
> >>>>>>with the current userspace. We will clear/no-nop some of the
> >>>>>>submitted requests since the state is corrupt. But how will
> >>>>>>userspace notice this before it submits more requets?
> >>>>>
> >>>>>There is no mechanism currently for user space to be able to detect a
> >>>>>hung request. (It can use the uevent for async notification of the
> >>>>>hang/reset, but that will not tell you who caused the hang.) Userspace
> >>>>>can track the number of hangs it caused, but the delay makes any
> >>>>>roundtripping impractical (i.e. you have to synchronise before all
> >>>>>rendering if you must detect the event immediately). Note also that we
> >>>>>do not want to give out interprocess information (i.e. to allow one
> >>>>>client to spy on another), which makes things harder to get right.
> >>>>
> >>>>So idea is to clear already submitted requests _if_ the userspace is
> >>>>synchronising before all rendering and looking at reset stats, to
> >>>>make it theoretically possible to detect the corrupt state?
> >>>
> >>>No, I'm just don't see a way that userspace can detect the hang without
> >>>testing after seeing the request signaled (either by waiting on the
> >>>batch or by waiting on the fence), i.e. by being completely synchronous
> >>>(or at least chosing its synchronous points very carefully, such as
> >>>around IPC). It can either poll reset-count or use sync_file (which
> >>>requires fence exporting).
> >>>
> >>>The current robustness interfaces is a basic query on whether any reset
> >>>occurred within the context, not when.
> >>
> >>Why do we bother with clearing the submitted requests then?
> >
> >The same reason we ban processes from submitting new requests if they
> >cause repeated hangs. If before we ban that client, it has already
> >submitted 1000 hanging requests, it has successfully locked the machine
> >up for a couple of hours.
> 
> So we would need to gate clearing on the transition to banned state
> I think. Because currently it does in unconditionally.

Yes, see the other patch :)
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 204c4a673bf3..bc99c0e292d8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2757,10 +2757,12 @@  static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 		ring_hung = false;
 	}
 
-	if (ring_hung)
+	if (ring_hung) {
 		i915_gem_context_mark_guilty(request->ctx);
-	else
+		request->fence.status = -EIO;
+	} else {
 		i915_gem_context_mark_innocent(request->ctx);
+	}
 
 	if (!ring_hung)
 		return;