diff mbox

[RFC,01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode.

Message ID 1433783009-17251-2-git-send-email-tomas.elf@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tomas Elf June 8, 2015, 5:03 p.m. UTC
When submitting semaphores in execlist mode the hang checker crashes in this
function because it is only runnable in ring submission mode. The reason this
is of particular interest to the TDR patch series is because we use semaphores
as a mean to induce hangs during testing (which is the recommended way to
induce hangs for gen8+). It's not clear how this is supposed to work in
execlist mode since:

1. This function requires a ring buffer.

2. Retrieving a ring buffer in execlist mode requires us to retrieve the
corresponding context, which we get from a request.

3. Retieving a request from the hang checker is not straight-forward since that
requires us to grab the struct_mutex in order to synchronize against the
request retirement thread.

4. Grabbing the struct_mutex from the hang checker is nothing that we will do
since that puts us at risk of deadlock since a hung thread might be holding the
struct_mutex already.

Therefore it's not obvious how we're supposed to deal with this. For now, we're
doing an early exit from this function, which avoids any kernel panic situation
when running our own internal TDR ULT.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Chris Wilson June 8, 2015, 5:36 p.m. UTC | #1
On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
> When submitting semaphores in execlist mode the hang checker crashes in this
> function because it is only runnable in ring submission mode. The reason this
> is of particular interest to the TDR patch series is because we use semaphores
> as a mean to induce hangs during testing (which is the recommended way to
> induce hangs for gen8+). It's not clear how this is supposed to work in
> execlist mode since:
> 
> 1. This function requires a ring buffer.
> 
> 2. Retrieving a ring buffer in execlist mode requires us to retrieve the
> corresponding context, which we get from a request.
> 
> 3. Retieving a request from the hang checker is not straight-forward since that
> requires us to grab the struct_mutex in order to synchronize against the
> request retirement thread.
> 
> 4. Grabbing the struct_mutex from the hang checker is nothing that we will do
> since that puts us at risk of deadlock since a hung thread might be holding the
> struct_mutex already.
> 
> Therefore it's not obvious how we're supposed to deal with this. For now, we're
> doing an early exit from this function, which avoids any kernel panic situation
> when running our own internal TDR ULT.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 46bcbff..40c44fc 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
>  	u64 offset = 0;
>  	int i, backwards;
>  
> +	/*
> +	 * This function does not support execlist mode - any attempt to
> +	 * proceed further into this function will result in a kernel panic
> +	 * when dereferencing ring->buffer, which is not set up in execlist
> +	 * mode.
> +	 *
> +	 * The correct way of doing it would be to derive the currently
> +	 * executing ring buffer from the current context, which is derived
> +	 * from the currently running request. Unfortunately, to get the
> +	 * current request we would have to grab the struct_mutex before doing
> +	 * anything else, which would be ill-advised since some other thread
> +	 * might have grabbed it already and managed to hang itself, causing
> +	 * the hang checker to deadlock.
> +	 *
> +	 * Therefore, this function does not support execlist mode in its
> +	 * current form. Just return NULL and move on.
> +	 */
> +	if (i915.enable_execlists)
> +		return NULL;

if (ring->buffer == NULL)
	return NULL;

if (i915.enable_execlists) is a layering violation that needs to be
erradicated.
-Chris
Tomas Elf June 9, 2015, 11:02 a.m. UTC | #2
On 08/06/2015 18:36, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
>> When submitting semaphores in execlist mode the hang checker crashes in this
>> function because it is only runnable in ring submission mode. The reason this
>> is of particular interest to the TDR patch series is because we use semaphores
>> as a mean to induce hangs during testing (which is the recommended way to
>> induce hangs for gen8+). It's not clear how this is supposed to work in
>> execlist mode since:
>>
>> 1. This function requires a ring buffer.
>>
>> 2. Retrieving a ring buffer in execlist mode requires us to retrieve the
>> corresponding context, which we get from a request.
>>
>> 3. Retieving a request from the hang checker is not straight-forward since that
>> requires us to grab the struct_mutex in order to synchronize against the
>> request retirement thread.
>>
>> 4. Grabbing the struct_mutex from the hang checker is nothing that we will do
>> since that puts us at risk of deadlock since a hung thread might be holding the
>> struct_mutex already.
>>
>> Therefore it's not obvious how we're supposed to deal with this. For now, we're
>> doing an early exit from this function, which avoids any kernel panic situation
>> when running our own internal TDR ULT.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index 46bcbff..40c44fc 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
>>   	u64 offset = 0;
>>   	int i, backwards;
>>
>> +	/*
>> +	 * This function does not support execlist mode - any attempt to
>> +	 * proceed further into this function will result in a kernel panic
>> +	 * when dereferencing ring->buffer, which is not set up in execlist
>> +	 * mode.
>> +	 *
>> +	 * The correct way of doing it would be to derive the currently
>> +	 * executing ring buffer from the current context, which is derived
>> +	 * from the currently running request. Unfortunately, to get the
>> +	 * current request we would have to grab the struct_mutex before doing
>> +	 * anything else, which would be ill-advised since some other thread
>> +	 * might have grabbed it already and managed to hang itself, causing
>> +	 * the hang checker to deadlock.
>> +	 *
>> +	 * Therefore, this function does not support execlist mode in its
>> +	 * current form. Just return NULL and move on.
>> +	 */
>> +	if (i915.enable_execlists)
>> +		return NULL;
>
> if (ring->buffer == NULL)
> 	return NULL;
>
> if (i915.enable_execlists) is a layering violation that needs to be
> erradicated.
> -Chris
>

Fair enough.

Thanks,
Tomas
Daniel Vetter June 16, 2015, 1:44 p.m. UTC | #3
On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
> When submitting semaphores in execlist mode the hang checker crashes in this
> function because it is only runnable in ring submission mode. The reason this
> is of particular interest to the TDR patch series is because we use semaphores
> as a mean to induce hangs during testing (which is the recommended way to
> induce hangs for gen8+). It's not clear how this is supposed to work in
> execlist mode since:
> 
> 1. This function requires a ring buffer.
> 
> 2. Retrieving a ring buffer in execlist mode requires us to retrieve the
> corresponding context, which we get from a request.
> 
> 3. Retieving a request from the hang checker is not straight-forward since that
> requires us to grab the struct_mutex in order to synchronize against the
> request retirement thread.
> 
> 4. Grabbing the struct_mutex from the hang checker is nothing that we will do
> since that puts us at risk of deadlock since a hung thread might be holding the
> struct_mutex already.
> 
> Therefore it's not obvious how we're supposed to deal with this. For now, we're
> doing an early exit from this function, which avoids any kernel panic situation
> when running our own internal TDR ULT.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>

We should have a Testcase: line here which mentions the igt testcase which
provoke this bug. Or we need to fill this gap asap.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 46bcbff..40c44fc 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
>  	u64 offset = 0;
>  	int i, backwards;
>  
> +	/*
> +	 * This function does not support execlist mode - any attempt to
> +	 * proceed further into this function will result in a kernel panic
> +	 * when dereferencing ring->buffer, which is not set up in execlist
> +	 * mode.
> +	 *
> +	 * The correct way of doing it would be to derive the currently
> +	 * executing ring buffer from the current context, which is derived
> +	 * from the currently running request. Unfortunately, to get the
> +	 * current request we would have to grab the struct_mutex before doing
> +	 * anything else, which would be ill-advised since some other thread
> +	 * might have grabbed it already and managed to hang itself, causing
> +	 * the hang checker to deadlock.
> +	 *
> +	 * Therefore, this function does not support execlist mode in its
> +	 * current form. Just return NULL and move on.
> +	 */
> +	if (i915.enable_execlists)
> +		return NULL;
> +
>  	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
>  	if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
>  		return NULL;
> -- 
> 1.7.9.5
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Tomas Elf June 16, 2015, 3:46 p.m. UTC | #4
On 16/06/2015 14:44, Daniel Vetter wrote:
> On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
>> When submitting semaphores in execlist mode the hang checker crashes in this
>> function because it is only runnable in ring submission mode. The reason this
>> is of particular interest to the TDR patch series is because we use semaphores
>> as a mean to induce hangs during testing (which is the recommended way to
>> induce hangs for gen8+). It's not clear how this is supposed to work in
>> execlist mode since:
>>
>> 1. This function requires a ring buffer.
>>
>> 2. Retrieving a ring buffer in execlist mode requires us to retrieve the
>> corresponding context, which we get from a request.
>>
>> 3. Retieving a request from the hang checker is not straight-forward since that
>> requires us to grab the struct_mutex in order to synchronize against the
>> request retirement thread.
>>
>> 4. Grabbing the struct_mutex from the hang checker is nothing that we will do
>> since that puts us at risk of deadlock since a hung thread might be holding the
>> struct_mutex already.
>>
>> Therefore it's not obvious how we're supposed to deal with this. For now, we're
>> doing an early exit from this function, which avoids any kernel panic situation
>> when running our own internal TDR ULT.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>
> We should have a Testcase: line here which mentions the igt testcase which
> provoke this bug. Or we need to fill this gap asap.
> -Daniel

You know this better than I do: Is there an IGT test that submits a 
semaphore in execlist mode? Because that's all you need to do to 
reproduce this. We could certainly add one if there is none like that 
already.

Thanks,
Tomas

>
>> ---
>>   drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index 46bcbff..40c44fc 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
>>   	u64 offset = 0;
>>   	int i, backwards;
>>
>> +	/*
>> +	 * This function does not support execlist mode - any attempt to
>> +	 * proceed further into this function will result in a kernel panic
>> +	 * when dereferencing ring->buffer, which is not set up in execlist
>> +	 * mode.
>> +	 *
>> +	 * The correct way of doing it would be to derive the currently
>> +	 * executing ring buffer from the current context, which is derived
>> +	 * from the currently running request. Unfortunately, to get the
>> +	 * current request we would have to grab the struct_mutex before doing
>> +	 * anything else, which would be ill-advised since some other thread
>> +	 * might have grabbed it already and managed to hang itself, causing
>> +	 * the hang checker to deadlock.
>> +	 *
>> +	 * Therefore, this function does not support execlist mode in its
>> +	 * current form. Just return NULL and move on.
>> +	 */
>> +	if (i915.enable_execlists)
>> +		return NULL;
>> +
>>   	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
>>   	if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
>>   		return NULL;
>> --
>> 1.7.9.5
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
Chris Wilson June 16, 2015, 4:50 p.m. UTC | #5
On Tue, Jun 16, 2015 at 04:46:05PM +0100, Tomas Elf wrote:
> On 16/06/2015 14:44, Daniel Vetter wrote:
> >On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
> >>When submitting semaphores in execlist mode the hang checker crashes in this
> >>function because it is only runnable in ring submission mode. The reason this
> >>is of particular interest to the TDR patch series is because we use semaphores
> >>as a mean to induce hangs during testing (which is the recommended way to
> >>induce hangs for gen8+). It's not clear how this is supposed to work in
> >>execlist mode since:
> >>
> >>1. This function requires a ring buffer.
> >>
> >>2. Retrieving a ring buffer in execlist mode requires us to retrieve the
> >>corresponding context, which we get from a request.
> >>
> >>3. Retieving a request from the hang checker is not straight-forward since that
> >>requires us to grab the struct_mutex in order to synchronize against the
> >>request retirement thread.
> >>
> >>4. Grabbing the struct_mutex from the hang checker is nothing that we will do
> >>since that puts us at risk of deadlock since a hung thread might be holding the
> >>struct_mutex already.
> >>
> >>Therefore it's not obvious how we're supposed to deal with this. For now, we're
> >>doing an early exit from this function, which avoids any kernel panic situation
> >>when running our own internal TDR ULT.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >
> >We should have a Testcase: line here which mentions the igt testcase which
> >provoke this bug. Or we need to fill this gap asap.
> >-Daniel
> 
> You know this better than I do: Is there an IGT test that submits a
> semaphore in execlist mode? Because that's all you need to do to
> reproduce this. We could certainly add one if there is none like
> that already.

No, we don't have anything submitting a hanging semaphore from
userspace or igt specifically.
-Chris
Tomas Elf June 16, 2015, 5:07 p.m. UTC | #6
On 16/06/2015 17:50, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 04:46:05PM +0100, Tomas Elf wrote:
>> On 16/06/2015 14:44, Daniel Vetter wrote:
>>> On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
>>>> When submitting semaphores in execlist mode the hang checker crashes in this
>>>> function because it is only runnable in ring submission mode. The reason this
>>>> is of particular interest to the TDR patch series is because we use semaphores
>>>> as a mean to induce hangs during testing (which is the recommended way to
>>>> induce hangs for gen8+). It's not clear how this is supposed to work in
>>>> execlist mode since:
>>>>
>>>> 1. This function requires a ring buffer.
>>>>
>>>> 2. Retrieving a ring buffer in execlist mode requires us to retrieve the
>>>> corresponding context, which we get from a request.
>>>>
>>>> 3. Retieving a request from the hang checker is not straight-forward since that
>>>> requires us to grab the struct_mutex in order to synchronize against the
>>>> request retirement thread.
>>>>
>>>> 4. Grabbing the struct_mutex from the hang checker is nothing that we will do
>>>> since that puts us at risk of deadlock since a hung thread might be holding the
>>>> struct_mutex already.
>>>>
>>>> Therefore it's not obvious how we're supposed to deal with this. For now, we're
>>>> doing an early exit from this function, which avoids any kernel panic situation
>>>> when running our own internal TDR ULT.
>>>>
>>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>>
>>> We should have a Testcase: line here which mentions the igt testcase which
>>> provoke this bug. Or we need to fill this gap asap.
>>> -Daniel
>>
>> You know this better than I do: Is there an IGT test that submits a
>> semaphore in execlist mode? Because that's all you need to do to
>> reproduce this. We could certainly add one if there is none like
>> that already.
>
> No, we don't have anything submitting a hanging semaphore from
> userspace or igt specifically.
> -Chris
>

At first I thought that it would be ok to just submit any semaphore but 
I guess it would have to be a hanging semaphore specifically. Or at 
least a semaphore that does not progress ACTHD from one hang check 
period to the following (seeing as we check for ACTHD progression in 
ring_stuck() and then call semaphore_passed() that calls 
semaphore_waits_for() if ACTHD hasn't moved).

Fine, we'll have to add that then.

Thanks,
Tomas
Daniel Vetter June 17, 2015, 11:43 a.m. UTC | #7
On Tue, Jun 16, 2015 at 04:46:05PM +0100, Tomas Elf wrote:
> On 16/06/2015 14:44, Daniel Vetter wrote:
> >On Mon, Jun 08, 2015 at 06:03:19PM +0100, Tomas Elf wrote:
> >>When submitting semaphores in execlist mode the hang checker crashes in this
> >>function because it is only runnable in ring submission mode. The reason this
> >>is of particular interest to the TDR patch series is because we use semaphores
> >>as a mean to induce hangs during testing (which is the recommended way to
> >>induce hangs for gen8+). It's not clear how this is supposed to work in
> >>execlist mode since:
> >>
> >>1. This function requires a ring buffer.
> >>
> >>2. Retrieving a ring buffer in execlist mode requires us to retrieve the
> >>corresponding context, which we get from a request.
> >>
> >>3. Retieving a request from the hang checker is not straight-forward since that
> >>requires us to grab the struct_mutex in order to synchronize against the
> >>request retirement thread.
> >>
> >>4. Grabbing the struct_mutex from the hang checker is nothing that we will do
> >>since that puts us at risk of deadlock since a hung thread might be holding the
> >>struct_mutex already.
> >>
> >>Therefore it's not obvious how we're supposed to deal with this. For now, we're
> >>doing an early exit from this function, which avoids any kernel panic situation
> >>when running our own internal TDR ULT.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >
> >We should have a Testcase: line here which mentions the igt testcase which
> >provoke this bug. Or we need to fill this gap asap.
> >-Daniel
> 
> You know this better than I do: Is there an IGT test that submits a
> semaphore in execlist mode? Because that's all you need to do to reproduce
> this. We could certainly add one if there is none like that already.

Not that I know of, so needs to be added. But it's a security fix and
should go to -fixes+cc:stable since any kind of userspace could Oops the
kernel with this.
-Daniel

> 
> Thanks,
> Tomas
> 
> >
> >>---
> >>  drivers/gpu/drm/i915/i915_irq.c |   20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> >>index 46bcbff..40c44fc 100644
> >>--- a/drivers/gpu/drm/i915/i915_irq.c
> >>+++ b/drivers/gpu/drm/i915/i915_irq.c
> >>@@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
> >>  	u64 offset = 0;
> >>  	int i, backwards;
> >>
> >>+	/*
> >>+	 * This function does not support execlist mode - any attempt to
> >>+	 * proceed further into this function will result in a kernel panic
> >>+	 * when dereferencing ring->buffer, which is not set up in execlist
> >>+	 * mode.
> >>+	 *
> >>+	 * The correct way of doing it would be to derive the currently
> >>+	 * executing ring buffer from the current context, which is derived
> >>+	 * from the currently running request. Unfortunately, to get the
> >>+	 * current request we would have to grab the struct_mutex before doing
> >>+	 * anything else, which would be ill-advised since some other thread
> >>+	 * might have grabbed it already and managed to hang itself, causing
> >>+	 * the hang checker to deadlock.
> >>+	 *
> >>+	 * Therefore, this function does not support execlist mode in its
> >>+	 * current form. Just return NULL and move on.
> >>+	 */
> >>+	if (i915.enable_execlists)
> >>+		return NULL;
> >>+
> >>  	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
> >>  	if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
> >>  		return NULL;
> >>--
> >>1.7.9.5
> >>
> >>_______________________________________________
> >>Intel-gfx mailing list
> >>Intel-gfx@lists.freedesktop.org
> >>http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> >
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 46bcbff..40c44fc 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2698,6 +2698,26 @@  semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
 	u64 offset = 0;
 	int i, backwards;
 
+	/*
+	 * This function does not support execlist mode - any attempt to
+	 * proceed further into this function will result in a kernel panic
+	 * when dereferencing ring->buffer, which is not set up in execlist
+	 * mode.
+	 *
+	 * The correct way of doing it would be to derive the currently
+	 * executing ring buffer from the current context, which is derived
+	 * from the currently running request. Unfortunately, to get the
+	 * current request we would have to grab the struct_mutex before doing
+	 * anything else, which would be ill-advised since some other thread
+	 * might have grabbed it already and managed to hang itself, causing
+	 * the hang checker to deadlock.
+	 *
+	 * Therefore, this function does not support execlist mode in its
+	 * current form. Just return NULL and move on.
+	 */
+	if (i915.enable_execlists)
+		return NULL;
+
 	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
 	if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
 		return NULL;