diff mbox

[RFC,10/11] drm/i915: Debugfs interface for per-engine hang recovery.

Message ID 1433783009-17251-11-git-send-email-tomas.elf@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tomas Elf June 8, 2015, 5:03 p.m. UTC
1. The i915_wedged_set function allows us to schedule three forms of hang recovery:

	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
	GPU reset recovery path.

	b) Single engine hang recovery: By passing an engine ID in the interval
	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
	engine assuming that the context submission consistency requirements
	are met (otherwise the hang recovery path will simply exit early and
	wait for another hang detection). The values are assumed to use up bits
	3:0 only since we certainly do not support as many as 16 engines.

	This mode is supported since there are several legacy test applications
	that rely on this interface.

	c) Multiple engine hang recovery: By passing in an engine flag mask in
	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
	engine 1 = VCS etc) we can schedule any combination of engine hang
	recoveries as we please. For example, by passing in the value 0x3 << 8
	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
	the same time.

	If bits in fields 3:0 and 31:8 are both used then single engine hang
	recovery mode takes presidence and bits 31:8 are ignored.

2. The i915_wedged_get function produces a set of statistics related to:

	a) Number of engine hangs detected by periodic hang checker.
	b) Number of watchdog timeout hangs detected.
	c) Number of full GPU resets carried out.
	d) Number of engine resets carried out.

	These statistics are presented in a very parser-friendly way and are
	used by the TDR ULT to poll system behaviour to validate test outcomes.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c |  146 +++++++++++++++++++++++++++++++++--
 1 file changed, 141 insertions(+), 5 deletions(-)

Comments

Chris Wilson June 8, 2015, 5:45 p.m. UTC | #1
On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
> 1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
> 
> 	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
> 	GPU reset recovery path.
> 
> 	b) Single engine hang recovery: By passing an engine ID in the interval
> 	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
> 	engine assuming that the context submission consistency requirements
> 	are met (otherwise the hang recovery path will simply exit early and
> 	wait for another hang detection). The values are assumed to use up bits
> 	3:0 only since we certainly do not support as many as 16 engines.
> 
> 	This mode is supported since there are several legacy test applications
> 	that rely on this interface.

Are there? I don't see them in igt - and let's not start making debugfs
ABI.
 
> 	c) Multiple engine hang recovery: By passing in an engine flag mask in
> 	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
> 	engine 1 = VCS etc) we can schedule any combination of engine hang
> 	recoveries as we please. For example, by passing in the value 0x3 << 8
> 	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
> 	the same time.

Seems fine. But I don't see the reason for the extra complication.

> 	If bits in fields 3:0 and 31:8 are both used then single engine hang
> 	recovery mode takes presidence and bits 31:8 are ignored.
> 
> 2. The i915_wedged_get function produces a set of statistics related to:

Add it to hangcheck_info instead.

i915_wedged_get could be updated to give the ring mask of wedged rings?
If that concept exists.
-Chris
Tomas Elf June 9, 2015, 11:18 a.m. UTC | #2
On 08/06/2015 18:45, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
>> 1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
>>
>> 	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
>> 	GPU reset recovery path.
>>
>> 	b) Single engine hang recovery: By passing an engine ID in the interval
>> 	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
>> 	engine assuming that the context submission consistency requirements
>> 	are met (otherwise the hang recovery path will simply exit early and
>> 	wait for another hang detection). The values are assumed to use up bits
>> 	3:0 only since we certainly do not support as many as 16 engines.
>>
>> 	This mode is supported since there are several legacy test applications
>> 	that rely on this interface.
>
> Are there? I don't see them in igt - and let's not start making debugfs
> ABI.

They're not in IGT only internal to VPG. I guess we could limit these 
changes and adapt the internal test suite in VPG instead of upstreaming 
changes that only VPG validation cares about.

>
>> 	c) Multiple engine hang recovery: By passing in an engine flag mask in
>> 	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
>> 	engine 1 = VCS etc) we can schedule any combination of engine hang
>> 	recoveries as we please. For example, by passing in the value 0x3 << 8
>> 	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
>> 	the same time.
>
> Seems fine. But I don't see the reason for the extra complication.

I wanted to make sure that we could test multiple concurrent hang 
recoveries, but to be fair nobody is actually using this at the moment 
so unless someone actually _needs_ this we probably don't need to 
upstream it.

I guess we could leave it in its currently upstreamed form where it only 
allows full GPU reset. Or would it be of use to anyone to support 
per-engine recovery?

>
>> 	If bits in fields 3:0 and 31:8 are both used then single engine hang
>> 	recovery mode takes presidence and bits 31:8 are ignored.
>>
>> 2. The i915_wedged_get function produces a set of statistics related to:
>
> Add it to hangcheck_info instead.

Yeah, I considered that but I felt that hangcheck_info had too much text 
and it would be too much of a hassle to parse out the data. But having 
spoken to the validation guys it seems like they're fine with updating 
the parser. So I could update hangcheck_info with this new information.

>
> i915_wedged_get could be updated to give the ring mask of wedged rings?
> If that concept exists.
> -Chris
>

Nah, no need, I'll just add the information to hangcheck_info. Besides, 
wedged_get needs to provide more information than just the current 
wedged state. It also provides information about the number of resets, 
the number of watchdog timeouts etc. So it's not that easy to summarise 
it as a ring mask.

Thanks,
Tomas
Chris Wilson June 9, 2015, 12:27 p.m. UTC | #3
On Tue, Jun 09, 2015 at 12:18:28PM +0100, Tomas Elf wrote:
> On 08/06/2015 18:45, Chris Wilson wrote:
> >On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
> >>1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
> >>
> >>	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
> >>	GPU reset recovery path.
> >>
> >>	b) Single engine hang recovery: By passing an engine ID in the interval
> >>	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
> >>	engine assuming that the context submission consistency requirements
> >>	are met (otherwise the hang recovery path will simply exit early and
> >>	wait for another hang detection). The values are assumed to use up bits
> >>	3:0 only since we certainly do not support as many as 16 engines.
> >>
> >>	This mode is supported since there are several legacy test applications
> >>	that rely on this interface.
> >
> >Are there? I don't see them in igt - and let's not start making debugfs
> >ABI.
> 
> They're not in IGT only internal to VPG. I guess we could limit
> these changes and adapt the internal test suite in VPG instead of
> upstreaming changes that only VPG validation cares about.

Also note that there are quite a few concurrent hang tests in igt that
this series should aim to fix.

You will be expected to provide basic validation tests for igt as well,
which will be using the debugfs interface I guess.

> >>	c) Multiple engine hang recovery: By passing in an engine flag mask in
> >>	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
> >>	engine 1 = VCS etc) we can schedule any combination of engine hang
> >>	recoveries as we please. For example, by passing in the value 0x3 << 8
> >>	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
> >>	the same time.
> >
> >Seems fine. But I don't see the reason for the extra complication.
> 
> I wanted to make sure that we could test multiple concurrent hang
> recoveries, but to be fair nobody is actually using this at the
> moment so unless someone actually _needs_ this we probably don't
> need to upstream it.
> 
> I guess we could leave it in its currently upstreamed form where it
> only allows full GPU reset. Or would it be of use to anyone to
> support per-engine recovery?

I like the per-engine flags, I was just arguing about having the
interface do both seems overly compliated (when the existing behaviour
can be retrained to use -1).

> >>	If bits in fields 3:0 and 31:8 are both used then single engine hang
> >>	recovery mode takes presidence and bits 31:8 are ignored.
> >>
> >>2. The i915_wedged_get function produces a set of statistics related to:
> >
> >Add it to hangcheck_info instead.
> 
> Yeah, I considered that but I felt that hangcheck_info had too much
> text and it would be too much of a hassle to parse out the data. But
> having spoken to the validation guys it seems like they're fine with
> updating the parser. So I could update hangcheck_info with this new
> information.

It can be more or less just be searching for the start of your info block.
A quick string search on new debugfs name, old debugfs name could even
provide backwards compatibility in the test.

> >i915_wedged_get could be updated to give the ring mask of wedged rings?
> >If that concept exists.
> >-Chris
> >
> 
> Nah, no need, I'll just add the information to hangcheck_info.
> Besides, wedged_get needs to provide more information than just the
> current wedged state. It also provides information about the number
> of resets, the number of watchdog timeouts etc. So it's not that
> easy to summarise it as a ring mask.

We are still talking about today's single valued debugfs/i915_wedged
rather than the extended info?

Oh, whilst I am thinking of it, you could also add the reset stats to
the error state.
-Chris
Tomas Elf June 9, 2015, 5:28 p.m. UTC | #4
On 09/06/2015 13:27, Chris Wilson wrote:
> On Tue, Jun 09, 2015 at 12:18:28PM +0100, Tomas Elf wrote:
>> On 08/06/2015 18:45, Chris Wilson wrote:
>>> On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
>>>> 1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
>>>>
>>>> 	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
>>>> 	GPU reset recovery path.
>>>>
>>>> 	b) Single engine hang recovery: By passing an engine ID in the interval
>>>> 	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
>>>> 	engine assuming that the context submission consistency requirements
>>>> 	are met (otherwise the hang recovery path will simply exit early and
>>>> 	wait for another hang detection). The values are assumed to use up bits
>>>> 	3:0 only since we certainly do not support as many as 16 engines.
>>>>
>>>> 	This mode is supported since there are several legacy test applications
>>>> 	that rely on this interface.
>>>
>>> Are there? I don't see them in igt - and let's not start making debugfs
>>> ABI.
>>
>> They're not in IGT only internal to VPG. I guess we could limit
>> these changes and adapt the internal test suite in VPG instead of
>> upstreaming changes that only VPG validation cares about.
>
> Also note that there are quite a few concurrent hang tests in igt that
> this series should aim to fix.
>
> You will be expected to provide basic validation tests for igt as well,
> which will be using the debugfs interface I guess.
>

Yeah, once we get past the RFC stage I can start looking into the IGTs. 
Obviously, the existing tests must not break and I'll add tests for 
per-engine recovery, full GPU reset promotion and watchdog timeout.

Daniel Vetter has already said that he wants me to add more hang 
concurrency tests that run a wider variety of tests with intermittent 
hangs trigging different hang recovery modes. Having already dealt with 
long-duration operations testing with concurrent rendering for more than 
a year now during TDR development I know what kind of havoc the TDR 
mechanism can raise when interacting with stuff like display driver and 
shrinker, so I'm hoping that we can distill some of those system-level 
tests into a smaller IGT form that can be run more easily and perhaps 
more determistically.

I haven't started looking into that yet, though, but it will have to be 
done once I start submitting the patch series proper.

>>>> 	c) Multiple engine hang recovery: By passing in an engine flag mask in
>>>> 	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
>>>> 	engine 1 = VCS etc) we can schedule any combination of engine hang
>>>> 	recoveries as we please. For example, by passing in the value 0x3 << 8
>>>> 	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
>>>> 	the same time.
>>>
>>> Seems fine. But I don't see the reason for the extra complication.
>>
>> I wanted to make sure that we could test multiple concurrent hang
>> recoveries, but to be fair nobody is actually using this at the
>> moment so unless someone actually _needs_ this we probably don't
>> need to upstream it.
>>
>> I guess we could leave it in its currently upstreamed form where it
>> only allows full GPU reset. Or would it be of use to anyone to
>> support per-engine recovery?
>
> I like the per-engine flags, I was just arguing about having the
> interface do both seems overly compliated (when the existing behaviour
> can be retrained to use -1).
>

Sure, we'll just go with the per-engine flags then.

>>>> 	If bits in fields 3:0 and 31:8 are both used then single engine hang
>>>> 	recovery mode takes presidence and bits 31:8 are ignored.
>>>>
>>>> 2. The i915_wedged_get function produces a set of statistics related to:
>>>
>>> Add it to hangcheck_info instead.
>>
>> Yeah, I considered that but I felt that hangcheck_info had too much
>> text and it would be too much of a hassle to parse out the data. But
>> having spoken to the validation guys it seems like they're fine with
>> updating the parser. So I could update hangcheck_info with this new
>> information.
>
> It can be more or less just be searching for the start of your info block.
> A quick string search on new debugfs name, old debugfs name could even
> provide backwards compatibility in the test.
>
>>> i915_wedged_get could be updated to give the ring mask of wedged rings?
>>> If that concept exists.
>>> -Chris
>>>
>>
>> Nah, no need, I'll just add the information to hangcheck_info.
>> Besides, wedged_get needs to provide more information than just the
>> current wedged state. It also provides information about the number
>> of resets, the number of watchdog timeouts etc. So it's not that
>> easy to summarise it as a ring mask.
>
> We are still talking about today's single valued debugfs/i915_wedged
> rather than the extended info?

Gah, you're right, I completely screwed up there (both in the patch and 
in the subsequent discussion): I'm not talking about i915_wedged_get 
(which produces a single value), I'm talking about i915_hangcheck_read 
(which produces extended info). For some reason my brain keeps 
convincing me that I've changed i915_wedged_set and i915_wedged_get, 
though (probably because it's one setter function and one getter 
function of sorts, but they're named differently). So, to make sure that 
we're all on the same page here:

* I've updated i915_wedged_set. This one will be changed to only accept 
engine flags.
* I've added i915_hangcheck_read. This function will be removed and 
replaced by i915_hangcheck_info instead.
* I won't be touching i915_wedged_get. Unless that is something that is 
requested. The VPG tests don't need it at least.

Anything else?

>
> Oh, whilst I am thinking of it, you could also add the reset stats to
> the error state.

Sure.

Thanks,
Tomas

> -Chris
>
Dave Gordon June 11, 2015, 9:32 a.m. UTC | #5
On 08/06/15 18:45, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
>> 1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
>>
>> 	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
>> 	GPU reset recovery path.
>>
>> 	b) Single engine hang recovery: By passing an engine ID in the interval
>> 	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
>> 	engine assuming that the context submission consistency requirements
>> 	are met (otherwise the hang recovery path will simply exit early and
>> 	wait for another hang detection). The values are assumed to use up bits
>> 	3:0 only since we certainly do not support as many as 16 engines.
>>
>> 	This mode is supported since there are several legacy test applications
>> 	that rely on this interface.
> 
> Are there? I don't see them in igt - and let's not start making debugfs
> ABI.

AFAIK there are currently only two meaningful values to write to
i915_wedged: zero, which triggers error capture without resetting, and
any non-zero value, which triggers error capture followed by reset.

.Dave.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index a89da48..f3305ed 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -2030,7 +2030,7 @@  static int i915_execlists(struct seq_file *m, void *data)
 		seq_printf(m, "%s\n", ring->name);
 
 		status = I915_READ(RING_EXECLIST_STATUS(ring));
-		ctx_id = I915_READ(RING_EXECLIST_STATUS(ring) + 4);
+		ctx_id = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring));
 		seq_printf(m, "\tExeclist status: 0x%08X, context: %u\n",
 			   status, ctx_id);
 
@@ -4164,11 +4164,50 @@  i915_wedged_get(void *data, u64 *val)
 	return 0;
 }
 
+static const char *ringid_to_str(enum intel_ring_id ring_id)
+{
+	switch (ring_id) {
+	case RCS:
+		return "RCS";
+	case VCS:
+		return "VCS";
+	case BCS:
+		return "BCS";
+	case VECS:
+		return "VECS";
+	case VCS2:
+		return "VCS2";
+	}
+
+	return "unknown";
+}
+
 static int
 i915_wedged_set(void *data, u64 val)
 {
 	struct drm_device *dev = data;
 	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	const u32 engine_mask = ((1 << I915_NUM_RINGS) - 1);
+	const u32 single_engine_reset_mask = 0xF;
+	const u32 bitfield_boundary = 8;
+	u32 val_mask = 0;
+	u32 i;
+#define ENGINE_MSGLEN 64
+	char msg[ENGINE_MSGLEN] = "";
+
+	/*
+	 * Val can contain values in one of the following mutally exclusive
+	 * formats:
+	 *
+	 * 1. Bits [3:0] != 0x0 :
+	 *	Index (0 .. I915_NUM_RINGS-1) of engine to be manually reset.
+	 *	Invalid indices translate to full gpu reset.
+	 *
+	 * 2. Bits [(I915_NUM_RINGS-1)+8 : 8] != 0x0 :
+	 *	Bit mask containing the engine flags of all the engines that
+	 *	are to be manually reset.
+	 */
 
 	/*
 	 * There is no safeguard against this debugfs entry colliding
@@ -4177,14 +4216,61 @@  i915_wedged_set(void *data, u64 val)
 	 * test harness is responsible enough not to inject gpu hangs
 	 * while it is writing to 'i915_wedged'
 	 */
-
-	if (i915_reset_in_progress(&dev_priv->gpu_error))
+	if (i915_gem_check_wedge(dev_priv, NULL, true))
 		return -EAGAIN;
 
 	intel_runtime_pm_get(dev_priv);
 
-	i915_handle_error(dev, 0x0, false, val,
-			  "Manually setting wedged to %llu", val);
+	if (!val || (single_engine_reset_mask & val)) {
+		/*
+		 * Single engine hang mode
+		 *
+		 * Bits [3:0] of val contains index of engine
+		 * to be manually reset.
+		 */
+		val &= single_engine_reset_mask;
+		if (val == single_engine_reset_mask)
+			val_mask = 0x0;
+		else
+			val_mask = (1 << (val & 0xF));
+
+	} else {
+		/*
+		 * Mask mode
+		 *
+		 * Bits [31:8] of val contains bit mask of engines to be
+		 * manually reset, engine index 0 at bit 4, engine index 1 at
+		 * bit 5 and so forth.
+		 */
+		val_mask = (val >> bitfield_boundary) & engine_mask;
+	}
+
+
+	if (val_mask) {
+		u32 len;
+
+		len = scnprintf(msg, sizeof(msg), "Manual reset:");
+
+		/* Assemble message string */
+		for_each_ring(engine, dev_priv, i)
+			if (intel_ring_flag(engine) & val_mask) {
+				DRM_INFO("Manual reset: %s\n", engine->name);
+
+				len += scnprintf(msg + len, sizeof(msg) - len,
+						 "%s [%s]",
+						 msg,
+						 ringid_to_str(i));
+			}
+
+	} else {
+		scnprintf(msg, sizeof(msg), "Manual global reset");
+	}
+
+	i915_handle_error(dev,
+			  val_mask,
+			  false,
+			  true,
+			  msg);
 
 	intel_runtime_pm_put(dev_priv);
 
@@ -4195,6 +4281,55 @@  DEFINE_SIMPLE_ATTRIBUTE(i915_wedged_fops,
 			i915_wedged_get, i915_wedged_set,
 			"%llu\n");
 
+static ssize_t
+i915_ring_hangcheck_read(struct file *filp, char __user *ubuf,
+			 size_t max, loff_t *ppos)
+{
+	int i;
+	int len;
+	char buf[300];
+	struct drm_device *dev = filp->private_data;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	/*
+	 * Returns the total number of times the rings
+	 * have hung and been reset since boot
+	 */
+	len = scnprintf(buf, sizeof(buf), "GPU=0x%08X,",
+			i915_reset_count(&dev_priv->gpu_error));
+	for (i = 0; i < I915_NUM_RINGS; ++i)
+		len += scnprintf(buf + len, sizeof(buf) - len,
+				 "%s=0x%08lX,",
+				 ringid_to_str(i),
+				 (long unsigned)
+				 dev_priv->ring[i].hangcheck.reset_count);
+
+	for (i = 0; i < I915_NUM_RINGS; ++i)
+		len += scnprintf(buf + len, sizeof(buf) - len,
+				 "%s_T=0x%08lX,",
+				 ringid_to_str(i),
+				 (long unsigned)
+				 dev_priv->ring[i].hangcheck.tdr_count);
+
+	for (i = 0; i < I915_NUM_RINGS; ++i)
+		len += scnprintf(buf + len, sizeof(buf) - len,
+				 "%s_W=0x%08lX,",
+				 ringid_to_str(i),
+				 (long unsigned)
+				 dev_priv->ring[i].hangcheck.watchdog_count);
+
+	len += scnprintf(buf + len - 1, sizeof(buf) - len, "\n");
+
+	return simple_read_from_buffer(ubuf, max, ppos, buf, len);
+}
+
+static const struct file_operations i915_ring_hangcheck_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = i915_ring_hangcheck_read,
+	.llseek = default_llseek,
+};
+
 static int
 i915_ring_stop_get(void *data, u64 *val)
 {
@@ -4825,6 +4960,7 @@  static const struct i915_debugfs_files {
 	{"i915_ring_missed_irq", &i915_ring_missed_irq_fops},
 	{"i915_ring_test_irq", &i915_ring_test_irq_fops},
 	{"i915_gem_drop_caches", &i915_drop_caches_fops},
+	{"i915_ring_hangcheck", &i915_ring_hangcheck_fops},
 	{"i915_error_state", &i915_error_state_fops},
 	{"i915_next_seqno", &i915_next_seqno_fops},
 	{"i915_display_crc_ctl", &i915_display_crc_ctl_fops},