[i-g-t,v2] tests/drv_hangman: test for acthd increasing through invalid VM space
diff mbox

Message ID 1456396331-27262-1-git-send-email-daniele.ceraolospurio@intel.com
State New
Headers show

Commit Message

Daniele Ceraolo Spurio Feb. 25, 2016, 10:32 a.m. UTC
From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

The hangcheck logic will not flag an hang if acthd keeps increasing.
However, if a malformed batch jumps to an invalid offset in the ppgtt it
can potentially continue executing through the whole address space
without triggering the hangcheck mechanism.

This patch adds a test to simulate the issue. I've kept the test running
for more than 10 minutes before killing it on a BDW and no hang occurred.
I've sampled i915_hangcheck_info a few times during the run and got the
following:

Hangcheck active, fires in 468ms
render ring:
	seqno = fffff55e [current fffff55e]
	ACTHD = 0x47df685ecc [current 0x4926b81d90]
	max ACTHD = 0x47df685ecc
	score = 0
	action = 2
	instdone read = 0xffd7ffff 0xffffffff 0xffffffff 0xffffffff
	instdone accu = 0x00000000 0x00000000 0x00000000 0x00000000

Hangcheck active, fires in 424ms
render ring:
	seqno = fffff55e [current fffff55e]
	ACTHD = 0x6c953d3a34 [current 0x6de5e76fa4]
	max ACTHD = 0x6c953d3a34
	score = 0
	action = 2
	instdone read = 0xffd7ffff 0xffffffff 0xffffffff 0xffffffff
	instdone accu = 0x00000000 0x00000000 0x00000000 0x00000000

Hangcheck active, fires in 1692ms
render ring:
	seqno = fffff55e [current fffff55e]
	ACTHD = 0x1f49b0366dc [current 0x1f4dcbd88ec]
	max ACTHD = 0x1f49b0366dc
	score = 0
	action = 2
	instdone read = 0xffd7ffff 0xffffffff 0xffffffff 0xffffffff
	instdone accu = 0x00000000 0x00000000 0x00000000 0x00000000

v2: use the new gem_wait() function (Chris)

Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 tests/drv_hangman.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

Comments

Chris Wilson Feb. 25, 2016, 10:41 a.m. UTC | #1
On Thu, Feb 25, 2016 at 10:32:11AM +0000, daniele.ceraolospurio@intel.com wrote:
> +/* This test covers the case where we end up in an uninitialised area of the
> + * ppgtt at an offset greater than the one where the last buffer is mapped. This
> + * is particularly relevant if 48b ppgtt is enabled because the ppgtt is
> + * massively bigger compared to the 32b case and it takes a lot more time to
> + * wrap, so the acthd can potentially keep increasing for a long time
> + */
> +#define NSEC_PER_SEC	1000000000L
> +static void ppgtt_walking(void)
> +{
> +	int fd;
> +	int64_t timeout_ns = 100 * NSEC_PER_SEC; /* 100 seconds */

This needs a note that this has to be greater than ~5*hangcheck.

> +	struct drm_i915_gem_execbuffer2 execbuf;
> +	struct drm_i915_gem_exec_object2 gem_exec;
> +	uint32_t handle;
> +	uint32_t batch[4];
> +
> +	fd = drm_open_driver(DRIVER_INTEL);
> +	igt_require(gem_gtt_type(fd) > 2);

Nope, just full-ppgtt is required (and provides a sensible hangcheck
test if !48bit as well).

Note this does require that the hangcheck is enabled, so igt_require().

> +
> +	/* the batch will be mapped to an offset < 4GB because the flag to allow
> +	 * 48b offsets is not specified, so jump to address 0x00000001 00000000
> +	 */
> +	batch[0] = MI_BATCH_BUFFER_START | 1;
> +	batch[1] = 0;
> +	batch[2] = 1;
> +	batch[3] = MI_BATCH_BUFFER_END;

The vm is entirely empty. Just submit an unterminated (empty) batch, and
it will walk from 0 to 1<<48bit and around and around and around and
around...

> +
> +	handle = gem_create(fd, 4096);
> +	gem_write(fd, handle, 0, batch, sizeof(batch));
> +
> +	memset(&gem_exec, 0, sizeof(gem_exec));
> +	gem_exec.handle = handle;
> +
> +	memset(&execbuf, 0, sizeof(execbuf));
> +	execbuf.buffers_ptr = (uintptr_t)&gem_exec;
> +	execbuf.buffer_count = 1;
> +	execbuf.batch_len = 16;
> +
> +	gem_execbuf(fd, &execbuf);
> +
> +	igt_assert(gem_wait(fd, handle, &timeout_ns) == 0);

igt_assert_eq(gem_wait(), 0); so you get the information about the
failure.

> +	igt_assert(timeout_ns > 0);

Redundant. gem_wait() returns ETIME if we wait for timeout_ns without
completion.

> +
> +	gem_close(fd, handle);

Irrelevant, it will be closed with close(fd).

> +	close(fd);
> +}
> +
>  igt_main
>  {
>  	const struct intel_execution_engine *e;
> @@ -314,4 +361,7 @@ igt_main
>  			test_error_state_capture(e->exec_id | e->flags,
>  						 e->full_name);
>  	}
> +
> +	igt_subtest("ppgtt-walking")
> +		ppgtt_walking();

This is a hangcheck test, "hangcheck-unterminated"
-Chris
Daniele Ceraolo Spurio Feb. 25, 2016, 11:12 a.m. UTC | #2
On 25/02/16 10:41, Chris Wilson wrote:
> On Thu, Feb 25, 2016 at 10:32:11AM +0000, daniele.ceraolospurio@intel.com wrote:
>> +/* This test covers the case where we end up in an uninitialised area of the
>> + * ppgtt at an offset greater than the one where the last buffer is mapped. This
>> + * is particularly relevant if 48b ppgtt is enabled because the ppgtt is
>> + * massively bigger compared to the 32b case and it takes a lot more time to
>> + * wrap, so the acthd can potentially keep increasing for a long time
>> + */
>> +#define NSEC_PER_SEC	1000000000L
>> +static void ppgtt_walking(void)
>> +{
>> +	int fd;
>> +	int64_t timeout_ns = 100 * NSEC_PER_SEC; /* 100 seconds */
> This needs a note that this has to be greater than ~5*hangcheck.
>
>> +	struct drm_i915_gem_execbuffer2 execbuf;
>> +	struct drm_i915_gem_exec_object2 gem_exec;
>> +	uint32_t handle;
>> +	uint32_t batch[4];
>> +
>> +	fd = drm_open_driver(DRIVER_INTEL);
>> +	igt_require(gem_gtt_type(fd) > 2);
> Nope, just full-ppgtt is required (and provides a sensible hangcheck
> test if !48bit as well).
>
> Note this does require that the hangcheck is enabled, so igt_require().
>
>> +
>> +	/* the batch will be mapped to an offset < 4GB because the flag to allow
>> +	 * 48b offsets is not specified, so jump to address 0x00000001 00000000
>> +	 */
>> +	batch[0] = MI_BATCH_BUFFER_START | 1;
>> +	batch[1] = 0;
>> +	batch[2] = 1;
>> +	batch[3] = MI_BATCH_BUFFER_END;
> The vm is entirely empty. Just submit an unterminated (empty) batch, and
> it will walk from 0 to 1<<48bit and around and around and around and
> around...

I chose to jump instead of just leaving the batch unterminated to cover 
the (rare) case where the rest of the allocated 4k of the batch contain 
some random values, which could cause a hang and thus falsely pass the 
test. I'll respin with a memset to 0 of the batch (plus all the other 
suggested changes).

Thanks,
Daniele

>
>> +
>> +	handle = gem_create(fd, 4096);
>> +	gem_write(fd, handle, 0, batch, sizeof(batch));
>> +
>> +	memset(&gem_exec, 0, sizeof(gem_exec));
>> +	gem_exec.handle = handle;
>> +
>> +	memset(&execbuf, 0, sizeof(execbuf));
>> +	execbuf.buffers_ptr = (uintptr_t)&gem_exec;
>> +	execbuf.buffer_count = 1;
>> +	execbuf.batch_len = 16;
>> +
>> +	gem_execbuf(fd, &execbuf);
>> +
>> +	igt_assert(gem_wait(fd, handle, &timeout_ns) == 0);
> igt_assert_eq(gem_wait(), 0); so you get the information about the
> failure.
>
>> +	igt_assert(timeout_ns > 0);
> Redundant. gem_wait() returns ETIME if we wait for timeout_ns without
> completion.
>
>> +
>> +	gem_close(fd, handle);
> Irrelevant, it will be closed with close(fd).
>
>> +	close(fd);
>> +}
>> +
>>   igt_main
>>   {
>>   	const struct intel_execution_engine *e;
>> @@ -314,4 +361,7 @@ igt_main
>>   			test_error_state_capture(e->exec_id | e->flags,
>>   						 e->full_name);
>>   	}
>> +
>> +	igt_subtest("ppgtt-walking")
>> +		ppgtt_walking();
> This is a hangcheck test, "hangcheck-unterminated"
> -Chris
>
Chris Wilson Feb. 25, 2016, 11:32 a.m. UTC | #3
On Thu, Feb 25, 2016 at 11:12:06AM +0000, Daniele Ceraolo Spurio wrote:
> 
> 
> On 25/02/16 10:41, Chris Wilson wrote:
> >On Thu, Feb 25, 2016 at 10:32:11AM +0000, daniele.ceraolospurio@intel.com wrote:
> >>+/* This test covers the case where we end up in an uninitialised area of the
> >>+ * ppgtt at an offset greater than the one where the last buffer is mapped. This
> >>+ * is particularly relevant if 48b ppgtt is enabled because the ppgtt is
> >>+ * massively bigger compared to the 32b case and it takes a lot more time to
> >>+ * wrap, so the acthd can potentially keep increasing for a long time
> >>+ */
> >>+#define NSEC_PER_SEC	1000000000L
> >>+static void ppgtt_walking(void)
> >>+{
> >>+	int fd;
> >>+	int64_t timeout_ns = 100 * NSEC_PER_SEC; /* 100 seconds */
> >This needs a note that this has to be greater than ~5*hangcheck.
> >
> >>+	struct drm_i915_gem_execbuffer2 execbuf;
> >>+	struct drm_i915_gem_exec_object2 gem_exec;
> >>+	uint32_t handle;
> >>+	uint32_t batch[4];
> >>+
> >>+	fd = drm_open_driver(DRIVER_INTEL);
> >>+	igt_require(gem_gtt_type(fd) > 2);
> >Nope, just full-ppgtt is required (and provides a sensible hangcheck
> >test if !48bit as well).
> >
> >Note this does require that the hangcheck is enabled, so igt_require().
> >
> >>+
> >>+	/* the batch will be mapped to an offset < 4GB because the flag to allow
> >>+	 * 48b offsets is not specified, so jump to address 0x00000001 00000000
> >>+	 */
> >>+	batch[0] = MI_BATCH_BUFFER_START | 1;
> >>+	batch[1] = 0;
> >>+	batch[2] = 1;
> >>+	batch[3] = MI_BATCH_BUFFER_END;
> >The vm is entirely empty. Just submit an unterminated (empty) batch, and
> >it will walk from 0 to 1<<48bit and around and around and around and
> >around...
> 
> I chose to jump instead of just leaving the batch unterminated to
> cover the (rare) case where the rest of the allocated 4k of the
> batch contain some random values, which could cause a hang and thus
> falsely pass the test.

That would be a huge kernel bug. Freshly allocated buffers have to be
zero to avoid information leaks. I hope you are confusing allocating
from the userspace buffer cache with a fresh kernel allocation...
-Chris
Daniele Ceraolo Spurio Feb. 25, 2016, 12:04 p.m. UTC | #4
On 25/02/16 11:32, Chris Wilson wrote:
> On Thu, Feb 25, 2016 at 11:12:06AM +0000, Daniele Ceraolo Spurio wrote:
>>
>> On 25/02/16 10:41, Chris Wilson wrote:
>>> On Thu, Feb 25, 2016 at 10:32:11AM +0000, daniele.ceraolospurio@intel.com wrote:
>>>> +/* This test covers the case where we end up in an uninitialised area of the
>>>> + * ppgtt at an offset greater than the one where the last buffer is mapped. This
>>>> + * is particularly relevant if 48b ppgtt is enabled because the ppgtt is
>>>> + * massively bigger compared to the 32b case and it takes a lot more time to
>>>> + * wrap, so the acthd can potentially keep increasing for a long time
>>>> + */
>>>> +#define NSEC_PER_SEC	1000000000L
>>>> +static void ppgtt_walking(void)
>>>> +{
>>>> +	int fd;
>>>> +	int64_t timeout_ns = 100 * NSEC_PER_SEC; /* 100 seconds */
>>> This needs a note that this has to be greater than ~5*hangcheck.
>>>
>>>> +	struct drm_i915_gem_execbuffer2 execbuf;
>>>> +	struct drm_i915_gem_exec_object2 gem_exec;
>>>> +	uint32_t handle;
>>>> +	uint32_t batch[4];
>>>> +
>>>> +	fd = drm_open_driver(DRIVER_INTEL);
>>>> +	igt_require(gem_gtt_type(fd) > 2);
>>> Nope, just full-ppgtt is required (and provides a sensible hangcheck
>>> test if !48bit as well).
>>>
>>> Note this does require that the hangcheck is enabled, so igt_require().
>>>
>>>> +
>>>> +	/* the batch will be mapped to an offset < 4GB because the flag to allow
>>>> +	 * 48b offsets is not specified, so jump to address 0x00000001 00000000
>>>> +	 */
>>>> +	batch[0] = MI_BATCH_BUFFER_START | 1;
>>>> +	batch[1] = 0;
>>>> +	batch[2] = 1;
>>>> +	batch[3] = MI_BATCH_BUFFER_END;
>>> The vm is entirely empty. Just submit an unterminated (empty) batch, and
>>> it will walk from 0 to 1<<48bit and around and around and around and
>>> around...
>> I chose to jump instead of just leaving the batch unterminated to
>> cover the (rare) case where the rest of the allocated 4k of the
>> batch contain some random values, which could cause a hang and thus
>> falsely pass the test.
> That would be a huge kernel bug. Freshly allocated buffers have to be
> zero to avoid information leaks. I hope you are confusing allocating
> from the userspace buffer cache with a fresh kernel allocation...
> -Chris
>

Apologies for the confusion, you're correct I was thinking about it from 
a libdrm level and not from a bare kernel level.

Daniele

Patch
diff mbox

diff --git a/tests/drv_hangman.c b/tests/drv_hangman.c
index 8a465cf..4f396b9 100644
--- a/tests/drv_hangman.c
+++ b/tests/drv_hangman.c
@@ -288,6 +288,53 @@  static void test_error_state_capture(unsigned ring_id,
 	check_error_state(gen, cmd_parser, ring_name, offset);
 }
 
+/* This test covers the case where we end up in an uninitialised area of the
+ * ppgtt at an offset greater than the one where the last buffer is mapped. This
+ * is particularly relevant if 48b ppgtt is enabled because the ppgtt is
+ * massively bigger compared to the 32b case and it takes a lot more time to
+ * wrap, so the acthd can potentially keep increasing for a long time
+ */
+#define NSEC_PER_SEC	1000000000L
+static void ppgtt_walking(void)
+{
+	int fd;
+	int64_t timeout_ns = 100 * NSEC_PER_SEC; /* 100 seconds */
+	struct drm_i915_gem_execbuffer2 execbuf;
+	struct drm_i915_gem_exec_object2 gem_exec;
+	uint32_t handle;
+	uint32_t batch[4];
+
+	fd = drm_open_driver(DRIVER_INTEL);
+	igt_require(gem_gtt_type(fd) > 2);
+
+	/* the batch will be mapped to an offset < 4GB because the flag to allow
+	 * 48b offsets is not specified, so jump to address 0x00000001 00000000
+	 */
+	batch[0] = MI_BATCH_BUFFER_START | 1;
+	batch[1] = 0;
+	batch[2] = 1;
+	batch[3] = MI_BATCH_BUFFER_END;
+
+	handle = gem_create(fd, 4096);
+	gem_write(fd, handle, 0, batch, sizeof(batch));
+
+	memset(&gem_exec, 0, sizeof(gem_exec));
+	gem_exec.handle = handle;
+
+	memset(&execbuf, 0, sizeof(execbuf));
+	execbuf.buffers_ptr = (uintptr_t)&gem_exec;
+	execbuf.buffer_count = 1;
+	execbuf.batch_len = 16;
+
+	gem_execbuf(fd, &execbuf);
+
+	igt_assert(gem_wait(fd, handle, &timeout_ns) == 0);
+	igt_assert(timeout_ns > 0);
+
+	gem_close(fd, handle);
+	close(fd);
+}
+
 igt_main
 {
 	const struct intel_execution_engine *e;
@@ -314,4 +361,7 @@  igt_main
 			test_error_state_capture(e->exec_id | e->flags,
 						 e->full_name);
 	}
+
+	igt_subtest("ppgtt-walking")
+		ppgtt_walking();
 }