diff mbox

[i-g-t,v2,2/2] tests/gem_reset_stats: Add client ban test

Message ID 20171013204929.1748-2-antonio.argenziano@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Antonio Argenziano Oct. 13, 2017, 8:49 p.m. UTC
A client that submits 'bad' contexts will be banned eventually while
other clients are not affected. Add a test for this.

v2
	- Do not use a fixed number of contexts to ban client (Chris)

Signed-off-by: Antonio Argenziano <antonio.argenziano@intel.com>

Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 tests/gem_reset_stats.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

Comments

Chris Wilson Oct. 13, 2017, 9:23 p.m. UTC | #1
Quoting Antonio Argenziano (2017-10-13 21:49:29)
> A client that submits 'bad' contexts will be banned eventually while
> other clients are not affected. Add a test for this.
> 
> v2
>         - Do not use a fixed number of contexts to ban client (Chris)
> 
> Signed-off-by: Antonio Argenziano <antonio.argenziano@intel.com>
> 
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  tests/gem_reset_stats.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
> index edc40767..da309237 100644
> --- a/tests/gem_reset_stats.c
> +++ b/tests/gem_reset_stats.c
> @@ -438,6 +438,61 @@ static void test_ban_ctx(const struct intel_execution_engine *e)
>         close(fd);
>  }
>  
> +static void test_client_ban(const struct intel_execution_engine *e)
> +{
> +       int fd_bad,fd_good;
> +       struct local_drm_i915_reset_stats rs_bad, rs_good;
> +       int ban, ctx_bans_retry = 10;
> +       int client_ban, client_bans_retry = 10;
> +       uint32_t ctx_bad;
> +       uint32_t test_ctx;
> +       int active_count = 0;

/*
 * We try to prevent DoS GPU hang attacks from one client affecting the
 * service of other clients by banning the malicious client.
 *
 * To simulate this, we have one client submit a sequence of GPU hangs
 * that we expect to be banned (the client will no longer be allowed
 * to submit new work). Meanwhile a second client will not be affected
 * and allowed to continue.
 *
 * Note this is an extremely simplistic mechanism. For example, each
 * context inside an fd may represent a separate client (e.g. WebGL)
 * in which case the per-fd ban allow one client to DoS a second.
 * On the other hand, a malicious client may be creating a new context
 * for each hang to circumvent per-context restrictions. It may even be
 * creating a new fd each time to circumvent the per-fd restrictions.
 * Finally, the most malicious of all clients may succeed in wedging
 * the driver, bringing everyone to a standstill. In short, this DoS
 * prevention is itself suspect.
 *
 * So given the above what exactly to we want to define as the
 * expected behaviour? The answer is none of the above; use a
 * timeslicing scheduler and no banning. Any context is then allowed to
 * use as much as its time slice spinning as it wants; it won't be
 * allowed to steal GPU time from other processes. DoS averted.
 */

I keep coming to the conclusion that banning is a temporary hack, and
not mandatory uABI, like hangcheck. The usual answer is that we then
describe the implementation's behaviour through say a CONTEXT_GETPARAM
and let userspace factor that into account. But again, I to have ask
what requirements has userspace?
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_robustness.txt

(And how are the requirements going to change?)

The implication from that is that we could ban a context after the first
hang, but mesa isn't ready for that!

Maybe you should try writing an explanation of exactly what behaviour
you are trying to elicit here and why? :)

One thing that seems nice on the surface for execlists is repriotising
after a hang to allow all innocent parties to make progress at the
expense of the guilty. You will probably think of a few other approaches
that make sense as you think about what userspace expects.

> +       fd_bad = drm_open_driver(DRIVER_INTEL);
> +       fd_good = drm_open_driver(DRIVER_INTEL);
> +
> +       assert_reset_status(fd_bad, fd_bad, 0, RS_NO_ERROR);
> +       assert_reset_status(fd_good, fd_good, 0, RS_NO_ERROR);
> +
> +       while (client_bans_retry--) {
> +               client_ban = __gem_context_create(fd_bad, &ctx_bad);
> +               if (client_ban == -EIO)
> +                       break;
> +
> +               noop(fd_bad, ctx_bad, e);
> +               assert_reset_status(fd_bad, fd_bad, ctx_bad, RS_NO_ERROR);
> +
> +               ctx_bans_retry = 10;
> +               active_count = 0;
> +               while (ctx_bans_retry--) {
> +                       inject_hang(fd_bad, ctx_bad, e, BAN);
> +                       active_count++;
> +
> +                       ban = noop(fd_bad, ctx_bad, e);
> +                       if (ban == -EIO)
> +                               break;

There's a major variation in submitting a queue of hangs that also needs
to be taken into account (with different code paths in the kernel).
-Chris
diff mbox

Patch

diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
index edc40767..da309237 100644
--- a/tests/gem_reset_stats.c
+++ b/tests/gem_reset_stats.c
@@ -438,6 +438,61 @@  static void test_ban_ctx(const struct intel_execution_engine *e)
 	close(fd);
 }
 
+static void test_client_ban(const struct intel_execution_engine *e)
+{
+	int fd_bad,fd_good;
+	struct local_drm_i915_reset_stats rs_bad, rs_good;
+	int ban, ctx_bans_retry = 10;
+	int client_ban, client_bans_retry = 10;
+	uint32_t ctx_bad;
+	uint32_t test_ctx;
+	int active_count = 0;
+
+	fd_bad = drm_open_driver(DRIVER_INTEL);
+	fd_good = drm_open_driver(DRIVER_INTEL);
+
+	assert_reset_status(fd_bad, fd_bad, 0, RS_NO_ERROR);
+	assert_reset_status(fd_good, fd_good, 0, RS_NO_ERROR);
+
+	while (client_bans_retry--) {
+		client_ban = __gem_context_create(fd_bad, &ctx_bad);
+		if (client_ban == -EIO)
+			break;
+
+		noop(fd_bad, ctx_bad, e);
+		assert_reset_status(fd_bad, fd_bad, ctx_bad, RS_NO_ERROR);
+
+		ctx_bans_retry = 10;
+		active_count = 0;
+		while (ctx_bans_retry--) {
+			inject_hang(fd_bad, ctx_bad, e, BAN);
+			active_count++;
+
+			ban = noop(fd_bad, ctx_bad, e);
+			if (ban == -EIO)
+				break;
+		}
+
+		igt_assert_eq(ban, -EIO);
+
+		assert_reset_status(fd_bad, fd_bad, ctx_bad, RS_BATCH_ACTIVE);
+		igt_assert_eq(gem_reset_stats(fd_bad, ctx_bad, &rs_bad), 0);
+		igt_assert_eq(rs_bad.batch_active, active_count);
+
+		igt_debug("retrying for client ban (%d)\n", client_bans_retry);
+	}
+
+	igt_assert_eq(__gem_context_create(fd_bad, &test_ctx), -EIO);
+
+	igt_assert_lt(0, noop(fd_good, 0, e));
+	assert_reset_status(fd_good, fd_good, 0, RS_NO_ERROR);
+	igt_assert_eq(gem_reset_stats(fd_good, 0, &rs_good), 0);
+	igt_assert_eq(rs_good.batch_active, 0);
+
+	close(fd_bad);
+	close(fd_good);
+}
+
 static void test_unrelated_ctx(const struct intel_execution_engine *e)
 {
 	int fd1,fd2;
@@ -817,6 +872,9 @@  igt_main
 		igt_subtest_f("ban-ctx-%s", e->name)
 			RUN_CTX_TEST(test_ban_ctx(e));
 
+		igt_subtest_f("ban-client-%s", e->name)
+			RUN_CTX_TEST(test_client_ban(e));
+
 		igt_subtest_f("reset-count-%s", e->name)
 			RUN_TEST(test_reset_count(e, false));