Message ID | 20180912111151.21147-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Limit the backpressure for i915_request allocation | expand |
On 12/09/2018 12:11, Chris Wilson wrote: > If we try and fail to allocate a i915_request, we apply some > backpressure on the clients to throttle the memory allocations coming > from i915.ko. Currently, we wait until completely idle, but this is far > too heavy and leads to some situations where the only escape is to > declare a client hung and reset the GPU. The intent is to only ratelimit > the allocation requests, so we need only wait for a jiffie before using > the normal direct reclaim. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > --- > drivers/gpu/drm/i915/i915_request.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > index 09ed48833b54..588bc5a4d18b 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) > ret = i915_gem_wait_for_idle(i915, > I915_WAIT_LOCKED | > I915_WAIT_INTERRUPTIBLE, > - MAX_SCHEDULE_TIMEOUT); > + 1); > if (ret) > goto err_unreserve; > > What is the remaining value of even trying to wait for idle, instead of maybe just i915_request_retire and sleep for a jiffie? The intention would potentially read clearer since it is questionable there is any relationship with idle and rate limiting clients. In fact, now that I think of it, waiting for idle is a nice way to starve an unlucky client forever. Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-09-12 14:34:16) > > On 12/09/2018 12:11, Chris Wilson wrote: > > If we try and fail to allocate a i915_request, we apply some > > backpressure on the clients to throttle the memory allocations coming > > from i915.ko. Currently, we wait until completely idle, but this is far > > too heavy and leads to some situations where the only escape is to > > declare a client hung and reset the GPU. The intent is to only ratelimit > > the allocation requests, so we need only wait for a jiffie before using > > the normal direct reclaim. > > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > --- > > drivers/gpu/drm/i915/i915_request.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > > index 09ed48833b54..588bc5a4d18b 100644 > > --- a/drivers/gpu/drm/i915/i915_request.c > > +++ b/drivers/gpu/drm/i915/i915_request.c > > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) > > ret = i915_gem_wait_for_idle(i915, > > I915_WAIT_LOCKED | > > I915_WAIT_INTERRUPTIBLE, > > - MAX_SCHEDULE_TIMEOUT); > > + 1); > > if (ret) > > goto err_unreserve; > > > > > > What is the remaining value of even trying to wait for idle, instead of > maybe just i915_request_retire and sleep for a jiffie? There's no point in the wait if actually idle? We just want the retire and kicking of slabs. > The intention > would potentially read clearer since it is questionable there is any > relationship with idle and rate limiting clients. In fact, now that I > think of it, waiting for idle is a nice way to starve an unlucky client > forever. See i915_gem_shrink. It's a seriously moot point. You can define any pathology you'd like, someone has to lose. -Chris
Quoting Tvrtko Ursulin (2018-09-12 14:34:16) > > On 12/09/2018 12:11, Chris Wilson wrote: > > If we try and fail to allocate a i915_request, we apply some > > backpressure on the clients to throttle the memory allocations coming > > from i915.ko. Currently, we wait until completely idle, but this is far > > too heavy and leads to some situations where the only escape is to > > declare a client hung and reset the GPU. The intent is to only ratelimit > > the allocation requests, so we need only wait for a jiffie before using > > the normal direct reclaim. > > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > --- > > drivers/gpu/drm/i915/i915_request.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > > index 09ed48833b54..588bc5a4d18b 100644 > > --- a/drivers/gpu/drm/i915/i915_request.c > > +++ b/drivers/gpu/drm/i915/i915_request.c > > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) > > ret = i915_gem_wait_for_idle(i915, > > I915_WAIT_LOCKED | > > I915_WAIT_INTERRUPTIBLE, > > - MAX_SCHEDULE_TIMEOUT); > > + 1); > > if (ret) > > goto err_unreserve; > > > > > > What is the remaining value of even trying to wait for idle, instead of > maybe just i915_request_retire and sleep for a jiffie? The intention > would potentially read clearer since it is questionable there is any > relationship with idle and rate limiting clients. In fact, now that I > think of it, waiting for idle is a nice way to starve an unlucky client > forever. Better to starve the unlucky client, than to allow the entire system to grind to a halt. One caveat to using RCU is that it is our responsibility to apply backpressure as none is applied by the vm. -Chris
On 12/09/2018 14:38, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-09-12 14:34:16) >> >> On 12/09/2018 12:11, Chris Wilson wrote: >>> If we try and fail to allocate a i915_request, we apply some >>> backpressure on the clients to throttle the memory allocations coming >>> from i915.ko. Currently, we wait until completely idle, but this is far >>> too heavy and leads to some situations where the only escape is to >>> declare a client hung and reset the GPU. The intent is to only ratelimit >>> the allocation requests, so we need only wait for a jiffie before using >>> the normal direct reclaim. >>> >>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_request.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c >>> index 09ed48833b54..588bc5a4d18b 100644 >>> --- a/drivers/gpu/drm/i915/i915_request.c >>> +++ b/drivers/gpu/drm/i915/i915_request.c >>> @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) >>> ret = i915_gem_wait_for_idle(i915, >>> I915_WAIT_LOCKED | >>> I915_WAIT_INTERRUPTIBLE, >>> - MAX_SCHEDULE_TIMEOUT); >>> + 1); >>> if (ret) >>> goto err_unreserve; >>> >>> >> >> What is the remaining value of even trying to wait for idle, instead of >> maybe just i915_request_retire and sleep for a jiffie? > > There's no point in the wait if actually idle? We just want the retire > and kicking of slabs. Why there is no point? If allocation failed and we want to apply backpressure, lets apply it. Why would it matter which client first spotted GPU is idle and skipped the sleep? What would happen if we added a random back-off in this case between some bounds, like Ethernet frames if my memory serves me well. Don't get me wrong, in practice it is minor point, I am purely thinking that the message presence of i915_gem_wait_for_idle is maybe not clear or is misleading. Regards, Tvrtko >> The intention >> would potentially read clearer since it is questionable there is any >> relationship with idle and rate limiting clients. In fact, now that I >> think of it, waiting for idle is a nice way to starve an unlucky client >> forever. > > See i915_gem_shrink. It's a seriously moot point. You can define any > pathology you'd like, someone has to lose. > -Chris >
On Wed, Sep 12, 2018 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Tvrtko Ursulin (2018-09-12 14:34:16) >> >> On 12/09/2018 12:11, Chris Wilson wrote: >> > If we try and fail to allocate a i915_request, we apply some >> > backpressure on the clients to throttle the memory allocations coming >> > from i915.ko. Currently, we wait until completely idle, but this is far >> > too heavy and leads to some situations where the only escape is to >> > declare a client hung and reset the GPU. The intent is to only ratelimit >> > the allocation requests, so we need only wait for a jiffie before using >> > the normal direct reclaim. >> > >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> >> > --- >> > drivers/gpu/drm/i915/i915_request.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c >> > index 09ed48833b54..588bc5a4d18b 100644 >> > --- a/drivers/gpu/drm/i915/i915_request.c >> > +++ b/drivers/gpu/drm/i915/i915_request.c >> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) >> > ret = i915_gem_wait_for_idle(i915, >> > I915_WAIT_LOCKED | >> > I915_WAIT_INTERRUPTIBLE, >> > - MAX_SCHEDULE_TIMEOUT); >> > + 1); >> > if (ret) >> > goto err_unreserve; >> > >> > >> >> What is the remaining value of even trying to wait for idle, instead of >> maybe just i915_request_retire and sleep for a jiffie? The intention >> would potentially read clearer since it is questionable there is any >> relationship with idle and rate limiting clients. In fact, now that I >> think of it, waiting for idle is a nice way to starve an unlucky client >> forever. > > Better to starve the unlucky client, than to allow the entire system to > grind to a halt. > > One caveat to using RCU is that it is our responsibility to apply > backpressure as none is applied by the vm. So instead of 1 jiffie, should we wait for 1 rcu grace period? My understanding is that under very heavy load these can be extended (since batching is more effective for throughput, if you don't run out of memory). Just a random comment from the sidelines really :-) -Daniel
Quoting Daniel Vetter (2018-09-12 15:47:21) > On Wed, Sep 12, 2018 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > Quoting Tvrtko Ursulin (2018-09-12 14:34:16) > >> > >> On 12/09/2018 12:11, Chris Wilson wrote: > >> > If we try and fail to allocate a i915_request, we apply some > >> > backpressure on the clients to throttle the memory allocations coming > >> > from i915.ko. Currently, we wait until completely idle, but this is far > >> > too heavy and leads to some situations where the only escape is to > >> > declare a client hung and reset the GPU. The intent is to only ratelimit > >> > the allocation requests, so we need only wait for a jiffie before using > >> > the normal direct reclaim. > >> > > >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > >> > --- > >> > drivers/gpu/drm/i915/i915_request.c | 2 +- > >> > 1 file changed, 1 insertion(+), 1 deletion(-) > >> > > >> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > >> > index 09ed48833b54..588bc5a4d18b 100644 > >> > --- a/drivers/gpu/drm/i915/i915_request.c > >> > +++ b/drivers/gpu/drm/i915/i915_request.c > >> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) > >> > ret = i915_gem_wait_for_idle(i915, > >> > I915_WAIT_LOCKED | > >> > I915_WAIT_INTERRUPTIBLE, > >> > - MAX_SCHEDULE_TIMEOUT); > >> > + 1); > >> > if (ret) > >> > goto err_unreserve; > >> > > >> > > >> > >> What is the remaining value of even trying to wait for idle, instead of > >> maybe just i915_request_retire and sleep for a jiffie? The intention > >> would potentially read clearer since it is questionable there is any > >> relationship with idle and rate limiting clients. In fact, now that I > >> think of it, waiting for idle is a nice way to starve an unlucky client > >> forever. > > > > Better to starve the unlucky client, than to allow the entire system to > > grind to a halt. > > > > One caveat to using RCU is that it is our responsibility to apply > > backpressure as none is applied by the vm. > > So instead of 1 jiffie, should we wait for 1 rcu grace period? My > understanding is that under very heavy load these can be extended > (since batching is more effective for throughput, if you don't run out > of memory). Just a random comment from the sidelines really :-) There's (essentially) a wait for that later ;) But a long time ago Paul did write a missive explaining that there should be some use of cond_synchronize_rcu() to provide the rate-limiting, but I've never been sure of a way to plug in the right figures. Do we say if the timeline does more than two RCU allocations within the same grace period it should be throttled? A light allocation failure has always seemed to be a sensible trigger to start worrying. -Chris
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 09ed48833b54..588bc5a4d18b 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx) ret = i915_gem_wait_for_idle(i915, I915_WAIT_LOCKED | I915_WAIT_INTERRUPTIBLE, - MAX_SCHEDULE_TIMEOUT); + 1); if (ret) goto err_unreserve;
If we try and fail to allocate a i915_request, we apply some backpressure on the clients to throttle the memory allocations coming from i915.ko. Currently, we wait until completely idle, but this is far too heavy and leads to some situations where the only escape is to declare a client hung and reset the GPU. The intent is to only ratelimit the allocation requests, so we need only wait for a jiffie before using the normal direct reclaim. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> --- drivers/gpu/drm/i915/i915_request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)