diff mbox series

drm/i915: Limit the backpressure for i915_request allocation

Message ID 20180912111151.21147-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series drm/i915: Limit the backpressure for i915_request allocation | expand

Commit Message

Chris Wilson Sept. 12, 2018, 11:11 a.m. UTC
If we try and fail to allocate a i915_request, we apply some
backpressure on the clients to throttle the memory allocations coming
from i915.ko. Currently, we wait until completely idle, but this is far
too heavy and leads to some situations where the only escape is to
declare a client hung and reset the GPU. The intent is to only ratelimit
the allocation requests, so we need only wait for a jiffie before using
the normal direct reclaim.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tvrtko Ursulin Sept. 12, 2018, 1:34 p.m. UTC | #1
On 12/09/2018 12:11, Chris Wilson wrote:
> If we try and fail to allocate a i915_request, we apply some
> backpressure on the clients to throttle the memory allocations coming
> from i915.ko. Currently, we wait until completely idle, but this is far
> too heavy and leads to some situations where the only escape is to
> declare a client hung and reset the GPU. The intent is to only ratelimit
> the allocation requests, so we need only wait for a jiffie before using
> the normal direct reclaim.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 09ed48833b54..588bc5a4d18b 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
>   		ret = i915_gem_wait_for_idle(i915,
>   					     I915_WAIT_LOCKED |
>   					     I915_WAIT_INTERRUPTIBLE,
> -					     MAX_SCHEDULE_TIMEOUT);
> +					     1);
>   		if (ret)
>   			goto err_unreserve;
>   
> 

What is the remaining value of even trying to wait for idle, instead of 
maybe just i915_request_retire and sleep for a jiffie? The intention 
would potentially read clearer since it is questionable there is any 
relationship with idle and rate limiting clients. In fact, now that I 
think of it, waiting for idle is a nice way to starve an unlucky client 
forever.

Regards,

Tvrtko
Chris Wilson Sept. 12, 2018, 1:38 p.m. UTC | #2
Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
> 
> On 12/09/2018 12:11, Chris Wilson wrote:
> > If we try and fail to allocate a i915_request, we apply some
> > backpressure on the clients to throttle the memory allocations coming
> > from i915.ko. Currently, we wait until completely idle, but this is far
> > too heavy and leads to some situations where the only escape is to
> > declare a client hung and reset the GPU. The intent is to only ratelimit
> > the allocation requests, so we need only wait for a jiffie before using
> > the normal direct reclaim.
> > 
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 09ed48833b54..588bc5a4d18b 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
> >               ret = i915_gem_wait_for_idle(i915,
> >                                            I915_WAIT_LOCKED |
> >                                            I915_WAIT_INTERRUPTIBLE,
> > -                                          MAX_SCHEDULE_TIMEOUT);
> > +                                          1);
> >               if (ret)
> >                       goto err_unreserve;
> >   
> > 
> 
> What is the remaining value of even trying to wait for idle, instead of 
> maybe just i915_request_retire and sleep for a jiffie?

There's no point in the wait if actually idle? We just want the retire
and kicking of slabs.

> The intention 
> would potentially read clearer since it is questionable there is any 
> relationship with idle and rate limiting clients. In fact, now that I 
> think of it, waiting for idle is a nice way to starve an unlucky client 
> forever.

See i915_gem_shrink. It's a seriously moot point. You can define any
pathology you'd like, someone has to lose.
-Chris
Chris Wilson Sept. 12, 2018, 1:42 p.m. UTC | #3
Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
> 
> On 12/09/2018 12:11, Chris Wilson wrote:
> > If we try and fail to allocate a i915_request, we apply some
> > backpressure on the clients to throttle the memory allocations coming
> > from i915.ko. Currently, we wait until completely idle, but this is far
> > too heavy and leads to some situations where the only escape is to
> > declare a client hung and reset the GPU. The intent is to only ratelimit
> > the allocation requests, so we need only wait for a jiffie before using
> > the normal direct reclaim.
> > 
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 09ed48833b54..588bc5a4d18b 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
> >               ret = i915_gem_wait_for_idle(i915,
> >                                            I915_WAIT_LOCKED |
> >                                            I915_WAIT_INTERRUPTIBLE,
> > -                                          MAX_SCHEDULE_TIMEOUT);
> > +                                          1);
> >               if (ret)
> >                       goto err_unreserve;
> >   
> > 
> 
> What is the remaining value of even trying to wait for idle, instead of 
> maybe just i915_request_retire and sleep for a jiffie? The intention 
> would potentially read clearer since it is questionable there is any 
> relationship with idle and rate limiting clients. In fact, now that I 
> think of it, waiting for idle is a nice way to starve an unlucky client 
> forever.

Better to starve the unlucky client, than to allow the entire system to
grind to a halt.

One caveat to using RCU is that it is our responsibility to apply
backpressure as none is applied by the vm.
-Chris
Tvrtko Ursulin Sept. 12, 2018, 1:53 p.m. UTC | #4
On 12/09/2018 14:38, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
>>
>> On 12/09/2018 12:11, Chris Wilson wrote:
>>> If we try and fail to allocate a i915_request, we apply some
>>> backpressure on the clients to throttle the memory allocations coming
>>> from i915.ko. Currently, we wait until completely idle, but this is far
>>> too heavy and leads to some situations where the only escape is to
>>> declare a client hung and reset the GPU. The intent is to only ratelimit
>>> the allocation requests, so we need only wait for a jiffie before using
>>> the normal direct reclaim.
>>>
>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index 09ed48833b54..588bc5a4d18b 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
>>>                ret = i915_gem_wait_for_idle(i915,
>>>                                             I915_WAIT_LOCKED |
>>>                                             I915_WAIT_INTERRUPTIBLE,
>>> -                                          MAX_SCHEDULE_TIMEOUT);
>>> +                                          1);
>>>                if (ret)
>>>                        goto err_unreserve;
>>>    
>>>
>>
>> What is the remaining value of even trying to wait for idle, instead of
>> maybe just i915_request_retire and sleep for a jiffie?
> 
> There's no point in the wait if actually idle? We just want the retire
> and kicking of slabs.

Why there is no point? If allocation failed and we want to apply 
backpressure, lets apply it.

Why would it matter which client first spotted GPU is idle and skipped 
the sleep? What would happen if we added a random back-off in this case 
between some bounds, like Ethernet frames if my memory serves me well.

Don't get me wrong, in practice it is minor point, I am purely thinking 
that the message presence of i915_gem_wait_for_idle is maybe not clear 
or is misleading.

Regards,

Tvrtko

>> The intention
>> would potentially read clearer since it is questionable there is any
>> relationship with idle and rate limiting clients. In fact, now that I
>> think of it, waiting for idle is a nice way to starve an unlucky client
>> forever.
> 
> See i915_gem_shrink. It's a seriously moot point. You can define any
> pathology you'd like, someone has to lose.
> -Chris
>
Daniel Vetter Sept. 12, 2018, 2:47 p.m. UTC | #5
On Wed, Sep 12, 2018 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
>>
>> On 12/09/2018 12:11, Chris Wilson wrote:
>> > If we try and fail to allocate a i915_request, we apply some
>> > backpressure on the clients to throttle the memory allocations coming
>> > from i915.ko. Currently, we wait until completely idle, but this is far
>> > too heavy and leads to some situations where the only escape is to
>> > declare a client hung and reset the GPU. The intent is to only ratelimit
>> > the allocation requests, so we need only wait for a jiffie before using
>> > the normal direct reclaim.
>> >
>> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> > ---
>> >   drivers/gpu/drm/i915/i915_request.c | 2 +-
>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>> > index 09ed48833b54..588bc5a4d18b 100644
>> > --- a/drivers/gpu/drm/i915/i915_request.c
>> > +++ b/drivers/gpu/drm/i915/i915_request.c
>> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
>> >               ret = i915_gem_wait_for_idle(i915,
>> >                                            I915_WAIT_LOCKED |
>> >                                            I915_WAIT_INTERRUPTIBLE,
>> > -                                          MAX_SCHEDULE_TIMEOUT);
>> > +                                          1);
>> >               if (ret)
>> >                       goto err_unreserve;
>> >
>> >
>>
>> What is the remaining value of even trying to wait for idle, instead of
>> maybe just i915_request_retire and sleep for a jiffie? The intention
>> would potentially read clearer since it is questionable there is any
>> relationship with idle and rate limiting clients. In fact, now that I
>> think of it, waiting for idle is a nice way to starve an unlucky client
>> forever.
>
> Better to starve the unlucky client, than to allow the entire system to
> grind to a halt.
>
> One caveat to using RCU is that it is our responsibility to apply
> backpressure as none is applied by the vm.

So instead of 1 jiffie, should we wait for 1 rcu grace period? My
understanding is that under very heavy load these can be extended
(since batching is more effective for throughput, if you don't run out
of memory). Just a random comment from the sidelines really :-)
-Daniel
Chris Wilson Sept. 12, 2018, 2:54 p.m. UTC | #6
Quoting Daniel Vetter (2018-09-12 15:47:21)
> On Wed, Sep 12, 2018 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
> >>
> >> On 12/09/2018 12:11, Chris Wilson wrote:
> >> > If we try and fail to allocate a i915_request, we apply some
> >> > backpressure on the clients to throttle the memory allocations coming
> >> > from i915.ko. Currently, we wait until completely idle, but this is far
> >> > too heavy and leads to some situations where the only escape is to
> >> > declare a client hung and reset the GPU. The intent is to only ratelimit
> >> > the allocation requests, so we need only wait for a jiffie before using
> >> > the normal direct reclaim.
> >> >
> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >> > ---
> >> >   drivers/gpu/drm/i915/i915_request.c | 2 +-
> >> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> >> > index 09ed48833b54..588bc5a4d18b 100644
> >> > --- a/drivers/gpu/drm/i915/i915_request.c
> >> > +++ b/drivers/gpu/drm/i915/i915_request.c
> >> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
> >> >               ret = i915_gem_wait_for_idle(i915,
> >> >                                            I915_WAIT_LOCKED |
> >> >                                            I915_WAIT_INTERRUPTIBLE,
> >> > -                                          MAX_SCHEDULE_TIMEOUT);
> >> > +                                          1);
> >> >               if (ret)
> >> >                       goto err_unreserve;
> >> >
> >> >
> >>
> >> What is the remaining value of even trying to wait for idle, instead of
> >> maybe just i915_request_retire and sleep for a jiffie? The intention
> >> would potentially read clearer since it is questionable there is any
> >> relationship with idle and rate limiting clients. In fact, now that I
> >> think of it, waiting for idle is a nice way to starve an unlucky client
> >> forever.
> >
> > Better to starve the unlucky client, than to allow the entire system to
> > grind to a halt.
> >
> > One caveat to using RCU is that it is our responsibility to apply
> > backpressure as none is applied by the vm.
> 
> So instead of 1 jiffie, should we wait for 1 rcu grace period? My
> understanding is that under very heavy load these can be extended
> (since batching is more effective for throughput, if you don't run out
> of memory). Just a random comment from the sidelines really :-)

There's (essentially) a wait for that later ;)

But a long time ago Paul did write a missive explaining that there
should be some use of cond_synchronize_rcu() to provide the
rate-limiting, but I've never been sure of a way to plug in the right
figures. Do we say if the timeline does more than two RCU allocations
within the same grace period it should be throttled? A light allocation
failure has always seemed to be a sensible trigger to start worrying.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 09ed48833b54..588bc5a4d18b 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -736,7 +736,7 @@  i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
 		ret = i915_gem_wait_for_idle(i915,
 					     I915_WAIT_LOCKED |
 					     I915_WAIT_INTERRUPTIBLE,
-					     MAX_SCHEDULE_TIMEOUT);
+					     1);
 		if (ret)
 			goto err_unreserve;