drm/i915: Make the GEM reclaim workqueue high priority

Message ID	20201013103256.31446-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Vt61=DU=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 90AAD20E65 From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Tue, 13 Oct 2020 11:32:56 +0100 Message-Id: <20201013103256.31446-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue high priority Precedence: list Cc: Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: Make the GEM reclaim workqueue high priority \| expand drm/i915: Make the GEM reclaim workqueue high priority

Chris Wilson Oct. 13, 2020, 10:32 a.m. UTC

Since removing dev->struct_mutex usage, we only use i915->wq for batch
freeing of GEM objects and ppGTT, it is essential for memory reclaim. If
we let the workqueue dawdle, we trap excess amounts of memory, so give
it a priority boost. Although since we no longer depend on a singular
mutex, we could run unbounded, but first lets try to keep some
constraint upon the worker.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: CQ Tang <cq.tang@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

Tang, CQ Oct. 13, 2020, 4:19 p.m. UTC | #1

Chris,
    I tested this patch. It is still not enough, I keep catch running out of lmem.  Every worker invocation takes larger and larger freeing object count.

Here is my debugging code:

+static int obj_count = 0;
+
......
+       if (llist_add(&obj->freed, &i915->mm.free_list)) {
+               bool b;
+               b = queue_work(i915->wq, &i915->mm.free_work);
+               pr_err("queue_work: %d, %d; %d\n", atomic_read(&i915->mm.free_count), obj_count, b);
+               obj_count = 1;
+       } else {
+               obj_count++;
+       }
	
And here  is the output:

[  821.213680] queue_work: 108068, 105764; 1
[  823.309468] queue_work: 148843, 147948; 1
[  826.453132] queue_work: 220000, 218123; 1
[  831.522506] queue_work: 334812, 333773; 1
[  840.040571] queue_work: 539650, 538922; 1
[  860.337644] queue_work: 960811, 1017158; 1

The second number, 'obj_count', is the objects taken by last worker invocation to free.

--CQ


> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Tuesday, October 13, 2020 3:33 AM
> To: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>; Tang, CQ <cq.tang@intel.com>
> Subject: [PATCH] drm/i915: Make the GEM reclaim workqueue high priority
> 
> Since removing dev->struct_mutex usage, we only use i915->wq for batch
> freeing of GEM objects and ppGTT, it is essential for memory reclaim. If we
> let the workqueue dawdle, we trap excess amounts of memory, so give it a
> priority boost. Although since we no longer depend on a singular mutex, we
> could run unbounded, but first lets try to keep some constraint upon the
> worker.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: CQ Tang <cq.tang@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c
> b/drivers/gpu/drm/i915/i915_drv.c index 8bb7e2dcfaaa..8c9198f0d2ad
> 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -219,20 +219,10 @@ intel_teardown_mchbar(struct drm_i915_private
> *dev_priv)  static int i915_workqueues_init(struct drm_i915_private
> *dev_priv)  {
>  	/*
> -	 * The i915 workqueue is primarily used for batched retirement of
> -	 * requests (and thus managing bo) once the task has been
> completed
> -	 * by the GPU. i915_retire_requests() is called directly when we
> -	 * need high-priority retirement, such as waiting for an explicit
> -	 * bo.
> -	 *
> -	 * It is also used for periodic low-priority events, such as
> -	 * idle-timers and recording error state.
> -	 *
> -	 * All tasks on the workqueue are expected to acquire the dev mutex
> -	 * so there is no point in running more than one instance of the
> -	 * workqueue at any time.  Use an ordered one.
> +	 * The i915 workqueue is primarily used for batched freeing of
> +	 * GEM objects and ppGTT, and is essential for memory reclaim.
>  	 */
> -	dev_priv->wq = alloc_ordered_workqueue("i915", 0);
> +	dev_priv->wq = alloc_ordered_workqueue("i915", WQ_HIGHPRI);
>  	if (dev_priv->wq == NULL)
>  		goto out_err;
> 
> --
> 2.20.1

Chris Wilson Oct. 13, 2020, 4:24 p.m. UTC | #2

Quoting Tang, CQ (2020-10-13 17:19:27)
> Chris,
>     I tested this patch. It is still not enough, I keep catch running out of lmem.  Every worker invocation takes larger and larger freeing object count.
> 

Was that with the immediate call (not via call_rcu) to
__i915_gem_free_object_rcu?

If this brings the freelist under control, the next item is judicious
use of cond_synchronize_rcu(). We just have to make sure we penalize the
right hog.

Otherwise, we have to shotgun apply i915_gem_flush_free_objects() and
still find somewhere to put the rcu sync.
-Chris

Tang, CQ Oct. 13, 2020, 4:40 p.m. UTC | #3

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Tuesday, October 13, 2020 9:25 AM
> To: Tang, CQ <cq.tang@intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue
> high priority
> 
> Quoting Tang, CQ (2020-10-13 17:19:27)
> > Chris,
> >     I tested this patch. It is still not enough, I keep catch running out of lmem.
> Every worker invocation takes larger and larger freeing object count.
> >
> 
> Was that with the immediate call (not via call_rcu) to
> __i915_gem_free_object_rcu?
> 
> If this brings the freelist under control, the next item is judicious use of
> cond_synchronize_rcu(). We just have to make sure we penalize the right
> hog.
> 
> Otherwise, we have to shotgun apply i915_gem_flush_free_objects() and
> still find somewhere to put the rcu sync.

This is with call_rcu().

Then I removed cond_resched(), it does not help, and then I call __i915_gem_free_object_rcu() directly, still the same error,
However, I noticed that sometimes 'queue_work()' return false, which means the work is already queued, how? The worker had been called so 'free_list' is empty:

[  117.381888] queue_work: 107967, 107930; 1
[  119.180230] queue_work: 125531, 125513; 1
[  121.349308] queue_work: 155017, 154996; 1
[  124.214885] queue_work: 193918, 193873; 1
[  127.967260] queue_work: 256838, 256776; 1
[  133.281045] queue_work: 345753, 345734; 1
[  141.457995] queue_work: 516943, 516859; 1
[  156.264420] queue_work: 863622, 863516; 1
[  156.322619] queue_work: 865849, 3163; 0
[  156.448551] queue_work: 865578, 7141; 0
[  156.882985] queue_work: 866984, 24138; 0
[  157.952163] queue_work: 862902, 53365; 0
[  159.838412] queue_work: 842522, 95504; 0
[  174.321508] queue_work: 937179, 657323; 0

--CQ

> -Chris

Tang, CQ Oct. 13, 2020, 11:29 p.m. UTC | #4

i915_gem_free_object() is called by multiple threads/processes, they all add objects onto the same free_list. The free_list processing worker thread becomes bottle-neck. I see that the worker is mostly a single thread (with particular thread ID), but sometimes multiple threads are launched to process the 'free_list' work concurrently. But the processing speed is still slower than the multiple process's feeding speed, and 'free_list' is holding more and more memory.

The worker launching time is delayed a lot, we call queue_work() when we add the first object onto the empty 'free_list', but when the worker is launched, the 'free_list' has sometimes accumulated 1M objects. Maybe it is because of waiting currently running worker to finish?

This happens with direct call to __i915_gem_free_object_rcu() and no cond_resched().

--CQ

> -----Original Message-----
> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Tang, CQ
> Sent: Tuesday, October 13, 2020 9:41 AM
> To: Chris Wilson <chris@chris-wilson.co.uk>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue
> high priority
> 
> 
> 
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Tuesday, October 13, 2020 9:25 AM
> > To: Tang, CQ <cq.tang@intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim
> > workqueue high priority
> >
> > Quoting Tang, CQ (2020-10-13 17:19:27)
> > > Chris,
> > >     I tested this patch. It is still not enough, I keep catch running out of
> lmem.
> > Every worker invocation takes larger and larger freeing object count.
> > >
> >
> > Was that with the immediate call (not via call_rcu) to
> > __i915_gem_free_object_rcu?
> >
> > If this brings the freelist under control, the next item is judicious
> > use of cond_synchronize_rcu(). We just have to make sure we penalize
> > the right hog.
> >
> > Otherwise, we have to shotgun apply i915_gem_flush_free_objects() and
> > still find somewhere to put the rcu sync.
> 
> This is with call_rcu().
> 
> Then I removed cond_resched(), it does not help, and then I call
> __i915_gem_free_object_rcu() directly, still the same error, However, I
> noticed that sometimes 'queue_work()' return false, which means the work
> is already queued, how? The worker had been called so 'free_list' is empty:
> 
> [  117.381888] queue_work: 107967, 107930; 1 [  119.180230] queue_work:
> 125531, 125513; 1 [  121.349308] queue_work: 155017, 154996; 1 [  124.214885]
> queue_work: 193918, 193873; 1 [  127.967260] queue_work: 256838, 256776;
> 1 [  133.281045] queue_work: 345753, 345734; 1 [  141.457995] queue_work:
> 516943, 516859; 1 [  156.264420] queue_work: 863622, 863516; 1 [  156.322619]
> queue_work: 865849, 3163; 0 [  156.448551] queue_work: 865578, 7141; 0
> [  156.882985] queue_work: 866984, 24138; 0 [  157.952163] queue_work:
> 862902, 53365; 0 [  159.838412] queue_work: 842522, 95504; 0 [  174.321508]
> queue_work: 937179, 657323; 0
> 
> --CQ
> 
> > -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Oct. 15, 2020, 3:06 p.m. UTC | #5

Quoting Tang, CQ (2020-10-14 00:29:13)
> i915_gem_free_object() is called by multiple threads/processes, they all add objects onto the same free_list. The free_list processing worker thread becomes bottle-neck. I see that the worker is mostly a single thread (with particular thread ID), but sometimes multiple threads are launched to process the 'free_list' work concurrently. But the processing speed is still slower than the multiple process's feeding speed, and 'free_list' is holding more and more memory.

We can also prune the free_list immediately, if we know we are outside
of any critical section. (We do this before create ioctls, and I thought
upon close(device), but I see that's just contexts.)
 
> The worker launching time is delayed a lot, we call queue_work() when we add the first object onto the empty 'free_list', but when the worker is launched, the 'free_list' has sometimes accumulated 1M objects. Maybe it is because of waiting currently running worker to finish?

1M is a lot more than is comfortable, and that's even with a high-priority
worker.  The problem with objects being freed from any context is that we can't
simply put a flush_work around there. (Not without ridding ourselves of
a few mutexes at least.) We could try more than worker, but it's no more
more effort to starve 2 cpus than it is to starve 1.

No, with that much pressure the only option is to apply the backpressure
at the point of allocation ala create_ioctl. i.e. find the hog, and look
to see if there's a convenient spot before/after to call
i915_gem_flush_free_objects(). Since you highlight the vma-stash as the
likely culprit, and the free_pt_stash is unlikely to be inside any
critical section, might as well try flushing from there for starters.

Hmm, actually we are tantalizing close to having dropped all mutexes
(and similar global lock-like effects) from free_objects. That would be
a nice victory.
-Chris

Tang, CQ Oct. 15, 2020, 8:09 p.m. UTC | #6

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Thursday, October 15, 2020 8:07 AM
> To: Tang, CQ <cq.tang@intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue
> high priority
> 
> Quoting Tang, CQ (2020-10-14 00:29:13)
> > i915_gem_free_object() is called by multiple threads/processes, they all
> add objects onto the same free_list. The free_list processing worker thread
> becomes bottle-neck. I see that the worker is mostly a single thread (with
> particular thread ID), but sometimes multiple threads are launched to
> process the 'free_list' work concurrently. But the processing speed is still
> slower than the multiple process's feeding speed, and 'free_list' is holding
> more and more memory.
> 
> We can also prune the free_list immediately, if we know we are outside of
> any critical section. (We do this before create ioctls, and I thought upon
> close(device), but I see that's just contexts.)
> 
> > The worker launching time is delayed a lot, we call queue_work() when we
> add the first object onto the empty 'free_list', but when the worker is
> launched, the 'free_list' has sometimes accumulated 1M objects. Maybe it is
> because of waiting currently running worker to finish?
> 
> 1M is a lot more than is comfortable, and that's even with a high-priority
> worker.  The problem with objects being freed from any context is that we
> can't simply put a flush_work around there. (Not without ridding ourselves of
> a few mutexes at least.) We could try more than worker, but it's no more
> more effort to starve 2 cpus than it is to starve 1.
> 
> No, with that much pressure the only option is to apply the backpressure at
> the point of allocation ala create_ioctl. i.e. find the hog, and look to see if
> there's a convenient spot before/after to call
> i915_gem_flush_free_objects(). Since you highlight the vma-stash as the
> likely culprit, and the free_pt_stash is unlikely to be inside any critical section,
> might as well try flushing from there for starters.

I have not yet tested, but I guess calling i915_gem_flush_free_objects() inside free_pt_stash() will solve the problem that gem_exec_gttfill has, because it will give some back pressure on the system traffic.

But this is only for the page table 4K lmem objects allocated/freed by vma-stash. We might encounter the same situation with user space allocated objects.

--CQ

> 
> Hmm, actually we are tantalizing close to having dropped all mutexes (and
> similar global lock-like effects) from free_objects. That would be a nice
> victory.
> -Chris

Chris Wilson Oct. 15, 2020, 8:32 p.m. UTC | #7

Quoting Tang, CQ (2020-10-15 21:09:32)
> 
> 
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Thursday, October 15, 2020 8:07 AM
> > To: Tang, CQ <cq.tang@intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue
> > high priority
> > 
> > Quoting Tang, CQ (2020-10-14 00:29:13)
> > > i915_gem_free_object() is called by multiple threads/processes, they all
> > add objects onto the same free_list. The free_list processing worker thread
> > becomes bottle-neck. I see that the worker is mostly a single thread (with
> > particular thread ID), but sometimes multiple threads are launched to
> > process the 'free_list' work concurrently. But the processing speed is still
> > slower than the multiple process's feeding speed, and 'free_list' is holding
> > more and more memory.
> > 
> > We can also prune the free_list immediately, if we know we are outside of
> > any critical section. (We do this before create ioctls, and I thought upon
> > close(device), but I see that's just contexts.)
> > 
> > > The worker launching time is delayed a lot, we call queue_work() when we
> > add the first object onto the empty 'free_list', but when the worker is
> > launched, the 'free_list' has sometimes accumulated 1M objects. Maybe it is
> > because of waiting currently running worker to finish?
> > 
> > 1M is a lot more than is comfortable, and that's even with a high-priority
> > worker.  The problem with objects being freed from any context is that we
> > can't simply put a flush_work around there. (Not without ridding ourselves of
> > a few mutexes at least.) We could try more than worker, but it's no more
> > more effort to starve 2 cpus than it is to starve 1.
> > 
> > No, with that much pressure the only option is to apply the backpressure at
> > the point of allocation ala create_ioctl. i.e. find the hog, and look to see if
> > there's a convenient spot before/after to call
> > i915_gem_flush_free_objects(). Since you highlight the vma-stash as the
> > likely culprit, and the free_pt_stash is unlikely to be inside any critical section,
> > might as well try flushing from there for starters.
> 
> I have not yet tested, but I guess calling i915_gem_flush_free_objects() inside free_pt_stash() will solve the problem that gem_exec_gttfill has, because it will give some back pressure on the system traffic.

Still I'm slightly concerned that so many PD objects are being created;
it's not something that shows up in the smem ppgtt tests (or at least
it's been dwarfed by other bottlenecks), and the set of vma (and so the
PD) are meant to reach a steady state. You would need to be using a
constant set of objects and recycling the vma, not to hit the
create_ioctl flush. However, it points back to the pressure point being
around the vma bind.

> But this is only for the page table 4K lmem objects allocated/freed by vma-stash. We might encounter the same situation with user space allocated objects.

See gem_exec_create, it's mission is to cause memory starvation by
creating as many new objects as it can and releasing them after a nop
batch. That's why we have the freelist flush from create_ioctl.

Now I need to add a pass that tries to create as many vma from a few
objects as is possible.

(And similarly why we try to free requests as they are created.)

One problem is that they will catch the client after the hog, not
necessarily the hog themselves.

I'm optimistic we can make freeing the object atomic, even if that means
pushing the pages onto some reclaim list. (Which is currently a really
nasty drawback of the free worker, a trick lost with the removal of
struct_mutex.)
-Chris

Tang, CQ Oct. 15, 2020, 10:25 p.m. UTC | #8

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Thursday, October 15, 2020 1:33 PM
> To: Tang, CQ <cq.tang@intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Make the GEM reclaim workqueue
> high priority
> > > We can also prune the free_list immediately, if we know we are
> > > outside of any critical section. (We do this before create ioctls,
> > > and I thought upon close(device), but I see that's just contexts.)
> > >
> > > > The worker launching time is delayed a lot, we call queue_work()
> > > > when we
> > > add the first object onto the empty 'free_list', but when the worker
> > > is launched, the 'free_list' has sometimes accumulated 1M objects.
> > > Maybe it is because of waiting currently running worker to finish?
> > >
> > > 1M is a lot more than is comfortable, and that's even with a
> > > high-priority worker.  The problem with objects being freed from any
> > > context is that we can't simply put a flush_work around there. (Not
> > > without ridding ourselves of a few mutexes at least.) We could try
> > > more than worker, but it's no more more effort to starve 2 cpus than it is
> to starve 1.
> > >
> > > No, with that much pressure the only option is to apply the
> > > backpressure at the point of allocation ala create_ioctl. i.e. find
> > > the hog, and look to see if there's a convenient spot before/after
> > > to call i915_gem_flush_free_objects(). Since you highlight the
> > > vma-stash as the likely culprit, and the free_pt_stash is unlikely
> > > to be inside any critical section, might as well try flushing from there for
> starters.
> >
> > I have not yet tested, but I guess calling i915_gem_flush_free_objects()
> inside free_pt_stash() will solve the problem that gem_exec_gttfill has,
> because it will give some back pressure on the system traffic.
> 
> Still I'm slightly concerned that so many PD objects are being created; it's not
> something that shows up in the smem ppgtt tests (or at least it's been
> dwarfed by other bottlenecks), and the set of vma (and so the
> PD) are meant to reach a steady state. You would need to be using a
> constant set of objects and recycling the vma, not to hit the create_ioctl flush.
> However, it points back to the pressure point being around the vma bind.
> 
> > But this is only for the page table 4K lmem objects allocated/freed by vma-
> stash. We might encounter the same situation with user space allocated
> objects.
> 
> See gem_exec_create, it's mission is to cause memory starvation by creating
> as many new objects as it can and releasing them after a nop batch. That's
> why we have the freelist flush from create_ioctl.
> 
> Now I need to add a pass that tries to create as many vma from a few objects
> as is possible.
> 
> (And similarly why we try to free requests as they are created.)
> 
> One problem is that they will catch the client after the hog, not necessarily
> the hog themselves.
> 
> I'm optimistic we can make freeing the object atomic, even if that means
> pushing the pages onto some reclaim list. (Which is currently a really nasty
> drawback of the free worker, a trick lost with the removal of
> struct_mutex.)

On a DG1 system with 4GB lmem, and 7.7GB system memory, I can easily catch system OOM when running "gem_exec_gttfill --r all".

When I call i915_gem_flush_free_objects() inside i915_vm_free_pt_stash(), the test code passes without any problem.

With simple test 'gem_exec_gttfill --r all", I also concerned that why driver create/free so many vma-stash objects, when the system OOM happened, there are total 1.5M stash objects freed.

--CQ

> -Chris

drm/i915: Make the GEM reclaim workqueue high priority

Commit Message

Comments

Patch