diff mbox

[2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker

Message ID 20170406232347.988-3-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrea Arcangeli April 6, 2017, 11:23 p.m. UTC
Waiting a RCU grace period only guarantees the work gets queued, but
until after the queued workqueue returns, there's no guarantee the
memory was actually freed. So flush the work to provide better
guarantees to the reclaim code in addition of waiting a RCU grace
period to pass.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem.c          | 2 ++
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 1 +
 2 files changed, 3 insertions(+)

Comments

Chris Wilson April 7, 2017, 10:02 a.m. UTC | #1
On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> Waiting a RCU grace period only guarantees the work gets queued, but
> until after the queued workqueue returns, there's no guarantee the
> memory was actually freed. So flush the work to provide better
> guarantees to the reclaim code in addition of waiting a RCU grace
> period to pass.

We are not allowed to call flush_work() from the shrinker, the workqueue
doesn't have and can't have the right reclaim flags.
-Chris
Andrea Arcangeli April 7, 2017, 1:06 p.m. UTC | #2
On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote:
> On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> > Waiting a RCU grace period only guarantees the work gets queued, but
> > until after the queued workqueue returns, there's no guarantee the
> > memory was actually freed. So flush the work to provide better
> > guarantees to the reclaim code in addition of waiting a RCU grace
> > period to pass.
> 
> We are not allowed to call flush_work() from the shrinker, the workqueue
> doesn't have and can't have the right reclaim flags.

I figured the flush_work had to be conditional to "unlock" being true
too in the i915 shrinker (not only synchronize_rcu_expedited()), and I
already fixed that bit, but I didn't think it would be a problem to
wait for the workqueue as long as reclaim didn't recurse on the
struct_mutex (it is a problem if unlock is false of course as we would
be back to square one). I didn't get further hangs and I assume I've
been running a couple of synchronize_rcu_expedited() and flush_work (I
should add dynamic tracing to be sure).

Also note, I didn't get any lockdep warning when I reproduced the
workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned
there's no problem to call synchronize_rcu_expedited and it couldn't
notice we were holding the struct_mutex while waiting for the new
workqueue to run.

Also note recursing on the lock (unlock false case) is something
nothing else does, I'm not sure if it's worth the risk and if you
shouldn't just call mutex_trylock in the shrinker instead of
mutex_trylock_recursive. One thing was to recurse on the lock
internally in the same context, but recursing through the whole
reclaim is more dubious as safe.

You could start dropping objects and wiping vmas and stuff in the
middle of some kmalloc/alloc_pages that doesn't expect it and then
crash for other reasons. So this reclaim recursion model of the
shinker is quite unique and quite challenging to proof as safe if you
keep using mutex_trylock_recursive in i915_gem_shrinker_scan.

Lock recursion in all other places could be dropped without runtime
downsides, the only place mutex_trylock_recursive makes a design
difference and makes sense to be used is in i915_gem_shrinker_scan,
the rest are implementation issues not fundamental shrinker design and
it'd be nice if those other mutex_trylock_recursive would all be
removed and the only one that is left is in i915_gem_shrinker_scan and
nowhere else (or to drop it also from i915_gem_shrinker_scan).

mutex_trylock_recursive() should also be patched to use
READ_ONCE(__mutex_owner(lock)) because currently it breaks C.

In the whole kernel i915 and msm drm are the only two users of such
function in fact.

Another thing is what value return from i915_gem_shrinker_scan when
unlock is false, and we can't possibly wait for the memory to be freed
let alone for a rcu grace period. For various reasons I think it's
safer to return the current "free" even if we could also return "0" in
such case. There are different tradeoffs, returning "free" is less
likely to trigger an early OOM as the VM thinks it's still making
progress and in fact it will get more free memory shortly, while
returning SHRINK_STOP would also be an option and it would insist more
on the other slabs so it would be more reliable at freeing memory
timely, but it would be more at risk of early OOM. I think returning
"free" is the better tradeoff of the two, but I suggest to add a
comment as it's not exactly obvious what is better.

Thanks,
Andrea
Chris Wilson April 7, 2017, 3:30 p.m. UTC | #3
On Fri, Apr 07, 2017 at 03:06:00PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote:
> > On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> > > Waiting a RCU grace period only guarantees the work gets queued, but
> > > until after the queued workqueue returns, there's no guarantee the
> > > memory was actually freed. So flush the work to provide better
> > > guarantees to the reclaim code in addition of waiting a RCU grace
> > > period to pass.
> > 
> > We are not allowed to call flush_work() from the shrinker, the workqueue
> > doesn't have and can't have the right reclaim flags.
> 
> I figured the flush_work had to be conditional to "unlock" being true
> too in the i915 shrinker (not only synchronize_rcu_expedited()), and I
> already fixed that bit, but I didn't think it would be a problem to
> wait for the workqueue as long as reclaim didn't recurse on the
> struct_mutex (it is a problem if unlock is false of course as we would
> be back to square one). I didn't get further hangs and I assume I've
> been running a couple of synchronize_rcu_expedited() and flush_work (I
> should add dynamic tracing to be sure).

Not getting hangs is a good sign, but lockdep doesn't like it:

[  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
[  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]

If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
as well.

> Also note, I didn't get any lockdep warning when I reproduced the
> workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned
> there's no problem to call synchronize_rcu_expedited and it couldn't
> notice we were holding the struct_mutex while waiting for the new
> workqueue to run.

Yes, that is concerning. I think it's due to the coupling via
the struct completion that is not being picked up lockdep, and I hope
the "crossrelease" patches would fix the lack of warnings.

> Also note recursing on the lock (unlock false case) is something
> nothing else does, I'm not sure if it's worth the risk and if you
> shouldn't just call mutex_trylock in the shrinker instead of
> mutex_trylock_recursive. One thing was to recurse on the lock
> internally in the same context, but recursing through the whole
> reclaim is more dubious as safe.

We know. We don't use trylock in order to reduce the frequency of users'
oom. Peter added mutex_trylock_recursive() because we already were doing
recursive locking in the shrinker and although we know we shouldn't,
getting rid of the recursion is something we are doing, but slowly.

> You could start dropping objects and wiping vmas and stuff in the
> middle of some kmalloc/alloc_pages that doesn't expect it and then
> crash for other reasons. So this reclaim recursion model of the
> shinker is quite unique and quite challenging to proof as safe if you
> keep using mutex_trylock_recursive in i915_gem_shrinker_scan.

I know. Trying to stay on top of all the kmallocs under the struct_mutex
and being aware that the shrinker can and will undo your objects as you
work is a continual battle. And catches everyone working on i915.ko by
surprise. Our policy to avoid surprises is based around pin before alloc.
 
> Lock recursion in all other places could be dropped without runtime
> downsides, the only place mutex_trylock_recursive makes a design
> difference and makes sense to be used is in i915_gem_shrinker_scan,
> the rest are implementation issues not fundamental shrinker design and
> it'd be nice if those other mutex_trylock_recursive would all be
> removed and the only one that is left is in i915_gem_shrinker_scan and
> nowhere else (or to drop it also from i915_gem_shrinker_scan).

We do need it for shrinker_count as well. If we just report 0 objects,
the shrinker_scan callback will be skipped, iirc. All we do need it for
direct calls to i915_gem_shrink() which themselves may or may not be
underneath the struct_mutex at the time.

> mutex_trylock_recursive() should also be patched to use
> READ_ONCE(__mutex_owner(lock)) because currently it breaks C.

I don't follow,

static inline struct task_struct *__mutex_owner(struct mutex *lock)
{
        return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x07);
}

The atomic read is equivalent to READ_ONCE(). What's broken here? (I
guess strict aliasing and pointer cast?)

> In the whole kernel i915 and msm drm are the only two users of such
> function in fact.

Yes, Peter will continue to remind us to fix our code and complain until
it is.

> Another thing is what value return from i915_gem_shrinker_scan when
> unlock is false, and we can't possibly wait for the memory to be freed
> let alone for a rcu grace period. For various reasons I think it's
> safer to return the current "free" even if we could also return "0" in
> such case. There are different tradeoffs, returning "free" is less
> likely to trigger an early OOM as the VM thinks it's still making
> progress and in fact it will get more free memory shortly, while
> returning SHRINK_STOP would also be an option and it would insist more
> on the other slabs so it would be more reliable at freeing memory
> timely, but it would be more at risk of early OOM. I think returning
> "free" is the better tradeoff of the two, but I suggest to add a
> comment as it's not exactly obvious what is better.

Ah. The RCU freeing is only for the small fry, the slabs from which
requests and objects are allocated. It's the gigabytes of pages we have
pinned that can be released by i915_gem_shrink() are what we count as
freed, even though we only return them to the system and hope the lru
anon page scanner will make available for reuse (if they are dirty, they
will need to be swapped out, but some will be returned to the system
directly by truncating the shmemfs filp backing an object.)
-Chris
Andrea Arcangeli April 7, 2017, 4:48 p.m. UTC | #4
On Fri, Apr 07, 2017 at 04:30:11PM +0100, Chris Wilson wrote:
> Not getting hangs is a good sign, but lockdep doesn't like it:
> 
> [  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
> [  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]
> 
> If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
> as well.

So in PF_MEMALLOC context we can't flush a workqueue with
!WQ_MEM_RECLAIM.

	system_wq = alloc_workqueue("events", 0, 0);

My point is synchronize_rcu_expedited will still push its work in
the same system_wq workqueue...

		/* Marshall arguments & schedule the expedited grace period. */
		rew.rew_func = func;
		rew.rew_rsp = rsp;
		rew.rew_s = s;
		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
		schedule_work(&rew.rew_work);

It's also using schedule_work, so either the above is a false
positive, or we've still a problem with synchronize_rcu_expedited,
just a reproducible issue anymore after we stop running it under the
struct_mutex.

Even synchronize_sched will wait on the system_wq if
synchronize_rcu_expedited has been issued in parallel by some other
code, it's just there's no check for it because it's not invoking
flush_work to wait.

The deadlock happens if we flush_work() while holding any lock that
may be taken by any of the workqueues that could be queued there. i915
makes sure to flush_work only if the struct_mutex was released (not
my initial version) but we're not checking if any of the other
system_wq workqueues could possibly taking a lock that may have been
hold during the allocation that invoked reclaim, I suppose that is the
problem left, but I don't see how flush_work is special about it,
synchronize_rcu_expedited would still have the same issue no? (despite
no lockdep warning)

I suspect this means synchronize_rcu_expedited() is not usable in
reclaim context and lockdep should warn if PF_MEMALLOC is set when
synchronize_rcu_expedited/synchronize_sched/synchronize_rcu are
called.

Probably to fix this we should create a private workqueue for both RCU
and i915 and stop sharing the system_wq with the rest of the system
(and of course set WQ_MEM_RECLAIM in both workqueues). This makes sure
when we call synchronize_rcu_expedited; flush_work from the shrinker,
we won't risk waiting on other random work that may be taking locks
that are hold by the code that invoked reclaim during an allocation.

The macro bug of waiting on system_wq 100% of the time while always
holding the struct_mutex is gone, but we need to perfect this further
and stop using the system_wq for RCU and i915 shrinker work.

> We do need it for shrinker_count as well. If we just report 0 objects,

Yes the shrinker_count too.

> the shrinker_scan callback will be skipped, iirc. All we do need it for
> direct calls to i915_gem_shrink() which themselves may or may not be
> underneath the struct_mutex at the time.

Yes.

> I don't follow,
> 
> static inline struct task_struct *__mutex_owner(struct mutex *lock)
> {
>         return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x07);
> }
> 
> The atomic read is equivalent to READ_ONCE(). What's broken here? (I
> guess strict aliasing and pointer cast?)

That was an oversight, atomic64_read does READ_ONCE internally (as it
should of course being an atomic read). I didn't recall it uses
atomic_long_read.

Thanks,
Andrea
Chris Wilson April 10, 2017, 9:39 a.m. UTC | #5
On Fri, Apr 07, 2017 at 06:48:58PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 07, 2017 at 04:30:11PM +0100, Chris Wilson wrote:
> > Not getting hangs is a good sign, but lockdep doesn't like it:
> > 
> > [  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
> > [  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]
> > 
> > If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
> > as well.
> 
> So in PF_MEMALLOC context we can't flush a workqueue with
> !WQ_MEM_RECLAIM.
> 
> 	system_wq = alloc_workqueue("events", 0, 0);
> 
> My point is synchronize_rcu_expedited will still push its work in
> the same system_wq workqueue...
> 
> 		/* Marshall arguments & schedule the expedited grace period. */
> 		rew.rew_func = func;
> 		rew.rew_rsp = rsp;
> 		rew.rew_s = s;
> 		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> 		schedule_work(&rew.rew_work);
> 
> It's also using schedule_work, so either the above is a false
> positive, or we've still a problem with synchronize_rcu_expedited,
> just a reproducible issue anymore after we stop running it under the
> struct_mutex.

We still do have a problem with using synchronize_rcu_expedited() from
the shrinker as we maybe under someone else's mutex is that involved in
its own RCU dance.

> Even synchronize_sched will wait on the system_wq if
> synchronize_rcu_expedited has been issued in parallel by some other
> code, it's just there's no check for it because it's not invoking
> flush_work to wait.

Right.
 
> The deadlock happens if we flush_work() while holding any lock that
> may be taken by any of the workqueues that could be queued there. i915
> makes sure to flush_work only if the struct_mutex was released (not
> my initial version) but we're not checking if any of the other
> system_wq workqueues could possibly taking a lock that may have been
> hold during the allocation that invoked reclaim, I suppose that is the
> problem left, but I don't see how flush_work is special about it,
> synchronize_rcu_expedited would still have the same issue no? (despite
> no lockdep warning)
> 
> I suspect this means synchronize_rcu_expedited() is not usable in
> reclaim context and lockdep should warn if PF_MEMALLOC is set when
> synchronize_rcu_expedited/synchronize_sched/synchronize_rcu are
> called.

Yes.

> Probably to fix this we should create a private workqueue for both RCU
> and i915 and stop sharing the system_wq with the rest of the system
> (and of course set WQ_MEM_RECLAIM in both workqueues). This makes sure
> when we call synchronize_rcu_expedited; flush_work from the shrinker,
> we won't risk waiting on other random work that may be taking locks
> that are hold by the code that invoked reclaim during an allocation.

We simply do not need to do our own synchronize_rcu* -- it's only used
to flush our slab frees on the off chance that (a) we have any and (b)
we do manage to free a whole slab. It is not the bulk of the memory that
we return to the system from the shrinker.

In the other thread, I stated that we should simply remove it. The
kswapd reclaim path should try to reclaim RCU slabs (by doing a
synhronize_sched or equivalent).

> The macro bug of waiting on system_wq 100% of the time while always
> holding the struct_mutex is gone, but we need to perfect this further
> and stop using the system_wq for RCU and i915 shrinker work.

Agreed. My preference is to simply not do it and leave the dangling RCU
to the core reclaim paths.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 3982489..612fde3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4748,6 +4748,7 @@  int i915_gem_freeze(struct drm_i915_private *dev_priv)
 	 * running workqueue may wait on the struct_mutex.
 	 */
 	synchronize_rcu(); /* wait for our earlier RCU delayed slab frees */
+	flush_work(&dev_priv->mm.free_work);
 
 	intel_runtime_pm_put(dev_priv);
 
@@ -4789,6 +4790,7 @@  int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
 	synchronize_rcu_expedited();
+	flush_work(&dev_priv->mm.free_work);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index fea1454..30f79af 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -329,6 +329,7 @@  i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 		 * blocked waiting on us to release struct_mutex.
 		 */
 		synchronize_rcu_expedited();
+	flush_work(&dev_priv->mm.free_work);
 
 	return freed;
 }