diff mbox

[v2,6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()

Message ID 20180222235025.28662-7-john.ogness@linutronix.de (mailing list archive)
State New, archived
Headers show

Commit Message

John Ogness Feb. 22, 2018, 11:50 p.m. UTC
shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loop by using dentry_kill(). When killing dentries
from the dispose list, it is very similar to killing a dentry in
dput(). The difference is that dput() expects to be the last user of
the dentry (refcount=1) and will deref whereas shrink_dentry_list()
expects there to be no user (refcount=0). In order to handle both
situations with the same code, move the deref code from dentry_kill()
into a new wrapper function dentry_put_kill(), which can be used
by previous dentry_kill() users. Then shrink_dentry_list() can use
the dentry_kill() to cleanup the dispose list.

This also has the benefit that the locking order is now the same.
First the inode is locked, then the parent.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 58 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 36 insertions(+), 22 deletions(-)

Comments

Al Viro Feb. 23, 2018, 3:58 a.m. UTC | #1
On Fri, Feb 23, 2018 at 12:50:25AM +0100, John Ogness wrote:

> Avoid the trylock loop by using dentry_kill(). When killing dentries
> from the dispose list, it is very similar to killing a dentry in
> dput(). The difference is that dput() expects to be the last user of
> the dentry (refcount=1) and will deref whereas shrink_dentry_list()
> expects there to be no user (refcount=0). In order to handle both
> situations with the same code, move the deref code from dentry_kill()
> into a new wrapper function dentry_put_kill(), which can be used
> by previous dentry_kill() users. Then shrink_dentry_list() can use
> the dentry_kill() to cleanup the dispose list.
> 
> This also has the benefit that the locking order is now the same.
> First the inode is locked, then the parent.

Current code moves the sucker to the end of list in that case; I'm not
at all sure that what you are doing will improve the situation at all...

You *still* have a trylock loop there - only it keeps banging at the
same dentry instead of going through the rest first...
Al Viro Feb. 23, 2018, 4:08 a.m. UTC | #2
On Fri, Feb 23, 2018 at 03:58:14AM +0000, Al Viro wrote:
> On Fri, Feb 23, 2018 at 12:50:25AM +0100, John Ogness wrote:
> 
> > Avoid the trylock loop by using dentry_kill(). When killing dentries
> > from the dispose list, it is very similar to killing a dentry in
> > dput(). The difference is that dput() expects to be the last user of
> > the dentry (refcount=1) and will deref whereas shrink_dentry_list()
> > expects there to be no user (refcount=0). In order to handle both
> > situations with the same code, move the deref code from dentry_kill()
> > into a new wrapper function dentry_put_kill(), which can be used
> > by previous dentry_kill() users. Then shrink_dentry_list() can use
> > the dentry_kill() to cleanup the dispose list.
> > 
> > This also has the benefit that the locking order is now the same.
> > First the inode is locked, then the parent.
> 
> Current code moves the sucker to the end of list in that case; I'm not
> at all sure that what you are doing will improve the situation at all...
> 
> You *still* have a trylock loop there - only it keeps banging at the
> same dentry instead of going through the rest first...

Actually, it's even worse - _here_ you are dealing with something that
really can change inode under you.  This is one and only case where
we are kicking out a zero-refcount dentry without having already held
->i_lock.  At the very least, it's bloody different from regular
dentry_kill().  In this case, dentry itself is protected from freeing
by being on the shrink list - that's what makes __dentry_kill() to
leave the sucker allocated.   We are not holding references, it is
hashed and anybody could come, pick it, d_delete() it, etc.
John Ogness Feb. 23, 2018, 1:57 p.m. UTC | #3
Hi Al,

I am responding to these comments first because they touch on some
fundemental reasons behind the motivation of the patchset.

On 2018-02-23, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>>> Avoid the trylock loop by using dentry_kill(). When killing dentries
>>> from the dispose list, it is very similar to killing a dentry in
>>> dput(). The difference is that dput() expects to be the last user of
>>> the dentry (refcount=1) and will deref whereas shrink_dentry_list()
>>> expects there to be no user (refcount=0). In order to handle both
>>> situations with the same code, move the deref code from dentry_kill()
>>> into a new wrapper function dentry_put_kill(), which can be used
>>> by previous dentry_kill() users. Then shrink_dentry_list() can use
>>> the dentry_kill() to cleanup the dispose list.
>>> 
>>> This also has the benefit that the locking order is now the same.
>>> First the inode is locked, then the parent.
>> 
>> Current code moves the sucker to the end of list in that case; I'm not
>> at all sure that what you are doing will improve the situation at all...
>> 
>> You *still* have a trylock loop there - only it keeps banging at the
>> same dentry instead of going through the rest first...

What I am doing is a loop, but it is not a trylock loop. The trylocks in
the code only exist as likely() optimization. What I am doing is
spinning on each lock I need in the correct locking order until I get
them all. This is quite different than looping until I chance to get the
locks I need by using only trylocks and in the incorrect locking order.

> Actually, it's even worse - _here_ you are dealing with something that
> really can change inode under you.  This is one and only case where we
> are kicking out a zero-refcount dentry without having already held
> ->i_lock.  At the very least, it's bloody different from regular
> dentry_kill().  In this case, dentry itself is protected from freeing
> by being on the shrink list - that's what makes __dentry_kill() to
> leave the sucker allocated.  We are not holding references, it is
> hashed and anybody could come, pick it, d_delete() it, etc.

Yes, and that is why the new dentry_lock_inode() and dentry_kill()
functions react to any changes in refcount and check for inode
changes. Obviously for d_delete() the helper functions are checking way
more than they need to. But if we've missed the trylock optimization
we're already in the unlikely case, so the extra checks _may_ be
acceptable in order to have simplified code. As Linus already pointed
out, the cost of spinning will likely overshadow the cost of a few
compares.

These patches consolidate the 4 trylocking loops into 3 helper
functions: dentry_lock_inode(), dentry_kill(), dentry_put_kill(). And
these helper functions base off one another. Through that consolidation,
all former trylock-loops are now spinning on locks in the correct
locking order. This will directly address the two problems that are
mentioned in the commit messages: PREEMPT_RT sleeping spinlocks with
priorities and current cpu_relax()/cond_resched() efforts to try to
handle bad luck in trylock loops.

Do you recommend I avoid consolidating the 4 trylock loops into the same
set of helper functions and instead handle them all separately (as is
the case in mainline)?

Or maybe the problem is how my patchset is assembling the final
result. If patch 3 and 4 were refined to address your concerns about
them but then by the end of the 6th patch we still end up where we are
now, is that something that is palatable?

IOW, do the patches only need (possibly a lot of) refinement or do you
consider this approach fundamentally flawed?

John Ogness
Al Viro Feb. 23, 2018, 3:09 p.m. UTC | #4
On Fri, Feb 23, 2018 at 02:57:23PM +0100, John Ogness wrote:

> > Actually, it's even worse - _here_ you are dealing with something that
> > really can change inode under you.  This is one and only case where we
> > are kicking out a zero-refcount dentry without having already held
> > ->i_lock.  At the very least, it's bloody different from regular
> > dentry_kill().  In this case, dentry itself is protected from freeing
> > by being on the shrink list - that's what makes __dentry_kill() to
> > leave the sucker allocated.  We are not holding references, it is
> > hashed and anybody could come, pick it, d_delete() it, etc.
> 
> Yes, and that is why the new dentry_lock_inode() and dentry_kill()
> functions react to any changes in refcount and check for inode
> changes. Obviously for d_delete() the helper functions are checking way
> more than they need to. But if we've missed the trylock optimization
> we're already in the unlikely case, so the extra checks _may_ be
> acceptable in order to have simplified code. As Linus already pointed
> out, the cost of spinning will likely overshadow the cost of a few
> compares.

It's not that you are checking extra things - you are checking the wrong
things.  "Refcount has returned to original" is useless.

> Do you recommend I avoid consolidating the 4 trylock loops into the same
> set of helper functions and instead handle them all separately (as is
> the case in mainline)?
> 
> Or maybe the problem is how my patchset is assembling the final
> result. If patch 3 and 4 were refined to address your concerns about
> them but then by the end of the 6th patch we still end up where we are
> now, is that something that is palatable?

No.  The place where you end up with dput() is flat-out wrong.

> IOW, do the patches only need (possibly a lot of) refinement or do you
> consider this approach fundamentally flawed?

You are conflating the "we have a reference" cases with this one, and
they are very different.  Note, BTW, that had we raced with somebody
else grabbing a reference, we would've quietly dropped dentry from
the shrink list; what if we do the following: just after checking that
refcount is not positive, do
	inode = dentry->d_inode;
	if unlikely(inode && !spin_trylock...)
		rcu_read_lock
		drop ->d_lock
		grab inode->i_lock
		grab ->d_lock
		if unlikely(dentry->d_inode != inode)
			drop inode->i_lock
			rcu_read_unlock
			if !killed
				drop ->d_lock
				drop parent's ->d_lock
				continue;
		else
			rcu_read_unlock
*before* going into
                if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
                        bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
                        spin_unlock(&dentry->d_lock);
			...
part?
Al Viro Feb. 23, 2018, 5:42 p.m. UTC | #5
On Fri, Feb 23, 2018 at 03:09:28PM +0000, Al Viro wrote:
> You are conflating the "we have a reference" cases with this one, and
> they are very different.  Note, BTW, that had we raced with somebody
> else grabbing a reference, we would've quietly dropped dentry from
> the shrink list; what if we do the following: just after checking that
> refcount is not positive, do
> 	inode = dentry->d_inode;
> 	if unlikely(inode && !spin_trylock...)
> 		rcu_read_lock
> 		drop ->d_lock
> 		grab inode->i_lock
> 		grab ->d_lock
> 		if unlikely(dentry->d_inode != inode)
> 			drop inode->i_lock
> 			rcu_read_unlock
> 			if !killed
> 				drop ->d_lock
> 				drop parent's ->d_lock
> 				continue;
> 		else
> 			rcu_read_unlock
> *before* going into
>                 if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
>                         bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
>                         spin_unlock(&dentry->d_lock);
> 			...
> part?

Owww....  It's actually even nastier than I realized - dropping ->d_lock
opens us to having the sucker freed by dput() from another thread here.
IOW, between d_shrink_del(dentry) and __dentry_kill(dentry) dropping ->d_lock
is dangerous...

It's really very different from all other cases, and the trickiest by far.

FWIW, my impression from the series:
	1) dentry_kill() should deal with trylock failures on its own, leaving
the callers only the real "we need to drop the parent" case.  See upthread for
one variant of doing that.
	2) switching parent eviction in shrink_dentry_list() to dentry_kill()
is fine.
	3) for d_delete() trylock loop is wrong; however, it does not need
anything more elaborate than
{
        struct inode *inode;
        int isdir = d_is_dir(dentry);
        /*
         * Are we the only user?
         */
        spin_lock(&dentry->d_lock);
        if (dentry->d_lockref.count != 1)
		goto Shared;

        inode = dentry->d_inode;
	if (unlikely(!spin_trylock(&inode->i_lock))) {
		spin_unlock(&dentry->d_lock);
		spin_lock(&inode->i_lock);
		spin_lock(&dentry->d_lock);
		if (dentry->d_lockref.count != 1) {
			spin_unlock(&inode->i_lock);
			goto Shared;
		}
	}
           
	dentry->d_flags &= ~DCACHE_CANT_MOUNT;
	dentry_unlink_inode(dentry);
	fsnotify_nameremove(dentry, isdir);
	return;

Shared:	/* can't make it negative, must unhash */
        if (!d_unhashed(dentry))
                __d_drop(dentry);
        spin_unlock(&dentry->d_lock);

        fsnotify_nameremove(dentry, isdir);
}

If not an outright "lock inode first from the very beginning" - note that
inode is stable (and non-NULL) here.  IOW, that needs to be compared with
{
        struct inode *inode = dentry->d_inode;
        int isdir = d_is_dir(dentry);
        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
        /*
         * Are we the only user?
         */
        if (dentry->d_lockref.count == 1) {
		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
		dentry_unlink_inode(dentry);
	} else {
		if (!d_unhashed(dentry))
			__d_drop(dentry);
		spin_unlock(&dentry->d_lock);
		spin_unlock(&inode->i_lock);
	}
	fsnotify_nameremove(dentry, isdir);
}

That costs an extra boinking the ->i_lock in case dentry is shared, but it's
much shorter and simpler that way.  Needs profiling; if the second variant
does not give worse performance, I would definitely prefer that one.
	4) the nasty one - shrink_dentry_list() evictions of zero-count dentries.
_That_ calls for careful use of RCU, etc. - none of the others need that.  Need
to think how to deal with that sucker; in any case, I do not believe that sharing
said RCU use, etc. with any other cases would do anything other than obfuscating
the rest.
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index e470d49daa54..23a90a64d74f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -690,11 +690,17 @@  static bool dentry_lock_inode(struct dentry *dentry)
 
 /*
  * Finish off a dentry we've decided to kill.
- * dentry->d_lock must be held, returns with it unlocked.
- * Returns dentry requiring refcount drop, or NULL if we're done.
+ * dentry->d_lock must be held and returns with it unlocked if the
+ * dentry was killed.
+ *
+ * Returns parent dentry requiring refcount drop or NULL if we're done
+ * or ERR_PTR(-EINTR) if the dentry was not killed because its refcount
+ * changed while preparing to kill.
+ *
+ * Note, if the dentry was not killed, i.e. ERR_PTR(-EINTR) returned,
+ * dentry->d_lock is left locked!
  */
 static struct dentry *dentry_kill(struct dentry *dentry)
-	__releases(dentry->d_lock)
 {
 	int saved_count = dentry->d_lockref.count;
 	struct dentry *parent;
@@ -741,6 +747,27 @@  static struct dentry *dentry_kill(struct dentry *dentry)
 	return parent;
 
 out_ref_changed:
+	/* dentry->d_lock still locked! */
+	return ERR_PTR(-EINTR);
+}
+
+/*
+ * Finish off a dentry where we are the last user (refcount=1) and
+ * we've decided to kill it.
+ * dentry->d_lock must be held, returns with it unlocked.
+ * Returns dentry requiring refcount drop, or NULL if we're done.
+ */
+static struct dentry *dentry_put_kill(struct dentry *dentry)
+	__releases(dentry->d_lock)
+{
+	struct dentry *parent;
+
+	parent = dentry_kill(dentry);
+	if (likely(!IS_ERR(parent)))
+		return parent;
+
+	/* dentry->d_lock still held */
+
 	/*
 	 * The refcount was incremented while dentry->d_lock was dropped.
 	 * Just decrement the refcount, unlock, and tell the caller to
@@ -927,7 +954,7 @@  void dput(struct dentry *dentry)
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry);
+	dentry = dentry_put_kill(dentry);
 	if (dentry)
 		goto repeat;
 }
@@ -1078,10 +1105,8 @@  static void shrink_dentry_list(struct list_head *list)
 	struct dentry *dentry, *parent;
 
 	while (!list_empty(list)) {
-		struct inode *inode;
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 		spin_lock(&dentry->d_lock);
-		parent = lock_parent(dentry);
 
 		/*
 		 * The dispose list is isolated and dentries are not accounted
@@ -1089,15 +1114,13 @@  static void shrink_dentry_list(struct list_head *list)
 		 * here regardless of whether it is referenced or not.
 		 */
 		d_shrink_del(dentry);
-
+again:
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_lockref.count > 0) {
 			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
 			continue;
 		}
 
@@ -1105,23 +1128,14 @@  static void shrink_dentry_list(struct list_head *list)
 		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
 			bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
 			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
 			if (can_free)
 				dentry_free(dentry);
 			continue;
 		}
 
-		inode = dentry->d_inode;
-		if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
-			d_shrink_add(dentry, list);
-			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
-			continue;
-		}
-
-		__dentry_kill(dentry);
+		parent = dentry_kill(dentry);
+		if (unlikely(IS_ERR(parent)))
+			goto again;
 
 		/*
 		 * We need to prune ancestors too. This is necessary to prevent
@@ -1131,7 +1145,7 @@  static void shrink_dentry_list(struct list_head *list)
 		 */
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
+			dentry = dentry_put_kill(dentry);
 	}
 }