[17/22] don't try to cut corners in shrink_lock_dentry()

Message ID	20231109062056.3181775-17-viro@zeniv.linux.org.uk (mailing list archive)
State	New
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D496BFBF1 for <linux-fsdevel@vger.kernel.org>; Thu, 9 Nov 2023 06:21:03 +0000 (UTC) From: Al Viro <viro@zeniv.linux.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org, Christian Brauner <brauner@kernel.org> Subject: [PATCH 17/22] don't try to cut corners in shrink_lock_dentry() Date: Thu, 9 Nov 2023 06:20:51 +0000 Message-Id: <20231109062056.3181775-17-viro@zeniv.linux.org.uk> In-Reply-To: <20231109062056.3181775-1-viro@zeniv.linux.org.uk> References: <20231109061932.GA3181489@ZenIV> <20231109062056.3181775-1-viro@zeniv.linux.org.uk> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: Al Viro <viro@ftp.linux.org.uk>
Series	[01/22] struct dentry: get rid of randomize_layout idiocy \| expand [01/22] struct dentry: get rid of randomize_layout idiocy [02/22] switch nfsd_client_rmdir() to use of simple_recursive_removal() [03/22] coda_flag_children(): cope with dentries turning negative [04/22] dentry: switch the lists of children to hlist [05/22] centralize killing dentry from shrink list [06/22] get rid of __dget() [07/22] shrink_dentry_list(): no need to check that dentry refcount is marked dead [08/22] fast_dput(): having ->d_delete() is not reason to delay refcount decrement [09/22] fast_dput(): handle underflows gracefully [10/22] fast_dput(): new rules for refcount [11/22] __dput_to_list(): do decrement of refcount in the callers [12/22] Make retain_dentry() neutral with respect to refcounting [13/22] __dentry_kill(): get consistent rules for victim's refcount [14/22] dentry_kill(): don't bother with retain_dentry() on slow path [15/22] Call retain_dentry() with refcount 0 [16/22] fold the call of retain_dentry() into fast_dput() [17/22] don't try to cut corners in shrink_lock_dentry() [18/22] fold dentry_kill() into dput() [19/22] to_shrink_list(): call only if refcount is 0 [20/22] switch select_collect{,2}() to use of to_shrink_list() [21/22] d_prune_aliases(): use a shrink list [22/22] __dentry_kill(): new locking scheme

Al Viro Nov. 9, 2023, 6:20 a.m. UTC

That is to say, do *not* treat the ->d_inode or ->d_parent changes
as "it's hard, return false; somebody must have grabbed it, so
even if has zero refcount, we don't need to bother killing it -
final dput() from whoever grabbed it would've done everything".

First of all, that is not guaranteed.  It might have been dropped
by shrink_kill() handling of victim's parent, which would've found
it already on a shrink list (ours) and decided that they don't need
to put it on their shrink list.

What's more, dentry_kill() is doing pretty much the same thing,
cutting its own set of corners (it assumes that dentry can't
go from positive to negative, so its inode can change but only once
and only in one direction).

Doing that right allows to get rid of that not-quite-duplication
and removes the only reason for re-incrementing refcount before
the call of dentry_kill().

Replacement is called lock_for_kill(); called under rcu_read_lock
and with ->d_lock held.  If it returns false, dentry has non-zero
refcount and the same locks are held.  If it returns true,
dentry has zero refcount and all locks required by __dentry_kill()
are taken.

Part of __lock_parent() had been lifted into lock_parent() to
allow its reuse.  Now it's called with rcu_read_lock already
held and dentry already unlocked.

Note that this is not the final change - locking requirements for
__dentry_kill() are going to change later in the series and the
set of locks taken by lock_for_kill() will be adjusted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c | 159 ++++++++++++++++++++++------------------------------
 1 file changed, 66 insertions(+), 93 deletions(-)

Christian Brauner Nov. 9, 2023, 5:20 p.m. UTC | #1

On Thu, Nov 09, 2023 at 06:20:51AM +0000, Al Viro wrote:
> That is to say, do *not* treat the ->d_inode or ->d_parent changes
> as "it's hard, return false; somebody must have grabbed it, so
> even if has zero refcount, we don't need to bother killing it -
> final dput() from whoever grabbed it would've done everything".
> 
> First of all, that is not guaranteed.  It might have been dropped
> by shrink_kill() handling of victim's parent, which would've found
> it already on a shrink list (ours) and decided that they don't need
> to put it on their shrink list.
> 
> What's more, dentry_kill() is doing pretty much the same thing,
> cutting its own set of corners (it assumes that dentry can't
> go from positive to negative, so its inode can change but only once
> and only in one direction).
> 
> Doing that right allows to get rid of that not-quite-duplication
> and removes the only reason for re-incrementing refcount before
> the call of dentry_kill().
> 
> Replacement is called lock_for_kill(); called under rcu_read_lock
> and with ->d_lock held.  If it returns false, dentry has non-zero
> refcount and the same locks are held.  If it returns true,
> dentry has zero refcount and all locks required by __dentry_kill()
> are taken.
> 
> Part of __lock_parent() had been lifted into lock_parent() to
> allow its reuse.  Now it's called with rcu_read_lock already
> held and dentry already unlocked.
> 
> Note that this is not the final change - locking requirements for
> __dentry_kill() are going to change later in the series and the
> set of locks taken by lock_for_kill() will be adjusted.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

It's a bit unfortunate that __lock_parent() locks the parent *and* may
lock the child which isn't really obvious from the name. It just becomes
clear that this is assumed by how callers release the child's lock.

>  fs/dcache.c | 159 ++++++++++++++++++++++------------------------------
>  1 file changed, 66 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 23afcd48c1a9..801502871671 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -625,8 +625,6 @@ static void __dentry_kill(struct dentry *dentry)
>  static struct dentry *__lock_parent(struct dentry *dentry)
>  {
>  	struct dentry *parent;
> -	rcu_read_lock();
> -	spin_unlock(&dentry->d_lock);
>  again:
>  	parent = READ_ONCE(dentry->d_parent);
>  	spin_lock(&parent->d_lock);
> @@ -642,7 +640,6 @@ static struct dentry *__lock_parent(struct dentry *dentry)
>  		spin_unlock(&parent->d_lock);
>  		goto again;
>  	}
> -	rcu_read_unlock();
>  	if (parent != dentry)
>  		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>  	else
> @@ -657,7 +654,64 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
>  		return NULL;
>  	if (likely(spin_trylock(&parent->d_lock)))
>  		return parent;
> -	return __lock_parent(dentry);
> +	rcu_read_lock();
> +	spin_unlock(&dentry->d_lock);
> +	parent = __lock_parent(dentry);
> +	rcu_read_unlock();
> +	return parent;
> +}
> +
> +/*
> + * Lock a dentry for feeding it to __dentry_kill().
> + * Called under rcu_read_lock() and dentry->d_lock; the former
> + * guarantees that nothing we access will be freed under us.
> + * Note that dentry is *not* protected from concurrent dentry_kill(),
> + * d_delete(), etc.
> + *
> + * Return false if dentry is busy.  Otherwise, return true and have
> + * that dentry's inode and parent both locked.
> + */
> +
> +static bool lock_for_kill(struct dentry *dentry)
> +{
> +	struct inode *inode = dentry->d_inode;
> +	struct dentry *parent = dentry->d_parent;
> +
> +	if (unlikely(dentry->d_lockref.count))
> +		return false;
> +
> +	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
> +		goto slow;
> +	if (dentry == parent)
> +		return true;
> +	if (likely(spin_trylock(&parent->d_lock)))
> +		return true;
> +
> +	if (inode)
> +		spin_unlock(&inode->i_lock);
> +slow:
> +	spin_unlock(&dentry->d_lock);
> +
> +	for (;;) {
> +		if (inode)
> +			spin_lock(&inode->i_lock);
> +		parent = __lock_parent(dentry);

We're under rcu here. Are we sure that this can't trigger rcu timeouts
because we're spinning? Maybe there's a reason that's not an issue here.

That spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED) in
__lock_parent() is there for the sake of lockdep to verify that the
parent lock is always aqcuired before the child lock?

Linus Torvalds Nov. 9, 2023, 5:39 p.m. UTC | #2

On Wed, 8 Nov 2023 at 22:23, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>  static struct dentry *__lock_parent(struct dentry *dentry)
>  {
>         struct dentry *parent;
> -       rcu_read_lock();
> -       spin_unlock(&dentry->d_lock);
>  again:
>         parent = READ_ONCE(dentry->d_parent);
>         spin_lock(&parent->d_lock);

Can we rename this while at it?

That name *used* to make sense, in that the function was entered with
the dentry lock held, and then it returned with the dentry lock *and*
the parent lock held.

But now you've changed the rules so that the dentry lock is *not* held
at entry, so now the semantics of that function is essentially "lock
dentry and parent". Which I think means that the name should change to
reflect that.

Finally: it does look like most callers actually did hold the dentry
lock, and that you just moved the

        spin_unlock(&dentry->d_lock);

from inside that function to the caller. I don't hate that, but now
that I look at it, I get the feeling that what we *should* have done
is

  static struct dentry *__lock_parent(struct dentry *dentry)
  {
        struct dentry *parent = dentry->d_parent;
        if (try_spin_lock(&parent->d_lock))
                return parent;
        /* Uhhuh - need to get the parent lock first */
        .. old code goes here ..

but that won't work with the new world order.

So I get the feeling that maybe instead of renaming it for the new
semantics, maybe the old semantics of "called with the dentry lock
held" were simply better"

                  Linus

Linus Torvalds Nov. 9, 2023, 6:11 p.m. UTC | #3

On Thu, 9 Nov 2023 at 09:39, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Can we rename this while at it?

Never mind. I didn't notice that the thing disappears entirely in 22/22.

Just ignore my blind ass.

                Linus

Al Viro Nov. 9, 2023, 6:20 p.m. UTC | #4

On Thu, Nov 09, 2023 at 09:39:09AM -0800, Linus Torvalds wrote:
> On Wed, 8 Nov 2023 at 22:23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >  static struct dentry *__lock_parent(struct dentry *dentry)
> >  {
> >         struct dentry *parent;
> > -       rcu_read_lock();
> > -       spin_unlock(&dentry->d_lock);
> >  again:
> >         parent = READ_ONCE(dentry->d_parent);
> >         spin_lock(&parent->d_lock);
> 
> Can we rename this while at it?
> 
> That name *used* to make sense, in that the function was entered with
> the dentry lock held, and then it returned with the dentry lock *and*
> the parent lock held.
> 
> But now you've changed the rules so that the dentry lock is *not* held
> at entry, so now the semantics of that function is essentially "lock
> dentry and parent". Which I think means that the name should change to
> reflect that.
> 
> Finally: it does look like most callers actually did hold the dentry
> lock, and that you just moved the
> 
>         spin_unlock(&dentry->d_lock);
> 
> from inside that function to the caller. I don't hate that, but now
> that I look at it, I get the feeling that what we *should* have done
> is
> 
>   static struct dentry *__lock_parent(struct dentry *dentry)
>   {
>         struct dentry *parent = dentry->d_parent;
>         if (try_spin_lock(&parent->d_lock))
>                 return parent;
>         /* Uhhuh - need to get the parent lock first */
>         .. old code goes here ..
> 
> but that won't work with the new world order.

Can't - currently lock_for_kill() uses it in a loop.  Can't have trylocks
in there, or realtime setups will get unhappy.  More to the point, the whole
function is gone by the end of the series.  Along with lock_parent().

The only reason why we needed that thing is that we lock the parent too
early; that's where the last commit in the series is a big win.  There
we remove from the parent's list of children in the very end, when we'd
already made the victim negative (and unlocked it); there ->d_parent
is stable and we can simply lock that, then lock dentry.

We still need a loop in lock_for_kill() to get the inode locked along
with dentry, but that's less convoluted (the ordering between two
->d_lock can change; ->i_lock is always safe to take before ->d_lock).

> So I get the feeling that maybe instead of renaming it for the new
> semantics, maybe the old semantics of "called with the dentry lock
> held" were simply better"

lock_parent() goes aways when d_prune_alias() is switched to shrink list;
after that __lock_parent() is used only in that loop in lock_for_kill()
and only until (22/22) when lock_for_kill() stops touching the parent.
After that it's simply gone.

Al Viro Nov. 9, 2023, 9:45 p.m. UTC | #5

On Thu, Nov 09, 2023 at 06:20:08PM +0100, Christian Brauner wrote:

> It's a bit unfortunate that __lock_parent() locks the parent *and* may
> lock the child which isn't really obvious from the name. It just becomes
> clear that this is assumed by how callers release the child's lock.

__lock_parent() is gone by the end of the series.

> We're under rcu here. Are we sure that this can't trigger rcu timeouts
> because we're spinning? Maybe there's a reason that's not an issue here.

Spinning happens only if somebody is busy moving that dentry from
directory to directory or back-and-forth turning it negative/positive
with different inode.  It's not a livelock situation - for each
iteration you need a successful rename() and/or unlink()/creat() pair
on the dentry in question.  Coming rapidly enough to cause you
spinning there...

Note that lock_parent() had that loop under rcu for a long time;
so did dget_parent().  I don't remember seeing rcu timeout warnings
about either...

> That spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED) in
> __lock_parent() is there for the sake of lockdep to verify that the
> parent lock is always aqcuired before the child lock?

Yes.

Christian Brauner Nov. 10, 2023, 9:07 a.m. UTC | #6

On Thu, Nov 09, 2023 at 09:45:37PM +0000, Al Viro wrote:
> On Thu, Nov 09, 2023 at 06:20:08PM +0100, Christian Brauner wrote:
> 
> > It's a bit unfortunate that __lock_parent() locks the parent *and* may
> > lock the child which isn't really obvious from the name. It just becomes
> > clear that this is assumed by how callers release the child's lock.
> 
> __lock_parent() is gone by the end of the series.

Yes, I saw that once I got to the end of the series. Thanks.

[17/22] don't try to cut corners in shrink_lock_dentry()

Commit Message

Comments

Patch