diff mbox

[BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

Message ID 20180223201317.GG30522@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro Feb. 23, 2018, 8:13 p.m. UTC
On Fri, Feb 23, 2018 at 05:42:16PM +0000, Al Viro wrote:

> 	4) the nasty one - shrink_dentry_list() evictions of zero-count dentries.
> _That_ calls for careful use of RCU, etc. - none of the others need that.  Need
> to think how to deal with that sucker; in any case, I do not believe that sharing
> said RCU use, etc. with any other cases would do anything other than obfuscating
> the rest.

Arrrgh...  Actually, there's a nasty corner case where the variant in mainline is
broken.  Look:
	dentry placed on a shrink list
	we pick the fucker from the list and lock it.
	we call lock_parent() on it.
		dentry is not a root and it's not deleted, so we proceed.
		trylock fails.
		we grab rcu_read_lock()
		we drop dentry->d_lock
	on another CPU, something does e.g. d_prune_aliases() (or finds the
	sucker in hash and proceeds to unhash and dput(), etc.) - anything
	that evicts that dentry.  It is marked with DCACHE_MAY_FREE and left
	alone.  The parent, OTOH, is dropped and freeing gets scheduled as
	soon as RCU allows.
		we grab parent->d_lock
		we verify that dentry->d_parent is still the same (it is)
		we do rcu_read_unlock()
		we grab dentry->d_lock
		we return parent
At that point we are fucked - there's nothing to prevent parent from being
freed at any point.  And we assume that its ->d_lock is held and needs to
be dropped.

The call site in d_prune_aliases() avoids the same scenario, since there we
are already holding ->i_lock and another thread won't get to __dentry_kill()
until we are done with lock_parent().

Unless I'm missing something, that's a (narrow) memory corruptor.  The window is
narrow, but not impossibly so - if that other thread had been spinning on attempt
to grab dentry->d_lock in d_prune_alias(), it has to squeeze through
                if (!dentry->d_lockref.count) {
and then in lock_parent() called there - through
        if (IS_ROOT(dentry))
                return NULL;
        if (unlikely(dentry->d_lockref.count < 0))
                return NULL;
        if (likely(spin_trylock(&parent->d_lock)))
before the first CPU gets through
        parent = READ_ONCE(dentry->d_parent);
        spin_lock(&parent->d_lock);

The first CPU can't be preempted, but there's nothing to prevent an IRQ arriving
at that point, letting the second one win the race.

Comments?

I think the (untested) patch below is -stable fodder:

lock_parent() needs to recheck if dentry got __dentry_kill'ed under it

In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Linus Torvalds Feb. 23, 2018, 9:35 p.m. UTC | #1
On Fri, Feb 23, 2018 at 12:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Look:
>         dentry placed on a shrink list
>         we pick the fucker from the list and lock it.
>         we call lock_parent() on it.
>                 dentry is not a root and it's not deleted, so we proceed.
>                 trylock fails.
>                 we grab rcu_read_lock()
>                 we drop dentry->d_lock

[ deleted the bad things ]

Should we just instead get the ref to the dentry before dropping the
lock, so that nobody else can get to dentry_kill?

This is too subtle, and your fix to check d_lockref.count < 0 sounds
wrong to me. If it's really gone, maybe it has been reused and the
refcount is positive again, but it's something else than a dentry
entirely?

Hmm.

No, you extended the rcu read section, so I guess your patch is fine.
And lock_parent already has that pattern, soiit's not new.

Ok, I agree, looks like lock_parent should just re-check that thing
that it already checked earler, but that now might be true again
because of we dropped d_lock.

              Linus
Al Viro Feb. 24, 2018, 12:22 a.m. UTC | #2
On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:

> This is too subtle, and your fix to check d_lockref.count < 0 sounds
> wrong to me. If it's really gone, maybe it has been reused and the
> refcount is positive again, but it's something else than a dentry
> entirely?
> 
> Hmm.
> 
> No, you extended the rcu read section, so I guess your patch is fine.
> And lock_parent already has that pattern, soiit's not new.
> 
> Ok, I agree, looks like lock_parent should just re-check that thing
> that it already checked earler, but that now might be true again
> because of we dropped d_lock.

IMO that's the right thing for backports; whether we keep it after
the getting rid of trylock loops is a different question.  Note that
the only case where we do not have __dentry_kill() prevention
guaranteed by the caller (either by holding a reference, or by
holding onto ->i_lock all along) is in shrink_dentry_list().
And there we have more than enough of other subtle crap.

Moreover, there we have a good reason to treat "it had been moved"
as "kick it off the shrink list and free if it's already dead",
which might simplify the things.  Below is a stab at that:

/*
 * ONLY for shrink_dentry_list() - it returns false if it finds
 * dentry grabbed, moved or killed, which is fine there but not
 * anywhere else.  OTOH, nobody else needs to deal with dentries
 * getting killed under them.
 */
static bool shrink_lock_for_kill(struct dentry *dentry)
{
	if (dentry->d_lockref.count)
		return false;

	inode = dentry->d_inode;
	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
		rcu_read_lock();	/* to protect inode */
		spin_unlock(&dentry->d_lock);
		spin_lock(&inode->i_lock);
		spin_lock(&dentry->d_lock);
		if (unlikely(dentry->d_lockref.count))
			goto out;
		/* changed inode means that somebody had grabbed it */
		if (unlikely(inode != dentry->d_inode))
			goto out;
		rcu_read_unlock();
	}

	parent = dentry->d_parent;
	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
		return true;
	
	rcu_read_lock();		/* to protect parent */
	spin_unlock(&dentry->d_lock);
	parent = READ_ONCE(dentry->d_parent);
	spin_lock(&parent->d_lock);
	if (unlikely(parent != dentry->d_parent)) {
		spin_unlock(&parent->d_lock);
		spin_lock(&dentry->d_lock);
		goto out;
	}
	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	if (likely(!dentry->d_lockref.count)) {
		rcu_read_unlock();
		return true;
	}
	spin_unlock(&parent->d_lock);
out:
	spin_unlock(&inode->i_lock);
	rcu_read_unlock();
	return false;
}

static void shrink_dentry_list(struct list_head *list)
{
	struct dentry *dentry, *parent;

	while (!list_empty(list)) {
		struct inode *inode;
		dentry = list_entry(list->prev, struct dentry, d_lru);
		spin_lock(&dentry->d_lock);
		if (!shrink_lock_for_kill(dentry)) {
			bool can_free = false;
			d_shrink_del(dentry);
			if (dentry->d_lockref.count < 0)
				can_free = dentry->d_flags & DCACHE_MAY_FREE;
			spin_unlock(&dentry->d_lock);
			if (can_free)
				dentry_free(dentry);
			continue;
		}
		d_shrink_del(dentry);
		parent = dentry->d_parent;
		__dentry_kill(dentry);
		if (dentry == parent)
			continue;
		dentry = parent;
		.... 
		same as now
		....
	}
}
Al Viro Feb. 25, 2018, 7:40 a.m. UTC | #3
On Sat, Feb 24, 2018 at 12:22:48AM +0000, Al Viro wrote:
> On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:
> 
> > This is too subtle, and your fix to check d_lockref.count < 0 sounds
> > wrong to me. If it's really gone, maybe it has been reused and the
> > refcount is positive again, but it's something else than a dentry
> > entirely?
> > 
> > Hmm.
> > 
> > No, you extended the rcu read section, so I guess your patch is fine.
> > And lock_parent already has that pattern, soiit's not new.
> > 
> > Ok, I agree, looks like lock_parent should just re-check that thing
> > that it already checked earler, but that now might be true again
> > because of we dropped d_lock.
> 
> IMO that's the right thing for backports; whether we keep it after
> the getting rid of trylock loops is a different question.  Note that
> the only case where we do not have __dentry_kill() prevention
> guaranteed by the caller (either by holding a reference, or by
> holding onto ->i_lock all along) is in shrink_dentry_list().
> And there we have more than enough of other subtle crap.
> 
> Moreover, there we have a good reason to treat "it had been moved"
> as "kick it off the shrink list and free if it's already dead",
> which might simplify the things.  Below is a stab at that:

FWIW, a variant of that series is in #work.dcache; it's
almost certainly not the final (I want to clean the things up
and probably reorder them as well) and it needs one hell of
profiling and review.  It seems to work, and I don't see any
obvious performance regressions (everything seems to be within
normal noise), but I'd like it beaten up a lot more.

Parts are from John's series, parts are rewritten as discussed
upthread.  As the result, trylock loops are gone and no new
retry loops had been added in their place - lock_parent() still
has one, but that's it.

I'll play with cleaning up and reordering tomorrow after I get
some sleep.  In the meanwhile, the current state of that set is at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache
John Ogness Feb. 27, 2018, 5:16 a.m. UTC | #4
On 2018-02-25, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> I'll play with cleaning up and reordering tomorrow after I get some
> sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I have one comment on your new dentry_kill()...

> From e77db95d1ad189c76bcda7b90ac2225e98d776a2 Mon Sep 17 00:00:00 2001
> From: Al Viro <viro@zeniv.linux.org.uk>
> Date: Fri, 23 Feb 2018 21:25:42 -0500
> Subject: [PATCH] get rid of trylock loop around dentry_kill()
>
> In case when trylock in there fails, deal with it directly in
> dentry_kill().  Note that in cases when we drop and retake
> ->d_lock, we need to recheck whether to retain the dentry.
> Another thing is that dropping/retaking ->d_lock might have
> ended up with negative dentry turning into positive; that,
> of course, can happen only once...
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c | 34 +++++++++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 449c1a5..c1f082d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -657,23 +657,43 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>  	struct dentry *parent = NULL;
>  
>  	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
> -		goto failed;
> +		goto slow_positive;
>  
>  	if (!IS_ROOT(dentry)) {
>  		parent = dentry->d_parent;
>  		if (unlikely(!spin_trylock(&parent->d_lock))) {
> -			if (inode)
> -				spin_unlock(&inode->i_lock);
> -			goto failed;
> +			parent = __lock_parent(dentry);
> +			if (likely(inode || !dentry->d_inode))
> +				goto got_locks;
> +			/* negative that became positive */
> +			if (parent)
> +				spin_unlock(&parent->d_lock);
> +			inode = dentry->d_inode;
> +			goto slow_positive;
>  		}
>  	}
> -
>  	__dentry_kill(dentry);
>  	return parent;
>  
> -failed:
> +slow_positive:
> +	spin_unlock(&dentry->d_lock);
> +	spin_lock(&inode->i_lock);
> +	spin_lock(&dentry->d_lock);
> +	parent = lock_parent(dentry);
> +got_locks:
> +	if (likely(dentry->d_lockref.count == 1 && !retain_dentry(dentry))) {
> +		__dentry_kill(dentry);
> +		return parent;
> +	}
> +	/* we are keeping it, after all */
> +	if (inode)
> +		spin_unlock(&inode->i_lock);
> +	if (parent)
> +		spin_unlock(&parent->d_lock);

If someone else has grabbed a reference, it shouldn't be added to the
lru list. Only decremented.

if (entry->d_lockref.count == 1)

> +	dentry_lru_add(dentry);
> +	dentry->d_lockref.count--;
>  	spin_unlock(&dentry->d_lock);
> -	return dentry; /* try again with same dentry */
> +	return NULL;
>  }
>  
>  /*
Sebastian Andrzej Siewior March 2, 2018, 9:04 a.m. UTC | #5
On 2018-02-25 07:40:05 [+0000], Al Viro wrote:
> FWIW, a variant of that series is in #work.dcache; it's
> almost certainly not the final (I want to clean the things up
> and probably reorder them as well) and it needs one hell of
> profiling and review.  It seems to work, and I don't see any
> obvious performance regressions (everything seems to be within
> normal noise), but I'd like it beaten up a lot more.

If you have something in mind, I could fire up something.

> I'll play with cleaning up and reordering tomorrow after I get
> some sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I've put it in -RT, let it do some upgrades and other things and nothing
bad happened so far.

Sebastian
Al Viro March 12, 2018, 7:13 p.m. UTC | #6
On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:

> If someone else has grabbed a reference, it shouldn't be added to the
> lru list. Only decremented.
> 
> if (entry->d_lockref.count == 1)

Nah, better handle that in retain_dentry() itself.  See updated
#work.dcache.

Note: another potentially fun thing in that branch is that I've
finally decided to bite the bullet and make __d_move() preserve
->d_parent of target.

Mainline:
al@sonny:/tmp$ touch d
al@sonny:/tmp$ sleep 100 >/tmp/a/b/c &
[1] 16487
al@sonny:/tmp$ ls -l /proc/16487/fd
total 0
lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c
lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c 
al@sonny:/tmp$ ls -l /proc/16487/fd
total 0
lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted)
lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13

With that branch:
root@kvm1:/tmp# touch d
root@kvm1:/tmp# sleep 100 >/tmp/a/b/c &
[1] 2263
root@kvm1:/tmp# ls -l /proc/2263/fd
total 0
lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c
lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c
root@kvm1:/tmp# ls -l /proc/2263/fd
total 0
lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)'
lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0

It doesn't come quite for free; cross-directory d_move()
and d_exchange() callers are responsible for having both
parents pinned (all of them do that, mostly since the usual
sequence is "look parents up, lock_rename(), *then* look
children up, then do renaming"; those that are not part of
rename(2) are also OK) and d_splice_alias() has become potentially
blocking in one case.  AFAICS, none of the callers is in
locking environment that would not allow that.  Survives
the local beating and doesn't seem to cause any performance
regressions.

What we get out of that is
	a) much saner semantics for d_move() et.al.
	b) saner behaviour of d_path() (see above)
	c) dentry can be IS_ROOT only if it has been
such all along; that simplifies the hell out of analysis.

FWIW, there's another trylock loop on dentries - one in
autofs get_next_positive_dentry().  Any plans re dealing
with that one?

I'd spent the last couple of weeks (when not being too sick
for any work) going through dcache.c and related code; hopefully
this time I will get the documentation into postable shape ;-/

There's an unpleasant area around the ->s_root vs. NFS.  There's
code that makes assumptions about ->s_root that are simply not true
for NFS.  Is path_connected() correct wrt NFS multiple imports from
the same server?  Ditto for mnt_already_visible() (that one might
be mitigated at the moment, but probably won't last).  Eric, am
I missing something subtle in there?
Al Viro March 12, 2018, 8:05 p.m. UTC | #7
On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote:

> There's an unpleasant area around the ->s_root vs. NFS.  There's
> code that makes assumptions about ->s_root that are simply not true
> for NFS.  Is path_connected() correct wrt NFS multiple imports from
> the same server?  Ditto for mnt_already_visible() (that one might
> be mitigated at the moment, but probably won't last).  Eric, am
> I missing something subtle in there?

BTW, another thing spotted while trying to document the stuff:
Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle
side effect I hadn't spotted when applying.  Namely, for
DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have
d_unhashed() true.  That makes d_find_alias() logics around
discon_alias dead code:
                if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
                        if (IS_ROOT(alias) &&
                            (alias->d_flags & DCACHE_DISCONNECTED)) {
                                discon_alias = alias;
                        } else {
                                __dget_dlock(alias);
                                spin_unlock(&alias->d_lock);
                                return alias;
                        }
                }

Directory case is a red herring - we'll end up picking one and only
alias, whether it's IS_ROOT/disconnected or not.  For non-directories,
though, IS_ROOT and disconnected implies d_unhashed() now, so we might
as well turn that into
	if directory
		return d_find_any_alias
	else
		return the first hashed alias

Does any of the d_find_alias() callers want those dentries?  That is,
non-directories from d_obtain_alias() still not attached to a parent;
note that exportfs_decode_fh() is *not* one of such places - we don't
use d_find_alias() there at all.  If there's no such caller, we can
bloody well just drop the discon_alias logics and be done with that;
if there is, that commit has introduced a bug.  I might have missed
a part of threads related to that patch, so my apologies if it had
been discussed.

Neil, what's the situation there?  A lot of those callers clearly treat
the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
it looks like the change might have been the right thing, but it sure
as hell shouldn't be an undocumented side effect...
Eric W. Biederman March 12, 2018, 8:23 p.m. UTC | #8
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:
>
>> If someone else has grabbed a reference, it shouldn't be added to the
>> lru list. Only decremented.
>> 
>> if (entry->d_lockref.count == 1)
>
> Nah, better handle that in retain_dentry() itself.  See updated
> #work.dcache.
>
> Note: another potentially fun thing in that branch is that I've
> finally decided to bite the bullet and make __d_move() preserve
> ->d_parent of target.
>
> Mainline:
> al@sonny:/tmp$ touch d
> al@sonny:/tmp$ sleep 100 >/tmp/a/b/c &
> [1] 16487
> al@sonny:/tmp$ ls -l /proc/16487/fd
> total 0
> lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
> l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c
> lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
> al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c 
> al@sonny:/tmp$ ls -l /proc/16487/fd
> total 0
> lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
> l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted)
> lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
>
> With that branch:
> root@kvm1:/tmp# touch d
> root@kvm1:/tmp# sleep 100 >/tmp/a/b/c &
> [1] 2263
> root@kvm1:/tmp# ls -l /proc/2263/fd
> total 0
> lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
> l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c
> lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
> root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c
> root@kvm1:/tmp# ls -l /proc/2263/fd
> total 0
> lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
> l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)'
> lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
>
> It doesn't come quite for free; cross-directory d_move()
> and d_exchange() callers are responsible for having both
> parents pinned (all of them do that, mostly since the usual
> sequence is "look parents up, lock_rename(), *then* look
> children up, then do renaming"; those that are not part of
> rename(2) are also OK) and d_splice_alias() has become potentially
> blocking in one case.  AFAICS, none of the callers is in
> locking environment that would not allow that.  Survives
> the local beating and doesn't seem to cause any performance
> regressions.
>
> What we get out of that is
> 	a) much saner semantics for d_move() et.al.
> 	b) saner behaviour of d_path() (see above)
> 	c) dentry can be IS_ROOT only if it has been
> such all along; that simplifies the hell out of analysis.
>
> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?
>
> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/
>
> There's an unpleasant area around the ->s_root vs. NFS.  There's
> code that makes assumptions about ->s_root that are simply not true
> for NFS.  Is path_connected() correct wrt NFS multiple imports from
> the same server? Ditto for mnt_already_visible() (that one might
> be mitigated at the moment, but probably won't last).  Eric, am
> I missing something subtle in there?

I don't have the entire context in my head.  But I don't think we
have problems today.

NFS before it uses paths from an unconnected root in the rest of
the vfs walks those paths backwards and makes the paths connect.
I don't remember where all of that code that performs those connections
but I do remember the code in fs/fhandle.c shares that code with nfs, to
perform the same operation in open_by_handle_at.

So I don't think the nfs peculiarities are actually relevant to
anything on an ordinary code path.


Of the two code paths you are concert about:

For path path_connected looking at s_root is a heuristic to avoid
calling is_subdir every time we need to do that check.  If the heuristic
fails we still have is_subdir which should remain accurate.  If
is_subdir fails the path is genuinely not connected at that moment
and failing is the correct thing to do.


For mnt_too_revealing the only filesystems under consideration are
proc and sysfs.   So nfs oddities are of no consequence.
mnt_too_revealing probably won't be extended to other filesystems.
Certainly nfs is not a candidate for having setting SB_I_USERNS_VISIBLE.


Al is that sufficient to address your concerns?

Eric
Al Viro March 12, 2018, 8:33 p.m. UTC | #9
On Mon, Mar 12, 2018 at 08:05:41PM +0000, Al Viro wrote:

> Does any of the d_find_alias() callers want those dentries?  That is,
> non-directories from d_obtain_alias() still not attached to a parent;
> note that exportfs_decode_fh() is *not* one of such places - we don't
> use d_find_alias() there at all.  If there's no such caller, we can
> bloody well just drop the discon_alias logics and be done with that;
> if there is, that commit has introduced a bug.  I might have missed
> a part of threads related to that patch, so my apologies if it had
> been discussed.
> 
> Neil, what's the situation there?  A lot of those callers clearly treat
> the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
> it looks like the change might have been the right thing, but it sure
> as hell shouldn't be an undocumented side effect...

FWIW, callers are
drivers/staging/lustre/lustre/llite/llite_lib.c:2448:           dentry = d_find_alias(page->mapping->host);
	definitely doesn't want them
fs/ceph/caps.c:2945:    while ((dn = d_find_alias(inode))) {
fs/ceph/mds_client.c:1930:      dentry = d_find_alias(inode);
	ditto - it's building a pathname.
fs/ceph/mds_client.c:2910:      dentry = d_find_alias(inode);
	ditto.
fs/ceph/mds_client.c:3374:      parent = d_find_alias(inode);
	clearly at least expects a directory (feeds to d_lookup(), etc.)
fs/coda/cache.c:112:    alias_de = d_find_alias(inode);
	directory-only
fs/fat/namei_vfat.c:736:        alias = d_find_alias(inode);
	we will reject any IS_ROOT there anyway
fs/fs-writeback.c:2071:         dentry = d_find_alias(inode);
	wants a name, these don't have one.
fs/fuse/dir.c:966:      dir = d_find_alias(parent);
	directory-only
fs/hostfs/hostfs_kern.c:129:    dentry = d_find_alias(ino);
	wants a pathname
fs/nilfs2/super.c:931:          dentry = d_find_alias(inode);
	directory-only
fs/nilfs2/super.c:1025:                 dentry = d_find_alias(inode);
	bloody better be a directory (root one, at that)
fs/xfs/xfs_filestream.c:293:    dentry = d_find_alias(inode);
	no parent, no love.
mm/shmem.c:3435:                dentry = d_find_alias(inode);
	no d_obtain_alias() on that one
security/commoncap.c:391:       dentry = d_find_alias(inode);
security/lsm_audit.c:295:               dentry = d_find_alias(inode);
	wants a name
security/selinux/hooks.c:1536:                  dentry = d_find_alias(inode);
security/selinux/hooks.c:1646:                          dentry = d_find_alias(inode);

So all but four of them clearly do *not* want those suckers.  And remaining
ones are in
	* ceph invalidate_aliases()
	* cap_inode_getsecurity()
	* selinux inode_doinit_with_dentry() (two call sites, very alike)
Do any of those want that stuff?
Al Viro March 12, 2018, 8:39 p.m. UTC | #10
On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote:

> Of the two code paths you are concert about:
> 
> For path path_connected looking at s_root is a heuristic to avoid
> calling is_subdir every time we need to do that check.  If the heuristic
> fails we still have is_subdir which should remain accurate.  If
> is_subdir fails the path is genuinely not connected at that moment
> and failing is the correct thing to do.
 
Umm...  That might be not good enough - the logics is "everything's
reachable from ->s_root anyway, so we might as well not bother checking".
For NFS it's simply not true.

We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b
and we'll have ->s_root pointing to a subtree of what's reachable at
/tmp/b.  Play with renames under /tmp/b and you just might end up with
a problem.  And mount on /tmp/a will be (mistakenly) considered to
be safe, since it satisfies the heuristics in path_connected().
Thomas Gleixner March 12, 2018, 10:14 p.m. UTC | #11
On Mon, 12 Mar 2018, Al Viro wrote:
> On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:
> What we get out of that is
> 	a) much saner semantics for d_move() et.al.
> 	b) saner behaviour of d_path() (see above)
> 	c) dentry can be IS_ROOT only if it has been
> such all along; that simplifies the hell out of analysis.

Nice.

> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?

It's on our list and we wanted to wait for the final resolution of dcache
before tackling it. Do you have any immediate ide about that?

> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/

Documentation is surely welcome. Paging that code back in is a mindboggling
exercise.

Thanks,

	tglx
Eric W. Biederman March 12, 2018, 11:28 p.m. UTC | #12
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote:
>
>> Of the two code paths you are concert about:
>> 
>> For path path_connected looking at s_root is a heuristic to avoid
>> calling is_subdir every time we need to do that check.  If the heuristic
>> fails we still have is_subdir which should remain accurate.  If
>> is_subdir fails the path is genuinely not connected at that moment
>> and failing is the correct thing to do.
>  
> Umm...  That might be not good enough - the logics is "everything's
> reachable from ->s_root anyway, so we might as well not bother checking".
> For NFS it's simply not true.

If I am parsing the code correctly path_connected is broken for nfsv2
and nfsv3 when NFS_MOUNT_UNSHARED is not set.   nfsv4 appears to make
a kernel mount of the real root of the filesystem properly setting
s_root and then finds the child it is mounting.

> We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b
> and we'll have ->s_root pointing to a subtree of what's reachable at
> /tmp/b.  Play with renames under /tmp/b and you just might end up with
> a problem.  And mount on /tmp/a will be (mistakenly) considered to
> be safe, since it satisfies the heuristics in path_connected().

Agreed.

Which means that if you mount server:/foo/bar/baz first and then
mount server:/foo with an appropriate rename you might be able
to see all of server:/foo or possibly server:/

Hmm..

Given that nfs_kill_super calls generic_shutdown_super and
generic_shutdown_super calls shrink_dcache_for_umount
I would argue that nfsv2 and nfsv3 are buggy in the same
case, as shrink_dcache_for_umount is called on something
that is not the root of the filesystem's dentry tree.

Eric
Al Viro March 13, 2018, 12:36 a.m. UTC | #13
On Mon, Mar 12, 2018 at 06:28:30PM -0500, Eric W. Biederman wrote:

> Given that nfs_kill_super calls generic_shutdown_super and
> generic_shutdown_super calls shrink_dcache_for_umount
> I would argue that nfsv2 and nfsv3 are buggy in the same
> case, as shrink_dcache_for_umount is called on something
> that is not the root of the filesystem's dentry tree.

That's not a problem - it will simply evict a subtree first,
then go through ->s_roots and kill each piece.
NeilBrown March 13, 2018, 1:12 a.m. UTC | #14
On Mon, Mar 12 2018, Al Viro wrote:

> On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote:
>
>> There's an unpleasant area around the ->s_root vs. NFS.  There's
>> code that makes assumptions about ->s_root that are simply not true
>> for NFS.  Is path_connected() correct wrt NFS multiple imports from
>> the same server?  Ditto for mnt_already_visible() (that one might
>> be mitigated at the moment, but probably won't last).  Eric, am
>> I missing something subtle in there?
>
> BTW, another thing spotted while trying to document the stuff:
> Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle
> side effect I hadn't spotted when applying.  Namely, for
> DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have
> d_unhashed() true.  That makes d_find_alias() logics around
> discon_alias dead code:
>                 if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
>                         if (IS_ROOT(alias) &&
>                             (alias->d_flags & DCACHE_DISCONNECTED)) {
>                                 discon_alias = alias;
>                         } else {
>                                 __dget_dlock(alias);
>                                 spin_unlock(&alias->d_lock);
>                                 return alias;
>                         }
>                 }

I agree that this code is now more complex than needed.

>
> Directory case is a red herring - we'll end up picking one and only
> alias, whether it's IS_ROOT/disconnected or not.  For non-directories,
> though, IS_ROOT and disconnected implies d_unhashed() now, so we might
> as well turn that into
> 	if directory
> 		return d_find_any_alias
> 	else
> 		return the first hashed alias

Yes, this looks like a good simplification.

>
> Does any of the d_find_alias() callers want those dentries?  That is,
> non-directories from d_obtain_alias() still not attached to a parent;
> note that exportfs_decode_fh() is *not* one of such places - we don't
> use d_find_alias() there at all.  If there's no such caller, we can
> bloody well just drop the discon_alias logics and be done with that;
> if there is, that commit has introduced a bug.  I might have missed
> a part of threads related to that patch, so my apologies if it had
> been discussed.
>
> Neil, what's the situation there?  A lot of those callers clearly treat
> the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
> it looks like the change might have been the right thing, but it sure
> as hell shouldn't be an undocumented side effect...

Many of the callers really want a name, which these disconnected
dentries didn't have, so the new code is more correct in those cases.

d_find_alias() is documented as returning a hashed dentry (for
non-directories) and I suspect most people would be surprised to find
that a disconnected dentry with no name was considered to be "hashed",
so I think the patch could be seen as fixing the documentation.

Of the three users that you didn't classify as "clearly" not wanting the
disconnected dentries:

>	     * ceph invalidate_aliases()

This wants to ensure the dentries are discarded on final dput().  This
is already the case for disconnected dentries, so it doesn't need to be
told about them.  Code is correct.


>	     * cap_inode_getsecurity()

If this gets invoked when the nfsd server calls vfs_getxattr() on a
disconnected dentry, it will return -EINVAL.  This looks like a
potential bug.
nfsd only calls vfs_getxattr() in nfsd4_is_junction() so this bug would
not affect many use cases.
I think cap_inode_getsecurity() should really be calling
d_find_any_alias().

>	     * selinux inode_doinit_with_dentry() (two call sites, very alike)

I'm less sure about this one, but I think it probably wants
d_find_any_alias() as well.  Like cap_inode_getsecurity() it only wants
a dentry so that it can pass something to __vfs_getxattr(),
and that only wants a dentry so it can pass something to ->get.

Possibly we should rename d_find_alias() to d_find_hashed_alias() so that
people need to make a conscious choice between d_find_hashed_alias() and
d_find_any_alias() ??

Thanks,
NeilBrown
John Ogness March 13, 2018, 8:46 p.m. UTC | #15
On 2018-03-12, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>> If someone else has grabbed a reference, it shouldn't be added to the
>> lru list. Only decremented.
>> 
>> if (entry->d_lockref.count == 1)
>
> Nah, better handle that in retain_dentry() itself.  See updated
> #work.dcache.
>
> +	if (unlikely(dentry->d_lockref.count != 1)) {
> +		dentry->d_lockref.count--;
> +	} else if (likely(!retain_dentry(dentry))) {
> +		__dentry_kill(dentry);
> +		return parent;
> +	}

Although the updated version is correct (and saves on lines of code), I
find putting the deref and lru_add code in the "true" case of
retain_dentry() to be pretty tricky. IMHO the code is easier to
understand if it looks like this:

	if (unlikely(dentry->d_lockref.count != 1)) {
		dentry->d_lockref.count--;
	} else if (likely(!retain_dentry(dentry))) {
		__dentry_kill(dentry);
		return parent;
	} else {
		dentry->d_lockref.count--;
		dentry_lru_add(dentry);
	}

This is what your version is doing, but that final else is hiding in the
retain_dentry() "true" case.

My suggestion is to revert 7479f57fecd2a4837b5c79ce1cf0dcf284db54be (and
then fixup dput() to deref before calling dentry_lru_add()).

> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?

I will need to dig into it a bit deeper (I am unfamiliar with autofs),
but it looks like it is trying to do basically the same thing as the
ascend loop in d_walk().

> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/

Thank you for all your help in getting these changes cleaned and
correctly implemented so quickly.

I've reviewed your latest trylock loop removal patches and found only 1
minor issue. I'll post that separately.

John Ogness
John Ogness March 13, 2018, 9:05 p.m. UTC | #16
Hi Al,

1 minor issue on the new shrink_lock_dentry()...

> From 121a8e0834862d2c5d88c95f8e6bc8984f195abf Mon Sep 17 00:00:00 2001
> From: Al Viro <viro@zeniv.linux.org.uk>
> Date: Fri, 23 Feb 2018 21:54:18 -0500
> Subject: [PATCH] get rid of trylock loop in locking dentries on shrink
>  list
>
> In case of trylock failure don't re-add to the list - drop the locks
> and carefully get them in the right order.  For shrink_dentry_list(),
> somebody having grabbed a reference to dentry means that we can
> kick it off-list, so if we find dentry being modified under us we
> don't need to play silly buggers with retries anyway - off the list
> it is.
>
> The locking logics taken out into a helper of its own; lock_parent()
> is no longer used for dentries that can be killed under us.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c | 103 ++++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 66 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 1684b6b..58097fd 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -974,56 +974,85 @@ void d_prune_aliases(struct inode *inode)
>  }
>  EXPORT_SYMBOL(d_prune_aliases);
>  
> -static void shrink_dentry_list(struct list_head *list)
> +/*
> + * Lock a dentry from shrink list.
> + * Note that dentry is *not* protected from concurrent dentry_kill(),
> + * d_delete(), etc.  It is protected from freeing (by the fact of
> + * being on a shrink list), but everything else is fair game.
> + * Return false if dentry has been disrupted or grabbed, leaving
> + * the caller to kick it off-list.  Otherwise, return true and have
> + * that dentry's inode and parent both locked.
> + */
> +static bool shrink_lock_dentry(struct dentry *dentry)
>  {
> -	struct dentry *dentry, *parent;
> +	struct inode *inode;
> +	struct dentry *parent;
>  
> -	while (!list_empty(list)) {
> -		struct inode *inode;
> -		dentry = list_entry(list->prev, struct dentry, d_lru);
> +	if (dentry->d_lockref.count)
> +		return false;
> +
> +	inode = dentry->d_inode;
> +	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
> +		rcu_read_lock();	/* to protect inode */
> +		spin_unlock(&dentry->d_lock);
> +		spin_lock(&inode->i_lock);
>  		spin_lock(&dentry->d_lock);
> -		parent = lock_parent(dentry);
> +		if (unlikely(dentry->d_lockref.count))
> +			goto out;
> +		/* changed inode means that somebody had grabbed it */
> +		if (unlikely(inode != dentry->d_inode))
> +			goto out;
> +		rcu_read_unlock();
> +	}
>  
> -		/*
> -		 * The dispose list is isolated and dentries are not accounted
> -		 * to the LRU here, so we can simply remove it from the list
> -		 * here regardless of whether it is referenced or not.
> -		 */
> -		d_shrink_del(dentry);
> +	parent = dentry->d_parent;
> +	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
> +		return true;
>  
> -		/*
> -		 * We found an inuse dentry which was not removed from
> -		 * the LRU because of laziness during lookup. Do not free it.
> -		 */
> -		if (dentry->d_lockref.count > 0) {
> -			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
> -			continue;
> -		}
> +	rcu_read_lock();		/* to protect parent */
> +	spin_unlock(&dentry->d_lock);
> +	parent = READ_ONCE(dentry->d_parent);

The preceeding line should be removed. We already have a "parent" from
before we did the most recent trylock().

> +	spin_lock(&parent->d_lock);
> +	if (unlikely(parent != dentry->d_parent)) {
> +		spin_unlock(&parent->d_lock);
> +		spin_lock(&dentry->d_lock);
> +		goto out;
> +	}
> +	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> +	if (likely(!dentry->d_lockref.count)) {
> +		rcu_read_unlock();
> +		return true;
> +	}
> +	spin_unlock(&parent->d_lock);
> +out:
> +	spin_unlock(&inode->i_lock);
> +	rcu_read_unlock();
> +	return false;
> +}
>  
> +static void shrink_dentry_list(struct list_head *list)
> +{
> +	while (!list_empty(list)) {
> +		struct dentry *dentry, *parent;
> +		struct inode *inode;
>  
> -		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> -			bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
> +		dentry = list_entry(list->prev, struct dentry, d_lru);
> +		spin_lock(&dentry->d_lock);
> +		if (!shrink_lock_dentry(dentry)) {
> +			bool can_free = false;
> +			d_shrink_del(dentry);
> +			if (dentry->d_lockref.count < 0)
> +				can_free = dentry->d_flags & DCACHE_MAY_FREE;
>  			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
>  			if (can_free)
>  				dentry_free(dentry);
>  			continue;
>  		}
> -
> -		inode = dentry->d_inode;
> -		if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
> -			d_shrink_add(dentry, list);
> -			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
> -			continue;
> -		}
> -
> +		d_shrink_del(dentry);
> +		parent = dentry->d_parent;
>  		__dentry_kill(dentry);
> -
> +		if (parent == dentry)
> +			continue;
>  		/*
>  		 * We need to prune ancestors too. This is necessary to prevent
>  		 * quadratic behavior of shrink_dcache_parent(), but is also

John Ogness
Al Viro April 28, 2018, 12:10 a.m. UTC | #17
On Tue, Mar 13, 2018 at 12:12:51PM +1100, NeilBrown wrote:

> >	     * selinux inode_doinit_with_dentry() (two call sites, very alike)
> 
> I'm less sure about this one, but I think it probably wants
> d_find_any_alias() as well.  Like cap_inode_getsecurity() it only wants
> a dentry so that it can pass something to __vfs_getxattr(),
> and that only wants a dentry so it can pass something to ->get.
> 
> Possibly we should rename d_find_alias() to d_find_hashed_alias() so that
> people need to make a conscious choice between d_find_hashed_alias() and
> d_find_any_alias() ??

FWIW, it *is* a bug; this
                        /*
                         * this is can be hit on boot when a file is accessed
                         * before the policy is loaded.  When we load policy we
                         * may find inodes that have no dentry on the
                         * sbsec->isec_head list.  No reason to complain as these
                         * will get fixed up the next time we go through
                         * inode_doinit with a dentry, before these inodes could
                         * be used again by userspace.
                         */
in selinux/hooks.c is flat-out wrong now.  Sure, if you first load selinux
policy after exporting something over NFS or letting attacker play with
open-by-fhandle, you are past any help, but still...

I disagree about going for d_find_any_alias() from the very beginning, BTW -
we need to try it in case of d_find_alias() failure, but sufficiently crappy
filesystem can bloody well fail to access xattrs via disconnected dentry.
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..32aaab21e648 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -647,11 +647,16 @@  static inline struct dentry *lock_parent(struct dentry *dentry)
 		spin_unlock(&parent->d_lock);
 		goto again;
 	}
-	rcu_read_unlock();
-	if (parent != dentry)
+	if (parent != dentry) {
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-	else
+		if (unlikely(dentry->d_lockref.count < 0)) {
+			spin_unlock(&parent->d_lock);
+			parent = NULL;
+		}
+	} else {
 		parent = NULL;
+	}
+	rcu_read_unlock();
 	return parent;
 }