diff mbox series

[01/10] VFS: support parallel updates in the one directory.

Message ID 166147984370.25420.13019217727422217511.stgit@noble.brown (mailing list archive)
State New, archived
Headers show
Series Improve scalability of directory operations | expand

Commit Message

NeilBrown Aug. 26, 2022, 2:10 a.m. UTC
From: NeilBrown <neilb@suse.com>

Some filesystems can support parallel modifications to a directory,
either because the modification happen on a remote server which does its
own locking (e.g.  NFS) or because they can internally lock just a part
of a directory (e.g.  many local filesystems, with a bit of work - the
lustre project has patches for ext4 to support concurrent updates).

To allow this, we introduce VFS support for parallel modification:
unlink (including rmdir) and create.  Parallel rename is added in a
later patch.

If a filesystem supports parallel modification in directories, it sets
FS_PAR_DIR_UPDATE on the file_system_type.  lookup_open() and the new
lookup_hash_update() notice the flag and take a shared lock on the
directory, and rely on a lock-bit in d_flags, much like parallel lookup
relies on DCACHE_PAR_LOOKUP.  The lock bit is always set even if an
exclusive lock was taken, though in that case it will always be
uncontended.

__lookup_hash() is enhanced to use d_alloc_parallel() if it was told
that a shared lock was taken - by being given a non-NULL *wq.  This is
needed as the shared lock does not prevent races with lookups.

The new DCACHE_PAR_UPDATE is the most significant bit in d_flags, so we
have nearly run out of flags.  0x08000000 is still unused ....  for now.

Once a dentry for the target name has been obtained, DCACHE_PAR_UPDATE
is set on it, waiting if necessary if it is already set.  Once this is
set, the thread has exclusive access to the name and can call into the
filesystem to perform the required action.

We use wait_var_event() to wait for DCACHE_PAR_UPDATE to be cleared rather
than trying to overload the wq used for DCACHE_PAR_LOOKUP.

Some filesystems do *not* complete the lookup that precedes a create,
but leave the dentry d_in_lookup() and unhashed, so often a dentry will
have both DCACHE_PAR_LOOKUP and DCACHE_PAR_UPDATE set at the same time.
To allow for this, we need the 'wq' that is passed to d_alloc_parallel()
(and used when DCACHE_PAR_LOOKUP is cleared), to continue to exist until
the creation is complete, which may be after filename_create()
completes.  So a wq now needs to be passed to filename_create() when
parallel modification is attempted.

Some callers, such as kern_path(), perform a lookup for create but don't
pass in a wq and changing all call sites to do so would be churn for
little gain.  So if no wq is provided an exclusive lock is taken and
d_alloc() is used to allocate the dentry.  A new done_path_create_wq()
can be told if a wq was provided for create, so the correct unlock can
be used.

Waiting for DCACHE_PAR_UPDATE can sometimes result in the dentry
changing - either through unlink or rename.  This requires that the
lookup be retried.  This in turn requires that the wq be reused.  The
use of DECLARE_WAITQUEUE() in d_wait_lookup() and the omission of
remove_wait_queue() means that after the wake_up_all() in
__d_lookup_down(), the wait_queue_head list will be undefined.  So we
change d_wait_lookup() to use DEFINE_WAIT(), so that
autoremove_wake_function() is used for wakeups, and the wait_queue_head
remains well defined.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c            |   66 ++++++++++++
 fs/namei.c             |  268 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/dcache.h |   28 +++++
 include/linux/fs.h     |    5 +
 include/linux/namei.h  |   16 +++
 5 files changed, 321 insertions(+), 62 deletions(-)

Comments

Linus Torvalds Aug. 26, 2022, 7:06 p.m. UTC | #1
On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote:
>
> If a filesystem supports parallel modification in directories, it sets
> FS_PAR_DIR_UPDATE on the file_system_type.  lookup_open() and the new
> lookup_hash_update() notice the flag and take a shared lock on the
> directory, and rely on a lock-bit in d_flags, much like parallel lookup
> relies on DCACHE_PAR_LOOKUP.

Ugh.

I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
updates" being important, but I *despise* locking code like this

+       if (wq && IS_PAR_UPDATE(dir))
+               inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+       else
+               inode_lock_nested(dir, I_MUTEX_PARENT);

and I really really hope there's some better model for this.

That "wq" test in particular is just absolutely disgusting. So now it
doesn't just depend on whether the directory supports parallel
updates, now the caller can choose whether to do the parallel thing or
not, and that's how "create" is different from "rename".

And that last difference is, I think, the one that makes me go "No. HELL NO".

Instead of it being up to the filesystem to say "I can do parallel
creates, but I need to serialize renames", this whole thing has been
set up to be about the caller making that decision.

That's just feels horribly horribly wrong.

Yes, I realize that to you that's just a "later patches will do
renames", but what if it really is a filesystem issue where the
filesystem can easily handle new names, but needs something else for
renames because it has various cross-directory issues, perhaps?

So I feel this is fundamentally wrong, and this ugliness is a small
effect of that wrongness.

I think we should strive for

 (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

 (b) aim for the inode lock being taken *after* the _lookup_hash(),
since the VFS layer side has to be able to handle the concurrency on
the dcache side anyway

 (c) at that point, the serialization really ends up being about the
call into the filesystem, and aim to simply move the
inode_lock{_shared]_nested() into the filesystem so that there's no
need for a flag and related conditional locking at all.

Because right now I think the main reason we cannot move the lock into
the filesystem is literally that we've made the lock cover not just
the filesystem part, but the "lookup and create dentry" part too.

But once you have that "DCACHE_PAR_LOOKUP" bit and the
d_alloc_parallel() logic to serialize a _particular_ dentry being
created (as opposed to serializing all the sleeping ops to that
directly), I really think we should strive to move the locking - that
no longer helps the VFS dcache layer - closer to the filesystem call
and eventually into it.

Please? I think these patches are "mostly going in the right
direction", but at the same time I feel like there's some serious
mis-design going on.

                Linus
NeilBrown Aug. 26, 2022, 11:06 p.m. UTC | #2
On Sat, 27 Aug 2022, Linus Torvalds wrote:
> On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote:
> >
> > If a filesystem supports parallel modification in directories, it sets
> > FS_PAR_DIR_UPDATE on the file_system_type.  lookup_open() and the new
> > lookup_hash_update() notice the flag and take a shared lock on the
> > directory, and rely on a lock-bit in d_flags, much like parallel lookup
> > relies on DCACHE_PAR_LOOKUP.
> 
> Ugh.

Thanks :-) - no, really - thanks for the high-level review!

> 
> I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
> updates" being important, but I *despise* locking code like this
> 
> +       if (wq && IS_PAR_UPDATE(dir))
> +               inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> +       else
> +               inode_lock_nested(dir, I_MUTEX_PARENT);
> 
> and I really really hope there's some better model for this.
> 
> That "wq" test in particular is just absolutely disgusting. So now it
> doesn't just depend on whether the directory supports parallel
> updates, now the caller can choose whether to do the parallel thing or
> not, and that's how "create" is different from "rename".

As you note, by the end of the series "create" is not more different
from "rename" than it already is.  I only broke up the patches to make
review more manageable.

The "wq" can be removed.  There are two options.
One is to change every kern_path_create() or user_path_create() caller
to passed in a wq.  Then we can assume that a wq is always available.
There are about a dozen of these calls, so not an enormous change, but
one that I didn't want to think about just yet.  I could add a patch at
the front of the series which did this.

Alternate option is to never pass in a wq for create operation, and use
var_waitqueue() (or something similar) to provide a global shared wait
queue (which is essentially what I am using to wait for
DCACHE_PAR_UPDATE to clear).
The more I think about it, the more I think this is the best way
forward.   Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how
to measure how much contention there is on these shared queues.

> 
> And that last difference is, I think, the one that makes me go "No. HELL NO".
> 
> Instead of it being up to the filesystem to say "I can do parallel
> creates, but I need to serialize renames", this whole thing has been
> set up to be about the caller making that decision.

I think that is a misunderstanding.  The caller isn't making a decision
- except the IS_PAR_UPDATE() test which is simply acting on the fs
request.  What you are seeing is a misguided attempt to leave in place
some existing interfaces which assumed exclusive locking and didn't
provide wqs.

> 
> That's just feels horribly horribly wrong.
> 
> Yes, I realize that to you that's just a "later patches will do
> renames", but what if it really is a filesystem issue where the
> filesystem can easily handle new names, but needs something else for
> renames because it has various cross-directory issues, perhaps?

Obviously a filesystem can add its own locking - and they will have to,
though at a finer grain that the VFS can do.


> 
> So I feel this is fundamentally wrong, and this ugliness is a small
> effect of that wrongness.
> 
> I think we should strive for
> 
>  (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I
realised that you wouldn't like that :-)

> 
>  (b) aim for the inode lock being taken *after* the _lookup_hash(),
> since the VFS layer side has to be able to handle the concurrency on
> the dcache side anyway

I think you are suggesting that we change ->lookup call to NOT
require i_rwsem be held.  That is not a small change.
I agree that it makes sense in the long term.  Getting there ....  won't
be a quick as I'd hoped.

> 
>  (c) at that point, the serialization really ends up being about the
> call into the filesystem, and aim to simply move the
> inode_lock{_shared]_nested() into the filesystem so that there's no
> need for a flag and related conditional locking at all.

It might be nice to take a shared lock in VFS, and let the FS upgrade it
to exclusive if needed, but we don't have upgrade_read() ...  maybe it
would be deadlock-prone.

> 
> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.
> 
> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.
> 
> Please? I think these patches are "mostly going in the right
> direction", but at the same time I feel like there's some serious
> mis-design going on.

Hmmm....  I'll dig more deeply into ->lookup and see if I can understand
the locking well enough to feel safe removing i_rwsem from it.

Thanks,
NeilBrown
Linus Torvalds Aug. 27, 2022, 12:13 a.m. UTC | #3
On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@suse.de> wrote:
>
> As you note, by the end of the series "create" is not more different
> from "rename" than it already is.  I only broke up the patches to make
> review more manageable.

Yes, I understand. But I'm saying that maybe a filesystem actually
might want to treat them differently.

That said, the really nasty part was that 'wq' thing that meant that
different paths had different directory locking not because of
low-level filesystem issues, but because of caller issues.

So that's the one I _really_ disliked, and that I don't think should
exist even as a partial first step.

The "tie every operation together with one flag" I can live with, in
case it turns out that yes, that one flag is all anybody ever really
wants.

> Alternate option is to never pass in a wq for create operation, and use
> var_waitqueue() (or something similar) to provide a global shared wait
> queue (which is essentially what I am using to wait for
> DCACHE_PAR_UPDATE to clear).

I _think_ this is what I would prefer.

I say that I _think_ I prefer that, because maybe there are issues
with it. But since you basically do that DCACHE_PAR_UPDATE thing
anyway, and it's one of the main users of this var_waitqueue, it feels
right to me.

But then if it just end sup not working well for some practical
reason, at that point maybe I'd just say "I was wrong, I thought it
would work, but it's better to spread it out to be a per-thread
wait-queue on the stack".

IOW, my preference would be to simply just try it, knowing that you
*can* do the "pass explicit wait-queue down" thing if we need to.

Hmm?

> > Instead of it being up to the filesystem to say "I can do parallel
> > creates, but I need to serialize renames", this whole thing has been
> > set up to be about the caller making that decision.
>
> I think that is a misunderstanding.  The caller isn't making a decision
> - except the IS_PAR_UPDATE() test which is simply acting on the fs
> request.  What you are seeing is a misguided attempt to leave in place
> some existing interfaces which assumed exclusive locking and didn't
> provide wqs.

Ok. I still would prefer to have unified locking, not that "do this
for one filesystem, do that for another" conditional one.

> >  (b) aim for the inode lock being taken *after* the _lookup_hash(),
> > since the VFS layer side has to be able to handle the concurrency on
> > the dcache side anyway
>
> I think you are suggesting that we change ->lookup call to NOT
> require i_rwsem be held.

Yes and no.

One issue for me is that with your change as-is, then 99% of all
people who don't happen to use NFS, the inode lock gives all that VFS
code mutual exclusion.

Take that lookup_hash_update() function as a practical case: all the
*common* filesystems will be running with that function basically 100%
serialized per directory, because they'll be doing that

        inode_lock_nested(dir);
        ...
        inode_unlock(dir);

around it all.

At the same time, all that code is supposed to work even *without* the
lock, because once it's a IS_PAR_UPDATE() filesystem, there's
effectively no locking at all. What exclusive directory locks even
remain at that point?

IOW, to me it feels like you are trying to make the code go towards a
future with basically no locking at all as far as the VFS layer is
concerned (because once all the directory modifications take the inode
lock as shared, *all* the inode locking is shared, and is basically a
no-op).

BUT you are doing so while not having most people even *test* that situation.

See what I'm trying to say (but possibly expressing very badly)?

So I feel like if the VFS code cannot rely on locking *anyway* in the
general case, and should work without it, then we really shouldn't
have any locking around any of the VFS operations.

The logical conclusion of that would be to push it all down into the
filesystem (possibly with the help of a coccinelle script).

Now it doesn't have to go that far - at least not initially - but I do
think we should at least make sure that as much as possible of the
actual VFS code sees that "new world order" of no directory locking,
so that that situation gets *tested* as widely as possible.

> That is not a small change.

Now, that I agree with. I guss we won't get there soon (or ever). But
see above what I dislike about the directory locking model change.

> It might be nice to take a shared lock in VFS, and let the FS upgrade it
> to exclusive if needed, but we don't have upgrade_read() ...  maybe it
> would be deadlock-prone.

Yes, upgrading a read lock is fundamentally impossible and will
deadlock trivially (think just two readers that both want to do the
upgrade - they'll block each other from doing so).

So it's not actually a possible operation.

                    Linus
Al Viro Aug. 27, 2022, 12:17 a.m. UTC | #4
On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote:

> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.

How about rename loop prevention?  mount vs. rmdir?  d_splice_alias()
fun on tree reconnects?

> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.

See above.
Al Viro Aug. 27, 2022, 12:23 a.m. UTC | #5
On Fri, Aug 26, 2022 at 05:13:38PM -0700, Linus Torvalds wrote:

> So I feel like if the VFS code cannot rely on locking *anyway* in the
> general case, and should work without it, then we really shouldn't
> have any locking around any of the VFS operations.

That's a really big if.  There's a bunch of places where we rely upon
->i_rwsem on directories and _not_ to provide exclusion for fs methods.
I'm still halfway through the Neil's patchset, but verifying correctness
won't be easy and I'm not optimistic about getting rid of those uses...
Al Viro Aug. 27, 2022, 3:43 a.m. UTC | #6
On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:

> +/**
> + * d_lock_update_nested - lock a dentry before updating
> + * @dentry: the dentry to be locked
> + * @base:   the parent, or %NULL
> + * @name:   the name in that parent, or %NULL
> + * @subclass: lockdep locking class.
> + *
> + * Lock a dentry in a directory on which a shared-lock may be held, and
> + * on which parallel updates are permitted.
> + * If the base and name are given, then on success the dentry will still
> + * have that base and name - it will not have raced with rename.
> + * On success, a positive dentry will still be hashed, ensuring there
> + * was no race with unlink.
> + * If they are not given, then on success the dentry will be negative,
> + * which again ensures no race with rename, or unlink.

I'm not sure it's a good idea to have that in one function, TBH.
Looking at the callers, there are
	* lookup_hash_update()
		lookup_hash_update_len()
			nfsd shite
		filename_create_one()
			filename_create_one_len()
				nfsd shite
			filename_create()
				kern_path_create()
				user_path_create()
				do_mknodat()
				do_mkdirat()
				do_symlinkat()
				do_linkat()
		do_rmdir()
		do_unlinkat()
	* fuckloads of callers in lock_rename_lookup()
	* d_lock_update()
		atomic_open()
		lookup_open()

Only the last two can get NULL base or name.  Already interesting,
isn't it?  What's more, splitup between O_CREATE open() on one
side and everything else that might create, remove or rename on
the other looks really weird.

> +	rcu_read_lock(); /* for d_same_name() */
> +	if (d_unhashed(dentry) && d_is_positive(dentry)) {
> +		/* name was unlinked while we waited */
> +		ret = false;

BTW, what happens if somebody has ->lookup() returning a positive
unhashed?  Livelock on attempt to hit it with any directory-modifying
syscall?  Right now such behaviour is permitted; I don't know if
anything in the tree actually does it, but at the very least it
would need to be documented.

Note that *negative* unhashed is not just permitted, it's
actively used e.g. by autofs.  That really needs to be well
commented...

> +	} else if (base &&
> +		 (dentry->d_parent != base ||
> +		  dentry->d_name.hash != name->hash ||
> +		  !d_same_name(dentry, base, name))) {
> +		/* dentry was renamed - possibly silly-rename */
> +		ret = false;
> +	} else if (!base && d_is_positive(dentry)) {
> +		ret = false;
> +	} else {
> +		dentry->d_flags |= DCACHE_PAR_UPDATE;
> +	}

> + * Parent directory has inode locked exclusive, or possibly shared if wq
> + * is given.  In the later case the fs has explicitly allowed concurrent
> + * updates in this directory.  This is the one and only case
> + * when ->lookup() may be called on a non in-lookup dentry.

What Linus already said about wq...  To add a reason he hadn't mentioned,
the length of call chain one needs to track to figure out whether it's
NULL or not is... excessive.  And I don't mean just "greater than 0".
We have places like that, and sometimes we have to, but it's never a good
thing...

>  static struct dentry *__lookup_hash(const struct qstr *name,
> -		struct dentry *base, unsigned int flags)
> +				    struct dentry *base, unsigned int flags,
> +				    wait_queue_head_t *wq)

> -	dentry = d_alloc(base, name);
> +	if (wq)
> +		dentry = d_alloc_parallel(base, name, wq);
> +	else
> +		dentry = d_alloc(base, name);
>  	if (unlikely(!dentry))
>  		return ERR_PTR(-ENOMEM);
> +	if (IS_ERR(dentry))
> +		return dentry;

	BTW, considering the fact that we have 12 callers of d_alloc()
(including this one) and 28 callers of its wrapper (d_alloc_name()), I
would seriously consider converting d_alloc() from "NULL or new dentry"
to "ERR_PTR(-ENOMEM) or new dentry".  Especially since quite a few of
those callers will be happier that way.  Grep and you'll see...  As a
side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).

> +static struct dentry *lookup_hash_update(
> +	const struct qstr *name,
> +	struct dentry *base, unsigned int flags,
> +	wait_queue_head_t *wq)
> +{
> +	struct dentry *dentry;
> +	struct inode *dir = base->d_inode;
> +	int err;
> +
> +	if (wq && IS_PAR_UPDATE(dir))
> +		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> +	else
> +		inode_lock_nested(dir, I_MUTEX_PARENT);
> +
> +retry:
> +	dentry = __lookup_hash(name, base, flags, wq);
> +	if (IS_ERR(dentry)) {
> +		err = PTR_ERR(dentry);
> +		goto out_err;
> +	}
> +	if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> +		/*
> +		 * Failed to get lock due to race with unlink or rename
> +		 * - try again
> +		 */
> +		d_lookup_done(dentry);

When would we get out of __lookup_hash() with in-lookup dentry?
Confused...

> +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> +				      struct path *path, unsigned int flags,

	const struct path *, damnit...

> +				      wait_queue_head_t *wq)
> +{
> +	struct qstr this;
> +	int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> +				    path->dentry, nlen, &this);
> +	if (err)
> +		return ERR_PTR(err);
> +	return lookup_hash_update(&this, path->dentry, flags, wq);
> +}
> +EXPORT_SYMBOL(lookup_hash_update_len);

Frankly, the calling conventions of the "..._one_len" family is something
I've kept regretting for a long time.  Oh, well...

> +static void done_path_update(struct path *path, struct dentry *dentry,
> +			     bool with_wq)
> +{
> +	struct inode *dir = path->dentry->d_inode;
> +
> +	d_lookup_done(dentry);
> +	d_unlock_update(dentry);
> +	if (IS_PAR_UPDATE(dir) && with_wq)
> +		inode_unlock_shared(dir);
> +	else
> +		inode_unlock(dir);
> +}

const struct path *, again...

> @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  			dentry = res;
>  		}
>  	}
> +	/*
> +	 * If dentry is negative and this is a create we need to get
> +	 * DCACHE_PAR_UPDATE.
> +	 */
> +	if ((open_flag & O_CREAT) && !dentry->d_inode)
> +		have_par_update = d_lock_update(dentry, NULL, NULL);
>  
>  	/* Negative dentry, just create the file */
>  	if (!dentry->d_inode && (open_flag & O_CREAT)) {

Fold the above here, please.  What's more, losing the flag would've
made it much easier to follow...

> @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  		error = create_error;
>  		goto out_dput;
>  	}
> +	if (have_par_update)
> +		d_unlock_update(dentry);
>  	return dentry;
>  
>  out_dput:
> +	if (have_par_update)
> +		d_unlock_update(dentry);


> @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
>  				struct path *path, unsigned int lookup_flags)

BTW, there's 9 callers of that sucker in the entire tree, along with
3 callers of user_path_create() and 14 callers of done_path_create().
Not a big deal to add the wq in those, especially since it can be easily
split into preparatory patch (with wq passed, but being unused).

> -void done_path_create(struct path *path, struct dentry *dentry)
> +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)

Why "with_wq" and not the wq itself?

> - * The caller must hold dir->i_mutex.
> + * The caller must either hold a write-lock on dir->i_rwsem, or
> + * a have atomically set DCACHE_PAR_UPDATE, or both.

???

> + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.

The latter happens unconditionally here, unless I'm misreading the code (as well
as your cover letter).  It does *NOT* happen on rename(), though, contrary to
the same.  And while your later patch adds it in lock_rename_lookup(), existing
lock_rename() callers do not get that at all.  Likely to be a problem...

> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -13,7 +13,9 @@
>  #include <linux/rcupdate.h>
>  #include <linux/lockref.h>
>  #include <linux/stringhash.h>
> +#include <linux/sched.h>

Bloody hell, man...

> +static inline void d_unlock_update(struct dentry *dentry)
> +{
> +	if (IS_ERR_OR_NULL(dentry))
> +		return;

Do explain...  When could we ever get NULL or ERR_PTR() passed to that?


BTW, I would seriously look into splitting the "let's add a helper
that combines locking parent with __lookup_hash()" into a preliminary
patch.  Would be easier to follow.
Al Viro Aug. 27, 2022, 9:14 p.m. UTC | #7
On Fri, Aug 26, 2022 at 05:13:38PM -0700, Linus Torvalds wrote:
> On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@suse.de> wrote:
> >
> > As you note, by the end of the series "create" is not more different
> > from "rename" than it already is.  I only broke up the patches to make
> > review more manageable.
> 
> Yes, I understand. But I'm saying that maybe a filesystem actually
> might want to treat them differently.
> 
> That said, the really nasty part was that 'wq' thing that meant that
> different paths had different directory locking not because of
> low-level filesystem issues, but because of caller issues.
> 
> So that's the one I _really_ disliked, and that I don't think should
> exist even as a partial first step.
> 
> The "tie every operation together with one flag" I can live with, in
> case it turns out that yes, that one flag is all anybody ever really
> wants.

FWIW, what's really missing is the set of rules describing what the
methods can expect from their arguments.

Things like "oh, we can safely use ->d_parent here - we know that
foo_rmdir(dir, child) is called only with dir held exclusive and
child that had been observed to be a child of dentry alias of
dir after dir had been locked, while all places that might change
child->d_parent will be doing that only with child->d_parent->d_inode
held at least shared" rely upon the current locking scheme.

Change that 'held exclusive' to 'held shared' and we need something
different, presumably 'this new bitlock on the child is held by the caller'.
That's nice, but...  What's to guarantee that we won't be hit by
__d_unalias()?  It won't care about the bitlock on existing alias,
would it?  And it only holds the old parent shared, so...

My comments had been along the lines of "doing that would make the
series easier to reason about"; I don't hate the approach, but
	* in the current form it's hard to read; there might be
problems I hadn't even noticed yet
	* it's much easier to verify that stated assertions are
guaranteed by the callers and sufficient for safety of callees
if they *ARE* stated.  Spelling them out is on the patch series
authors, and IME doing that helps a lot when writing a series
like that.  At least on the level of internal notes...  Especially
since NFS is... special (or, as they say in New York, "sophisticated" -
sorry).  There's a plenty of things that are true for it, but do
not hold for filesystems in general.  And without an explicitly
spelled out warranties it's very easy to end up with a mess that
would be hell to apply to other filesystems.  I really don't want
to see an explosion of cargo-culted logics that might or might
not remain valid for NFS by the time it gets copied around...
NeilBrown Aug. 29, 2022, 1:59 a.m. UTC | #8
On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> 
> > +/**
> > + * d_lock_update_nested - lock a dentry before updating
> > + * @dentry: the dentry to be locked
> > + * @base:   the parent, or %NULL
> > + * @name:   the name in that parent, or %NULL
> > + * @subclass: lockdep locking class.
> > + *
> > + * Lock a dentry in a directory on which a shared-lock may be held, and
> > + * on which parallel updates are permitted.
> > + * If the base and name are given, then on success the dentry will still
> > + * have that base and name - it will not have raced with rename.
> > + * On success, a positive dentry will still be hashed, ensuring there
> > + * was no race with unlink.
> > + * If they are not given, then on success the dentry will be negative,
> > + * which again ensures no race with rename, or unlink.
> 
> I'm not sure it's a good idea to have that in one function, TBH.
> Looking at the callers, there are
> 	* lookup_hash_update()
> 		lookup_hash_update_len()
> 			nfsd shite
> 		filename_create_one()
> 			filename_create_one_len()
> 				nfsd shite
> 			filename_create()
> 				kern_path_create()
> 				user_path_create()
> 				do_mknodat()
> 				do_mkdirat()
> 				do_symlinkat()
> 				do_linkat()
> 		do_rmdir()
> 		do_unlinkat()
> 	* fuckloads of callers in lock_rename_lookup()
> 	* d_lock_update()
> 		atomic_open()
> 		lookup_open()
> 
> Only the last two can get NULL base or name.  Already interesting,
> isn't it?  What's more, splitup between O_CREATE open() on one
> side and everything else that might create, remove or rename on
> the other looks really weird.

Well, O_CREATE is a bit weird.
But I can see that it would be cleaner to pass the dir/name into those
two calls that currently get NULL - then remove the handling of NULL.
Thanks.

> 
> > +	rcu_read_lock(); /* for d_same_name() */
> > +	if (d_unhashed(dentry) && d_is_positive(dentry)) {
> > +		/* name was unlinked while we waited */
> > +		ret = false;
> 
> BTW, what happens if somebody has ->lookup() returning a positive
> unhashed?  Livelock on attempt to hit it with any directory-modifying
> syscall?  Right now such behaviour is permitted; I don't know if
> anything in the tree actually does it, but at the very least it
> would need to be documented.
> 
> Note that *negative* unhashed is not just permitted, it's
> actively used e.g. by autofs.  That really needs to be well
> commented...

I hadn't thought that ->lookup() could return anything unhashed.  I'll
look into that - thanks.

> 
> > +	} else if (base &&
> > +		 (dentry->d_parent != base ||
> > +		  dentry->d_name.hash != name->hash ||
> > +		  !d_same_name(dentry, base, name))) {
> > +		/* dentry was renamed - possibly silly-rename */
> > +		ret = false;
> > +	} else if (!base && d_is_positive(dentry)) {
> > +		ret = false;
> > +	} else {
> > +		dentry->d_flags |= DCACHE_PAR_UPDATE;
> > +	}
> 
> > + * Parent directory has inode locked exclusive, or possibly shared if wq
> > + * is given.  In the later case the fs has explicitly allowed concurrent
> > + * updates in this directory.  This is the one and only case
> > + * when ->lookup() may be called on a non in-lookup dentry.
> 
> What Linus already said about wq...  To add a reason he hadn't mentioned,
> the length of call chain one needs to track to figure out whether it's
> NULL or not is... excessive.  And I don't mean just "greater than 0".
> We have places like that, and sometimes we have to, but it's never a good
> thing...
> 
> >  static struct dentry *__lookup_hash(const struct qstr *name,
> > -		struct dentry *base, unsigned int flags)
> > +				    struct dentry *base, unsigned int flags,
> > +				    wait_queue_head_t *wq)
> 
> > -	dentry = d_alloc(base, name);
> > +	if (wq)
> > +		dentry = d_alloc_parallel(base, name, wq);
> > +	else
> > +		dentry = d_alloc(base, name);
> >  	if (unlikely(!dentry))
> >  		return ERR_PTR(-ENOMEM);
> > +	if (IS_ERR(dentry))
> > +		return dentry;
> 
> 	BTW, considering the fact that we have 12 callers of d_alloc()
> (including this one) and 28 callers of its wrapper (d_alloc_name()), I
> would seriously consider converting d_alloc() from "NULL or new dentry"
> to "ERR_PTR(-ENOMEM) or new dentry".  Especially since quite a few of
> those callers will be happier that way.  Grep and you'll see...  As a
> side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).
> 
> > +static struct dentry *lookup_hash_update(
> > +	const struct qstr *name,
> > +	struct dentry *base, unsigned int flags,
> > +	wait_queue_head_t *wq)
> > +{
> > +	struct dentry *dentry;
> > +	struct inode *dir = base->d_inode;
> > +	int err;
> > +
> > +	if (wq && IS_PAR_UPDATE(dir))
> > +		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> > +	else
> > +		inode_lock_nested(dir, I_MUTEX_PARENT);
> > +
> > +retry:
> > +	dentry = __lookup_hash(name, base, flags, wq);
> > +	if (IS_ERR(dentry)) {
> > +		err = PTR_ERR(dentry);
> > +		goto out_err;
> > +	}
> > +	if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> > +		/*
> > +		 * Failed to get lock due to race with unlink or rename
> > +		 * - try again
> > +		 */
> > +		d_lookup_done(dentry);
> 
> When would we get out of __lookup_hash() with in-lookup dentry?
> Confused...

Whenever wq is passed in and ->lookup() decides, based on the flags, to do
nothing.
NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET

> 
> > +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> > +				      struct path *path, unsigned int flags,
> 
> 	const struct path *, damnit...

Sorry....

> 
> > +				      wait_queue_head_t *wq)
> > +{
> > +	struct qstr this;
> > +	int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> > +				    path->dentry, nlen, &this);
> > +	if (err)
> > +		return ERR_PTR(err);
> > +	return lookup_hash_update(&this, path->dentry, flags, wq);
> > +}
> > +EXPORT_SYMBOL(lookup_hash_update_len);
> 
> Frankly, the calling conventions of the "..._one_len" family is something
> I've kept regretting for a long time.  Oh, well...
> 
> > +static void done_path_update(struct path *path, struct dentry *dentry,
> > +			     bool with_wq)
> > +{
> > +	struct inode *dir = path->dentry->d_inode;
> > +
> > +	d_lookup_done(dentry);
> > +	d_unlock_update(dentry);
> > +	if (IS_PAR_UPDATE(dir) && with_wq)
> > +		inode_unlock_shared(dir);
> > +	else
> > +		inode_unlock(dir);
> > +}
> 
> const struct path *, again...
> 
> > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >  			dentry = res;
> >  		}
> >  	}
> > +	/*
> > +	 * If dentry is negative and this is a create we need to get
> > +	 * DCACHE_PAR_UPDATE.
> > +	 */
> > +	if ((open_flag & O_CREAT) && !dentry->d_inode)
> > +		have_par_update = d_lock_update(dentry, NULL, NULL);
> >  
> >  	/* Negative dentry, just create the file */
> >  	if (!dentry->d_inode && (open_flag & O_CREAT)) {
> 
> Fold the above here, please.  What's more, losing the flag would've
> made it much easier to follow...

Yes, that can certainly be tided up - thanks.

> 
> > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >  		error = create_error;
> >  		goto out_dput;
> >  	}
> > +	if (have_par_update)
> > +		d_unlock_update(dentry);
> >  	return dentry;
> >  
> >  out_dput:
> > +	if (have_par_update)
> > +		d_unlock_update(dentry);
> 
> 
> > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
> >  				struct path *path, unsigned int lookup_flags)
> 
> BTW, there's 9 callers of that sucker in the entire tree, along with
> 3 callers of user_path_create() and 14 callers of done_path_create().
> Not a big deal to add the wq in those, especially since it can be easily
> split into preparatory patch (with wq passed, but being unused).
> 
> > -void done_path_create(struct path *path, struct dentry *dentry)
> > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
> 
> Why "with_wq" and not the wq itself?
> 

I did that at first.  However when I did silly rename in NFS I found
that I wanted to call the 'done' thing when I didn't still have the wq.
...which might mean I have a bug.

And the done_path_create_wq() doesn't do anything with the wq so ...

I'll look at that again - thanks.


> > - * The caller must hold dir->i_mutex.
> > + * The caller must either hold a write-lock on dir->i_rwsem, or
> > + * a have atomically set DCACHE_PAR_UPDATE, or both.
> 
> ???

That's a hangover from an earlier version where I didn't set
DCACHE_PAR_UPDATE when we had an exclusive lock.  Will fix.


> 
> > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.
> 
> The latter happens unconditionally here, unless I'm misreading the code (as well
> as your cover letter).  It does *NOT* happen on rename(), though, contrary to
> the same.  And while your later patch adds it in lock_rename_lookup(), existing
> lock_rename() callers do not get that at all.  Likely to be a problem...

Between the patch were DCACHE_PAR_UPDATE is added and the patch were
lock_rename_lookup() is added, all filesystems use exclusive locks and
none of them check DCACHE_PAR_UPDATE.  So how can there be a problem?


> 
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -13,7 +13,9 @@
> >  #include <linux/rcupdate.h>
> >  #include <linux/lockref.h>
> >  #include <linux/stringhash.h>
> > +#include <linux/sched.h>
> 
> Bloody hell, man...
> 

I wonder what that was for .... removed.


> > +static inline void d_unlock_update(struct dentry *dentry)
> > +{
> > +	if (IS_ERR_OR_NULL(dentry))
> > +		return;
> 
> Do explain...  When could we ever get NULL or ERR_PTR() passed to that?

Another hangover from earlier iterations - removed.  Thanks.

> 
> 
> BTW, I would seriously look into splitting the "let's add a helper
> that combines locking parent with __lookup_hash()" into a preliminary
> patch.  Would be easier to follow.
> 
Will look into that.

Thanks a lot for the thorough review!

NeilBrown
NeilBrown Sept. 1, 2022, 12:31 a.m. UTC | #9
On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote:
> 
> > Because right now I think the main reason we cannot move the lock into
> > the filesystem is literally that we've made the lock cover not just
> > the filesystem part, but the "lookup and create dentry" part too.
> 
> How about rename loop prevention?  mount vs. rmdir?  d_splice_alias()
> fun on tree reconnects?

Thanks for this list.

I think the "mount vs. rmdir" usage of inode_lock() is independent of
the usage for directory operations, so we can change the latter as much
as we like without materially affecting the former.

The lock we take on the directory being removed DOES ensure no new
objects are linked into the directory, so for that reason we still need
at least a shared lock when adding links to a directory.
Moving that lock into the filesystem would invert the locking order in
rmdir between the child being removed and the parent being locked.  That
would require some consideration.

d_splice_alias() happens at ->lookup time so it is already under a
shared lock.  I don't see that it depends on i_rwsem - it uses i_lock
for the important locking.

Rename loop prevention is largely managed by s_vfs_rename_mutex.  Once
that is taken, nothing can be moved to a different directory.  That
means 'trap' will keep any relationship it had to new_path and old_path.
It could be renamed within it's parent, but as long as it isn't removed
the comparisons with old_dentry and new_dentry should still be reliable.
As 'trap' clearly isn't empty, we trust that the filesystem won't allow
an rmdir to succeed.

What have I missed?

Thanks,
NeilBrown

> 
> > But once you have that "DCACHE_PAR_LOOKUP" bit and the
> > d_alloc_parallel() logic to serialize a _particular_ dentry being
> > created (as opposed to serializing all the sleeping ops to that
> > directly), I really think we should strive to move the locking - that
> > no longer helps the VFS dcache layer - closer to the filesystem call
> > and eventually into it.
> 
> See above.
>
Al Viro Sept. 1, 2022, 3:44 a.m. UTC | #10
On Thu, Sep 01, 2022 at 10:31:10AM +1000, NeilBrown wrote:

> Thanks for this list.

Keep in mind that this list is just off the top of my head - it's
nowhere near complete.

> d_splice_alias() happens at ->lookup time so it is already under a
> shared lock.  I don't see that it depends on i_rwsem - it uses i_lock
> for the important locking.

Nope.

Process 1:
rmdir foo/bar
	foo found and locked exclusive [*]
	dentry of bar found
	->rmdir() instance called
Process 2:
stat foo/splat
	foo found and locked shared [*]
	dentry of splat does not exist anywhere
	dentry allocated, marked in-lookup
	->lookup() instance called
	inode found and passed to d_splice_alias() ...
	... which finds that it's a directory inode ...
	... and foo/bar refers to it.  E.g. it's on NFS and another
client has just done mv bar splat
	__d_unalias() is called, to try and move existing alias (foo/bar)
into the right place.  It sees that no change of parent is involved,
	so it can just proceed to __d_move().
Process 1:
	forms an rmdir request to server, using ->d_name (and possibly
->d_parent) of dentry of foo/bar.  It knows that ->d_name is stable,
since the caller holds foo locked exclusive and all callers of __d_move()
hold the old parent at least shared.

In mainline process 2 will block (or, in case if it deals with different
parent, try to grab the old parent of the existing alias shared and fail
and with -ESTALE).  With your changes process 1 will be holding
foo/ locked shared, so process 2 will succeed and proceed to __d_move(),
right under the nose of process 1 accessing ->d_name.  If the names involved
had been longer than 32 characters, it would risk accessing kfreed memory.
Or fetching the length from old name and pointer from new one, walking
past the end of kmalloc'ed object, etc.

Sure, assuming that we are talking about NFS, server would have probably
failed the RMDIR request - if you managed to form that request without
oopsing, that is.
Al Viro Sept. 3, 2022, 12:06 a.m. UTC | #11
On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote:

> > When would we get out of __lookup_hash() with in-lookup dentry?
> > Confused...
> 
> Whenever wq is passed in and ->lookup() decides, based on the flags, to do
> nothing.
> NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET

Frankly, I would rather do what all other callers of ->lookup() do and
just follow it with d_lookup_done(dentry), no matter what it returns.
It's cheap enough...
NeilBrown Sept. 3, 2022, 1:40 a.m. UTC | #12
On Sat, 03 Sep 2022, Al Viro wrote:
> On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote:
> 
> > > When would we get out of __lookup_hash() with in-lookup dentry?
> > > Confused...
> > 
> > Whenever wq is passed in and ->lookup() decides, based on the flags, to do
> > nothing.
> > NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET
> 
> Frankly, I would rather do what all other callers of ->lookup() do and
> just follow it with d_lookup_done(dentry), no matter what it returns.
> It's cheap enough...
> 

I don't think that is a good idea.  Once you call d_lookup_done()
(without having first called d_add() or similar) the dentry becomes
invisible to normal path lookup, so another might be created.  But the
dentry will still be used for the 'create' or 'rename' and may then be
added to the dcache - at which point you could have two dentries with the
same name.

When ->lookup() returns success without d_add()ing the dentry, that
means that something else will complete the d_add() if/when necessary.
For NFS, it specifically means that the lookup is effectively being
combined with the following CREATE or RENAME.  In this case there is no
d_lookup_done() until the full operation is complete.

For autofs (thanks for pointing me to that) the operation is completed
when d_automount() signals the daemon to create the directory or
symlink.  In that case there IS a d_lookup_done() call and autofs needs
some extra magic (the internal 'active' list) to make sure subsequent
->lookup requests can see that dentry which is still in the process of
being set up.

It might be nice if the dentry passed to autofs_lookup() could remain
"d_inlookup()" until after d_automount has completed.  Then autofs
wouldn't need that active list.  However I haven't yet looked at how
disruptive such a change might be.

Thanks,
NeilBrown
Al Viro Sept. 3, 2022, 2:12 a.m. UTC | #13
On Sat, Sep 03, 2022 at 11:40:44AM +1000, NeilBrown wrote:

> I don't think that is a good idea.  Once you call d_lookup_done()
> (without having first called d_add() or similar) the dentry becomes
> invisible to normal path lookup, so another might be created.  But the
> dentry will still be used for the 'create' or 'rename' and may then be
> added to the dcache - at which point you could have two dentries with the
> same name.
> 
> When ->lookup() returns success without d_add()ing the dentry, that
> means that something else will complete the d_add() if/when necessary.
> For NFS, it specifically means that the lookup is effectively being
> combined with the following CREATE or RENAME.  In this case there is no
> d_lookup_done() until the full operation is complete.
>
> For autofs (thanks for pointing me to that) the operation is completed
> when d_automount() signals the daemon to create the directory or
> symlink.  In that case there IS a d_lookup_done() call and autofs needs
> some extra magic (the internal 'active' list) to make sure subsequent
> ->lookup requests can see that dentry which is still in the process of
> being set up.
> 
> It might be nice if the dentry passed to autofs_lookup() could remain
> "d_inlookup()" until after d_automount has completed.  Then autofs
> wouldn't need that active list.  However I haven't yet looked at how
> disruptive such a change might be.

Very much so.  You are starting to invent new rules for ->lookup() that
just never had been there, basing on nothing better than a couple of
examples.  They are nowhere near everything there is.

And you can't rely upon d_add() done by a method, for very obvious
reasons.  They are out of your control, they might very well decide
that object creation has failed and drop the damn thing.  Which is
not allowed for in-lookup dentries without d_lookup_done().

Neil, *IF* you are introducing new rules like that, the absolutely minimal
requirement is having them in Documentation/filesystems/porting.rst.
And that includes "such-and-such method might be called with parent
locked only shared; in that case it's guaranteed such-and-such things
about its arguments (bitlocks held, etc.)".

One thing we really need to avoid is that thing coming undocumented, with
"NFS copes, nobody else has it enabled, whoever does it for other
filesystems will just have to RTFS".  I hope it's obvious that this
is not an option.  Because I can bloody guarantee that it will be
cargo-culted over to other filesystems, with nobody (you and me included)
understanding the resulting code.
Al Viro Sept. 3, 2022, 5:52 p.m. UTC | #14
On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote:

> Very much so.  You are starting to invent new rules for ->lookup() that
> just never had been there, basing on nothing better than a couple of
> examples.  They are nowhere near everything there is.

A few examples besides NFS and autofs:

ext4, f2fs and xfs might bloody well return NULL without hashing - happens
on negative lookups with 'casefolding' crap.

kernfs - treatment of inactive nodes.

afs_dynroot_lookup() treatment of @cell... names.

afs_lookup() treatment of @sys... names.

There might very well be more - both merged into mainline and in
development trees of various filesystems (including devel branches
of in-tree ones - I'm not talking about out-of-tree projects).

Note, BTW, that with the current rules it's perfectly possible to
have this kind of fun:
	a name that resolves to different files for different processes
	unlink(2) is allowed and results depend upon the calling process

All it takes is ->lookup() deliberately *NOT* hashing the sucker and
->unlink() acting according to dentry it has gotten from the caller.
unlink(2) from different callers are serialized and none of that
stuff is ever going to be hashed.  d_alloc_parallel() might pick an
in-lookup dentry from another caller of e.g. stat(2), but it will
wait for in-lookup state ending, notice that the sucker is not hashed,
drop it and retry.  Suboptimal, but it works.

Nothing in the mainline currently does that.  Nothing that I know of,
that is.  Sure, it could be made work with the changes you seem to
imply (if I'm not misreading you) - all it takes is lookup
calling d_lookup_done() on its argument before returning NULL.
But that's subtle, non-obvious and not documented anywhere...

Another interesting question is the rules for unhashing dentries.
What is needed for somebody to do temporary unhash, followed by
rehashing?
NeilBrown Sept. 4, 2022, 11:33 p.m. UTC | #15
On Sun, 04 Sep 2022, Al Viro wrote:
> On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote:
> 
> > Very much so.  You are starting to invent new rules for ->lookup() that
> > just never had been there, basing on nothing better than a couple of
> > examples.  They are nowhere near everything there is.
> 
> A few examples besides NFS and autofs:

Hi Al,
 thanks for these - very helpful.  I will give them due consideration
 when I write relevant documentation to include in the next posting of
 the series.

Thanks a lot,
NeilBrown


> 
> ext4, f2fs and xfs might bloody well return NULL without hashing - happens
> on negative lookups with 'casefolding' crap.
> 
> kernfs - treatment of inactive nodes.
> 
> afs_dynroot_lookup() treatment of @cell... names.
> 
> afs_lookup() treatment of @sys... names.
> 
> There might very well be more - both merged into mainline and in
> development trees of various filesystems (including devel branches
> of in-tree ones - I'm not talking about out-of-tree projects).
> 
> Note, BTW, that with the current rules it's perfectly possible to
> have this kind of fun:
> 	a name that resolves to different files for different processes
> 	unlink(2) is allowed and results depend upon the calling process
> 
> All it takes is ->lookup() deliberately *NOT* hashing the sucker and
> ->unlink() acting according to dentry it has gotten from the caller.
> unlink(2) from different callers are serialized and none of that
> stuff is ever going to be hashed.  d_alloc_parallel() might pick an
> in-lookup dentry from another caller of e.g. stat(2), but it will
> wait for in-lookup state ending, notice that the sucker is not hashed,
> drop it and retry.  Suboptimal, but it works.
> 
> Nothing in the mainline currently does that.  Nothing that I know of,
> that is.  Sure, it could be made work with the changes you seem to
> imply (if I'm not misreading you) - all it takes is lookup
> calling d_lookup_done() on its argument before returning NULL.
> But that's subtle, non-obvious and not documented anywhere...
> 
> Another interesting question is the rules for unhashing dentries.
> What is needed for somebody to do temporary unhash, followed by
> rehashing?
>
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index bb0c4d0038db..d6bfa49b143b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1765,6 +1765,7 @@  static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 	int err;
+	static struct lock_class_key __key;
 
 	dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru,
 				      GFP_KERNEL);
@@ -1820,6 +1821,8 @@  static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_child);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
 
+	lockdep_init_map(&dentry->d_update_map, "DENTRY_PAR_UPDATE", &__key, 0);
+
 	if (dentry->d_op && dentry->d_op->d_init) {
 		err = dentry->d_op->d_init(dentry);
 		if (err) {
@@ -2626,7 +2629,7 @@  static inline void end_dir_add(struct inode *dir, unsigned int n,
 static void d_wait_lookup(struct dentry *dentry)
 {
 	if (d_in_lookup(dentry)) {
-		DECLARE_WAITQUEUE(wait, current);
+		DEFINE_WAIT(wait);
 		add_wait_queue(dentry->d_wait, &wait);
 		do {
 			set_current_state(TASK_UNINTERRUPTIBLE);
@@ -3274,6 +3277,67 @@  void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/**
+ * d_lock_update_nested - lock a dentry before updating
+ * @dentry: the dentry to be locked
+ * @base:   the parent, or %NULL
+ * @name:   the name in that parent, or %NULL
+ * @subclass: lockdep locking class.
+ *
+ * Lock a dentry in a directory on which a shared-lock may be held, and
+ * on which parallel updates are permitted.
+ * If the base and name are given, then on success the dentry will still
+ * have that base and name - it will not have raced with rename.
+ * On success, a positive dentry will still be hashed, ensuring there
+ * was no race with unlink.
+ * If they are not given, then on success the dentry will be negative,
+ * which again ensures no race with rename, or unlink.
+ *
+ * @subclass uses the I_MUTEX_ classes.
+ * If the parent directory is locked with I_MUTEX_NORMAL, use I_MUTEX_NORMAL.
+ * If the parent is locked with I_MUTEX_PARENT, I_MUTEX_PARENT2 or
+ * I_MUTEX_CHILD, use I_MUTEX_PARENT or, for the second in a rename,
+ * I_MUTEX_PARENT2.
+ */
+bool d_lock_update_nested(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *name,
+			  int subclass)
+{
+	bool ret = true;
+
+	lock_acquire_exclusive(&dentry->d_update_map, subclass,
+			       0, NULL, _THIS_IP_);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_PAR_UPDATE)
+		___wait_var_event(&dentry->d_flags,
+				  !(dentry->d_flags & DCACHE_PAR_UPDATE),
+				  TASK_UNINTERRUPTIBLE, 0, 0,
+				  (spin_unlock(&dentry->d_lock),
+				   schedule(),
+				   spin_lock(&dentry->d_lock))
+			);
+	rcu_read_lock(); /* for d_same_name() */
+	if (d_unhashed(dentry) && d_is_positive(dentry)) {
+		/* name was unlinked while we waited */
+		ret = false;
+	} else if (base &&
+		 (dentry->d_parent != base ||
+		  dentry->d_name.hash != name->hash ||
+		  !d_same_name(dentry, base, name))) {
+		/* dentry was renamed - possibly silly-rename */
+		ret = false;
+	} else if (!base && d_is_positive(dentry)) {
+		ret = false;
+	} else {
+		dentry->d_flags |= DCACHE_PAR_UPDATE;
+	}
+	rcu_read_unlock();
+	spin_unlock(&dentry->d_lock);
+	if (!ret)
+		lock_map_release(&dentry->d_update_map);
+	return ret;
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
diff --git a/fs/namei.c b/fs/namei.c
index 53b4bc094db2..c008dfd01e30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1574,14 +1574,14 @@  static struct dentry *lookup_dcache(const struct qstr *name,
 }
 
 /*
- * Parent directory has inode locked exclusive.  This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked exclusive, or possibly shared if wq
+ * is given.  In the later case the fs has explicitly allowed concurrent
+ * updates in this directory.  This is the one and only case
+ * when ->lookup() may be called on a non in-lookup dentry.
  */
 static struct dentry *__lookup_hash(const struct qstr *name,
-		struct dentry *base, unsigned int flags)
+				    struct dentry *base, unsigned int flags,
+				    wait_queue_head_t *wq)
 {
 	struct dentry *dentry = lookup_dcache(name, base, flags);
 	struct dentry *old;
@@ -1594,18 +1594,103 @@  static struct dentry *__lookup_hash(const struct qstr *name,
 	if (unlikely(IS_DEADDIR(dir)))
 		return ERR_PTR(-ENOENT);
 
-	dentry = d_alloc(base, name);
+	if (wq)
+		dentry = d_alloc_parallel(base, name, wq);
+	else
+		dentry = d_alloc(base, name);
 	if (unlikely(!dentry))
 		return ERR_PTR(-ENOMEM);
+	if (IS_ERR(dentry))
+		return dentry;
+
+	if (wq && !d_in_lookup(dentry))
+		/* Must have raced with another thread doing the lookup */
+		return dentry;
 
 	old = dir->i_op->lookup(dir, dentry, flags);
 	if (unlikely(old)) {
+		d_lookup_done(dentry);
 		dput(dentry);
 		dentry = old;
 	}
 	return dentry;
 }
 
+/*
+ * Parent directory (base) is not locked.  We take either an exclusive
+ * or shared lock depending on the fs preference, then do a lookup,
+ * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
+ * was taken on the parent.
+ */
+static struct dentry *lookup_hash_update(
+	const struct qstr *name,
+	struct dentry *base, unsigned int flags,
+	wait_queue_head_t *wq)
+{
+	struct dentry *dentry;
+	struct inode *dir = base->d_inode;
+	int err;
+
+	if (wq && IS_PAR_UPDATE(dir))
+		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+	else
+		inode_lock_nested(dir, I_MUTEX_PARENT);
+
+retry:
+	dentry = __lookup_hash(name, base, flags, wq);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto out_err;
+	}
+	if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
+		/*
+		 * Failed to get lock due to race with unlink or rename
+		 * - try again
+		 */
+		d_lookup_done(dentry);
+		dput(dentry);
+		goto retry;
+	}
+	return dentry;
+
+out_err:
+	if (wq && IS_PAR_UPDATE(dir))
+		inode_unlock_shared(dir);
+	else
+		inode_unlock(dir);
+	return ERR_PTR(err);
+}
+
+static int lookup_one_common(struct user_namespace *mnt_userns,
+			     const char *name, struct dentry *base, int len,
+			     struct qstr *this);
+
+struct dentry *lookup_hash_update_len(const char *name, int nlen,
+				      struct path *path, unsigned int flags,
+				      wait_queue_head_t *wq)
+{
+	struct qstr this;
+	int err = lookup_one_common(mnt_user_ns(path->mnt), name,
+				    path->dentry, nlen, &this);
+	if (err)
+		return ERR_PTR(err);
+	return lookup_hash_update(&this, path->dentry, flags, wq);
+}
+EXPORT_SYMBOL(lookup_hash_update_len);
+
+static void done_path_update(struct path *path, struct dentry *dentry,
+			     bool with_wq)
+{
+	struct inode *dir = path->dentry->d_inode;
+
+	d_lookup_done(dentry);
+	d_unlock_update(dentry);
+	if (IS_PAR_UPDATE(dir) && with_wq)
+		inode_unlock_shared(dir);
+	else
+		inode_unlock(dir);
+}
+
 static struct dentry *lookup_fast(struct nameidata *nd)
 {
 	struct dentry *dentry, *parent = nd->path.dentry;
@@ -2570,7 +2655,7 @@  static struct dentry *__kern_path_locked(struct filename *name, struct path *pat
 		return ERR_PTR(-EINVAL);
 	}
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	d = __lookup_hash(&last, path->dentry, 0);
+	d = __lookup_hash(&last, path->dentry, 0, NULL);
 	if (IS_ERR(d)) {
 		inode_unlock(path->dentry->d_inode);
 		path_put(path);
@@ -3271,11 +3356,24 @@  static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
 	if (nd->flags & LOOKUP_DIRECTORY)
 		open_flag |= O_DIRECTORY;
 
+	/*
+	 * dentry is negative or d_in_lookup().  In case this is a
+	 * shared-lock create we need to set DCACHE_PAR_UPDATE to ensure
+	 * exclusion.
+	 */
+	if (open_flag & O_CREAT) {
+		if (!d_lock_update(dentry, NULL, NULL))
+			/* already exists, non-atomic open */
+			return dentry;
+	}
+
 	file->f_path.dentry = DENTRY_NOT_SET;
 	file->f_path.mnt = nd->path.mnt;
 	error = dir->i_op->atomic_open(dir, dentry, file,
 				       open_to_namei_flags(open_flag), mode);
 	d_lookup_done(dentry);
+	if ((open_flag & O_CREAT))
+		d_unlock_update(dentry);
 	if (!error) {
 		if (file->f_mode & FMODE_OPENED) {
 			if (unlikely(dentry != file->f_path.dentry)) {
@@ -3327,6 +3425,7 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	int error, create_error = 0;
 	umode_t mode = op->mode;
 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+	bool have_par_update = false;
 
 	if (unlikely(IS_DEADDIR(dir_inode)))
 		return ERR_PTR(-ENOENT);
@@ -3400,6 +3499,12 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 			dentry = res;
 		}
 	}
+	/*
+	 * If dentry is negative and this is a create we need to get
+	 * DCACHE_PAR_UPDATE.
+	 */
+	if ((open_flag & O_CREAT) && !dentry->d_inode)
+		have_par_update = d_lock_update(dentry, NULL, NULL);
 
 	/* Negative dentry, just create the file */
 	if (!dentry->d_inode && (open_flag & O_CREAT)) {
@@ -3419,9 +3524,13 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		error = create_error;
 		goto out_dput;
 	}
+	if (have_par_update)
+		d_unlock_update(dentry);
 	return dentry;
 
 out_dput:
+	if (have_par_update)
+		d_unlock_update(dentry);
 	dput(dentry);
 	return ERR_PTR(error);
 }
@@ -3434,6 +3543,7 @@  static const char *open_last_lookups(struct nameidata *nd,
 	bool got_write = false;
 	struct dentry *dentry;
 	const char *res;
+	bool shared;
 
 	nd->flags |= op->intent;
 
@@ -3474,14 +3584,15 @@  static const char *open_last_lookups(struct nameidata *nd,
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
+	shared = !(open_flag & O_CREAT) || IS_PAR_UPDATE(dir->d_inode);
+	if (!shared)
 		inode_lock(dir->d_inode);
 	else
 		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
 		fsnotify_create(dir->d_inode, dentry);
-	if (open_flag & O_CREAT)
+	if (!shared)
 		inode_unlock(dir->d_inode);
 	else
 		inode_unlock_shared(dir->d_inode);
@@ -3750,41 +3861,29 @@  struct file *do_file_open_root(const struct path *root,
 	return file;
 }
 
-static struct dentry *filename_create(int dfd, struct filename *name,
-				      struct path *path, unsigned int lookup_flags)
+static struct dentry *filename_create_one(struct qstr *last, struct path *path,
+					  unsigned int lookup_flags,
+					  wait_queue_head_t *wq)
 {
-	struct dentry *dentry = ERR_PTR(-EEXIST);
-	struct qstr last;
+	struct dentry *dentry;
 	bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
 	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
 	unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
-	int type;
 	int err2;
 	int error;
 
-	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
-	if (error)
-		return ERR_PTR(error);
-
-	/*
-	 * Yucky last component or no last component at all?
-	 * (foo/., foo/.., /////)
-	 */
-	if (unlikely(type != LAST_NORM))
-		goto out;
-
 	/* don't fail immediately if it's r/o, at least try to report other errors */
 	err2 = mnt_want_write(path->mnt);
 	/*
 	 * Do the final lookup.  Suppress 'create' if there is a trailing
 	 * '/', and a directory wasn't requested.
 	 */
-	if (last.name[last.len] && !want_dir)
+	if (last->name[last->len] && !want_dir)
 		create_flags = 0;
-	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path->dentry, reval_flag | create_flags);
+	dentry = lookup_hash_update(last, path->dentry,
+				    reval_flag | create_flags,  wq);
 	if (IS_ERR(dentry))
-		goto unlock;
+		goto drop_write;
 
 	error = -EEXIST;
 	if (d_is_positive(dentry))
@@ -3806,14 +3905,56 @@  static struct dentry *filename_create(int dfd, struct filename *name,
 	}
 	return dentry;
 fail:
+	done_path_update(path, dentry, !!wq);
 	dput(dentry);
 	dentry = ERR_PTR(error);
-unlock:
-	inode_unlock(path->dentry->d_inode);
+drop_write:
 	if (!err2)
 		mnt_drop_write(path->mnt);
+	return dentry;
+}
+
+struct dentry *filename_create_one_len(const char *name, int nlen,
+				       struct path *path,
+				       unsigned int lookup_flags,
+				       wait_queue_head_t *wq)
+{
+	struct qstr this;
+	int err;
+
+	err = lookup_one_common(mnt_user_ns(path->mnt), name,
+				path->dentry, nlen, &this);
+	if (err)
+		return ERR_PTR(err);
+	return filename_create_one(&this, path, lookup_flags, wq);
+}
+EXPORT_SYMBOL(filename_create_one_len);
+
+static struct dentry *filename_create(int dfd, struct filename *name,
+				      struct path *path, unsigned int lookup_flags,
+				      wait_queue_head_t *wq)
+{
+	struct dentry *dentry = ERR_PTR(-EEXIST);
+	struct qstr last;
+	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
+	int type;
+	int error;
+
+	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
+	if (error)
+		return ERR_PTR(error);
+
+	/*
+	 * Yucky last component or no last component at all?
+	 * (foo/., foo/.., /////)
+	 */
+	if (unlikely(type != LAST_NORM))
+		goto out;
+
+	dentry = filename_create_one(&last, path, lookup_flags, wq);
 out:
-	path_put(path);
+	if (IS_ERR(dentry))
+		path_put(path);
 	return dentry;
 }
 
@@ -3821,27 +3962,28 @@  struct dentry *kern_path_create(int dfd, const char *pathname,
 				struct path *path, unsigned int lookup_flags)
 {
 	struct filename *filename = getname_kernel(pathname);
-	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+	struct dentry *res = filename_create(dfd, filename, path, lookup_flags,
+					     NULL);
 
 	putname(filename);
 	return res;
 }
 EXPORT_SYMBOL(kern_path_create);
 
-void done_path_create(struct path *path, struct dentry *dentry)
+void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
 {
+	done_path_update(path, dentry, with_wq);
 	dput(dentry);
-	inode_unlock(path->dentry->d_inode);
 	mnt_drop_write(path->mnt);
 	path_put(path);
 }
-EXPORT_SYMBOL(done_path_create);
+EXPORT_SYMBOL(done_path_create_wq);
 
 inline struct dentry *user_path_create(int dfd, const char __user *pathname,
-				struct path *path, unsigned int lookup_flags)
+				       struct path *path, unsigned int lookup_flags)
 {
 	struct filename *filename = getname(pathname);
-	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+	struct dentry *res = filename_create(dfd, filename, path, lookup_flags, NULL);
 
 	putname(filename);
 	return res;
@@ -3921,12 +4063,13 @@  static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	error = may_mknod(mode);
 	if (error)
 		goto out1;
 retry:
-	dentry = filename_create(dfd, name, &path, lookup_flags);
+	dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out1;
@@ -3954,7 +4097,7 @@  static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 			break;
 	}
 out2:
-	done_path_create(&path, dentry);
+	done_path_create_wq(&path, dentry, true);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -4023,9 +4166,10 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 retry:
-	dentry = filename_create(dfd, name, &path, lookup_flags);
+	dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
@@ -4038,7 +4182,7 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
 	}
-	done_path_create(&path, dentry);
+	done_path_create_wq(&path, dentry, true);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -4122,6 +4266,7 @@  int do_rmdir(int dfd, struct filename *name)
 	struct qstr last;
 	int type;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
@@ -4143,8 +4288,7 @@  int do_rmdir(int dfd, struct filename *name)
 	if (error)
 		goto exit2;
 
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+	dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit3;
@@ -4158,9 +4302,9 @@  int do_rmdir(int dfd, struct filename *name)
 	mnt_userns = mnt_user_ns(path.mnt);
 	error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
 exit4:
+	done_path_update(&path, dentry, true);
 	dput(dentry);
 exit3:
-	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit2:
 	path_put(&path);
@@ -4185,13 +4329,14 @@  SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
  * @dentry:	victim
  * @delegated_inode: returns victim inode, if the inode is delegated.
  *
- * The caller must hold dir->i_mutex.
+ * The caller must either hold a write-lock on dir->i_rwsem, or
+ * a have atomically set DCACHE_PAR_UPDATE, or both.
  *
  * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
  * return a reference to the inode in delegated_inode.  The caller
  * should then break the delegation on that inode and retry.  Because
  * breaking a delegation may take a long time, the caller should drop
- * dir->i_mutex before doing so.
+ * dir->i_rwsem before doing so.
  *
  * Alternatively, a caller may pass NULL for delegated_inode.  This may
  * be appropriate for callers that expect the underlying filesystem not
@@ -4250,9 +4395,11 @@  EXPORT_SYMBOL(vfs_unlink);
 
 /*
  * Make sure that the actual truncation of the file will occur outside its
- * directory's i_mutex.  Truncate can take a long time if there is a lot of
+ * directory's i_rwsem.  Truncate can take a long time if there is a lot of
  * writeout happening, and we don't want to prevent access to the directory
  * while waiting on the I/O.
+ * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
+ * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.
  */
 int do_unlinkat(int dfd, struct filename *name)
 {
@@ -4264,6 +4411,7 @@  int do_unlinkat(int dfd, struct filename *name)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
@@ -4276,9 +4424,9 @@  int do_unlinkat(int dfd, struct filename *name)
 	error = mnt_want_write(path.mnt);
 	if (error)
 		goto exit2;
+
 retry_deleg:
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+	dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
 		struct user_namespace *mnt_userns;
@@ -4297,9 +4445,9 @@  int do_unlinkat(int dfd, struct filename *name)
 		error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
 				   &delegated_inode);
 exit3:
+		done_path_update(&path, dentry, true);
 		dput(dentry);
 	}
-	inode_unlock(path.dentry->d_inode);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
 	inode = NULL;
@@ -4388,13 +4536,14 @@  int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if (IS_ERR(from)) {
 		error = PTR_ERR(from);
 		goto out_putnames;
 	}
 retry:
-	dentry = filename_create(newdfd, to, &path, lookup_flags);
+	dentry = filename_create(newdfd, to, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putnames;
@@ -4407,7 +4556,7 @@  int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 		error = vfs_symlink(mnt_userns, path.dentry->d_inode, dentry,
 				    from->name);
 	}
-	done_path_create(&path, dentry);
+	done_path_create_wq(&path, dentry, true);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -4536,6 +4685,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
 		error = -EINVAL;
@@ -4559,7 +4709,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 		goto out_putnames;
 
 	new_dentry = filename_create(newdfd, new, &new_path,
-					(how & LOOKUP_REVAL));
+				     (how & LOOKUP_REVAL), &wq);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out_putpath;
@@ -4577,7 +4727,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode,
 			 new_dentry, &delegated_inode);
 out_dput:
-	done_path_create(&new_path, new_dentry);
+	done_path_create_wq(&new_path, new_dentry, true);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
 		if (!error) {
@@ -4847,7 +4997,8 @@  int do_renameat2(int olddfd, struct filename *from, int newdfd,
 retry_deleg:
 	trap = lock_rename(new_path.dentry, old_path.dentry);
 
-	old_dentry = __lookup_hash(&old_last, old_path.dentry, lookup_flags);
+	old_dentry = __lookup_hash(&old_last, old_path.dentry,
+				   lookup_flags, NULL);
 	error = PTR_ERR(old_dentry);
 	if (IS_ERR(old_dentry))
 		goto exit3;
@@ -4855,7 +5006,8 @@  int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -ENOENT;
 	if (d_is_negative(old_dentry))
 		goto exit4;
-	new_dentry = __lookup_hash(&new_last, new_path.dentry, lookup_flags | target_flags);
+	new_dentry = __lookup_hash(&new_last, new_path.dentry,
+				   lookup_flags | target_flags, NULL);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto exit4;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 92c78ed02b54..d71c12907617 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -13,7 +13,9 @@ 
 #include <linux/rcupdate.h>
 #include <linux/lockref.h>
 #include <linux/stringhash.h>
+#include <linux/sched.h>
 #include <linux/wait.h>
+#include <linux/wait_bit.h>
 
 struct path;
 struct vfsmount;
@@ -96,6 +98,8 @@  struct dentry {
 	unsigned long d_time;		/* used by d_revalidate */
 	void *d_fsdata;			/* fs-specific data */
 
+	/* lockdep tracking of DCACHE_PAR_UPDATE locks */
+	struct lockdep_map		d_update_map;
 	union {
 		struct list_head d_lru;		/* LRU list */
 		wait_queue_head_t *d_wait;	/* in-lookup ones only */
@@ -211,6 +215,7 @@  struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_PAR_UPDATE		0x80000000 /* Being created or unlinked with shared lock */
 
 extern seqlock_t rename_lock;
 
@@ -598,4 +603,27 @@  struct name_snapshot {
 void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
 void release_dentry_name_snapshot(struct name_snapshot *);
 
+bool d_lock_update_nested(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *name,
+			  int subclass);
+static inline bool d_lock_update(struct dentry *dentry,
+				 struct dentry *base, const struct qstr *name)
+{
+	/* 0 is I_MUTEX_NORMAL, but fs.h might not be included */
+	return d_lock_update_nested(dentry, base, name, 0);
+}
+
+static inline void d_unlock_update(struct dentry *dentry)
+{
+	if (IS_ERR_OR_NULL(dentry))
+		return;
+	if (dentry->d_flags & DCACHE_PAR_UPDATE) {
+		lock_map_release(&dentry->d_update_map);
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+		spin_unlock(&dentry->d_lock);
+		wake_up_var(&dentry->d_flags);
+	}
+}
+
 #endif	/* __LINUX_DCACHE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..a9a48306806a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2524,12 +2524,13 @@  int sync_inode_metadata(struct inode *inode, int wait);
 struct file_system_type {
 	const char *name;
 	int fs_flags;
-#define FS_REQUIRES_DEV		1 
+#define FS_REQUIRES_DEV		1
 #define FS_BINARY_MOUNTDATA	2
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
-#define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
+#define FS_ALLOW_IDMAP		32	/* FS has been updated to handle vfs idmappings. */
+#define FS_PAR_DIR_UPDATE	BIT(6)	/* Allow parallel directory updates */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index caeb08a98536..b7a123b489b1 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -61,7 +61,14 @@  extern int kern_path(const char *, unsigned, struct path *);
 
 extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int);
 extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int);
-extern void done_path_create(struct path *, struct dentry *);
+extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
+					     struct path *path, unsigned int flags,
+					     wait_queue_head_t *wq);
+extern void done_path_create_wq(struct path *, struct dentry *, bool);
+static inline void done_path_create(struct path *path, struct dentry *dentry)
+{
+	done_path_create_wq(path, dentry, false);
+}
 extern struct dentry *kern_path_locked(const char *, struct path *);
 
 extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
@@ -75,6 +82,13 @@  struct dentry *lookup_one_unlocked(struct user_namespace *mnt_userns,
 struct dentry *lookup_one_positive_unlocked(struct user_namespace *mnt_userns,
 					    const char *name,
 					    struct dentry *base, int len);
+struct dentry *filename_create_one_len(const char *name, int nlen,
+				       struct path *path,
+				       unsigned int lookup_flags,
+				       wait_queue_head_t *wq);
+static inline bool IS_PAR_UPDATE(struct inode *dir) {
+	return !!(dir->i_sb->s_type->fs_flags & FS_PAR_DIR_UPDATE);
+}
 
 extern int follow_down_one(struct path *);
 extern int follow_down(struct path *);