diff mbox series

[PATCH/RFC] VFS: support parallel updates in the one directory.

Message ID 164549669043.5153.2021348013072574365@noble.neil.brown.name (mailing list archive)
State New, archived
Headers show
Series [PATCH/RFC] VFS: support parallel updates in the one directory. | expand

Commit Message

NeilBrown Feb. 22, 2022, 2:24 a.m. UTC
Hi Al,
 I wonder if you might find time to have a look at this patch.  It
 allows concurrent updates to a single directory.  This can result in
 substantial throughput improvements when the application uses multiple
 threads to create lots of files in the one directory, and there is
 noticeable per-create latency, as there can be with NFS to a remote
 server.
Thanks,
NeilBrown

Some filesystems can support parallel modifications to a directory,
either because the modification happen on a remote server which does its
own locking (e.g.  NFS) or because they can internally lock just a part
of a directory (e.g.  many local filesystems, with a bit of work - the
lustre project has patches for ext4 to support concurrent updates).

To allow this, we introduce VFS support for parallel modification:
unlink (including rmdir) and create.  Parallel rename is not (yet)
supported.

If a filesystem supports parallel modification in a given directory, it
sets S_PAR_UNLINK on the inode for that directory.  lookup_open() and
the new lookup_hash_modify() (similar to __lookup_hash()) notice the
flag and take a shared lock on the directory, and rely on a lock-bit in
d_flags, much like parallel lookup relies on DCACHE_PAR_LOOKUP.

The new DCACHE_PAR_UPDATE is most significant bit in d_flags, so we have
nearly run out of flags.  0x08000000 is still unused .... for now.

Once a dentry for the target name has been obtained, DCACHE_PAR_UPDATE
is set on it, waiting if necessary if it is already set.  Once this is
set, the thread has exclusive access to the name and can call into the
filesystem to perform the required action.

We use wait_var_event() to wait for DCACHE_PAR_UPDATE to be cleared rather
than trying to overload the wq used for DCACHE_PAR_LOOKUP.

Some filesystems do *not* complete the lookup that precedes a create,
but leave the dentry d_in_lookup() and unhashed, so often a dentry will
have both DCACHE_PAR_LOOKUP and DCACHE_PAR_UPDATE set at the same time.
To allow for this, we need the 'wq' that is passed to d_alloc_parallel()
(and used when DCACHE_PAR_LOOKUP is cleared), to continue to exist until
the creation is complete, which may be after filename_create()
completes.  So a wq now needs to be passed to filename_create() when
parallel modification is attempted.

Waiting for DCACHE_PAR_UPDATE can sometimes result in the dentry
changing - either through unlink or rename.  This requires that the
lookup be retried.  This in turn requires that the wq be reused.
The use for DECLARE_WAITQUEUE() in d_wait_lookup() and the omission of
remove_wait_queue() means that the after the wake_up_all() in
__d_lookup_down(), the wait_queue_head list will be undefined.  So we change
d_wait_lookup() to use DEFINE_WAIT(), so that autoremove_wake_function()
is used for wakeups, and the wait_queue_head remains well defined.

NFS can easily support parallel updates as the locking is done on the
server, so this patch enable parallel updates for NFS.  This would be a
separate patch in the final submission.

NFS unlink needs to block concurrent opens() once it decides to actually
unlink the file, rather the rename it to .nfsXXXX (aka sillyrename).
It currently does this by temporarily unhashing the dentry and relying
on the exclusive lock on the directory to block a ->lookup().  That
doesn't work now that unlink uses a shared lock, so an alternate
approach is needed.

__nfs_lookup_revalidate (->d_revalidate) now blocks if DCACHE_PAR_UPDATE
is set, and if nfs_unlink() happens to be called with an exclusive lock
and DCACHE_PAR_UPDATE is not it, it sets during the potential race window.

I'd rather use some other indicator in the dentry to tell
_nfs_lookup_revalidate() to wait, but we are nearly out of d_flags bits,
and NFS does have a general-purpose d_fsdata.

NFS "silly-rename" may now be called with only a shared lock on the
directory, so it needs a bit of extra care to get exclusive access to
the new name. d_lock_update_nested() and d_unlock_update() help here.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c            |  59 ++++++++++++-
 fs/namei.c             | 193 ++++++++++++++++++++++++++++++++++-------
 fs/nfs/dir.c           |  31 +++++--
 fs/nfs/inode.c         |   2 +
 fs/nfs/unlink.c        |   5 +-
 include/linux/dcache.h |  27 ++++++
 include/linux/fs.h     |   1 +
 7 files changed, 278 insertions(+), 40 deletions(-)

Comments

J. Bruce Fields Feb. 22, 2022, 7:07 p.m. UTC | #1
For what it's worth, I applied this to recent upstream (038101e6b2cd)
and fed it through my usual scripts--tests all passed, but I did see
this lockdep warning.

I'm not actually sure what was running at the time--probably just cthon.

--b.

[  142.679891] ======================================================
[  142.680883] WARNING: possible circular locking dependency detected
[  142.681999] 5.17.0-rc5-00005-g64e79f877311 #1778 Not tainted
[  142.682970] ------------------------------------------------------
[  142.684059] test1/4557 is trying to acquire lock:
[  142.684881] ffff888023d85398 (DENTRY_PAR_UPDATE){+.+.}-{0:0}, at: d_lock_update_nested+0x5/0x6a0
[  142.686421] 
               but task is already holding lock:
[  142.687171] ffff88801f618bd0 (&type->i_mutex_dir_key#6){++++}-{3:3}, at: path_openat+0x7cb/0x24a0
[  142.689098] 
               which lock already depends on the new lock.

[  142.690045] 
               the existing dependency chain (in reverse order) is:
[  142.691171] 
               -> #1 (&type->i_mutex_dir_key#6){++++}-{3:3}:
[  142.692285]        down_write+0x82/0x130
[  142.692844]        vfs_rmdir+0xbd/0x560
[  142.693351]        do_rmdir+0x33d/0x400
[  142.693830]        __x64_sys_unlinkat+0xaf/0xe0
[  142.694391]        do_syscall_64+0x43/0x90
[  142.694900]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  142.695609] 
               -> #0 (DENTRY_PAR_UPDATE){+.+.}-{0:0}:
[  142.696357]        __lock_acquire+0x2bfd/0x61b0
[  142.696965]        lock_acquire+0x1a6/0x4b0
[  142.697489]        d_lock_update_nested+0x8d/0x6a0
[  142.698082]        lookup_open.isra.0+0x1305/0x1bf0
[  142.698685]        path_openat+0x833/0x24a0
[  142.699205]        do_filp_open+0x197/0x3c0
[  142.699725]        do_sys_openat2+0xef/0x3d0
[  142.700345]        __x64_sys_creat+0xb4/0xf0
[  142.700915]        do_syscall_64+0x43/0x90
[  142.701431]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  142.702112] 
               other info that might help us debug this:

[  142.702872]  Possible unsafe locking scenario:

[  142.703492]        CPU0                    CPU1
[  142.704115]        ----                    ----
[  142.704704]   lock(&type->i_mutex_dir_key#6);
[  142.705296]                                lock(DENTRY_PAR_UPDATE);
[  142.706065]                                lock(&type->i_mutex_dir_key#6);
[  142.706911]   lock(DENTRY_PAR_UPDATE);
[  142.707382] 
                *** DEADLOCK ***

[  142.707978] 2 locks held by test1/4557:
[  142.708485]  #0: ffff888011ef0438 (sb_writers#14){.+.+}-{0:0}, at: path_openat+0x1408/0x24a0
[  142.709562]  #1: ffff88801f618bd0 (&type->i_mutex_dir_key#6){++++}-{3:3}, at: path_openat+0x7cb/0x24a0
[  142.710757] 
               stack backtrace:
[  142.711187] CPU: 0 PID: 4557 Comm: test1 Not tainted 5.17.0-rc5-00005-g64e79f877311 #1778
[  142.712275] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
[  142.713409] Call Trace:
[  142.713717]  <TASK>
[  142.713986]  dump_stack_lvl+0x45/0x59
[  142.714449]  check_noncircular+0x23e/0x2e0
[  142.714960]  ? print_circular_bug+0x450/0x450
[  142.715502]  ? is_dynamic_key+0x1a0/0x1a0
[  142.716061]  ? _raw_spin_unlock_irqrestore+0x38/0x50
[  142.716706]  __lock_acquire+0x2bfd/0x61b0
[  142.717246]  ? do_syscall_64+0x43/0x90
[  142.717712]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[  142.718361]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  142.719590]  ? mark_lock.part.0+0xe7/0x2e00
[  142.720217]  lock_acquire+0x1a6/0x4b0
[  142.720706]  ? d_lock_update_nested+0x5/0x6a0
[  142.721293]  ? lock_release+0x6d0/0x6d0
[  142.721774]  ? lock_downgrade+0x690/0x690
[  142.722280]  ? do_raw_spin_unlock+0x54/0x220
[  142.722814]  d_lock_update_nested+0x8d/0x6a0
[  142.723350]  ? d_lock_update_nested+0x5/0x6a0
[  142.723959]  ? d_alloc_anon+0x10/0x10
[  142.724443]  ? kfree+0x110/0x250
[  142.724887]  ? lockdep_hardirqs_on+0x79/0x100
[  142.725454]  ? kfree+0x110/0x250
[  142.725859]  ? nfs_lookup+0x55a/0xa80 [nfs]
[  142.726410]  lookup_open.isra.0+0x1305/0x1bf0
[  142.726952]  ? lookup_positive_unlocked+0x80/0x80
[  142.727546]  ? rwsem_down_read_slowpath+0xaa0/0xaa0
[  142.728239]  path_openat+0x833/0x24a0
[  142.728700]  ? path_lookupat+0x6b0/0x6b0
[  142.729233]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  142.729867]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  142.730502]  do_filp_open+0x197/0x3c0
[  142.730955]  ? lock_is_held_type+0xd7/0x130
[  142.731477]  ? may_open_dev+0xd0/0xd0
[  142.731994]  ? do_raw_spin_lock+0x11e/0x240
[  142.732545]  ? do_raw_spin_unlock+0x54/0x220
[  142.733115]  ? getname_flags.part.0+0x8e/0x450
[  142.733670]  do_sys_openat2+0xef/0x3d0
[  142.734140]  ? mntput_no_expire+0x10f/0xad0
[  142.734661]  ? build_open_flags+0x450/0x450
[  142.735179]  ? dput+0x30/0xa10
[  142.735565]  __x64_sys_creat+0xb4/0xf0
[  142.736091]  ? __ia32_sys_openat2+0x260/0x260
[  142.736656]  ? syscall_enter_from_user_mode+0x1d/0x50
[  142.737325]  ? lockdep_hardirqs_on+0x79/0x100
[  142.737885]  do_syscall_64+0x43/0x90
[  142.738333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  142.738959] RIP: 0033:0x7f4c0f358e07
[  142.739408] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 55 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 74 24 0c 48 89 3c 24 e8
[  142.741911] RSP: 002b:00007ffe31b33e48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[  142.742843] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c0f358e07
[  142.743718] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 00007ffe31b33e80
[  142.744675] RBP: 00007ffe31b34e90 R08: 0000000000000000 R09: 00007ffe31b33be7
[  142.745589] R10: fffffffffffff934 R11: 0000000000000246 R12: 0000000000401190
[  142.746464] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  142.747343]  </TASK>
J. Bruce Fields Feb. 22, 2022, 9:11 p.m. UTC | #2
On Tue, Feb 22, 2022 at 02:07:51PM -0500, bfields wrote:
> For what it's worth, I applied this to recent upstream (038101e6b2cd)
> and fed it through my usual scripts--tests all passed, but I did see
> this lockdep warning.
> 
> I'm not actually sure what was running at the time--probably just cthon.

Yeah, reproduceable just by running "cthon -b -f" over an NFS mount.

--b.
Dave Chinner Feb. 22, 2022, 10:45 p.m. UTC | #3
On Tue, Feb 22, 2022 at 01:24:50PM +1100, NeilBrown wrote:
> 
> Hi Al,
>  I wonder if you might find time to have a look at this patch.  It
>  allows concurrent updates to a single directory.  This can result in
>  substantial throughput improvements when the application uses multiple
>  threads to create lots of files in the one directory, and there is
>  noticeable per-create latency, as there can be with NFS to a remote
>  server.
> Thanks,
> NeilBrown
> 
> Some filesystems can support parallel modifications to a directory,
> either because the modification happen on a remote server which does its
> own locking (e.g.  NFS) or because they can internally lock just a part
> of a directory (e.g.  many local filesystems, with a bit of work - the
> lustre project has patches for ext4 to support concurrent updates).
> 
> To allow this, we introduce VFS support for parallel modification:
> unlink (including rmdir) and create.  Parallel rename is not (yet)
> supported.

Yay!

> If a filesystem supports parallel modification in a given directory, it
> sets S_PAR_UNLINK on the inode for that directory.  lookup_open() and
> the new lookup_hash_modify() (similar to __lookup_hash()) notice the
> flag and take a shared lock on the directory, and rely on a lock-bit in
> d_flags, much like parallel lookup relies on DCACHE_PAR_LOOKUP.

I suspect that you could enable this for XFS right now. XFS has internal
directory inode locking that should serialise all reads and writes
correctly regardless of what the VFS does. So while the VFS might
use concurrent updates (e.g. inode_lock_shared() instead of
inode_lock() on the dir inode), XFS has an internal metadata lock
that will then serialise the concurrent VFS directory modifications
correctly....

Yeah, I know, this isn't true concurrent dir updates, but it should
allow multiple implementations of the concurrent dir update VFS APIs
across multiple filesystems and shake out any assumptions that might
arise from a single implementation target (e.g. silly rename
quirks).

Cheers,

Dave.
Darrick J. Wong Feb. 24, 2022, 4:43 a.m. UTC | #4
On Wed, Feb 23, 2022 at 09:45:46AM +1100, Dave Chinner wrote:
> On Tue, Feb 22, 2022 at 01:24:50PM +1100, NeilBrown wrote:
> > 
> > Hi Al,
> >  I wonder if you might find time to have a look at this patch.  It
> >  allows concurrent updates to a single directory.  This can result in
> >  substantial throughput improvements when the application uses multiple
> >  threads to create lots of files in the one directory, and there is
> >  noticeable per-create latency, as there can be with NFS to a remote
> >  server.
> > Thanks,
> > NeilBrown
> > 
> > Some filesystems can support parallel modifications to a directory,
> > either because the modification happen on a remote server which does its
> > own locking (e.g.  NFS) or because they can internally lock just a part
> > of a directory (e.g.  many local filesystems, with a bit of work - the
> > lustre project has patches for ext4 to support concurrent updates).
> > 
> > To allow this, we introduce VFS support for parallel modification:
> > unlink (including rmdir) and create.  Parallel rename is not (yet)
> > supported.
> 
> Yay!
> 
> > If a filesystem supports parallel modification in a given directory, it
> > sets S_PAR_UNLINK on the inode for that directory.  lookup_open() and
> > the new lookup_hash_modify() (similar to __lookup_hash()) notice the
> > flag and take a shared lock on the directory, and rely on a lock-bit in
> > d_flags, much like parallel lookup relies on DCACHE_PAR_LOOKUP.
> 
> I suspect that you could enable this for XFS right now. XFS has internal
> directory inode locking that should serialise all reads and writes
> correctly regardless of what the VFS does. So while the VFS might
> use concurrent updates (e.g. inode_lock_shared() instead of
> inode_lock() on the dir inode), XFS has an internal metadata lock
> that will then serialise the concurrent VFS directory modifications
> correctly....

I don't think that will work because xfs_readdir doesn't hold the
directory ILOCK while it runs, which means that readdir will see garbage
if other threads now only hold inode_lock_shared while they update the
directory.

--D

> Yeah, I know, this isn't true concurrent dir updates, but it should
> allow multiple implementations of the concurrent dir update VFS APIs
> across multiple filesystems and shake out any assumptions that might
> arise from a single implementation target (e.g. silly rename
> quirks).
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
NeilBrown Feb. 24, 2022, 5:08 a.m. UTC | #5
On Wed, 23 Feb 2022, J. Bruce Fields wrote:
> For what it's worth, I applied this to recent upstream (038101e6b2cd)
> and fed it through my usual scripts--tests all passed, but I did see
> this lockdep warning.
> 
> I'm not actually sure what was running at the time--probably just cthon.
> 
> --b.
> 
> [  142.679891] ======================================================
> [  142.680883] WARNING: possible circular locking dependency detected
> [  142.681999] 5.17.0-rc5-00005-g64e79f877311 #1778 Not tainted
> [  142.682970] ------------------------------------------------------
> [  142.684059] test1/4557 is trying to acquire lock:
> [  142.684881] ffff888023d85398 (DENTRY_PAR_UPDATE){+.+.}-{0:0}, at: d_lock_update_nested+0x5/0x6a0
> [  142.686421] 
>                but task is already holding lock:
> [  142.687171] ffff88801f618bd0 (&type->i_mutex_dir_key#6){++++}-{3:3}, at: path_openat+0x7cb/0x24a0
> [  142.689098] 
>                which lock already depends on the new lock.
> 
> [  142.690045] 
>                the existing dependency chain (in reverse order) is:
> [  142.691171] 
>                -> #1 (&type->i_mutex_dir_key#6){++++}-{3:3}:
> [  142.692285]        down_write+0x82/0x130
> [  142.692844]        vfs_rmdir+0xbd/0x560
> [  142.693351]        do_rmdir+0x33d/0x400

Thanks.  I hadn't tested rmdir :-)

"rmdir" and "open(O_CREATE)" take these locks in the opposite order.

I think the simplest fix might be to change the inode_lock(_shared) taken
on the dir in open_last_Lookups() to use I_MUTEX_PARENT.  That is
consistent with unlink and rmdir etc which use I_MUTEX_PARENT on the
parent.

open() doesn't currently use I_MUTEX_PARENT because it never needs to
lock the child.  But as it *is* a parent that is being locked, using
I_MUTEX_PARENT probably make more sense.

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3513,9 +3513,9 @@ static const char *open_last_lookups(struct nameidata *nd,
 	}
 	shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
 	if ((open_flag & O_CREAT) && !shared)
-		inode_lock(dir->d_inode);
+		inode_lock_nested(dir->d_inode, I_MUTEX_PARENT);
 	else
-		inode_lock_shared(dir->d_inode);
+		inode_lock_shared_nested(dir->d_inode, I_MUTEX_PARENT);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
 		fsnotify_create(dir->d_inode, dentry);

Thanks,
NeilBrown
NeilBrown Feb. 24, 2022, 5:56 a.m. UTC | #6
On Thu, 24 Feb 2022, Darrick J. Wong wrote:
> On Wed, Feb 23, 2022 at 09:45:46AM +1100, Dave Chinner wrote:
> > On Tue, Feb 22, 2022 at 01:24:50PM +1100, NeilBrown wrote:
> > > 
> > > Hi Al,
> > >  I wonder if you might find time to have a look at this patch.  It
> > >  allows concurrent updates to a single directory.  This can result in
> > >  substantial throughput improvements when the application uses multiple
> > >  threads to create lots of files in the one directory, and there is
> > >  noticeable per-create latency, as there can be with NFS to a remote
> > >  server.
> > > Thanks,
> > > NeilBrown
> > > 
> > > Some filesystems can support parallel modifications to a directory,
> > > either because the modification happen on a remote server which does its
> > > own locking (e.g.  NFS) or because they can internally lock just a part
> > > of a directory (e.g.  many local filesystems, with a bit of work - the
> > > lustre project has patches for ext4 to support concurrent updates).
> > > 
> > > To allow this, we introduce VFS support for parallel modification:
> > > unlink (including rmdir) and create.  Parallel rename is not (yet)
> > > supported.
> > 
> > Yay!
> > 
> > > If a filesystem supports parallel modification in a given directory, it
> > > sets S_PAR_UNLINK on the inode for that directory.  lookup_open() and
> > > the new lookup_hash_modify() (similar to __lookup_hash()) notice the
> > > flag and take a shared lock on the directory, and rely on a lock-bit in
> > > d_flags, much like parallel lookup relies on DCACHE_PAR_LOOKUP.
> > 
> > I suspect that you could enable this for XFS right now. XFS has internal
> > directory inode locking that should serialise all reads and writes
> > correctly regardless of what the VFS does. So while the VFS might
> > use concurrent updates (e.g. inode_lock_shared() instead of
> > inode_lock() on the dir inode), XFS has an internal metadata lock
> > that will then serialise the concurrent VFS directory modifications
> > correctly....
> 
> I don't think that will work because xfs_readdir doesn't hold the
> directory ILOCK while it runs, which means that readdir will see garbage
> if other threads now only hold inode_lock_shared while they update the
> directory.

I added this:
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -87,6 +87,7 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
+	VFS_I(ip)->i_flags |= S_PAR_UPDATE;
 	mapping_set_large_folios(VFS_I(ip)->i_mapping);
 
 	XFS_STATS_INC(mp, vn_active);

and ran my highly sophisticated test in an XFS directory:

 for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done

This doesn't crash - which is a good sign.
While that was going I tried
 while : ; do ls -l ; done

it sometimes reports garbage for the stat info:

total 0
-????????? ? ?    ?    ?            ? 1749
-????????? ? ?    ?    ?            ? 1764
-????????? ? ?    ?    ?            ? 1765
-rw-r--r-- 1 root root 0 Feb 24 16:47 1768
-rw-r--r-- 1 root root 0 Feb 24 16:47 1770
-rw-r--r-- 1 root root 0 Feb 24 16:47 1772
....

I *think* that is bad - probably the "garbage" that you referred to?

Obviously I gets lots of 
ls: cannot access '1764': No such file or directory
ls: cannot access '1749': No such file or directory
ls: cannot access '1780': No such file or directory
ls: cannot access '1765': No such file or directory

but that is normal and expected when you are creating and deleting
files during the ls.

NeilBrown




> 
> --D
> 
> > Yeah, I know, this isn't true concurrent dir updates, but it should
> > allow multiple implementations of the concurrent dir update VFS APIs
> > across multiple filesystems and shake out any assumptions that might
> > arise from a single implementation target (e.g. silly rename
> > quirks).
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 
>
Andreas Dilger Feb. 24, 2022, 4:31 p.m. UTC | #7
On Feb 23, 2022, at 22:57, NeilBrown <neilb@suse.de> wrote:
> 
> 
> I added this:
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -87,6 +87,7 @@ xfs_inode_alloc(
>    /* VFS doesn't initialise i_mode or i_state! */
>    VFS_I(ip)->i_mode = 0;
>    VFS_I(ip)->i_state = 0;
> +    VFS_I(ip)->i_flags |= S_PAR_UPDATE;
>    mapping_set_large_folios(VFS_I(ip)->i_mapping);
> 
>    XFS_STATS_INC(mp, vn_active);
> 
> and ran my highly sophisticated test in an XFS directory:
> 
> for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done
> 
> This doesn't crash - which is a good sign.
> While that was going I tried
> while : ; do ls -l ; done
> 
> it sometimes reports garbage for the stat info:
> 
> total 0
> -????????? ? ?    ?    ?            ? 1749
> -????????? ? ?    ?    ?            ? 1764
> -????????? ? ?    ?    ?            ? 1765
> -rw-r--r-- 1 root root 0 Feb 24 16:47 1768
> -rw-r--r-- 1 root root 0 Feb 24 16:47 1770
> -rw-r--r-- 1 root root 0 Feb 24 16:47 1772
> ....
> 
> I *think* that is bad - probably the "garbage" that you referred to?
> 
> Obviously I gets lots of 
> ls: cannot access '1764': No such file or directory
> ls: cannot access '1749': No such file or directory
> ls: cannot access '1780': No such file or directory
> ls: cannot access '1765': No such file or directory
> 
> but that is normal and expected when you are creating and deleting
> files during the ls.

The "ls -l" output with "???" is exactly the case where the filename is
in readdir() but stat() on a file fails due to an unavoidable userspace 
race between the two syscalls and the concurrent unlink(). This is
probably visible even without the concurrent dirops patch. 

The list of affected filenames even correlates with the reported errors:
1764, 1765, 1769

It looks like everything is working as expected. 

Cheers, Andreas
J. Bruce Fields Feb. 24, 2022, 9:23 p.m. UTC | #8
On Thu, Feb 24, 2022 at 04:08:36PM +1100, NeilBrown wrote:
> On Wed, 23 Feb 2022, J. Bruce Fields wrote:
> > For what it's worth, I applied this to recent upstream (038101e6b2cd)
> > and fed it through my usual scripts--tests all passed, but I did see
> > this lockdep warning.
> > 
> > I'm not actually sure what was running at the time--probably just cthon.
> > 
> > --b.
> > 
> > [  142.679891] ======================================================
> > [  142.680883] WARNING: possible circular locking dependency detected
> > [  142.681999] 5.17.0-rc5-00005-g64e79f877311 #1778 Not tainted
> > [  142.682970] ------------------------------------------------------
> > [  142.684059] test1/4557 is trying to acquire lock:
> > [  142.684881] ffff888023d85398 (DENTRY_PAR_UPDATE){+.+.}-{0:0}, at: d_lock_update_nested+0x5/0x6a0
> > [  142.686421] 
> >                but task is already holding lock:
> > [  142.687171] ffff88801f618bd0 (&type->i_mutex_dir_key#6){++++}-{3:3}, at: path_openat+0x7cb/0x24a0
> > [  142.689098] 
> >                which lock already depends on the new lock.
> > 
> > [  142.690045] 
> >                the existing dependency chain (in reverse order) is:
> > [  142.691171] 
> >                -> #1 (&type->i_mutex_dir_key#6){++++}-{3:3}:
> > [  142.692285]        down_write+0x82/0x130
> > [  142.692844]        vfs_rmdir+0xbd/0x560
> > [  142.693351]        do_rmdir+0x33d/0x400
> 
> Thanks.  I hadn't tested rmdir :-)

OK.  I tested with this applied and didn't see any issues.

--b.

> 
> "rmdir" and "open(O_CREATE)" take these locks in the opposite order.
> 
> I think the simplest fix might be to change the inode_lock(_shared) taken
> on the dir in open_last_Lookups() to use I_MUTEX_PARENT.  That is
> consistent with unlink and rmdir etc which use I_MUTEX_PARENT on the
> parent.
> 
> open() doesn't currently use I_MUTEX_PARENT because it never needs to
> lock the child.  But as it *is* a parent that is being locked, using
> I_MUTEX_PARENT probably make more sense.
> 
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3513,9 +3513,9 @@ static const char *open_last_lookups(struct nameidata *nd,
>  	}
>  	shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
>  	if ((open_flag & O_CREAT) && !shared)
> -		inode_lock(dir->d_inode);
> +		inode_lock_nested(dir->d_inode, I_MUTEX_PARENT);
>  	else
> -		inode_lock_shared(dir->d_inode);
> +		inode_lock_shared_nested(dir->d_inode, I_MUTEX_PARENT);
>  	dentry = lookup_open(nd, file, op, got_write);
>  	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
>  		fsnotify_create(dir->d_inode, dentry);
> 
> Thanks,
> NeilBrown
Darrick J. Wong Feb. 24, 2022, 11:38 p.m. UTC | #9
On Thu, Feb 24, 2022 at 09:31:28AM -0700, Andreas Dilger wrote:
> On Feb 23, 2022, at 22:57, NeilBrown <neilb@suse.de> wrote:
> > 
> > 
> > I added this:
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> >    /* VFS doesn't initialise i_mode or i_state! */
> >    VFS_I(ip)->i_mode = 0;
> >    VFS_I(ip)->i_state = 0;
> > +    VFS_I(ip)->i_flags |= S_PAR_UPDATE;
> >    mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > 
> >    XFS_STATS_INC(mp, vn_active);
> > 
> > and ran my highly sophisticated test in an XFS directory:
> > 
> > for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done

I think you want something faster here, like ln to hardlink an existing
file into the directory.

> > This doesn't crash - which is a good sign.
> > While that was going I tried
> > while : ; do ls -l ; done
> > 
> > it sometimes reports garbage for the stat info:
> > 
> > total 0
> > -????????? ? ?    ?    ?            ? 1749
> > -????????? ? ?    ?    ?            ? 1764
> > -????????? ? ?    ?    ?            ? 1765
> > -rw-r--r-- 1 root root 0 Feb 24 16:47 1768
> > -rw-r--r-- 1 root root 0 Feb 24 16:47 1770
> > -rw-r--r-- 1 root root 0 Feb 24 16:47 1772
> > ....
> > 
> > I *think* that is bad - probably the "garbage" that you referred to?
> > 
> > Obviously I gets lots of 
> > ls: cannot access '1764': No such file or directory
> > ls: cannot access '1749': No such file or directory
> > ls: cannot access '1780': No such file or directory
> > ls: cannot access '1765': No such file or directory
> > 
> > but that is normal and expected when you are creating and deleting
> > files during the ls.
> 
> The "ls -l" output with "???" is exactly the case where the filename is
> in readdir() but stat() on a file fails due to an unavoidable userspace 
> race between the two syscalls and the concurrent unlink(). This is
> probably visible even without the concurrent dirops patch. 
> 
> The list of affected filenames even correlates with the reported errors:
> 1764, 1765, 1769
> 
> It looks like everything is working as expected. 

Here, yes.

A problem that I saw a week or two ago with online fsck is that an evil
thread repeatedly link()ing and unlink()ing a file into an otherwise
empty directory while racing a thread calling readdir() in a loop will
eventually trigger a corruption report on the directory namecheck
because the loop in xfs_dir2_sf_getdents that uses sfp->count as a loop
counter will race with the unlink decrementing sfp->count and run off
the end of the inline directory data buffer.

https://elixir.bootlin.com/linux/latest/source/fs/xfs/xfs_dir2_readdir.c#L121

The solution in that case was a forgotten acquisition of the directory
IOLOCK, but I don't see why the same principle wouldn't apply here.
It's probably not so hard to fix it (rewrite readdir to take the ILOCK
once, format the dirents to a buffer until it's full, save cursor, drop
ILOCK, copy buffer to userspace) but it's not as easy as setting
PAR_UPDATE.

(I am also not a fan of "PAR_UPDATE", since 'par' is already an English
word that doesn't mean 'parallel'.)

--D

> 
> Cheers, Andreas
>
Dave Chinner Feb. 25, 2022, 11:07 p.m. UTC | #10
On Wed, Feb 23, 2022 at 08:43:28PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 23, 2022 at 09:45:46AM +1100, Dave Chinner wrote:
> > On Tue, Feb 22, 2022 at 01:24:50PM +1100, NeilBrown wrote:
> > > 
> > > Hi Al,
> > >  I wonder if you might find time to have a look at this patch.  It
> > >  allows concurrent updates to a single directory.  This can result in
> > >  substantial throughput improvements when the application uses multiple
> > >  threads to create lots of files in the one directory, and there is
> > >  noticeable per-create latency, as there can be with NFS to a remote
> > >  server.
> > > Thanks,
> > > NeilBrown
> > > 
> > > Some filesystems can support parallel modifications to a directory,
> > > either because the modification happen on a remote server which does its
> > > own locking (e.g.  NFS) or because they can internally lock just a part
> > > of a directory (e.g.  many local filesystems, with a bit of work - the
> > > lustre project has patches for ext4 to support concurrent updates).
> > > 
> > > To allow this, we introduce VFS support for parallel modification:
> > > unlink (including rmdir) and create.  Parallel rename is not (yet)
> > > supported.
> > 
> > Yay!
> > 
> > > If a filesystem supports parallel modification in a given directory, it
> > > sets S_PAR_UNLINK on the inode for that directory.  lookup_open() and
> > > the new lookup_hash_modify() (similar to __lookup_hash()) notice the
> > > flag and take a shared lock on the directory, and rely on a lock-bit in
> > > d_flags, much like parallel lookup relies on DCACHE_PAR_LOOKUP.
> > 
> > I suspect that you could enable this for XFS right now. XFS has internal
> > directory inode locking that should serialise all reads and writes
> > correctly regardless of what the VFS does. So while the VFS might
> > use concurrent updates (e.g. inode_lock_shared() instead of
> > inode_lock() on the dir inode), XFS has an internal metadata lock
> > that will then serialise the concurrent VFS directory modifications
> > correctly....
> 
> I don't think that will work because xfs_readdir doesn't hold the
> directory ILOCK while it runs, which means that readdir will see garbage
> if other threads now only hold inode_lock_shared while they update the
> directory.

It repeated picks up and drops the ILOCK as it maps buffers. IOWs,
the ILOCK serialises the lookup of the buffer at the next offset in
the readdir process and then reads the data out while the buffer is
locked. Hence we'll always serialise the buffer lookup and read
against concurrent modifications so we'll always get the next
directory buffer in ascending offset order. We then hold the buffer locked
while we read all the dirents out of it into the user buffer, so
that's also serialised against concurrent modifications.

Also, remember that readdir does not guarantee that it returns all
entries in the face of concurrent modifications that remove entries.
Because the offset of dirents never changes in the XFS data segment,
the only time we might skip an entry is when it has been removed
and it was the last entry in a data block so the entire data block
goes away between readdir buffer lookups. In that case, we just get
the next highest offset buffer returned, and we continue onwards.

If a hole is filled while we are walking, then we'll see the buffer
that was added into the hole. That new buffer is now at the next
highest offset, so readdir finding it is correct and valid
behaviour...

Cheers,

Dave.
Dave Chinner Feb. 25, 2022, 11:16 p.m. UTC | #11
On Thu, Feb 24, 2022 at 03:38:48PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 24, 2022 at 09:31:28AM -0700, Andreas Dilger wrote:
> > On Feb 23, 2022, at 22:57, NeilBrown <neilb@suse.de> wrote:
> > > 
> > > 
> > > I added this:
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > >    /* VFS doesn't initialise i_mode or i_state! */
> > >    VFS_I(ip)->i_mode = 0;
> > >    VFS_I(ip)->i_state = 0;
> > > +    VFS_I(ip)->i_flags |= S_PAR_UPDATE;
> > >    mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > 
> > >    XFS_STATS_INC(mp, vn_active);
> > > 
> > > and ran my highly sophisticated test in an XFS directory:
> > > 
> > > for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done
> 
> I think you want something faster here, like ln to hardlink an existing
> file into the directory.
> 
> > > This doesn't crash - which is a good sign.
> > > While that was going I tried
> > > while : ; do ls -l ; done
> > > 
> > > it sometimes reports garbage for the stat info:
> > > 
> > > total 0
> > > -????????? ? ?    ?    ?            ? 1749
> > > -????????? ? ?    ?    ?            ? 1764
> > > -????????? ? ?    ?    ?            ? 1765
> > > -rw-r--r-- 1 root root 0 Feb 24 16:47 1768
> > > -rw-r--r-- 1 root root 0 Feb 24 16:47 1770
> > > -rw-r--r-- 1 root root 0 Feb 24 16:47 1772
> > > ....
> > > 
> > > I *think* that is bad - probably the "garbage" that you referred to?
> > > 
> > > Obviously I gets lots of 
> > > ls: cannot access '1764': No such file or directory
> > > ls: cannot access '1749': No such file or directory
> > > ls: cannot access '1780': No such file or directory
> > > ls: cannot access '1765': No such file or directory
> > > 
> > > but that is normal and expected when you are creating and deleting
> > > files during the ls.
> > 
> > The "ls -l" output with "???" is exactly the case where the filename is
> > in readdir() but stat() on a file fails due to an unavoidable userspace 
> > race between the two syscalls and the concurrent unlink(). This is
> > probably visible even without the concurrent dirops patch. 
> > 
> > The list of affected filenames even correlates with the reported errors:
> > 1764, 1765, 1769
> > 
> > It looks like everything is working as expected. 
> 
> Here, yes.
> 
> A problem that I saw a week or two ago with online fsck is that an evil
> thread repeatedly link()ing and unlink()ing a file into an otherwise
> empty directory while racing a thread calling readdir() in a loop will
> eventually trigger a corruption report on the directory namecheck
> because the loop in xfs_dir2_sf_getdents that uses sfp->count as a loop
> counter will race with the unlink decrementing sfp->count and run off
> the end of the inline directory data buffer.

Ah, shortform dirs might need the readdir moved inside the

	lock_mode = xfs_ilock_data_map_shared(dp);

section so that the ILOCK is held while readdir is pulling the
dirents out of the inode - there's no buffer lock to serialise that
against concurrent modifications like there are for block/leaf/node
formats.

Cheers,

Dave.
NeilBrown Feb. 28, 2022, 12:55 a.m. UTC | #12
On Fri, 25 Feb 2022, Darrick J. Wong wrote:
> On Thu, Feb 24, 2022 at 09:31:28AM -0700, Andreas Dilger wrote:
> > On Feb 23, 2022, at 22:57, NeilBrown <neilb@suse.de> wrote:
> > > 
> > > for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done
> 
> I think you want something faster here, like ln to hardlink an existing
> file into the directory.
> 

And probably written in C too..

 
> (I am also not a fan of "PAR_UPDATE", since 'par' is already an English
> word that doesn't mean 'parallel'.)

:-)
We already have DCACHE_PAR_LOOKUP for parallel lookups in a directory
(though it is really a lock bit and I think should be named as soch).
So it made sense to use DCACHE_PAR_UPDATE for parallel updates.
And then S_PAR_UPDATE for the inode flag to enable this seemed logical.

But I agree that these names are sub-par :-)

NeilBrown
Daire Byrne March 10, 2022, 11:10 a.m. UTC | #13
Just to add that I have also been testing this patch with heavy
production workloads over high latency NFS clients (200ms) and have
not seen any issues so far.

As the latency increases, parallel operations for multiple client
processes becomes ever more important for maintaining good aggregate
throughputs, be it reads, writes or metadata.

With 1000 client processes/threads we see the file creates per single
directory increase from 3 per second to 1200 per second with this
patch.

It solves a real world application for us (high latency NFS clients)
without having to redesign our workflows around deeper (hashed)
directory structures.

Much appreciated Neil.

Daire


On Mon, 28 Feb 2022 at 00:56, NeilBrown <neilb@suse.de> wrote:
>
> On Fri, 25 Feb 2022, Darrick J. Wong wrote:
> > On Thu, Feb 24, 2022 at 09:31:28AM -0700, Andreas Dilger wrote:
> > > On Feb 23, 2022, at 22:57, NeilBrown <neilb@suse.de> wrote:
> > > >
> > > > for i in {1..70}; do ( for j in {1000..8000}; do touch $j; rm -f $j ; done ) & done
> >
> > I think you want something faster here, like ln to hardlink an existing
> > file into the directory.
> >
>
> And probably written in C too..
>
>
> > (I am also not a fan of "PAR_UPDATE", since 'par' is already an English
> > word that doesn't mean 'parallel'.)
>
> :-)
> We already have DCACHE_PAR_LOOKUP for parallel lookups in a directory
> (though it is really a lock bit and I think should be named as soch).
> So it made sense to use DCACHE_PAR_UPDATE for parallel updates.
> And then S_PAR_UPDATE for the inode flag to enable this seemed logical.
>
> But I agree that these names are sub-par :-)
>
> NeilBrown
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index c84269c6e8bf..1ca63abbe4b2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1765,6 +1765,7 @@  static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 	int err;
+	static struct lock_class_key __key;
 
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
 	if (!dentry)
@@ -1819,6 +1820,8 @@  static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_child);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
 
+	lockdep_init_map(&dentry->d_update_map, "DENTRY_PAR_UPDATE", &__key, 0);
+
 	if (dentry->d_op && dentry->d_op->d_init) {
 		err = dentry->d_op->d_init(dentry);
 		if (err) {
@@ -2579,7 +2582,7 @@  static inline void end_dir_add(struct inode *dir, unsigned n)
 static void d_wait_lookup(struct dentry *dentry)
 {
 	if (d_in_lookup(dentry)) {
-		DECLARE_WAITQUEUE(wait, current);
+		DEFINE_WAIT(wait);
 		add_wait_queue(dentry->d_wait, &wait);
 		do {
 			set_current_state(TASK_UNINTERRUPTIBLE);
@@ -3208,6 +3211,60 @@  void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/**
+ * d_lock_update_nested - lock a dentry before updating
+ * @dentry: the dentry to be locked
+ * @base:   the parent, or %NULL
+ * @name:   the name in that parent, or %NULL
+ * @subclass: lockdep locking class.
+ *
+ * Lock a dentry in a directory on which a shared-lock is held, and
+ * on which parallel updates are permitted.
+ * If the base and name are given, then on success the dentry will still
+ * have that base and name - it will not have raced with rename.
+ * On success, a positive dentry will still be hashed, ensuring there
+ * was no race with unlink.
+ * If they are not given, then on success the dentry will be negative,
+ * which again ensures no race with rename, or unlink.
+ */
+bool d_lock_update_nested(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *name,
+			  int subclass)
+{
+	bool ret = true;
+
+	lock_acquire_exclusive(&dentry->d_update_map, subclass,
+			       0, NULL, _THIS_IP_);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_PAR_UPDATE)
+		___wait_var_event(&dentry->d_flags,
+				  !(dentry->d_flags & DCACHE_PAR_UPDATE),
+				  TASK_UNINTERRUPTIBLE, 0, 0,
+				  (spin_unlock(&dentry->d_lock),
+				   schedule(),
+				   spin_lock(&dentry->d_lock))
+			);
+	if (d_unhashed(dentry) && d_is_positive(dentry)) {
+		/* name was unlinked while we waited */
+		ret = false;
+	} else if (base &&
+		 (dentry->d_parent != base ||
+		  dentry->d_name.hash != name->hash ||
+		  !d_same_name(dentry, base, name))) {
+		/* dentry was renamed - possibly silly-rename */
+		ret = false;
+	} else if (!base && d_is_positive(dentry)) {
+		ret = false;
+	} else {
+		dentry->d_flags |= DCACHE_PAR_UPDATE;
+	}
+	spin_unlock(&dentry->d_lock);
+	if (!ret)
+		lock_map_release(&dentry->d_update_map);
+	return ret;
+}
+EXPORT_SYMBOL(d_lock_update_nested);
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
diff --git a/fs/namei.c b/fs/namei.c
index 3f1829b3ab5b..f8cdbabc23c9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1613,6 +1613,94 @@  static struct dentry *__lookup_hash(const struct qstr *name,
 	return dentry;
 }
 
+/*
+ * Parent directory (base) is not locked.  We take either an exclusive
+ * or shared lock depending on the fs preference, then dor a lookup,
+ * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
+ * was taken on the parent.
+ */
+static struct dentry *lookup_hash_update(const struct qstr *name,
+					 struct dentry *base, unsigned int flags,
+					 wait_queue_head_t *wq)
+{
+	struct dentry *dentry;
+	struct inode *dir = base->d_inode;
+	bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
+	int err;
+
+	if (shared)
+		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+	else
+		inode_lock_nested(dir, I_MUTEX_PARENT);
+
+retry:
+	dentry = lookup_dcache(name, base, flags);
+	if (!dentry) {
+		/* Don't create child dentry for a dead directory. */
+		err = -ENOENT;
+		if (unlikely(IS_DEADDIR(dir)))
+			goto out_err;
+
+		if (shared)
+			dentry = d_alloc_parallel(base, name, wq);
+		else
+			dentry = d_alloc(base, name);
+
+		if (!IS_ERR(dentry) &&
+		    (!shared || d_in_lookup(dentry))) {
+			struct dentry *old;
+
+			old = dir->i_op->lookup(dir, dentry, flags);
+			/*
+			 * Note: dentry might still be d_unhashed() and
+			 * d_in_lookup() if the fs will do the lookup
+			 * at 'create' time.
+			 */
+			if (unlikely(old)) {
+				d_lookup_done(dentry);
+				dput(dentry);
+				dentry = old;
+			}
+		}
+	}
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto out_err;
+	}
+	if (shared && !d_lock_update(dentry, base, name)) {
+		/*
+		 * Failed to get lock due to race with unlink or rename
+		 * - try again
+		 */
+		d_lookup_done(dentry);
+		dput(dentry);
+		goto retry;
+	}
+	return dentry;
+
+out_err:
+	if (shared)
+		inode_unlock_shared(dir);
+	else
+		inode_unlock(dir);
+	return ERR_PTR(err);
+}
+
+static void lookup_done_update(struct path *path, struct dentry *dentry,
+			       wait_queue_head_t *wq)
+{
+	struct inode *dir = path->dentry->d_inode;
+	bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
+
+	if (shared) {
+		d_lookup_done(dentry);
+		d_unlock_update(dentry);
+		inode_unlock_shared(dir);
+	} else {
+		inode_unlock(dir);
+	}
+}
+
 static struct dentry *lookup_fast(struct nameidata *nd,
 				  struct inode **inode,
 			          unsigned *seqp)
@@ -3182,16 +3270,32 @@  static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
 {
 	struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
 	struct inode *dir =  nd->path.dentry->d_inode;
+	bool have_par_update = false;
 	int error;
 
 	if (nd->flags & LOOKUP_DIRECTORY)
 		open_flag |= O_DIRECTORY;
 
+	/*
+	 * dentry is negative or d_in_lookup().  If this is a
+	 * shared-lock create we need to set DCACHE_PAR_UPDATE to ensure
+	 * exclusion.
+	 */
+	if ((open_flag & O_CREAT) &&
+	    (dir->i_flags & S_PAR_UPDATE)) {
+		if (!d_lock_update(dentry, NULL, NULL))
+			/* already exists, non-atomic open */
+			return dentry;
+		have_par_update = true;
+	}
+
 	file->f_path.dentry = DENTRY_NOT_SET;
 	file->f_path.mnt = nd->path.mnt;
 	error = dir->i_op->atomic_open(dir, dentry, file,
 				       open_to_namei_flags(open_flag), mode);
 	d_lookup_done(dentry);
+	if (have_par_update)
+		d_unlock_update(dentry);
 	if (!error) {
 		if (file->f_mode & FMODE_OPENED) {
 			if (unlikely(dentry != file->f_path.dentry)) {
@@ -3243,6 +3347,7 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	int error, create_error = 0;
 	umode_t mode = op->mode;
 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+	bool have_par_update = false;
 
 	if (unlikely(IS_DEADDIR(dir_inode)))
 		return ERR_PTR(-ENOENT);
@@ -3317,6 +3422,14 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 			dentry = res;
 		}
 	}
+	/*
+	 * If dentry is negative and this is a shared-lock
+	 * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
+	 */
+	if ((open_flag & O_CREAT) &&
+	    !dentry->d_inode &&
+	    (dir_inode->i_flags & S_PAR_UPDATE))
+		have_par_update = d_lock_update(dentry, NULL, NULL);
 
 	/* Negative dentry, just create the file */
 	if (!dentry->d_inode && (open_flag & O_CREAT)) {
@@ -3336,9 +3449,13 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		error = create_error;
 		goto out_dput;
 	}
+	if (have_par_update)
+		d_unlock_update(dentry);
 	return dentry;
 
 out_dput:
+	if (have_par_update)
+		d_unlock_update(dentry);
 	dput(dentry);
 	return ERR_PTR(error);
 }
@@ -3353,6 +3470,7 @@  static const char *open_last_lookups(struct nameidata *nd,
 	struct inode *inode;
 	struct dentry *dentry;
 	const char *res;
+	bool shared;
 
 	nd->flags |= op->intent;
 
@@ -3393,14 +3511,15 @@  static const char *open_last_lookups(struct nameidata *nd,
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
+	shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
+	if ((open_flag & O_CREAT) && !shared)
 		inode_lock(dir->d_inode);
 	else
 		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
 		fsnotify_create(dir->d_inode, dentry);
-	if (open_flag & O_CREAT)
+	if ((open_flag & O_CREAT) && !shared)
 		inode_unlock(dir->d_inode);
 	else
 		inode_unlock_shared(dir->d_inode);
@@ -3669,7 +3788,8 @@  struct file *do_file_open_root(const struct path *root,
 }
 
 static struct dentry *filename_create(int dfd, struct filename *name,
-				      struct path *path, unsigned int lookup_flags)
+				      struct path *path, unsigned int lookup_flags,
+				      wait_queue_head_t *wq)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct qstr last;
@@ -3701,10 +3821,9 @@  static struct dentry *filename_create(int dfd, struct filename *name,
 	 * Do the final lookup.
 	 */
 	lookup_flags |= LOOKUP_CREATE | LOOKUP_EXCL;
-	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path->dentry, lookup_flags);
+	dentry = lookup_hash_update(&last, path->dentry, lookup_flags, wq);
 	if (IS_ERR(dentry))
-		goto unlock;
+		goto drop_write;
 
 	error = -EEXIST;
 	if (d_is_positive(dentry))
@@ -3726,10 +3845,10 @@  static struct dentry *filename_create(int dfd, struct filename *name,
 	}
 	return dentry;
 fail:
+	lookup_done_update(path, dentry, wq);
 	dput(dentry);
 	dentry = ERR_PTR(error);
-unlock:
-	inode_unlock(path->dentry->d_inode);
+drop_write:
 	if (!err2)
 		mnt_drop_write(path->mnt);
 out:
@@ -3741,27 +3860,33 @@  struct dentry *kern_path_create(int dfd, const char *pathname,
 				struct path *path, unsigned int lookup_flags)
 {
 	struct filename *filename = getname_kernel(pathname);
-	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+	struct dentry *res = filename_create(dfd, filename, path, lookup_flags,
+					     NULL);
 
 	putname(filename);
 	return res;
 }
 EXPORT_SYMBOL(kern_path_create);
 
-void done_path_create(struct path *path, struct dentry *dentry)
+static void __done_path_create(struct path *path, struct dentry *dentry,
+			       wait_queue_head_t *wq)
 {
+	lookup_done_update(path, dentry, wq);
 	dput(dentry);
-	inode_unlock(path->dentry->d_inode);
 	mnt_drop_write(path->mnt);
 	path_put(path);
 }
+void done_path_create(struct path *path, struct dentry *dentry)
+{
+	__done_path_create(path, dentry, NULL);
+}
 EXPORT_SYMBOL(done_path_create);
 
 inline struct dentry *user_path_create(int dfd, const char __user *pathname,
-				struct path *path, unsigned int lookup_flags)
+				       struct path *path, unsigned int lookup_flags)
 {
 	struct filename *filename = getname(pathname);
-	struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+	struct dentry *res = filename_create(dfd, filename, path, lookup_flags, NULL);
 
 	putname(filename);
 	return res;
@@ -3840,12 +3965,13 @@  static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	error = may_mknod(mode);
 	if (error)
 		goto out1;
 retry:
-	dentry = filename_create(dfd, name, &path, lookup_flags);
+	dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out1;
@@ -3874,7 +4000,7 @@  static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 			break;
 	}
 out2:
-	done_path_create(&path, dentry);
+	__done_path_create(&path, dentry, &wq);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3943,9 +4069,10 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 retry:
-	dentry = filename_create(dfd, name, &path, lookup_flags);
+	dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
@@ -3959,7 +4086,7 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
 	}
-	done_path_create(&path, dentry);
+	__done_path_create(&path, dentry, &wq);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -4043,6 +4170,7 @@  int do_rmdir(int dfd, struct filename *name)
 	struct qstr last;
 	int type;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
@@ -4064,8 +4192,7 @@  int do_rmdir(int dfd, struct filename *name)
 	if (error)
 		goto exit2;
 
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+	dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit3;
@@ -4079,9 +4206,9 @@  int do_rmdir(int dfd, struct filename *name)
 	mnt_userns = mnt_user_ns(path.mnt);
 	error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
 exit4:
+	lookup_done_update(&path, dentry, &wq);
 	dput(dentry);
 exit3:
-	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit2:
 	path_put(&path);
@@ -4106,13 +4233,14 @@  SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
  * @dentry:	victim
  * @delegated_inode: returns victim inode, if the inode is delegated.
  *
- * The caller must hold dir->i_mutex.
+ * The caller must either hold a write-lock on dir->i_rwsem, or
+ * a read-lock having atomically set DCACHE_PAR_UPDATE.
  *
  * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
  * return a reference to the inode in delegated_inode.  The caller
  * should then break the delegation on that inode and retry.  Because
  * breaking a delegation may take a long time, the caller should drop
- * dir->i_mutex before doing so.
+ * dir->i_rwsem before doing so.
  *
  * Alternatively, a caller may pass NULL for delegated_inode.  This may
  * be appropriate for callers that expect the underlying filesystem not
@@ -4171,9 +4299,11 @@  EXPORT_SYMBOL(vfs_unlink);
 
 /*
  * Make sure that the actual truncation of the file will occur outside its
- * directory's i_mutex.  Truncate can take a long time if there is a lot of
+ * directory's i_rwsem.  Truncate can take a long time if there is a lot of
  * writeout happening, and we don't want to prevent access to the directory
  * while waiting on the I/O.
+ * If the directory permits (S_PAR_UPDATE), we take a shared lock on the
+ * directory DCACHE_PAR_UPDATE to get exclusive access to the dentry.
  */
 int do_unlinkat(int dfd, struct filename *name)
 {
@@ -4185,6 +4315,7 @@  int do_unlinkat(int dfd, struct filename *name)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
@@ -4197,9 +4328,9 @@  int do_unlinkat(int dfd, struct filename *name)
 	error = mnt_want_write(path.mnt);
 	if (error)
 		goto exit2;
+
 retry_deleg:
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+	dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
 		struct user_namespace *mnt_userns;
@@ -4218,9 +4349,9 @@  int do_unlinkat(int dfd, struct filename *name)
 		error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
 				   &delegated_inode);
 exit3:
+		lookup_done_update(&path, dentry, &wq);
 		dput(dentry);
 	}
-	inode_unlock(path.dentry->d_inode);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
 	inode = NULL;
@@ -4309,13 +4440,14 @@  int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if (IS_ERR(from)) {
 		error = PTR_ERR(from);
 		goto out_putnames;
 	}
 retry:
-	dentry = filename_create(newdfd, to, &path, lookup_flags);
+	dentry = filename_create(newdfd, to, &path, lookup_flags, &wq);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putnames;
@@ -4328,7 +4460,7 @@  int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 		error = vfs_symlink(mnt_userns, path.dentry->d_inode, dentry,
 				    from->name);
 	}
-	done_path_create(&path, dentry);
+	__done_path_create(&path, dentry, &wq);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -4457,6 +4589,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
 		error = -EINVAL;
@@ -4480,7 +4613,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 		goto out_putnames;
 
 	new_dentry = filename_create(newdfd, new, &new_path,
-					(how & LOOKUP_REVAL));
+				     (how & LOOKUP_REVAL), &wq);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out_putpath;
@@ -4498,7 +4631,7 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode,
 			 new_dentry, &delegated_inode);
 out_dput:
-	done_path_create(&new_path, new_dentry);
+	__done_path_create(&new_path, new_dentry, &wq);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
 		if (!error) {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 7bc7cf6b26f0..4a6c24aadb79 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1638,6 +1638,9 @@  __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
 	int ret;
 
 	if (flags & LOOKUP_RCU) {
+		if (dentry->d_flags & DCACHE_PAR_UPDATE)
+			/* Pending unlink */
+			return -ECHILD;
 		parent = READ_ONCE(dentry->d_parent);
 		dir = d_inode_rcu(parent);
 		if (!dir)
@@ -1646,6 +1649,9 @@  __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
 		if (parent != READ_ONCE(dentry->d_parent))
 			return -ECHILD;
 	} else {
+		/* Wait for unlink to complete */
+		wait_var_event(&dentry->d_flags,
+			       !(dentry->d_flags & DCACHE_PAR_UPDATE));
 		parent = dget_parent(dentry);
 		ret = reval(d_inode(parent), dentry, flags);
 		dput(parent);
@@ -2308,8 +2314,6 @@  static int nfs_safe_remove(struct dentry *dentry)
 			nfs_drop_nlink(inode);
 	} else
 		error = NFS_PROTO(dir)->remove(dir, dentry);
-	if (error == -ENOENT)
-		nfs_dentry_handle_enoent(dentry);
 	trace_nfs_remove_exit(dir, dentry, error);
 out:
 	return error;
@@ -2323,7 +2327,7 @@  static int nfs_safe_remove(struct dentry *dentry)
 int nfs_unlink(struct inode *dir, struct dentry *dentry)
 {
 	int error;
-	int need_rehash = 0;
+	bool did_set_par_update = false;
 
 	dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
 		dir->i_ino, dentry);
@@ -2337,15 +2341,26 @@  int nfs_unlink(struct inode *dir, struct dentry *dentry)
 		error = nfs_sillyrename(dir, dentry);
 		goto out;
 	}
-	if (!d_unhashed(dentry)) {
-		__d_drop(dentry);
-		need_rehash = 1;
+	/* We must prevent any concurrent open until the unlink
+	 * completes.  ->d_revalidate will wait for DCACHE_PAR_UPDATE
+	 * to clear, but if this happens to a non-parallel update, we
+	 * still want to block opens.  So set DCACHE_PAR_UPDATE
+	 * temporarily.
+	 */
+	if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) {
+		/* Must have exclusive lock on parent */
+		did_set_par_update = true;
+		dentry->d_flags |= DCACHE_PAR_UPDATE;
 	}
+
 	spin_unlock(&dentry->d_lock);
 	error = nfs_safe_remove(dentry);
 	nfs_dentry_remove_handle_error(dir, dentry, error);
-	if (need_rehash)
-		d_rehash(dentry);
+	if (did_set_par_update) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+		spin_unlock(&dentry->d_lock);
+	}
 out:
 	trace_nfs_unlink_exit(dir, dentry, error);
 	return error;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index a918c3a834b6..f7b58f216274 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -484,6 +484,8 @@  nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 
 		/* We can't support update_atime(), since the server will reset it */
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		/* Parallel updates to directories are trivial */
+		inode->i_flags |= S_PAR_UPDATE;
 		inode->i_mode = fattr->mode;
 		nfsi->cache_validity = 0;
 		if ((fattr->valid & NFS_ATTR_FATTR_MODE) == 0
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index 5fa11e1aca4c..d4db67e82a01 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -453,6 +453,7 @@  nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 	sdentry = NULL;
 	do {
 		int slen;
+		d_unlock_update(sdentry);
 		dput(sdentry);
 		sillycounter++;
 		slen = scnprintf(silly, sizeof(silly),
@@ -470,7 +471,8 @@  nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 		 */
 		if (IS_ERR(sdentry))
 			goto out;
-	} while (d_inode(sdentry) != NULL); /* need negative lookup */
+	} while (!d_lock_update_nested(sdentry, NULL, NULL,
+				       SINGLE_DEPTH_NESTING));
 
 	ihold(inode);
 
@@ -515,6 +517,7 @@  nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 	rpc_put_task(task);
 out_dput:
 	iput(inode);
+	d_unlock_update(sdentry);
 	dput(sdentry);
 out:
 	return error;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f5bba51480b2..6331cf4ac87a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -13,7 +13,9 @@ 
 #include <linux/rcupdate.h>
 #include <linux/lockref.h>
 #include <linux/stringhash.h>
+#include <linux/sched.h>
 #include <linux/wait.h>
+#include <linux/wait_bit.h>
 
 struct path;
 struct vfsmount;
@@ -96,6 +98,8 @@  struct dentry {
 	unsigned long d_time;		/* used by d_revalidate */
 	void *d_fsdata;			/* fs-specific data */
 
+	/* lockdep tracking of DCACHE_PAR_UPDATE locks */
+	struct lockdep_map		d_update_map;
 	union {
 		struct list_head d_lru;		/* LRU list */
 		wait_queue_head_t *d_wait;	/* in-lookup ones only */
@@ -211,6 +215,7 @@  struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_PAR_UPDATE		0x80000000 /* Being created or unlinked with shared lock */
 
 extern seqlock_t rename_lock;
 
@@ -599,4 +604,26 @@  struct name_snapshot {
 void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
 void release_dentry_name_snapshot(struct name_snapshot *);
 
+bool d_lock_update_nested(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *name,
+			  int subclass);
+static inline bool d_lock_update(struct dentry *dentry,
+				 struct dentry *base, const struct qstr *name)
+{
+	return d_lock_update_nested(dentry, base, name, 0);
+}
+
+static inline void d_unlock_update(struct dentry *dentry)
+{
+	if (IS_ERR_OR_NULL(dentry))
+		return;
+	if (dentry->d_flags & DCACHE_PAR_UPDATE) {
+		lock_map_release(&dentry->d_update_map);
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+		spin_unlock(&dentry->d_lock);
+		wake_up_var(&dentry->d_flags);
+	}
+}
+
 #endif	/* __LINUX_DCACHE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..19f902d8d4dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2162,6 +2162,7 @@  struct super_operations {
 #define S_CASEFOLD	(1 << 15) /* Casefolded file */
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
 #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PAR_UPDATE	(1 << 18) /* Parallel directory operations supported */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system