[00/19,v7?] RFC: Allow concurrent and async changes in a directory

Message ID	20250206054504.2950516-1-neilb@suse.de (mailing list archive)
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC5D41F60A; Thu, 6 Feb 2025 05:45:19 +0000 (UTC) From: NeilBrown <neilb@suse.de> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>, Linus Torvalds <torvalds@linux-foundation.org>, Jeff Layton <jlayton@kernel.org>, Dave Chinner <david@fromorbit.com> Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory Date: Thu, 6 Feb 2025 16:42:37 +1100 Message-ID: <20250206054504.2950516-1-neilb@suse.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit default: False [-3.01 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MID_CONTAINS_FROM(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; NEURAL_HAM_SHORT(-0.20)[-1.000]; R_DKIM_ALLOW(-0.20)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; MIME_GOOD(-0.10)[text/plain]; MX_GOOD(-0.01)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:dkim,suse.de:mid]; FUZZY_BLOCKED(0.00)[rspamd.com]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_HAS_DN(0.00)[]; RCVD_TLS_ALL(0.00)[]; RCPT_COUNT_SEVEN(0.00)[8]; DKIM_TRACE(0.00)[suse.de:+]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; TO_DN_SOME(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; SUBJECT_HAS_QUESTION(0.00)[]
Series	RFC: Allow concurrent and async changes in a directory \| expand [00/19,v7?] RFC: Allow concurrent and async changes in a directory [01/19] VFS: introduce vfs_mkdir_return() [02/19] VFS: use global wait-queue table for d_alloc_parallel() [03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it. [04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry [05/19] VFS: add common error checks to lookup_one_qstr() [06/19] VFS: repack DENTRY_ flags. [07/19] VFS: repack LOOKUP_ bit flags. [08/19] VFS: introduce lookup_and_lock() and friends [09/19] VFS: add _async versions of the various directory modifying inode_operations [10/19] VFS: introduce inode flags to report locking needs for directory ops [11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations. [12/19] VFS: enhance d_splice_alias to accommodate shared-lock updates [13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc [14/19] VFS: Ensure no async updates happening in directory being removed. [15/19] VFS: Change lookup_and_lock() to use shared lock when possible. [16/19] VFS: add lookup_and_lock_rename() [17/19] nfsd: use lookup_and_lock_one() and lookup_and_lock_rename_one() [18/19] nfs: change mkdir inode_operation to mkdir_async [19/19] nfs: switch to _async for all directory ops.

Message ID

20250206054504.2950516-1-neilb@suse.de (mailing list archive)

Headers

From: NeilBrown <neilb@suse.de>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jeff Layton <jlayton@kernel.org>,
	Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a
 directory
Date: Thu,  6 Feb 2025 16:42:37 +1100
Message-ID: <20250206054504.2950516-1-neilb@suse.de>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

RFC: Allow concurrent and async changes in a directory | expand

Message

NeilBrown Feb. 6, 2025, 5:42 a.m. UTC

This is my latest attempt at removing the requirement for an exclusive
lock on a directory which performing updates in this.  This version,
inspired by Dave Chinner, goes a step further and allow async updates.

The inode operation still requires the inode lock, at least a shared
lock, but may return -EINPROGRES and then continue asynchronously
without needing any ongoing lock on the directory.

An exclusive lock on the dentry is held across the entire operation.

This change requires various extra checks.  rmdir must ensure there is
no async creation still happening.  rename between directories must
ensure non of the relevant ancestors are undergoing async rename.  There
may be or checks that I need to consider - mounting?

One other important change since my previous posting is that I've
dropped the idea of taking a separate exclusive lock on the directory
when the fs doesn't support shared locking.  This cannot work as it
doeesn't prevent lookups and filesystems don't expect a lookup while
they are changing a directory.  So instead we need to choose between
exclusive or shared for the inode on a case-by-case basis.

To make this choice we divide all ops into four groups: create, remove,
rename, open/create.  If an inode has no operations in the group that
require an exclusive lock, then a flag is set on the inode so that
various code knows that a shared lock is sufficient.  If the flag is not
set, an exclusive lock is obtained.

I've also added rename handling and converted NFS to use all _async ops.

The motivation for this comes from the general increase in scale of
systems.  We can support very large directories and many-core systems
and applications that choose to use large directories can hit
unnecessary contention.

NFS can easily hit this when used over a high-latency link.
Lustre already has code to allow concurrent directory updates in the
back-end filesystem (ldiskfs - a slightly modified ext4).
Lustre developers believe this would also benefit the client-side
filesystem with large core counts.

The idea behind the async support is to eventually connect this to
io_uring so that one process can launch several concurrent directory
operations.  I have not looked deeply into io_uring and cannot be
certain that the interface I've provided will be able to be used.  I
would welcome any advice on that matter, though I hope to find time to
explore myself.  For now if any _async op returns -EINPROGRESS we simply
wait for the callback to indicate completion.

Test status:  only light testing.  It doesn't easily blow up, but lockdep
complains that repeated calls to d_update_wait() are bad, even though
it has balanced acquire and release calls. Weird?

Thanks,
NeilBrown

 [PATCH 01/19] VFS: introduce vfs_mkdir_return()
 [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
 [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
 [PATCH 04/19] VFS: change kern_path_locked() and
 [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
 [PATCH 06/19] VFS: repack DENTRY_ flags.
 [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
 [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
 [PATCH 09/19] VFS: add _async versions of the various directory
 [PATCH 10/19] VFS: introduce inode flags to report locking needs for
 [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
 [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
 [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
 [PATCH 14/19] VFS: Ensure no async updates happening in directory
 [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
 [PATCH 16/19] VFS: add lookup_and_lock_rename()
 [PATCH 17/19] nfsd: use lookup_and_lock_one() and
 [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
 [PATCH 19/19] nfs: switch to _async for all directory ops.

Comments

Christian Brauner Feb. 6, 2025, 2:36 p.m. UTC | #1

On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:
> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this.  This version,
> inspired by Dave Chinner, goes a step further and allow async updates.
> 
> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.
> 
> An exclusive lock on the dentry is held across the entire operation.
> 
> This change requires various extra checks.  rmdir must ensure there is
> no async creation still happening.  rename between directories must
> ensure non of the relevant ancestors are undergoing async rename.  There
> may be or checks that I need to consider - mounting?

Mounting takes an exclusive lock on the target inode in do_lock_mount()
and finish_automount(). As long as dont_mount() can't happen
asynchronously in vfs_rmdir(), vfs_unlink() or vfs_rename() it should be
fine.

> 
> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking.  This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory.  So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.

Which is possibly fine if we do it similar to what I suggested in the
series. As it stands with the separate methods it's a no-go for me. But
that's a solvable problem, I think.

> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create.  If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient.  If the flag is not
> set, an exclusive lock is obtained.
> 
> I've also added rename handling and converted NFS to use all _async ops.
> 
> The motivation for this comes from the general increase in scale of
> systems.  We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.
> 
> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.
> 
> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.
> 
> Test status:  only light testing.  It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?
> 
> Thanks,
> NeilBrown
> 
>  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
>  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
>  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
>  [PATCH 04/19] VFS: change kern_path_locked() and
>  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
>  [PATCH 06/19] VFS: repack DENTRY_ flags.
>  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
>  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
>  [PATCH 09/19] VFS: add _async versions of the various directory
>  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
>  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
>  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
>  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
>  [PATCH 14/19] VFS: Ensure no async updates happening in directory
>  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
>  [PATCH 16/19] VFS: add lookup_and_lock_rename()
>  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
>  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
>  [PATCH 19/19] nfs: switch to _async for all directory ops.

John Stoffel Feb. 6, 2025, 3:36 p.m. UTC | #2

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this.  This version,
> inspired by Dave Chinner, goes a step further and allow async updates.

This initial sentence reads poorly to me.  I think you maybe are
trying to say:

  This is my latest attempt to removing the requirement for writers to
  have an exclusive lock on a directory when performing updates on
  entries in that directory.  This allows for parallel updates by
  multiple processes (connections? hosts? clients?) to improve scaling
  of large filesystems. 

I get what you're trying to do here, and I applaud it!  I just
struggled over the intro here.  


> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.

> An exclusive lock on the dentry is held across the entire operation.

> This change requires various extra checks.  rmdir must ensure there is
> no async creation still happening.  rename between directories must
> ensure non of the relevant ancestors are undergoing async rename.  There
> may be or checks that I need to consider - mounting?

> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking.  This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory.  So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.

> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create.  If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient.  If the flag is not
> set, an exclusive lock is obtained.

> I've also added rename handling and converted NFS to use all _async ops.

> The motivation for this comes from the general increase in scale of
> systems.  We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.

> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.

> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.

> Test status:  only light testing.  It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?

> Thanks,
> NeilBrown

>  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
>  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
>  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
>  [PATCH 04/19] VFS: change kern_path_locked() and
>  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
>  [PATCH 06/19] VFS: repack DENTRY_ flags.
>  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
>  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
>  [PATCH 09/19] VFS: add _async versions of the various directory
>  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
>  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
>  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
>  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
>  [PATCH 14/19] VFS: Ensure no async updates happening in directory
>  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
>  [PATCH 16/19] VFS: add lookup_and_lock_rename()
>  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
>  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
>  [PATCH 19/19] nfs: switch to _async for all directory ops.

NeilBrown Feb. 7, 2025, 2:18 a.m. UTC | #3

On Fri, 07 Feb 2025, John Stoffel wrote:
> >>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:
> 
> > This is my latest attempt at removing the requirement for an exclusive
> > lock on a directory which performing updates in this.  This version,
> > inspired by Dave Chinner, goes a step further and allow async updates.
> 
> This initial sentence reads poorly to me.  I think you maybe are
> trying to say:
> 
>   This is my latest attempt to removing the requirement for writers to
>   have an exclusive lock on a directory when performing updates on
>   entries in that directory.  This allows for parallel updates by
>   multiple processes (connections? hosts? clients?) to improve scaling
>   of large filesystems. 
> 
> I get what you're trying to do here, and I applaud it!  I just
> struggled over the intro here.  

Yes, my intro was rather poorly worded.  I think your version is much
better.  Thanks.

NeilBrown

> 
> 
> > The inode operation still requires the inode lock, at least a shared
> > lock, but may return -EINPROGRES and then continue asynchronously
> > without needing any ongoing lock on the directory.
> 
> > An exclusive lock on the dentry is held across the entire operation.
> 
> > This change requires various extra checks.  rmdir must ensure there is
> > no async creation still happening.  rename between directories must
> > ensure non of the relevant ancestors are undergoing async rename.  There
> > may be or checks that I need to consider - mounting?
> 
> > One other important change since my previous posting is that I've
> > dropped the idea of taking a separate exclusive lock on the directory
> > when the fs doesn't support shared locking.  This cannot work as it
> > doeesn't prevent lookups and filesystems don't expect a lookup while
> > they are changing a directory.  So instead we need to choose between
> > exclusive or shared for the inode on a case-by-case basis.
> 
> > To make this choice we divide all ops into four groups: create, remove,
> > rename, open/create.  If an inode has no operations in the group that
> > require an exclusive lock, then a flag is set on the inode so that
> > various code knows that a shared lock is sufficient.  If the flag is not
> > set, an exclusive lock is obtained.
> 
> > I've also added rename handling and converted NFS to use all _async ops.
> 
> > The motivation for this comes from the general increase in scale of
> > systems.  We can support very large directories and many-core systems
> > and applications that choose to use large directories can hit
> > unnecessary contention.
> 
> > NFS can easily hit this when used over a high-latency link.
> > Lustre already has code to allow concurrent directory updates in the
> > back-end filesystem (ldiskfs - a slightly modified ext4).
> > Lustre developers believe this would also benefit the client-side
> > filesystem with large core counts.
> 
> > The idea behind the async support is to eventually connect this to
> > io_uring so that one process can launch several concurrent directory
> > operations.  I have not looked deeply into io_uring and cannot be
> > certain that the interface I've provided will be able to be used.  I
> > would welcome any advice on that matter, though I hope to find time to
> > explore myself.  For now if any _async op returns -EINPROGRESS we simply
> > wait for the callback to indicate completion.
> 
> > Test status:  only light testing.  It doesn't easily blow up, but lockdep
> > complains that repeated calls to d_update_wait() are bad, even though
> > it has balanced acquire and release calls. Weird?
> 
> > Thanks,
> > NeilBrown
> 
> >  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
> >  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
> >  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
> >  [PATCH 04/19] VFS: change kern_path_locked() and
> >  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
> >  [PATCH 06/19] VFS: repack DENTRY_ flags.
> >  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
> >  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
> >  [PATCH 09/19] VFS: add _async versions of the various directory
> >  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
> >  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
> >  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
> >  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
> >  [PATCH 14/19] VFS: Ensure no async updates happening in directory
> >  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
> >  [PATCH 16/19] VFS: add lookup_and_lock_rename()
> >  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
> >  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
> >  [PATCH 19/19] nfs: switch to _async for all directory ops.
> 
>

Al Viro Feb. 9, 2025, 11:33 p.m. UTC | #4

On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:

> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.

OK, after looking through that and playing around with the locking
scheme of yours:

Separating directory rwsem for reads/modifications from locking of
individual dentries may be feasible, but it needs to be a lot more
careful about the states it sleeps in.  Your current variant is rife
with deadlocks; for the "wait on dentry itself" it's probably possible
to avoid, with some care; for "wait on parent" it's really not an option.

Quite a bit of headache comes from the fact that NFS et.al. are playing
silly buggers with "OK, we see that lookup is for <operation>; skip it,
the call of actual method will do the right thing".  The trouble is,
d_lookup_done() of not-really-looked-up is fine under exclusive lock on
parent, but only because there won't be d_alloc_parallel() on the same
name until we drop that exclusive lock.

Your scheme, OTOH, has hard dependency upon those suckers staying visible
to d_alloc_parallel() until the actual operation is done.  Which means
that this code, including the methods, is exposed to in-lookup dentries.

What's more, similar dependency is there for dentries getting unhashed
between the lookup and the end of operation - something which NFS
cheerfully violates.  If method's argument gets hit with d_drop() and
d_rehash(), there's a window where it won't be found in dcache, leaving
no indication that it's being operated upon.  Currently we are fine -
exclusive lock on parent means that on dcache miss we try to grab
the parent shared and repeat dcache lookup when we get that.

Your variant does not have such exclusion - parent is held shared and
child dentry involved is not there to be found during d_drop()/d_rehash()
window.

IOW, your in-update state might make sense, but not in the way it's done
at the moment - it's too brittle.

And the part about async tree topology modifications are bloody insane,
IMO.  I won't believe that to be feasible until I see the algorithm and
proof of correctness; preferably _before_ the actual code.