[PATCH/RFC,00/10,v5] Improve scalability of directory operations

Message ID	166147828344.25420.13834885828450967910.stgit@noble.brown (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [PATCH/RFC 00/10 v5] Improve scalability of directory operations From: NeilBrown <neilb@suse.de> To: Al Viro <viro@zeniv.linux.org.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Daire Byrne <daire@dneg.com>, Trond Myklebust <trond.myklebust@hammerspace.com>, Chuck Lever <chuck.lever@oracle.com> Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>, linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Date: Fri, 26 Aug 2022 12:10:43 +1000 Message-ID: <166147828344.25420.13834885828450967910.stgit@noble.brown> User-Agent: StGit/1.5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	Improve scalability of directory operations \| expand [PATCH/RFC,00/10,v5] Improve scalability of directory operations [01/10] VFS: support parallel updates in the one directory. [02/10] VFS: move EEXIST and ENOENT tests into lookup_hash_update() [03/10] VFS: move want_write checks into lookup_hash_update() [04/10] VFS: move dput() and mnt_drop_write() into done_path_update() [05/10] VFS: export done_path_update() [06/10] VFS: support concurrent renames. [07/10] VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate() [08/10] NFSD: allow parallel creates from nfsd [09/10] VFS: add LOOKUP_SILLY_RENAME [10/10] NFS: support parallel updates in the one directory.

Message ID

166147828344.25420.13834885828450967910.stgit@noble.brown (mailing list archive)

Headers

Subject: [PATCH/RFC 00/10 v5] Improve scalability of directory operations
From: NeilBrown <neilb@suse.de>
To: Al Viro <viro@zeniv.linux.org.uk>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Daire Byrne <daire@dneg.com>,
        Trond Myklebust <trond.myklebust@hammerspace.com>,
        Chuck Lever <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
        linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>
Date: Fri, 26 Aug 2022 12:10:43 +1000
Message-ID: <166147828344.25420.13834885828450967910.stgit@noble.brown>
User-Agent: StGit/1.5
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Precedence: bulk

Series

Improve scalability of directory operations | expand

Message

NeilBrown Aug. 26, 2022, 2:10 a.m. UTC

[I made up "v5" - I haven't been counting]

VFS currently holds an exclusive lock on the directory while making
changes: add, remove, rename.
When multiple threads make changes in the one directory, the contention
can be noticeable.
In the case of NFS with a high latency link, this can easily be
demonstrated.  NFS doesn't really need VFS locking as the server ensures
correctness.

Lustre uses a single(?) directory for object storage, and has patches
for ext4 to support concurrent updates (Lustre accesses ext4 directly,
not via the VFS).

XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
help at all.

This patch series allows filesystems to request a shared lock on
directories and provides serialisation on just the affected name, not the
whole directory.  It changes both the VFS and NFSD to use shared locks
when appropriate, and changes NFS to request shared locks.

The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
which acts as a bit-lock.  The ->d_lock spinlock is taken to set/clear
it, and wait_var_event() is used for waiting.  This flag is set on all
dentries that are part of a directory update, not just when a shared
lock is taken.

When a shared lock is taken we must use alloc_dentry_parallel() which
needs a wq which must remain until the update is completed.  To make use
of concurrent create, kern_path_create() would need to be passed a wq.
Rather than the churn required for that, we use exclusive locking when
no wq is provided.

One interesting consequence of this is that silly-rename becomes a
little more complex.  As the directory may not be exclusively locked,
the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
A new LOOKUP_SILLY_RENAME is added which helps implement this using
common code.

While testing I found some odd behaviour that was caused by
d_revalidate() racing with rename().  To resolve this I used
DCACHE_PAR_UPDATE to ensure they cannot race any more.

Testing, review, or other comments would be most welcome,

NeilBrown


---

NeilBrown (10):
      VFS: support parallel updates in the one directory.
      VFS: move EEXIST and ENOENT tests into lookup_hash_update()
      VFS: move want_write checks into lookup_hash_update()
      VFS: move dput() and mnt_drop_write() into done_path_update()
      VFS: export done_path_update()
      VFS: support concurrent renames.
      VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate()
      NFSD: allow parallel creates from nfsd
      VFS: add LOOKUP_SILLY_RENAME
      NFS: support parallel updates in the one directory.


 fs/dcache.c            |  72 ++++-
 fs/namei.c             | 616 ++++++++++++++++++++++++++++++++---------
 fs/nfs/dir.c           |  28 +-
 fs/nfs/fs_context.c    |   6 +-
 fs/nfs/internal.h      |   3 +-
 fs/nfs/unlink.c        |  51 +++-
 fs/nfsd/nfs3proc.c     |  28 +-
 fs/nfsd/nfs4proc.c     |  29 +-
 fs/nfsd/nfsfh.c        |   9 +
 fs/nfsd/nfsproc.c      |  29 +-
 fs/nfsd/vfs.c          | 177 +++++-------
 include/linux/dcache.h |  28 ++
 include/linux/fs.h     |   5 +-
 include/linux/namei.h  |  39 ++-
 14 files changed, 799 insertions(+), 321 deletions(-)

--
Signature

Comments

John Stoffel Aug. 26, 2022, 2:42 p.m. UTC | #1

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

NeilBrown> [I made up "v5" - I haven't been counting]

My first comments, but I'm not a serious developer...

NeilBrown> VFS currently holds an exclusive lock on the directory while making
NeilBrown> changes: add, remove, rename.
NeilBrown> When multiple threads make changes in the one directory, the contention
NeilBrown> can be noticeable.
NeilBrown> In the case of NFS with a high latency link, this can easily be
NeilBrown> demonstrated.  NFS doesn't really need VFS locking as the server ensures
NeilBrown> correctness.

NeilBrown> Lustre uses a single(?) directory for object storage, and has patches
NeilBrown> for ext4 to support concurrent updates (Lustre accesses ext4 directly,
NeilBrown> not via the VFS).

NeilBrown> XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
NeilBrown> help at all.

This sentence makes no sense to me... I assume you meant to say "...does
it's own locking..."

NeilBrown> This patch series allows filesystems to request a shared lock on
NeilBrown> directories and provides serialisation on just the affected name, not the
NeilBrown> whole directory.  It changes both the VFS and NFSD to use shared locks
NeilBrown> when appropriate, and changes NFS to request shared locks.

Are there any performance results?  Why wouldn't we just do a shared
locked across all VFS based filesystems?  

NeilBrown> The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
NeilBrown> which acts as a bit-lock.  The ->d_lock spinlock is taken to set/clear
NeilBrown> it, and wait_var_event() is used for waiting.  This flag is set on all
NeilBrown> dentries that are part of a directory update, not just when a shared
NeilBrown> lock is taken.

NeilBrown> When a shared lock is taken we must use alloc_dentry_parallel() which
NeilBrown> needs a wq which must remain until the update is completed.  To make use
NeilBrown> of concurrent create, kern_path_create() would need to be passed a wq.
NeilBrown> Rather than the churn required for that, we use exclusive locking when
NeilBrown> no wq is provided.

Is this a per-operation wq or a per-directory wq?  Can there be issues
if someone does something silly like having 1,000 directories, all of
which have multiple processes making parallel changes?  

Does it degrade gracefully if a wq can't be allocated?  

NeilBrown> One interesting consequence of this is that silly-rename becomes a
NeilBrown> little more complex.  As the directory may not be exclusively locked,
NeilBrown> the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
NeilBrown> A new LOOKUP_SILLY_RENAME is added which helps implement this using
NeilBrown> common code.

NeilBrown> While testing I found some odd behaviour that was caused by
NeilBrown> d_revalidate() racing with rename().  To resolve this I used
NeilBrown> DCACHE_PAR_UPDATE to ensure they cannot race any more.

NeilBrown> Testing, review, or other comments would be most welcome,

NeilBrown Aug. 26, 2022, 11:30 p.m. UTC | #2

On Sat, 27 Aug 2022, John Stoffel wrote:
> >>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:
> 
> NeilBrown> [I made up "v5" - I haven't been counting]
> 
> My first comments, but I'm not a serious developer...
> 
> NeilBrown> VFS currently holds an exclusive lock on the directory while making
> NeilBrown> changes: add, remove, rename.
> NeilBrown> When multiple threads make changes in the one directory, the contention
> NeilBrown> can be noticeable.
> NeilBrown> In the case of NFS with a high latency link, this can easily be
> NeilBrown> demonstrated.  NFS doesn't really need VFS locking as the server ensures
> NeilBrown> correctness.
> 
> NeilBrown> Lustre uses a single(?) directory for object storage, and has patches
> NeilBrown> for ext4 to support concurrent updates (Lustre accesses ext4 directly,
> NeilBrown> not via the VFS).
> 
> NeilBrown> XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
> NeilBrown> help at all.
> 
> This sentence makes no sense to me... I assume you meant to say "...does
> it's own locking..."

Thanks - you are correct.  "does its own locking".

> 
> NeilBrown> This patch series allows filesystems to request a shared lock on
> NeilBrown> directories and provides serialisation on just the affected name, not the
> NeilBrown> whole directory.  It changes both the VFS and NFSD to use shared locks
> NeilBrown> when appropriate, and changes NFS to request shared locks.
> 
> Are there any performance results?  Why wouldn't we just do a shared
> locked across all VFS based filesystems?  

Daire Byrne has done some tests with NFS clients to an NFS server which
re-exports mounts from another server - so there are a couple of levels
of locking that can be removed.  At lease one of these levels has
significant network latency (100ms or so I think) The results are much
what you would expect.  Many more file creations per second are
possible.  15 creates-per-second up to 121 crates-per-second in one
test.
https://lore.kernel.org/linux-nfs/CAPt2mGNjWXad6e7nSUTu=0ez1qU1wBNegrntgHKm5hOeBs5gQA@mail.gmail.com/

> 
> NeilBrown> The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
> NeilBrown> which acts as a bit-lock.  The ->d_lock spinlock is taken to set/clear
> NeilBrown> it, and wait_var_event() is used for waiting.  This flag is set on all
> NeilBrown> dentries that are part of a directory update, not just when a shared
> NeilBrown> lock is taken.
> 
> NeilBrown> When a shared lock is taken we must use alloc_dentry_parallel() which
> NeilBrown> needs a wq which must remain until the update is completed.  To make use
> NeilBrown> of concurrent create, kern_path_create() would need to be passed a wq.
> NeilBrown> Rather than the churn required for that, we use exclusive locking when
> NeilBrown> no wq is provided.
> 
> Is this a per-operation wq or a per-directory wq?  Can there be issues
> if someone does something silly like having 1,000 directories, all of
> which have multiple processes making parallel changes?  

It is per-operation though I expect to change that to be taken from a
pool for shared work queues.

Workqueues can be shared quite cheaply.  There is spin-lock contention
when multiple threads add/remove waiters to/from the queues.  Having
more queues in a pool than cores, and randomly selecting queues from the
pool can keep that under control.

If there are dozens of waiter of more, then a wakeup might run more
slowly (and hold the lock for longer), but in this case wakeup should be
rare.  

Most filesystem operations are uncontended at the name level. e.g. it is
rare that two threads will try to create the same name at the same time,
or one looks up a name that another is unlinking it.  These are the only
times that wakeups would happen, so sharing a pool among all filesystem
accesses is unlikely to be a problem.

> 
> Does it degrade gracefully if a wq can't be allocated?  

In the current code, the wq is allocated on the stack.  I'm probably
going to change to a global allocation.  In either case, there is no
risk of allocation failure during a filesystem operation.

Thanks for the questions,
NeilBrown

> 
> NeilBrown> One interesting consequence of this is that silly-rename becomes a
> NeilBrown> little more complex.  As the directory may not be exclusively locked,
> NeilBrown> the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
> NeilBrown> A new LOOKUP_SILLY_RENAME is added which helps implement this using
> NeilBrown> common code.
> 
> NeilBrown> While testing I found some odd behaviour that was caused by
> NeilBrown> d_revalidate() racing with rename().  To resolve this I used
> NeilBrown> DCACHE_PAR_UPDATE to ensure they cannot race any more.
> 
> NeilBrown> Testing, review, or other comments would be most welcome,
> 
>