mbox series

[v3,0/2] rcu-based inode lookup for iget*

Message ID 20240611101633.507101-1-mjguzik@gmail.com (mailing list archive)
Headers show
Series rcu-based inode lookup for iget* | expand

Message

Mateusz Guzik June 11, 2024, 10:16 a.m. UTC
I think the appropriate blurb which needs to land here also needs to be
in the commit message for the first patch, so here it is copy pasted
with some modifications at the end:

[quote]
Instantiating a new inode normally takes the global inode hash lock
twice:
1. once to check if it happens to already be present
2. once to add it to the hash

The back-to-back lock/unlock pattern is known to degrade performance
significantly, which is further exacerbated if the hash is heavily
populated (long chains to walk, extending hold time). Arguably hash
sizing and hashing algo need to be revisited, but that's beyond the
scope of this patch.

A long term fix would introduce finer-grained locking. An attempt was
made several times, most recently in [1], but the effort appears
stalled.

A simpler idea which solves majority of the problem and which may be
good enough for the time being is to use RCU for the initial lookup.
Basic RCU support is already present in the hash. This being a temporary
measure I tried to keep the change as small as possible.

iget_locked consumers (notably ext4) get away without any changes
because inode comparison method is built-in.

iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
removal of locking adds more problems (inode can be changing) it's not
safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu,
ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual
conversion.

In order to reduce code duplication find_inode and find_inode_fast grow
an argument indicating whether inode hash lock is held, which is passed
down should sleeping be necessary. They always rcu_read_lock, which is
redundant but harmless. Doing it conditionally reduces readability for
no real gain that I can see. RCU-alike restrictions were already put on
callbacks due to the hash spinlock being held.

There is a real cache-busting workload scanning millions of files in
parallel (it's a backup server thing), where the initial lookup is
guaranteed to fail resulting in the 2 lock acquires.

Implemented below is a synthehic benchmark which provides the same
behavior. [I shall note the workload is not running on Linux, instead it
was causing trouble elsewhere. Benchmark below was used while addressing
said problems and was found to adequately represent the real workload.]

Total real time fluctuates by 1-2s.

With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB RAM vm:
[/quote]

Specific results:

ext4 (needed mkfs.ext4 -N 24000000):
before:	3.77s user 890.90s system 1939% cpu 46.118 total
after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)

btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

btrfs bottlenecks itself on its own locks here.

Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz

fs rundown is as follows:
- ext4 patched implicitly
- xfs does not use the inode hash
- bcachefs is out of the picture as Kent decided to implement his own
  inode hashing based on rhashtable, for now private to his fs.

I have not looked at others.

[1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/

v3:
- export new routines with _GPL
- don't use the extern keyword
- add ilookup5_rcu to follow iget5_locked scheme

v2:
- add argument lists to new routines
- assert the inode hash lock is not held as applicable
- real btrfs patch included

Mateusz Guzik (2):
  vfs: add rcu-based find_inode variants for iget ops
  btrfs: use iget5_locked_rcu

 fs/btrfs/inode.c   |   2 +-
 fs/inode.c         | 119 ++++++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h |  10 +++-
 3 files changed, 112 insertions(+), 19 deletions(-)

Comments

Josef Bacik June 11, 2024, 4:49 p.m. UTC | #1
On Tue, Jun 11, 2024 at 12:16:30PM +0200, Mateusz Guzik wrote:
> I think the appropriate blurb which needs to land here also needs to be
> in the commit message for the first patch, so here it is copy pasted
> with some modifications at the end:
> 
> [quote]
> Instantiating a new inode normally takes the global inode hash lock
> twice:
> 1. once to check if it happens to already be present
> 2. once to add it to the hash
> 
> The back-to-back lock/unlock pattern is known to degrade performance
> significantly, which is further exacerbated if the hash is heavily
> populated (long chains to walk, extending hold time). Arguably hash
> sizing and hashing algo need to be revisited, but that's beyond the
> scope of this patch.
> 
> A long term fix would introduce finer-grained locking. An attempt was
> made several times, most recently in [1], but the effort appears
> stalled.
> 
> A simpler idea which solves majority of the problem and which may be
> good enough for the time being is to use RCU for the initial lookup.
> Basic RCU support is already present in the hash. This being a temporary
> measure I tried to keep the change as small as possible.
> 
> iget_locked consumers (notably ext4) get away without any changes
> because inode comparison method is built-in.
> 
> iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> removal of locking adds more problems (inode can be changing) it's not
> safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu,
> ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual
> conversion.
> 
> In order to reduce code duplication find_inode and find_inode_fast grow
> an argument indicating whether inode hash lock is held, which is passed
> down should sleeping be necessary. They always rcu_read_lock, which is
> redundant but harmless. Doing it conditionally reduces readability for
> no real gain that I can see. RCU-alike restrictions were already put on
> callbacks due to the hash spinlock being held.
> 
> There is a real cache-busting workload scanning millions of files in
> parallel (it's a backup server thing), where the initial lookup is
> guaranteed to fail resulting in the 2 lock acquires.
> 
> Implemented below is a synthehic benchmark which provides the same
> behavior. [I shall note the workload is not running on Linux, instead it
> was causing trouble elsewhere. Benchmark below was used while addressing
> said problems and was found to adequately represent the real workload.]
> 
> Total real time fluctuates by 1-2s.
> 
> With 20 threads each walking a dedicated 1000 dirs * 1000 files
> directory tree to stat(2) on a 32 core + 24GB RAM vm:
> [/quote]
> 
> Specific results:
> 
> ext4 (needed mkfs.ext4 -N 24000000):
> before:	3.77s user 890.90s system 1939% cpu 46.118 total
> after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
> 
> btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
> before: 3.54s user 892.30s system 1966% cpu 45.549 total
> after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
> 
> btrfs bottlenecks itself on its own locks here.
> 
> Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
> 
> fs rundown is as follows:
> - ext4 patched implicitly
> - xfs does not use the inode hash
> - bcachefs is out of the picture as Kent decided to implement his own
>   inode hashing based on rhashtable, for now private to his fs.
> 
> I have not looked at others.
> 
> [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/
> 

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef