mbox series

[GIT,PULL] bcachefs changes for 6.12-rc1

Message ID dtolpfivc4fvdfbqgmljygycyqfqoranpsjty4sle7ouydycez@aw7v34oibdhm (mailing list archive)
State New
Headers show
Series [GIT,PULL] bcachefs changes for 6.12-rc1 | expand

Pull-request

git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21

Message

Kent Overstreet Sept. 21, 2024, 7:27 p.m. UTC
Relatively quiet cycle here - which is good, because 6.11 was huge.

I think we'll be able to take off EXPERIMENTAL in inside of a year, see
below for more.

Cheers,
Kent

The following changes since commit 16005147cca41a0f67b5def2a4656286f8c0db4a:

  bcachefs: Don't delete open files in online fsck (2024-09-09 09:41:47 -0400)

are available in the Git repository at:

  git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21

for you to fetch changes up to 025c55a4c7f11ea38521c6e797f3192ad8768c93:

  bcachefs: return err ptr instead of null in read sb clean (2024-09-21 11:39:49 -0400)

----------------------------------------------------------------
bcachefs changes for 6.12-rc1

rcu_pending, btree key cache rework: this solves lock contenting in the
key cache, eliminating the biggest source of the srcu lock hold time
warnings, and drastically improving performance on some metadata heavy
workloads - on multithreaded creates we're now 3-4x faster than xfs.

We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.

for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu lock
time warnings, by running each loop iteration in its own transaction (as
the existing for_each_btree_key() does).

More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node locks.

Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.

Rework of btree node pinning, which we use in backpointers fsck. The old
hacky implementation, where the shrinker just skipped over nodes in the
pinned range, was causing OOMs; instead we now use another shrinker with
a much higher seeks number for pinned nodes.

Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data to a
specific target.

Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.

Idmap mounts are now supported - Hongbo.

Rename whiteouts are now supported - Hongbo.

Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.

Status, and when will we be taking off experimental:
----------------------------------------------------

Going by critical, user facing bugs getting found and fixed, we're
nearly there. There are a couple key items that need to be finished
before we can take off the experimental label:

- The end-user experience is still pretty painful when the root
  filesystem needs a fsck; we need some form of limited self healing so
  that necessary repair gets run automatically. Errors (by type) are
  recorded in the superblock, so what we need to do next is convert
  remaining inconsistent() errors to fsck() errors (so that all runtime
  inconsistencies are logged in the superblock), and we need to go
  through the list of fsck errors and classify them by which fsck passes
  are needed to repair them.

- We need comprehensive torture testing for all our repair paths, to
  shake out remaining bugs there. Thomas has been working on the tooling
  for this, so this is coming soonish.

Slightly less critical items:

- We need to improve the end-user experience for degraded mounts: right
  now, a degraded root filesystem means dropping to an initramfs shell
  or somehow inputting mount options manually (we don't want to allow
  degraded mounts without some form of user input, except on unattended
  servers) - we need the mount helper to prompt the user to allow
  mounting degraded, and make sure this works with systemd.

- Scalabiity: we have users running 100TB+ filesystems, and that's
  effectively the limit right now due to fsck times. We have some
  reworks in the pipeline to address this, we're aiming to make petabyte
  sized filesystems practical.

----------------------------------------------------------------
Alan Huang (8):
      bcachefs: Remove unused parameter of bkey_mantissa
      bcachefs: Remove unused parameter of bkey_mantissa_bits_dropped
      bcachefs: Remove dead code in __build_ro_aux_tree
      bcachefs: Convert open-coded extra computation to helper
      bcachefs: Minimize the search range used to calculate the mantissa
      bcachefs: Remove the prev array stuff
      bcachefs: Remove unused parameter
      bcachefs: Refactor bch2_bset_fix_lookup_table

Alyssa Ross (1):
      bcachefs: Fix negative timespecs

Chen Yufan (1):
      bcachefs: Convert to use jiffies macros

Diogo Jahchan Koike (1):
      bcachefs: return err ptr instead of null in read sb clean

Feiko Nanninga (1):
      bcachefs: Fix sysfs rebalance duration waited formatting

Hongbo Li (2):
      bcachefs: support idmap mounts
      bcachefs: Fix compilation error for bch2_sb_member_alloc

Julian Sun (4):
      bcachefs: remove the unused macro definition
      bcachefs: fix macro definition allocate_dropping_locks_errcode
      bcachefs: fix macro definition allocate_dropping_locks
      bcachefs: remove the unused parameter in macro bkey_crc_next

Kent Overstreet (68):
      inode: make __iget() a static inline
      bcachefs: switch to rhashtable for vfs inodes hash
      bcachefs: Fix deadlock in __wait_on_freeing_inode()
      lib/generic-radix-tree.c: genradix_ptr_inlined()
      lib/generic-radix-tree.c: add preallocation
      bcachefs: rcu_pending
      bcachefs: rcu_pending now works in userspace
      bcachefs: Rip out freelists from btree key cache
      bcachefs: key cache can now allocate from pending
      bcachefs: data_allowed is now an opts.h option
      bcachefs: bch2_opt_set_sb() can now set (some) device options
      bcachefs: Opt_durability can now be set via bch2_opt_set_sb()
      bcachefs: Add check for btree_path ref overflow
      bcachefs: Btree path tracepoints
      bcachefs: kill bch2_btree_iter_peek_and_restart()
      bcachefs: bchfs_read(): call trans_begin() on every loop iter
      bcachefs: bch2_fiemap(): call trans_begin() on every loop iter
      bcachefs: for_each_btree_key_in_subvolume_upto()
      bcachefs: bch2_readdir() -> for_each_btree_key_in_subvolume_upto
      bcachefs: bch2_xattr_list() -> for_each_btree_key_in_subvolume_upto
      bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto
      bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto
      bcachefs: range_has_data() -> for_each_btree_key_in_subvolume_upto
      bcachefs: bch2_folio_set() -> for_each_btree_key_in_subvolume_upto
      bcachefs: quota_reserve_range() -> for_each_btree_key_in_subvolume_upto
      bcachefs: Move rebalance_status out of sysfs/internal
      bcachefs: promote_whole_extents is now a normal option
      bcachefs: trivial open_bucket_add_buckets() cleanup
      bcachefs: bch2_sb_nr_devices()
      bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc()
      bcachefs: bch2_time_stats_reset()
      bcachefs: Assert that we don't lock nodes when !trans->locked
      bcachefs: darray: convert to alloc_hooks()
      bcachefs: Switch gc bucket array to a genradix
      bcachefs: Add pinned to btree cache not freed counters
      bcachefs: do_encrypt() now handles allocation failures
      bcachefs: convert __bch2_encrypt_bio() to darray
      bcachefs: kill redundant is_vmalloc_addr()
      bcachefs: fix prototype to bch2_alloc_sectors_start_trans()
      bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation
      bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS
      bcachefs: Use __GFP_ACCOUNT for reclaimable memory
      bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes
      bcachefs: Options for recovery_passes, recovery_passes_exclude
      bcachefs: Move tabstop setup to bch2_dev_usage_to_text()
      bcachefs: bch2_dev_remove_alloc() -> alloc_background.c
      bcachefs: bch2_sb_member_alloc()
      bcachefs: improve "no device to read from" message
      bcachefs: bch2_opts_to_text()
      bcachefs: Progress indicator for extents_to_backpointers
      bcachefs: bch2_dev_rcu_noerror()
      bcachefs: Failed devices no longer require mounting in degraded mode
      bcachefs: Don't count "skipped access bit" as touched in btree cache scan
      bcachefs: btree cache counters should be size_t
      bcachefs: split up btree cache counters for live, freeable
      bcachefs: Rework btree node pinning
      bcachefs: EIO errcode cleanup
      bcachefs: stripe_to_mem()
      bcachefs: bch_stripe.disk_label
      bcachefs: ec_stripe_head.nr_created
      bcachefs: improve bch2_new_stripe_to_text()
      bcachefs: improve error message on too few devices for ec
      bcachefs: improve error messages in bch2_ec_read_extent()
      bcachefs: bch2_trigger_ptr() calculates sectors even when no device
      bcachefs: bch2_dev_remove_stripes()
      bcachefs: bch_fs.rw_devs_change_count
      bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
      bcachefs: Don't drop devices with stripe pointers

Matthew Wilcox (Oracle) (1):
      bcachefs: Do not check folio_has_private()

Nathan Chancellor (1):
      bcachefs: Fix format specifier in bch2_btree_key_cache_to_text()

Reed Riley (1):
      bcachefs: Replace div_u64 with div64_u64 where second param is u64

Sasha Finkelstein (1):
      bcachefs: Hook up RENAME_WHITEOUT in rename.

Thorsten Blum (3):
      bcachefs: Annotate struct bucket_array with __counted_by()
      bcachefs: Annotate struct bch_xattr with __counted_by()
      bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by()

Xiaxi Shen (1):
      bcachefs: Fix a spelling error in docs

Yang Li (1):
      bcachefs: Remove duplicated include in backpointers.c

Youling Tang (4):
      bcachefs: allocate inode by using alloc_inode_sb()
      bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
      bcachefs: drop unused posix acl handlers
      bcachefs: Simplify bch2_xattr_emit() implementation

 Documentation/filesystems/bcachefs/CodingStyle.rst |   2 +-
 fs/bcachefs/Kconfig                                |   7 +
 fs/bcachefs/Makefile                               |   1 +
 fs/bcachefs/acl.c                                  |   2 +-
 fs/bcachefs/alloc_background.c                     |  45 +-
 fs/bcachefs/alloc_background.h                     |   3 +-
 fs/bcachefs/alloc_foreground.c                     |  59 +-
 fs/bcachefs/alloc_foreground.h                     |   5 +-
 fs/bcachefs/backpointers.c                         | 106 +++-
 fs/bcachefs/backpointers.h                         |  23 +-
 fs/bcachefs/bcachefs.h                             |  14 +-
 fs/bcachefs/bcachefs_format.h                      |   2 +
 fs/bcachefs/bset.c                                 | 182 +++---
 fs/bcachefs/bset.h                                 |   4 +-
 fs/bcachefs/btree_cache.c                          | 273 ++++++---
 fs/bcachefs/btree_cache.h                          |   3 +
 fs/bcachefs/btree_gc.c                             |  21 +-
 fs/bcachefs/btree_io.c                             |   8 +-
 fs/bcachefs/btree_io.h                             |   4 +-
 fs/bcachefs/btree_iter.c                           |  63 +-
 fs/bcachefs/btree_iter.h                           |  52 +-
 fs/bcachefs/btree_key_cache.c                      | 405 +++----------
 fs/bcachefs/btree_key_cache_types.h                |  18 +-
 fs/bcachefs/btree_locking.h                        |  13 +-
 fs/bcachefs/btree_trans_commit.c                   |   2 +-
 fs/bcachefs/btree_types.h                          |  60 +-
 fs/bcachefs/btree_update.c                         |  12 +-
 fs/bcachefs/btree_update_interior.c                |  37 +-
 fs/bcachefs/btree_update_interior.h                |   2 +
 fs/bcachefs/buckets.c                              |  35 +-
 fs/bcachefs/buckets.h                              |  15 +-
 fs/bcachefs/buckets_types.h                        |   8 -
 fs/bcachefs/checksum.c                             | 101 ++--
 fs/bcachefs/clock.h                                |   9 -
 fs/bcachefs/darray.c                               |   4 +-
 fs/bcachefs/darray.h                               |  26 +-
 fs/bcachefs/data_update.c                          |   2 +-
 fs/bcachefs/dirent.c                               |  66 +--
 fs/bcachefs/ec.c                                   | 303 +++++++---
 fs/bcachefs/ec.h                                   |  11 +-
 fs/bcachefs/ec_format.h                            |   9 +-
 fs/bcachefs/ec_types.h                             |   1 +
 fs/bcachefs/errcode.h                              |  14 +-
 fs/bcachefs/extents.c                              |  33 +-
 fs/bcachefs/extents.h                              |  24 +-
 fs/bcachefs/fs-common.c                            |   5 +-
 fs/bcachefs/fs-io-buffered.c                       |  41 +-
 fs/bcachefs/fs-io-direct.c                         |   2 +-
 fs/bcachefs/fs-io-pagecache.c                      |  90 ++-
 fs/bcachefs/fs-io-pagecache.h                      |   4 +-
 fs/bcachefs/fs-io.c                                | 178 ++----
 fs/bcachefs/fs-ioctl.c                             |   4 +-
 fs/bcachefs/fs.c                                   | 427 +++++++++-----
 fs/bcachefs/fs.h                                   |  18 +-
 fs/bcachefs/inode.c                                |   2 +-
 fs/bcachefs/io_read.c                              |  18 +-
 fs/bcachefs/io_write.c                             |   7 +-
 fs/bcachefs/journal_io.c                           |   6 +-
 fs/bcachefs/journal_reclaim.c                      |   7 +-
 fs/bcachefs/opts.c                                 |  85 ++-
 fs/bcachefs/opts.h                                 |  61 +-
 fs/bcachefs/rcu_pending.c                          | 650 +++++++++++++++++++++
 fs/bcachefs/rcu_pending.h                          |  27 +
 fs/bcachefs/rebalance.c                            |   3 +
 fs/bcachefs/recovery.c                             |  22 +-
 fs/bcachefs/recovery_passes.c                      |  10 +-
 fs/bcachefs/replicas.c                             |  10 +-
 fs/bcachefs/replicas_format.h                      |   9 +-
 fs/bcachefs/sb-clean.c                             |   2 +-
 fs/bcachefs/sb-members.c                           |  57 ++
 fs/bcachefs/sb-members.h                           |  22 +-
 fs/bcachefs/str_hash.h                             |   2 +-
 fs/bcachefs/subvolume.h                            |  45 ++
 fs/bcachefs/subvolume_types.h                      |   3 +-
 fs/bcachefs/super-io.c                             |  12 +-
 fs/bcachefs/super.c                                |  85 +--
 fs/bcachefs/sysfs.c                                |  55 +-
 fs/bcachefs/thread_with_file.c                     |   2 +-
 fs/bcachefs/time_stats.c                           |  14 +
 fs/bcachefs/time_stats.h                           |   3 +-
 fs/bcachefs/trace.h                                | 465 ++++++++++++++-
 fs/bcachefs/util.c                                 |  16 +-
 fs/bcachefs/util.h                                 |   2 +-
 fs/bcachefs/xattr.c                                |  81 +--
 fs/bcachefs/xattr_format.h                         |   2 +-
 fs/inode.c                                         |   8 -
 include/linux/fs.h                                 |   9 +-
 include/linux/generic-radix-tree.h                 | 105 +++-
 lib/generic-radix-tree.c                           |  80 +--
 89 files changed, 3155 insertions(+), 1690 deletions(-)
 create mode 100644 fs/bcachefs/rcu_pending.c
 create mode 100644 fs/bcachefs/rcu_pending.h

Comments

Linus Torvalds Sept. 23, 2024, 5:18 p.m. UTC | #1
On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> We're now using an rhashtable instead of the system inode hash table;
> this is another significant performance improvement on multithreaded
> metadata workloads, eliminating more lock contention.

So I've pulled this, but I reacted to this issue - what is the load
where the inode hash table is actually causing issues?

Because honestly, the reason we're using that "one single lock" for
the inode hash table is that nobody has ever bothered.

In particular, it *should* be reasonably straightforward (lots of
small details, but not fundamentally hard) to just make the inode hash
table be a "bl" hash - with the locking done per-bucket, not using one
global lock.

But nobody has ever seemed to care before. Because the inode hash just
isn't all that heavily used, since normal loads don't look up inodes
directly (they look up dentries, and you find the inodes off those -
and the *dentry* hash scales well thanks to both RCU and doing the
bucket lock thing).

So the fact that bcachefs cares makes me go "Why?"

Normal filesystems *never* call ilookup() and friends.  Why does
bcachefs do it so much that you decided that you need to use a whole
other hashtable?

                   Linus
pr-tracker-bot@kernel.org Sept. 23, 2024, 7:06 p.m. UTC | #2
The pull request you sent on Sat, 21 Sep 2024 15:27:58 -0400:

> git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b3f391fddf3cfaadda59ec8da8fd17f4520bbf42

Thank you!
Linus Torvalds Sept. 23, 2024, 7:07 p.m. UTC | #3
On Mon, 23 Sept 2024 at 10:18, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > We're now using an rhashtable instead of the system inode hash table;
> > this is another significant performance improvement on multithreaded
> > metadata workloads, eliminating more lock contention.
>
> So I've pulled this, but I reacted to this issue - what is the load
> where the inode hash table is actually causing issues?

Oh, and I also only now noticed that a third of the commits are fairly
recent and from after the merge window even started.

Maybe there's a good reason for that, but it sure wasn't explained in
the pull request. Grr.

                 Linus
Kent Overstreet Sept. 23, 2024, 7:56 p.m. UTC | #4
On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote:
> On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > We're now using an rhashtable instead of the system inode hash table;
> > this is another significant performance improvement on multithreaded
> > metadata workloads, eliminating more lock contention.
> 
> So I've pulled this, but I reacted to this issue - what is the load
> where the inode hash table is actually causing issues?
> 
> Because honestly, the reason we're using that "one single lock" for
> the inode hash table is that nobody has ever bothered.
> 
> In particular, it *should* be reasonably straightforward (lots of
> small details, but not fundamentally hard) to just make the inode hash
> table be a "bl" hash - with the locking done per-bucket, not using one
> global lock.

Dave was working on that - he posted patches and it seemed like they
were mostly done, not sure what happened with those.

> But nobody has ever seemed to care before. Because the inode hash just
> isn't all that heavily used, since normal loads don't look up inodes
> directly (they look up dentries, and you find the inodes off those -
> and the *dentry* hash scales well thanks to both RCU and doing the
> bucket lock thing).
> 
> So the fact that bcachefs cares makes me go "Why?"

I've seen benchmarks (not just on bcachefs, ext4 as well) where it was
actually _really_ heavy. They were synthetics, but I'd expect to see the
same effect from workloads where we're streaming a lot of inodes through
the cache - think rsync.

Another data point is that the btree key cache work I just did was also
for exactly this type of situation - it addressed lock contention when
streaming items into and out of the key cache, which are mostly inodes
(it's also used for the alloc btree) - and I had reports from multiple
users it was a drastic improvement, much more than expected, and
eliminated the main source of our "srcu lock held too long" warnings -
and if lock contention is bad enough to trip a 10 second warning, you
know it's bad, heh.

I try to really jump on these sorts of lock contention issues because
I've seen many instances where they caused performance cliffs - in
practice they're often O(n^2) issues because as your workload scales up
it's usually not just number of instances where you take the lock that
increases, usually something else is scaling too - even just increased
cacheline bouncing from the lock contention itself will be problematic.
Kent Overstreet Sept. 23, 2024, 7:58 p.m. UTC | #5
On Mon, Sep 23, 2024 at 12:07:24PM GMT, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 10:18, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > We're now using an rhashtable instead of the system inode hash table;
> > > this is another significant performance improvement on multithreaded
> > > metadata workloads, eliminating more lock contention.
> >
> > So I've pulled this, but I reacted to this issue - what is the load
> > where the inode hash table is actually causing issues?
> 
> Oh, and I also only now noticed that a third of the commits are fairly
> recent and from after the merge window even started.
> 
> Maybe there's a good reason for that, but it sure wasn't explained in
> the pull request. Grr.

I rebased to drop a patch that conflicts with Andrew's tree (the
PF_MEMALLOC_NORECLAIM stuff, which I would still like to chat with him
about since I missed him at Plumbers...)
Dave Chinner Sept. 24, 2024, 12:26 a.m. UTC | #6
On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote:
> On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote:
> > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > We're now using an rhashtable instead of the system inode hash table;
> > > this is another significant performance improvement on multithreaded
> > > metadata workloads, eliminating more lock contention.
> > 
> > So I've pulled this, but I reacted to this issue - what is the load
> > where the inode hash table is actually causing issues?
> > 
> > Because honestly, the reason we're using that "one single lock" for
> > the inode hash table is that nobody has ever bothered.
> > 
> > In particular, it *should* be reasonably straightforward (lots of
> > small details, but not fundamentally hard) to just make the inode hash
> > table be a "bl" hash - with the locking done per-bucket, not using one
> > global lock.
> 
> Dave was working on that - he posted patches and it seemed like they
> were mostly done, not sure what happened with those.

I lost interest in that patchset a long while ago. The last posting
I did was largely as a community service to get the completed,
lockdep and RTPREEMPT compatible version of the hashbl
infrastructure it needed out there for other people to be able to
easily push this forward if they needed it. That was here:

https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/

and I posted it because Kent was asking about it because his
attempts to convert the inode hash to use rhashtables wasn't
working.

I've suggested multiple times since then when people have asked me
about that patch set that they are free to pick it up and finish it
off themselves. Unfortunately, that usually results in silence, a
comment of "I don't have time for that", and/or they go off and hack
around the issue in some other way....

> > But nobody has ever seemed to care before. Because the inode hash just
> > isn't all that heavily used, since normal loads don't look up inodes
> > directly (they look up dentries, and you find the inodes off those -
> > and the *dentry* hash scales well thanks to both RCU and doing the
> > bucket lock thing).

Not entirely true.

Yes, the dentry cache protects the inode hash from contention when
the workload repeatedly hits cached dentries.

However, the problematic workload is cold cache operations where
the dentry cache repeatedly misses. This places all the operational
concurrency directly on the inode hash as new inodes are inserted
into the hash. Add memory reclaim and that adds contention as it
removes inodes from the hash on eviction.

This happens in workloads that cycle lots of inode through memory.
Concurrent cold cache lookups, create and unlink hammer the global
filesystem inode hash lock at more than 4 active tasks, especially
when there is also concurrent memory reclaim running evicting inodes
from the inode hash.

This inode cache miss contention issue was sort-of hacked around
recently by commit 7180f8d91fcb ("vfs: add rcu-based find_inode
variants for iget ops"). This allowed filesystems filesystems to use
-partially- RCU-based lookups rather fully locked lookups. In the
case of a hit, it will be an RCU operation, but as you state, cache
hits on the inode hash are the rare case, not the common case.

I say "sort-of hacked around" because the RCU lookup only avoids the
inode hash lock on the initial lookup that misses. Then insert is
still done under the inode_hash_lock and so inode_hash_lock
contention still definitely exists in this path. All the above
commit did was move the catastrophic cacheline contention breakdown
point to be higher than the test workload that was being run could
hit.

In review of that commit, I suggested that the inode hash_bl
conversion patches be finished off instead, but the response was "I
don't have time for that".

And so here we are again....

-Dave.
Kent Overstreet Sept. 24, 2024, 1:55 a.m. UTC | #7
On Tue, Sep 24, 2024 at 10:26:56AM GMT, Dave Chinner wrote:
> On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote:
> > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote:
> > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > We're now using an rhashtable instead of the system inode hash table;
> > > > this is another significant performance improvement on multithreaded
> > > > metadata workloads, eliminating more lock contention.
> > > 
> > > So I've pulled this, but I reacted to this issue - what is the load
> > > where the inode hash table is actually causing issues?
> > > 
> > > Because honestly, the reason we're using that "one single lock" for
> > > the inode hash table is that nobody has ever bothered.
> > > 
> > > In particular, it *should* be reasonably straightforward (lots of
> > > small details, but not fundamentally hard) to just make the inode hash
> > > table be a "bl" hash - with the locking done per-bucket, not using one
> > > global lock.
> > 
> > Dave was working on that - he posted patches and it seemed like they
> > were mostly done, not sure what happened with those.
> 
> I lost interest in that patchset a long while ago. The last posting
> I did was largely as a community service to get the completed,
> lockdep and RTPREEMPT compatible version of the hashbl
> infrastructure it needed out there for other people to be able to
> easily push this forward if they needed it. That was here:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/
> 
> and I posted it because Kent was asking about it because his
> attempts to convert the inode hash to use rhashtables wasn't
> working.
> 
> I've suggested multiple times since then when people have asked me
> about that patch set that they are free to pick it up and finish it
> off themselves. Unfortunately, that usually results in silence, a
> comment of "I don't have time for that", and/or they go off and hack
> around the issue in some other way....

Looking over that thread I'm not clear on what was missing - you
mentioned RTPREEMPT/lockdep but you had a patch for that, which looked
servicable enough?
Linus Torvalds Sept. 24, 2024, 2:26 a.m. UTC | #8
On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote:
>
> However, the problematic workload is cold cache operations where
> the dentry cache repeatedly misses. This places all the operational
> concurrency directly on the inode hash as new inodes are inserted
> into the hash. Add memory reclaim and that adds contention as it
> removes inodes from the hash on eviction.

Yeah, and then we spend all the time just adding the inodes to the
hashes, and probably fairly seldom use them. Oh well.

And I had missed the issue with PREEMPT_RT and the fact that right now
the inode hash lock is outside the inode lock, which is problematic.

So it's all a bit nasty.

But I also assume most of the bad issues end up mainly showing up on
just fairly synthetic benchmarks with ramdisks, because even with a
good SSD I suspect the IO for the cold cache would still dominate?

               Linus
Linus Torvalds Sept. 24, 2024, 2:48 a.m. UTC | #9
On Mon, 23 Sept 2024 at 19:26, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I had missed the issue with PREEMPT_RT and the fact that right now
> the inode hash lock is outside the inode lock, which is problematic.

.. although I guess that could be solved by just making the spinlock
be external to the hash list, rather than part of the list like
list_bh.

You wouldn't need to do one lock per list head, you could just do some
grouping of the hash.

That said, the vfs inode hash code really is pretty disgusting and has
been accumulating these warts for a long time. So maybe a "filesystems
do their own thing" together with some helper infrastructure (like
using rhashtable) is actually the right thing to do.

              Linus
Kent Overstreet Sept. 24, 2024, 2:55 a.m. UTC | #10
On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote:
> >
> > However, the problematic workload is cold cache operations where
> > the dentry cache repeatedly misses. This places all the operational
> > concurrency directly on the inode hash as new inodes are inserted
> > into the hash. Add memory reclaim and that adds contention as it
> > removes inodes from the hash on eviction.
> 
> Yeah, and then we spend all the time just adding the inodes to the
> hashes, and probably fairly seldom use them. Oh well.
> 
> And I had missed the issue with PREEMPT_RT and the fact that right now
> the inode hash lock is outside the inode lock, which is problematic.
> 
> So it's all a bit nasty.
> 
> But I also assume most of the bad issues end up mainly showing up on
> just fairly synthetic benchmarks with ramdisks, because even with a
> good SSD I suspect the IO for the cold cache would still dominate?

Not for bcachefs, because filling into the vfs inode cache doesn't
require a disk read - they're cached in the inodes btree and much
smaller there. We use a varint encoding so they're typically 50-100
bytes, last I checked.

Compare to ~1k, plus or minus, in the vfs inode cache.

Thomas Bertshinger has been working on applications at LANL where
avoiding pulling into the vfs inode cache seems to make a significant
difference (file indexing in his case) - it turns out there's an xattr
syscall that's missing, which I believe he'll be submitting a patch for.

But stat/statx always pulls into the vfs inode cache, and that's likely
worth fixing.
Dave Chinner Sept. 24, 2024, 3:04 a.m. UTC | #11
On Mon, Sep 23, 2024 at 07:26:31PM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote:
> >
> > However, the problematic workload is cold cache operations where
> > the dentry cache repeatedly misses. This places all the operational
> > concurrency directly on the inode hash as new inodes are inserted
> > into the hash. Add memory reclaim and that adds contention as it
> > removes inodes from the hash on eviction.
> 
> Yeah, and then we spend all the time just adding the inodes to the
> hashes, and probably fairly seldom use them. Oh well.
> 
> And I had missed the issue with PREEMPT_RT and the fact that right now
> the inode hash lock is outside the inode lock, which is problematic.

*nod*

> So it's all a bit nasty.
> 
> But I also assume most of the bad issues end up mainly showing up on
> just fairly synthetic benchmarks with ramdisks, because even with a
> good SSD I suspect the IO for the cold cache would still dominate?

No, all the issues show up on consumer level NVMe SSDs - they have
more than enough IO concurrency to cause these CPU concurrency
related problems.

Keep in mind that when it comes to doing huge amounts of IO,
ramdisks are fundamentally flawed and don't scale.  That is, the IO
is synchonous memcpy() based and so consumes CPU time and both read
and write memory bandwidth, and concurrency is limited to the number
of CPUs in the system..

With NVMe SSDs, all the data movement is asynchronous and offloaded
to hardware with DMA engines that move the data. Those DMA engines
can often handle hundreds of concurrent IOs at once.

DMA sourced data is also only written to RAM once, and there are no
dependent data reads to slow down the DMA write to RAM like there is
with a data copy streamed through a CPU. IOWs, once the IO rates and
concurrency go up, it is generally much faster to use the CPU to
program the DMA engines to move the data than it is to move the data
with the CPU itself.

The testing I did (and so the numbers in those benchmarks) was done
on 2018-era PCIe 3.0 enterprise NVMe SSDs that could do
approximately 400k 4kB random read IOPS. The latest consumer PCIe
5.0 NVMe SSDs are *way faster* than these drives when subject to
highly concurrent IO requests...

-Dave.
Dave Chinner Sept. 24, 2024, 3:34 a.m. UTC | #12
On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote:
> > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > However, the problematic workload is cold cache operations where
> > > the dentry cache repeatedly misses. This places all the operational
> > > concurrency directly on the inode hash as new inodes are inserted
> > > into the hash. Add memory reclaim and that adds contention as it
> > > removes inodes from the hash on eviction.
> > 
> > Yeah, and then we spend all the time just adding the inodes to the
> > hashes, and probably fairly seldom use them. Oh well.
> > 
> > And I had missed the issue with PREEMPT_RT and the fact that right now
> > the inode hash lock is outside the inode lock, which is problematic.
> > 
> > So it's all a bit nasty.
> > 
> > But I also assume most of the bad issues end up mainly showing up on
> > just fairly synthetic benchmarks with ramdisks, because even with a
> > good SSD I suspect the IO for the cold cache would still dominate?
> 
> Not for bcachefs, because filling into the vfs inode cache doesn't
> require a disk read - they're cached in the inodes btree and much
> smaller there. We use a varint encoding so they're typically 50-100
> bytes, last I checked.
> 
> Compare to ~1k, plus or minus, in the vfs inode cache.
> 
> Thomas Bertshinger has been working on applications at LANL where
> avoiding pulling into the vfs inode cache seems to make a significant
> difference (file indexing in his case) - it turns out there's an xattr
> syscall that's missing, which I believe he'll be submitting a patch for.
> 
> But stat/statx always pulls into the vfs inode cache, and that's likely
> worth fixing.

No, let's not even consider going there.

Unlike most people, old time XFS developers have direct experience
with the problems that "uncached" inode access for stat purposes.

XFS has had the bulkstat API for a long, long time (i.e. since 1998
on Irix). When it was first implemented on Irix, it was VFS cache
coherent. But in the early 2000s, that caused problems with HSMs
needing to scan billions inodes indexing petabytes of stored data
with certain SLA guarantees (i.e. needing to scan at least a million
inodes a second).  The CPU overhead of cache instantiation and
teardown was too great to meet those performance targets on 500MHz
MIPS CPUs.

So we converted bulkstat to run directly out of the XFS buffer cache
(i.e. uncached from the perspective of the VFS). This reduced the
CPU over per-inode substantially, allowing bulkstat rates to
increase by a factor of 10. However, it introduced all sorts of
coherency problems between cached inode state vs what was stored in
the buffer cache. It was basically O_DIRECT for stat() and, as you'd
expect from that description, the coherency problems were horrible.
Detecting iallocated-but-not-yet-updated and
unlinked-but-not-yet-freed inodes were particularly consistent
sources of issues.

The only way to fix these coherency problems was to check the inode
cache for a resident inode first, which basically defeated the
entire purpose of bypassing the VFS cache in the first place.

So we went back to using coherent lookups once we stopped caring
about the old SGI HSM SLA requirements back in 2010:

commit 7dce11dbac54fce777eea0f5fb25b2694ccd7900
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Jun 23 18:11:11 2010 +1000

    xfs: always use iget in bulkstat

    The non-coherent bulkstat versionsthat look directly at the inode
    buffers causes various problems with performance optimizations that
    make increased use of just logging inodes.  This patch makes bulkstat
    always use iget, which should be fast enough for normal use with the
    radix-tree based inode cache introduced a while ago.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Essentially, we now always being inodes into cache as fully
instantiated VFS inodes, and mark them with I_DONT_CACHE so they get
torn down immediately after we've used them if they weren't already
cached. (i.e. all accesses are done in a single task context with
the inode hot in the CPU caches.)

Without the patchset I pointed to, bulkstat maxxes out the VFS inode
cache cycling on the sb->s_inodes_list_lock at about 700,000
inodes/sec being cycled through the cache. With that patchset, it
maxxed out the bandwidth of the NVMe SSDs I was using  at ~7GB/s
read IO and cycling about 6 million inodes/sec through the VFS cache.

IOWs, we can make the VFS inode cache scale way beyond what most
users actually need right now.

The lesson in all this?

Don't hack around VFS scalability issues if it can be avoided.

If we fix the problems properly in the first place, then most fs
developers (let alone users!) won't even realise there was a
scalability problem in the VFS to begin with. Users will, however,
notice every inconsistency and/or corruption caused by fs developers
trying to work around VFS cache limitations...

-Dave.
Kent Overstreet Sept. 24, 2024, 3:47 a.m. UTC | #13
On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote:
> On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> > But stat/statx always pulls into the vfs inode cache, and that's likely
> > worth fixing.
> 
> No, let's not even consider going there.
> 
> Unlike most people, old time XFS developers have direct experience
> with the problems that "uncached" inode access for stat purposes.
> 
> XFS has had the bulkstat API for a long, long time (i.e. since 1998
> on Irix). When it was first implemented on Irix, it was VFS cache
> coherent. But in the early 2000s, that caused problems with HSMs
> needing to scan billions inodes indexing petabytes of stored data
> with certain SLA guarantees (i.e. needing to scan at least a million
> inodes a second).  The CPU overhead of cache instantiation and
> teardown was too great to meet those performance targets on 500MHz
> MIPS CPUs.
> 
> So we converted bulkstat to run directly out of the XFS buffer cache
> (i.e. uncached from the perspective of the VFS). This reduced the
> CPU over per-inode substantially, allowing bulkstat rates to
> increase by a factor of 10. However, it introduced all sorts of
> coherency problems between cached inode state vs what was stored in
> the buffer cache. It was basically O_DIRECT for stat() and, as you'd
> expect from that description, the coherency problems were horrible.
> Detecting iallocated-but-not-yet-updated and
> unlinked-but-not-yet-freed inodes were particularly consistent
> sources of issues.
> 
> The only way to fix these coherency problems was to check the inode
> cache for a resident inode first, which basically defeated the
> entire purpose of bypassing the VFS cache in the first place.

Eh? Of course it'd have to be coherent, but just checking if an inode is
present in the VFS cache is what, 1-2 cache misses? Depending on hash
table fill factor...

That's going to show up, but I have a hard time seeing that "defeating
the entire purpose" of bypassing the VFS cache, as you say.

> Don't hack around VFS scalability issues if it can be avoided.

Well, maybe if your dlock list patches make it in - I still see crazy
lock contention there...
Dave Chinner Sept. 24, 2024, 3:55 a.m. UTC | #14
On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 19:26, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And I had missed the issue with PREEMPT_RT and the fact that right now
> > the inode hash lock is outside the inode lock, which is problematic.
> 
> .. although I guess that could be solved by just making the spinlock
> be external to the hash list, rather than part of the list like
> list_bh.

That's effectively what the patch did - it added a spinlock per hash
list.

> You wouldn't need to do one lock per list head, you could just do some
> grouping of the hash.

Yes, that's a more complex thing to do, though - it is effectively
using hashed locks for the hash table.

> That said, the vfs inode hash code really is pretty disgusting and has
> been accumulating these warts for a long time. So maybe a "filesystems
> do their own thing" together with some helper infrastructure (like
> using rhashtable) is actually the right thing to do.

Perhaps.

XFS has done it's own thing since 2007, but that was because the XFS
inode lifecycle has always existed outside the VFS inode life cycle.
i.e. the inode has to outlive evict() and ->destroy_inode because it
can still be dirty and tracked by the journal when the VFS evicts it
from the cache.

We also have internal inode lookup stuff that we don't want to go
anywhere near the VFS (e.g. inode writeback clustering from journal
item cleaning callbacks).

But this has come at a significant complexity cost, and some of the
biggest problems we've had to solve have been a result of this
architecture. e.g. memory reclaim getting blocked on dirty XFS
inodes that the VFS is unaware of via the fs specific callout in the
superblock shrinker caused all sorts of long tail memory allocation
latency issues for years. We've fixed that, but it took a long time
to work out how to avoid blocking in memory reclaim without driving
the system immediately into OOM kill frenzies....

There are also issues around RCU freeing of inodes and how the
filesysetm level cache handles that. e.g. what happens when the
filesystem doesn't immediately free the inode after memory reclaim
evicts it, but then the fs re-instantiates it moments later when a
new lookup for that inode comes in? The rest of the VFS assumes that
the inode won't reappear until a RCU grace period expires, but
having the re-instantiation code call synchronise_rcu() (or
something similar) is not an option because this sort of situation
can occur thousands of times a second under memory pressure....

We know this because there is an outstanding issue in XFS around
this situation - none of the general solutions to the problem
we've tried have been viable. I suspect that any other filesystem
that implements it's own inode cache and does VFS inode
re-instantiation on a inode cache hit will have similar issues,
and so there will be wider VFS architecture impacts as a result.

IOWs, it's not clear to me that this is a caching model we really
want to persue in general because of a) the code duplication and b)
the issues such an inode life-cycle model has interacting with the
VFS life-cycle expectations...

-Dave.
Linus Torvalds Sept. 24, 2024, 4:57 p.m. UTC | #15
On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote:
>
> That's effectively what the patch did - it added a spinlock per hash
> list.

Yeah, no, I like your patches, they all seem to be doing sane things
and improve the code.

The part I don't like is how the basic model that your patches improve
seems quite a bit broken.

For example, that whole superblock list lock contention just makes me
go "Christ, that's stupid". Not because your patches to fix it with
Waiman's fancy list would be wrong or bad, but because the whole pain
is self-inflicted garbage.

And it's all historically very very understandable. It wasn't broken
when it was written.

That singly linked list is 100% sane in the historical context of "we
really don't want anything fancy for this". The whole list of inodes
for a superblock is basically unimportant, and it's main use (only
use?) is for the final "check that we don't have busy inodes at umount
time, remove any pending stuff".

So using a stupid linked list was absolutely the right thing to do,
but now that just the *locking* for that list is a pain, it turns out
that we probably shouldn't use a list at all. Not because the list was
wrong, but because a flat list is such a pain for locking, since
there's no hierarchy to spread the locking out over.

(We used to have that kind of "flat lock" for the dcache too, but
"dcache_lock" went away many moons ago, and good riddance - but the
only reason it could go away is that the dcache has a hierarchy, so
that you can always lock just the local dentry (or the parent dentry)
if you are just careful).

> [ filesystems doing their own optimized thing ]
>
> IOWs, it's not clear to me that this is a caching model we really
> want to persue in general because of a) the code duplication and b)
> the issues such an inode life-cycle model has interacting with the
> VFS life-cycle expectations...

No, I'm sure you are right, and I'm just frustrated with this code
that probably _should_ look a lot more like the dcache code, but
doesn't.

I get the very strong feeling that we should have a per-superblock
hash table that we could also traverse the entries of. That way the
superblock inode list would get some structure (the hash buckets) that
would allow the locking to be distributed (and we'd only need one lock
and just share it between the "hash inode" and "add it to the
superblock list").

But that would require something much more involved than "improve the
current code".

               Linus
Kent Overstreet Sept. 24, 2024, 5:27 p.m. UTC | #16
On Tue, Sep 24, 2024 at 09:57:13AM GMT, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote:
> >
> > That's effectively what the patch did - it added a spinlock per hash
> > list.
> 
> Yeah, no, I like your patches, they all seem to be doing sane things
> and improve the code.
> 
> The part I don't like is how the basic model that your patches improve
> seems quite a bit broken.
> 
> For example, that whole superblock list lock contention just makes me
> go "Christ, that's stupid". Not because your patches to fix it with
> Waiman's fancy list would be wrong or bad, but because the whole pain
> is self-inflicted garbage.
> 
> And it's all historically very very understandable. It wasn't broken
> when it was written.
> 
> That singly linked list is 100% sane in the historical context of "we
> really don't want anything fancy for this". The whole list of inodes
> for a superblock is basically unimportant, and it's main use (only
> use?) is for the final "check that we don't have busy inodes at umount
> time, remove any pending stuff".
> 
> So using a stupid linked list was absolutely the right thing to do,
> but now that just the *locking* for that list is a pain, it turns out
> that we probably shouldn't use a list at all. Not because the list was
> wrong, but because a flat list is such a pain for locking, since
> there's no hierarchy to spread the locking out over.

Well, lists are just a pain. They should really be radix trees, we're
not doing actual LRU anyways.

> > [ filesystems doing their own optimized thing ]
> >
> > IOWs, it's not clear to me that this is a caching model we really
> > want to persue in general because of a) the code duplication and b)
> > the issues such an inode life-cycle model has interacting with the
> > VFS life-cycle expectations...
> 
> No, I'm sure you are right, and I'm just frustrated with this code
> that probably _should_ look a lot more like the dcache code, but
> doesn't.
> 
> I get the very strong feeling that we should have a per-superblock
> hash table that we could also traverse the entries of. That way the
> superblock inode list would get some structure (the hash buckets) that
> would allow the locking to be distributed (and we'd only need one lock
> and just share it between the "hash inode" and "add it to the
> superblock list").

I was considering that - that's what the bcachefs btree key cache does,
the shrinker just iterates over the rhashtable.

But referenced inodes aren't kept on the shrinker lists; I hate the
overhead of adding/removing, but Dave was emphatic on there being
workloads where that was really necessary and I'm somewhat convinced.

The other big issue is memcg; I think for that to work we'd need reclaim
to coalesce shrinker calls and shrink from multiple memcgs at the same
time.
Dave Chinner Sept. 25, 2024, 12:17 a.m. UTC | #17
On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote:
> >
> > That's effectively what the patch did - it added a spinlock per hash
> > list.
> 
> Yeah, no, I like your patches, they all seem to be doing sane things
> and improve the code.
> 
> The part I don't like is how the basic model that your patches improve
> seems quite a bit broken.
> 
> For example, that whole superblock list lock contention just makes me
> go "Christ, that's stupid". Not because your patches to fix it with
> Waiman's fancy list would be wrong or bad, but because the whole pain
> is self-inflicted garbage.

I'm not about to disagree with that assessment.

> And it's all historically very very understandable. It wasn't broken
> when it was written.
> 
> That singly linked list is 100% sane in the historical context of "we
> really don't want anything fancy for this". The whole list of inodes
> for a superblock is basically unimportant, and it's main use (only
> use?) is for the final "check that we don't have busy inodes at umount
> time, remove any pending stuff".
> 
> So using a stupid linked list was absolutely the right thing to do,
> but now that just the *locking* for that list is a pain, it turns out
> that we probably shouldn't use a list at all. Not because the list was
> wrong, but because a flat list is such a pain for locking, since
> there's no hierarchy to spread the locking out over.

Right, that's effectively what the dlist infrastructure has taught
us - we need some kind of heirarchy to spread the locking over. But
what that heirachy is for a "iterate every object" list looks like
isn't really clear cut...

Another option I'd considered was to offload the iteration to
filesystems that have internal tracking mechanisms. e.g. use a
superblock method (say ->for_all_cached_inodes()) that gets passed a
callback function for the operation to perform on a stabilised
inode.

This would work for XFS - we already have such an internal callback
based cache-walking infrastructure - xfs_iwalk() - that is used
extensively for internal admin, gc and scrub/repair functions.
If we could use this for VFS icache traversals, then
we wouldn't need to maintain the sb->s_inodes list at all in XFS.

But I didn't go down that route because I didn't think we wanted to
encourage each major filesysetm to have their own unique internal
inode caching implementations with there own special subtle
differences. The XFS icache code is pretty complex and really
requires a lot of XFS expertise to understand - that makes changing
global inode caching behaviour or life cycle semantics much more
difficult than it already is.

That said, if we do decide that a model where filesystems will
provide their own inode caches is acceptible, then as a first step
we could convert the generic s_inodes list iteration to the callback
model fairly easily....

> (We used to have that kind of "flat lock" for the dcache too, but
> "dcache_lock" went away many moons ago, and good riddance - but the
> only reason it could go away is that the dcache has a hierarchy, so
> that you can always lock just the local dentry (or the parent dentry)
> if you are just careful).

> 
> > [ filesystems doing their own optimized thing ]
> >
> > IOWs, it's not clear to me that this is a caching model we really
> > want to persue in general because of a) the code duplication and b)
> > the issues such an inode life-cycle model has interacting with the
> > VFS life-cycle expectations...
> 
> No, I'm sure you are right, and I'm just frustrated with this code
> that probably _should_ look a lot more like the dcache code, but
> doesn't.
> 
> I get the very strong feeling that we should have a per-superblock
> hash table that we could also traverse the entries of. That way the
> superblock inode list would get some structure (the hash buckets) that
> would allow the locking to be distributed (and we'd only need one lock
> and just share it between the "hash inode" and "add it to the
> superblock list").

The only problem with this is the size of the per-sb hash tables
needed for scalability - we can't allocate system sized hash tables
for every superblock just in case a superblock might be asked to
cache 100 million inodes. That's why Kent used rhashtables for the
bcachefs implementation - they resize according to how many objects
are being indexed, and hence scale both up and down.

That is also why XFS uses multiple radix trees per-sb in it's icache
implementation - they scale up efficiently, yet have a small
memory footprint when only a few inodes are in cache in a little
used filesystem.

> But that would require something much more involved than "improve the
> current code".

Yup.

FWIW, I think all this "how do we cache inodes better" discussion is
somehwat glossing over a more important question we need to think
about first: do we even need a fully fledged inode cache anymore?

Every inode brought into cache is pinned by a dentry. The dentry
cache has an LRU and cache aging, and so by the time a dentry is
aged out of the dentry cache and the inode is finally dropped, it
has not been in use for some time. It has aged out of the current
working set.

Given that we've already aged the inode out of the working set.  why
do we then need to dump the inode into another LRU and age that out
again before we reclaim the inode? What does this double caching
actually gaining us?

I've done experiments on XFS marking all inodes with I_DONT_CACHE,
which means it gets removed from the inode cache the moment the
reference count goes to zero (i.e. when the dentry is dropped from
cache). This leaves the inode life cycle mirroring the dentry
life-cycle. i.e. the inode is evicted when the dentry is aged out as
per normal.

On SSD based systems, I really don't see any performance degradation
for my typical workstation workloads like git tree operations and
kernel compiles. I don't see any noticable impact on streaming
inode/data workloads, either. IOWs, the dentry cache appears to be
handling the workings set maintenance duties pretty well for most
common workloads. Hence I question the need for LRU based inode
cache aging being needed at all.

So: should we be looking towards gutting the inode cache and so the
in-memory VFS inode lifecycle tracks actively referenced inodes? If
so, then the management of the VFS inodes becomes a lot simpler as
the LRU lock, maintenance and shrinker-based reclaim goes away
entirely. Lots of code gets simpler if we trim down the VFS inode
life cycle to remove the caching of unreferenced inodes...

-Dave.
Dave Chinner Sept. 25, 2024, 1 a.m. UTC | #18
On Mon, Sep 23, 2024 at 11:47:54PM -0400, Kent Overstreet wrote:
> On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote:
> > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> > > But stat/statx always pulls into the vfs inode cache, and that's likely
> > > worth fixing.
> > 
> > No, let's not even consider going there.
> > 
> > Unlike most people, old time XFS developers have direct experience
> > with the problems that "uncached" inode access for stat purposes.
> > 
> > XFS has had the bulkstat API for a long, long time (i.e. since 1998
> > on Irix). When it was first implemented on Irix, it was VFS cache
> > coherent. But in the early 2000s, that caused problems with HSMs
> > needing to scan billions inodes indexing petabytes of stored data
> > with certain SLA guarantees (i.e. needing to scan at least a million
> > inodes a second).  The CPU overhead of cache instantiation and
> > teardown was too great to meet those performance targets on 500MHz
> > MIPS CPUs.
> > 
> > So we converted bulkstat to run directly out of the XFS buffer cache
> > (i.e. uncached from the perspective of the VFS). This reduced the
> > CPU over per-inode substantially, allowing bulkstat rates to
> > increase by a factor of 10. However, it introduced all sorts of
> > coherency problems between cached inode state vs what was stored in
> > the buffer cache. It was basically O_DIRECT for stat() and, as you'd
> > expect from that description, the coherency problems were horrible.
> > Detecting iallocated-but-not-yet-updated and
> > unlinked-but-not-yet-freed inodes were particularly consistent
> > sources of issues.
> > 
> > The only way to fix these coherency problems was to check the inode
> > cache for a resident inode first, which basically defeated the
> > entire purpose of bypassing the VFS cache in the first place.
> 
> Eh? Of course it'd have to be coherent, but just checking if an inode is
> present in the VFS cache is what, 1-2 cache misses? Depending on hash
> table fill factor...

Sure, when there is no contention and you have CPU to spare. But the
moment the lookup hits contention problems (i.e. we are exceeding
the cache lookup scalability capability), we are straight back to
running a VFS cache speed instead of uncached speed.

IOWs, needing to perform the cache lookup defeated the purpose of
using uncached lookups to avoid the cache scalabilty problems.

Keep in mind that not having a referenced inode opens up the code to
things like pre-emption races. i.e. a cache miss doesn't prevent
the current task from being preempted before it reads the inode
information into the user buffer. The VFS inode could bei
instantiated and modified before the uncached access runs again and
pulls stale information from the underlying buffer and returns that
to userspace.

Those were the sorts of problems we continually had with using low
level inode information for stat operations vs using the up-to-date
VFS inode state....

-Dave.
Linus Torvalds Sept. 25, 2024, 1:45 a.m. UTC | #19
On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote:
>
> FWIW, I think all this "how do we cache inodes better" discussion is
> somehwat glossing over a more important question we need to think
> about first: do we even need a fully fledged inode cache anymore?

I'd be more than happy to try. I have to admit that it's largely my
mental picture (ie "inode caches don't really matter"), and why I was
surprised that the inode locks even show up on benchmarks.

If we just always drop inodes when the refcount goes to zero and never
have any cached inodes outside of other references, I think most of
the reasons for the superblock inode list just disappear. Most of the
heavy lifting has long since been moved to the dentry shrinking.

I'd worry that we have a lot of tuning that depends on the whole inode
LRU thing (that iirc takes the mapping size etc into account).

But maybe it would be worth trying to make I_DONTCACHE the default,
with the aim of removing the independent inode lifetimes entirely...

                Linus
Kent Overstreet Sept. 25, 2024, 2:13 a.m. UTC | #20
On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote:
> > Eh? Of course it'd have to be coherent, but just checking if an inode is
> > present in the VFS cache is what, 1-2 cache misses? Depending on hash
> > table fill factor...
> 
> Sure, when there is no contention and you have CPU to spare. But the
> moment the lookup hits contention problems (i.e. we are exceeding
> the cache lookup scalability capability), we are straight back to
> running a VFS cache speed instead of uncached speed.

The cache lookups are just reads; they don't introduce scalability
issues, unless they're contending with other cores writing to those
cachelines - checking if an item is present in a hash table is trivial
to do locklessly.

But pulling an inode into and then evicting it from the inode cache
entails a lot more work - just initializing a struct inode is
nontrivial, and then there's the (multiple) shared data structures you
have to manipulate.

> Keep in mind that not having a referenced inode opens up the code to
> things like pre-emption races. i.e. a cache miss doesn't prevent
> the current task from being preempted before it reads the inode
> information into the user buffer. The VFS inode could bei
> instantiated and modified before the uncached access runs again and
> pulls stale information from the underlying buffer and returns that
> to userspace.

Yeah, if you're reading from a buffer cache that doesn't have a lock
that does get dicy - but for bcachefs where we're reading from a btree
node that does have a lock it's quite manageable.

And incidentally this sort of "we have a cache on top of the btree, but
sometimes we have to do direct access" is already something that comes
up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but
doesn't actually come up here for us since we don't use the vfs inode
cache as a writeback cache.

Now, for some completely different sillyness, there's actually _three_
levels of caching for inodes in bcachefs: btree node cache, btree key
cache, then the vfs cache. In the first two inodes are packed down to
~100 bytes so it's not that bad, but it does make you go "...what?". It
would be nice in theory to collapse - but the upside is that we don't
have the interactions between the vfs inode cache and journalling that
xfs has.

But if vfs inodes no longer have their own lifetime like you've been
talking about, that might open up interesting possibilities.
Kent Overstreet Sept. 25, 2024, 2:48 a.m. UTC | #21
+cc rhashtable folks for bug report

On Wed, Sep 25, 2024 at 10:17:46AM GMT, Dave Chinner wrote:
> On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote:
> > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > That's effectively what the patch did - it added a spinlock per hash
> > > list.
> > 
> > Yeah, no, I like your patches, they all seem to be doing sane things
> > and improve the code.
> > 
> > The part I don't like is how the basic model that your patches improve
> > seems quite a bit broken.
> > 
> > For example, that whole superblock list lock contention just makes me
> > go "Christ, that's stupid". Not because your patches to fix it with
> > Waiman's fancy list would be wrong or bad, but because the whole pain
> > is self-inflicted garbage.
> 
> I'm not about to disagree with that assessment.
> 
> > And it's all historically very very understandable. It wasn't broken
> > when it was written.
> > 
> > That singly linked list is 100% sane in the historical context of "we
> > really don't want anything fancy for this". The whole list of inodes
> > for a superblock is basically unimportant, and it's main use (only
> > use?) is for the final "check that we don't have busy inodes at umount
> > time, remove any pending stuff".
> > 
> > So using a stupid linked list was absolutely the right thing to do,
> > but now that just the *locking* for that list is a pain, it turns out
> > that we probably shouldn't use a list at all. Not because the list was
> > wrong, but because a flat list is such a pain for locking, since
> > there's no hierarchy to spread the locking out over.
> 
> Right, that's effectively what the dlist infrastructure has taught
> us - we need some kind of heirarchy to spread the locking over. But
> what that heirachy is for a "iterate every object" list looks like
> isn't really clear cut...

If you think of dlist as just "faster lru list", I think that's a
completely viable approach.

So one thing I've been mulling over is a data structure for efficiently
tracking async op structs - "some async op went out to lunch" is
something that we still have crap debug tooling for, and those bugs take
forever to get fixed. There seems to be an outstanding bug in compaction
or migration that I've seen multiple reports for, and I doubt it's
bcachefs (we're quite standard there) - and it's some async op that
should've released a folio lock but isn't.

And all we need for that is, roughly, an idr with a percpu buffer for
allocating/freeing slots in front. Which would be perfect for shrinker
lists, too: walking the list is a lockless radix tree walk, and
adding/removing entries mostly just hits the percpu buffers.

So now we have another use case for that data structure, so I'll try to
get that done in the near future and see how it does.

> > (We used to have that kind of "flat lock" for the dcache too, but
> > "dcache_lock" went away many moons ago, and good riddance - but the
> > only reason it could go away is that the dcache has a hierarchy, so
> > that you can always lock just the local dentry (or the parent dentry)
> > if you are just careful).
> 
> > 
> > > [ filesystems doing their own optimized thing ]
> > >
> > > IOWs, it's not clear to me that this is a caching model we really
> > > want to persue in general because of a) the code duplication and b)
> > > the issues such an inode life-cycle model has interacting with the
> > > VFS life-cycle expectations...
> > 
> > No, I'm sure you are right, and I'm just frustrated with this code
> > that probably _should_ look a lot more like the dcache code, but
> > doesn't.
> > 
> > I get the very strong feeling that we should have a per-superblock
> > hash table that we could also traverse the entries of. That way the
> > superblock inode list would get some structure (the hash buckets) that
> > would allow the locking to be distributed (and we'd only need one lock
> > and just share it between the "hash inode" and "add it to the
> > superblock list").
> 
> The only problem with this is the size of the per-sb hash tables
> needed for scalability - we can't allocate system sized hash tables
> for every superblock just in case a superblock might be asked to
> cache 100 million inodes. That's why Kent used rhashtables for the
> bcachefs implementation - they resize according to how many objects
> are being indexed, and hence scale both up and down.

I've been noticing rhashtable resize is surprisingly heavy (the default
parameters don't ever shrink the table, which is why it's not seen as
much).

And, when I was torture testing that code I tripped over what appeared
to be an infinite loop in rht_bucket() when a rehsah is in progress,
which I worked around in

  a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan()

(because it makes more sense to just not run while rehashing and avoid
the function call overhead of rht_bucket()). If rhashtable folks are
interested in looking at that I have a test that reproduces it.

> So: should we be looking towards gutting the inode cache and so the
> in-memory VFS inode lifecycle tracks actively referenced inodes? If
> so, then the management of the VFS inodes becomes a lot simpler as
> the LRU lock, maintenance and shrinker-based reclaim goes away
> entirely. Lots of code gets simpler if we trim down the VFS inode
> life cycle to remove the caching of unreferenced inodes...

Sounds completely legit to me - except for the weird filesystems. I just
started cscoping and afs at least seems to do iget() without ever
attaching it to a dentry, I didn't look further.

So...

If fs/inode.c is grotty, perhaps it makes sense to start a
fs/inode_rhashtable.c with all the new semantics and convert filesystems
one by one, like we've been doing with the dio code and iomap.

I finally have my head sufficiently around all the weird inode lifetime
rules that I could do it, although it'll be a bit before I'm not
swamped.
Dave Chinner Sept. 25, 2024, 4:43 a.m. UTC | #22
On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote:
> On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote:
> > > Eh? Of course it'd have to be coherent, but just checking if an inode is
> > > present in the VFS cache is what, 1-2 cache misses? Depending on hash
> > > table fill factor...
> > 
> > Sure, when there is no contention and you have CPU to spare. But the
> > moment the lookup hits contention problems (i.e. we are exceeding
> > the cache lookup scalability capability), we are straight back to
> > running a VFS cache speed instead of uncached speed.
> 
> The cache lookups are just reads; they don't introduce scalability
> issues, unless they're contending with other cores writing to those
> cachelines - checking if an item is present in a hash table is trivial
> to do locklessly.

Which was not something the VFS inode cache did until a couple of
months ago. Just because something is possible/easy today, it
doesn't mean it was possible or viable 15-20 years ago.

> But pulling an inode into and then evicting it from the inode cache
> entails a lot more work - just initializing a struct inode is
> nontrivial, and then there's the (multiple) shared data structures you
> have to manipulate.

Yes, but to avoid this we'd need to come up with a mechanism that is
generally safe for most filesystems, not just bcachefs.

I mean, if you can come up with a stat() mechanism that is safe
enough for us read straight out the XFS buffer cache for inode cache
misses, then we'll switch over to using it ASAP.

That's your challenge - if you want bcachefs to be able to do this,
then you have to make sure the infrastructure required works for
other filesystems just as safely, too.

> And incidentally this sort of "we have a cache on top of the btree, but
> sometimes we have to do direct access" is already something that comes
> up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but
> doesn't actually come up here for us since we don't use the vfs inode
> cache as a writeback cache.

And there-in lies the problem for the general case. Most filesystems
do use writeback caching of inode metadata via the VFS inode state,
XFS included, and that's where all the dragons are hiding.

-Dave.
Kent Overstreet Sept. 25, 2024, 5:11 a.m. UTC | #23
On Wed, Sep 25, 2024 at 02:43:15PM GMT, Dave Chinner wrote:
> On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote:
> > On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote:
> > > > Eh? Of course it'd have to be coherent, but just checking if an inode is
> > > > present in the VFS cache is what, 1-2 cache misses? Depending on hash
> > > > table fill factor...
> > > 
> > > Sure, when there is no contention and you have CPU to spare. But the
> > > moment the lookup hits contention problems (i.e. we are exceeding
> > > the cache lookup scalability capability), we are straight back to
> > > running a VFS cache speed instead of uncached speed.
> > 
> > The cache lookups are just reads; they don't introduce scalability
> > issues, unless they're contending with other cores writing to those
> > cachelines - checking if an item is present in a hash table is trivial
> > to do locklessly.
> 
> Which was not something the VFS inode cache did until a couple of
> months ago. Just because something is possible/easy today, it
> doesn't mean it was possible or viable 15-20 years ago.

We've been doing lockless hash table lookups for a lot longer than that.

> > But pulling an inode into and then evicting it from the inode cache
> > entails a lot more work - just initializing a struct inode is
> > nontrivial, and then there's the (multiple) shared data structures you
> > have to manipulate.
> 
> Yes, but to avoid this we'd need to come up with a mechanism that is
> generally safe for most filesystems, not just bcachefs.
> 
> I mean, if you can come up with a stat() mechanism that is safe
> enough for us read straight out the XFS buffer cache for inode cache
> misses, then we'll switch over to using it ASAP.

So add a per buffer lock to your buffer cache, and check for vfs cache
residency under the lock.

Whether bulkstat is worth adding that lock, for a filesystem that
doesn't need it? That I can't say, it's a tradeoff.

> That's your challenge - if you want bcachefs to be able to do this,
> then you have to make sure the infrastructure required works for
> other filesystems just as safely, too.

So... you think bcachefs should be limited to the capabilities of what
other filesystems can do? Interesting position...
Christian Brauner Sept. 25, 2024, 11:41 a.m. UTC | #24
On Tue, Sep 24, 2024 at 06:45:55PM GMT, Linus Torvalds wrote:
> On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FWIW, I think all this "how do we cache inodes better" discussion is
> > somehwat glossing over a more important question we need to think
> > about first: do we even need a fully fledged inode cache anymore?
> 
> I'd be more than happy to try. I have to admit that it's largely my
> mental picture (ie "inode caches don't really matter"), and why I was
> surprised that the inode locks even show up on benchmarks.
> 
> If we just always drop inodes when the refcount goes to zero and never
> have any cached inodes outside of other references, I think most of
> the reasons for the superblock inode list just disappear. Most of the
> heavy lifting has long since been moved to the dentry shrinking.
> 
> I'd worry that we have a lot of tuning that depends on the whole inode
> LRU thing (that iirc takes the mapping size etc into account).
> 
> But maybe it would be worth trying to make I_DONTCACHE the default,
> with the aim of removing the independent inode lifetimes entirely...

I would be rather surprised if we didn't see regressions for some
workloads but who knows. If we wanted to test this then one easy way
would be to add a FS_INODE_NOCACHE fs_flag and have inode_init_always()
raise I_DONTCACHE based on that. That way we can experiment with this on
a per-fs basis.
Herbert Xu Sept. 27, 2024, 12:48 a.m. UTC | #25
On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote:
>
> I've been noticing rhashtable resize is surprisingly heavy (the default
> parameters don't ever shrink the table, which is why it's not seen as
> much).

Most rhashtable users enable automatic shrinking.

> And, when I was torture testing that code I tripped over what appeared
> to be an infinite loop in rht_bucket() when a rehsah is in progress,
> which I worked around in
> 
>   a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan()

You must not walk the rhashtable internal data structures by hand.
If you need to iterate through the whole table, use the provided
walking mechanism.

However, as rhashtable is dynamic, walking it during a resize may
cause you to see some elements twice.  If you want to avoid that,
set up your own linked list of all elements for a completely stable
walk.

Cheers,
Kent Overstreet Sept. 28, 2024, 12:11 a.m. UTC | #26
On Fri, Sep 27, 2024 at 08:48:14AM GMT, Herbert Xu wrote:
> On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote:
> >
> > I've been noticing rhashtable resize is surprisingly heavy (the default
> > parameters don't ever shrink the table, which is why it's not seen as
> > much).
> 
> Most rhashtable users enable automatic shrinking.
> 
> > And, when I was torture testing that code I tripped over what appeared
> > to be an infinite loop in rht_bucket() when a rehsah is in progress,
> > which I worked around in
> > 
> >   a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan()
> 
> You must not walk the rhashtable internal data structures by hand.
> If you need to iterate through the whole table, use the provided
> walking mechanism.

I was just doing rht_for_each_entry_rcu() (open coded because you don't
provide a safe version).

> However, as rhashtable is dynamic, walking it during a resize may
> cause you to see some elements twice.  If you want to avoid that,
> set up your own linked list of all elements for a completely stable
> walk.

Yeah that's fine, but like I said, I was seeing an infinite loop inside
rht_bucket(). That would be an rhashtable bug...
Herbert Xu Sept. 28, 2024, 12:47 a.m. UTC | #27
On Fri, Sep 27, 2024 at 08:11:06PM -0400, Kent Overstreet wrote:
>
> > > And, when I was torture testing that code I tripped over what appeared
> > > to be an infinite loop in rht_bucket() when a rehsah is in progress,
> > > which I worked around in
> > > 
> > >   a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan()
> > 
> > You must not walk the rhashtable internal data structures by hand.
> > If you need to iterate through the whole table, use the provided
> > walking mechanism.
> 
> I was just doing rht_for_each_entry_rcu() (open coded because you don't
> provide a safe version).

I'm talking about the commit a592cd above.  You were doing this:

	for (i = 0; i < tbl->size; i++)
		rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {

This is absolutely not allowed.  The rhashtable must only be accessed
through the published API and not iterated over directly.  This is
guaranteed to break during a resize/rehash.

Cheers,