mbox series

[RFC,0/7] vfs: improving inode cache iteration scalability

Message ID 20241002014017.3801899-1-david@fromorbit.com (mailing list archive)
Headers show
Series vfs: improving inode cache iteration scalability | expand

Message

Dave Chinner Oct. 2, 2024, 1:33 a.m. UTC
The management of the sb->s_inodes list is a scalability limitation;
it is protected by a single lock and every inode that is
instantiated has to be added to the list. Those inodes then need to
be removed from the list when evicted from cache. Hence every inode
that moves through the VFS inode cache must take this global scope
lock twice.

This proves to be a significant limiting factor for concurrent file
access workloads that repeatedly miss the dentry cache on lookup.
Directory search and traversal workloads are particularly prone to
these issues, though on XFS we have enough concurrency capability
in file creation and unlink for the sb->s_inodes list to be a
limitation there as well.

Previous efforts to solve this problem have
largely centered around reworking the sb->s_inodes list into
something more scalable such as this longstanding patchset does:

https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/

However, a recent discussion about inode cache behaviour that arose
from the bcachefs 6.12-rc1 pull request opened a new direction for
us to explore. With both XFS and bcachefs now providing their own
per-superblock inode cache implementations, we should try to make
use of these inode caches as first class citizens.

With that new direction in mind, it became obvious that XFS could
elide the sb->s_inodes list completely - "the best part is no part"
- if iteration was not reliant on open-coded sb->s_inodes list
walks.

We already use the internal inode cache for iteration, and we have
filters for selecting specific inodes to operate on with specific
callback operations. If we had an abstraction for iterating
all VFS inodes, we can easily implement that directly on the XFS
inode cache.

This is what this patchset aims to implement.

There are two superblock iterator functions provided. The first is a
generic iterator that provides safe, reference counted inodes for
the callback to operate on. This is generally what most sb->s_inodes
iterators use, and it allows the iterator to drop locks and perform
blocking operations on the inode before moving to the next inode in
the sb->s_inodes list.

There is one quirk to this interface - INO_ITER_REFERENCE - because
fsnotify iterates the inode cache -after- evict_inodes() has been
called during superblock shutdown to evict all non-referenced
inodes. Hence it should only find referenced inodes, and it has
a check to skip unreferenced inodes. This flag does the same.

However, I suspect this is now somewhat sub-optimal because LSMs can
hold references to inodes beyond evict_inodes(), and they don't get
torn down until after fsnotify evicts the referenced inodes it
holds. However, the landlock LSM doesn't have checks for
unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
guard is not consistently applied.

I'm undecided on how best to handle this, but it does not need to be
solved for this patchset to work. fsnotify and
landlock don't need to run -after- evict_inodes(), but moving them
to before evict_inodes() mean we now do three full inode cache
iterations to evict all the inodes from the cache. That doesn't seem
like a good idea when there might be hundreds of millions of cached
inodes at unmount.

Similarly, needing the iterator to be aware that there should be no
unreferenced inodes left when they run doesn't seem like a good
idea, either. So perhaps the answer is that the iterator checks for
SB_ACTIVE (or some other similar flag) that indicates the superblock
is being torn down and so will skip zero-referenced inodes
automatically in this case. Like I said - this doesn't need to be
solved right now, it's just something to be aware of.

The second iterator is the "unsafe" iterator variant that only
provides the callback with an existence guarantee. It does this by
holding the rcu_read_lock() to guarantee that the inode is not freed
from under the callback. There are no validity checks performed on
the inode - it is entirely up to the callback to validate the inode
can be operated on safely.

Hence the "unsafe" variant is only for very specific internal uses
only. Nobody should be adding new uses of this function without
as there are very few good reasons for external access to inodes
without holding a valid reference. I have not decided whether the
unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
check in them to clearly document the context under which they are
running.

The patchset converts all the open coded iterators to use these
new iterator functions, which means the only use of sb->s_inodes
is now confined to fs/super.c (iterator API) and fs/inode.c
(add/remove API). A new superblock operation is then added to
call out from the iterators into the filesystem to allow them to run
the iteration instead of walking the sb->s_inodes list.

XFS is then converted to use this new superblock operation. I didn't
use the existing iterator function for this functionality right now
as it is currently based on radix tree tag lookups. It also uses a
batched 'lookup and lock' mechanism that complicated matters as I
developed this code. Hence I open coded a new, simpler cache walk
for testing purposes.

Now that I have stuff working and I think I have the callback API
semantics settled, batched radix tree lookups should still work to
minimise the iteration overhead. Also, we might want to tag VFS
inodes in the radix tree so that we can filter them efficiently for
traversals. This would allow us to use the existing generic inode
cache walker rather than a separate variant as this patch set
implements. This can be done as future work, though.

In terms of scalability improvements, a quick 'will it scale' test
demonstrates where the sb->s_inodes list hurts. Running a sharded,
share-nothing cold cache workload with 100,000 files per thread in
per-thread directories gives the following results on a 4-node 64p
machine with 128GB RAM.

The workloads "walk", "chmod" and "unlink" are all directory
traversal workloads that stream cold cache inodes into the cache.
There is enough memory on this test machine that these indoes are
not being reclaimed during the workload, and are being freed between
steps via drop_caches (which iterates the inode cache and so
explicitly tests the new iteration APIs!). Hence the sb->s_inodes
scalability issues aren't as bad in these tests as when memory is
tight and inodes are being reclaimed (i.e. the issues are worse in
real workloads).

The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
inodes via walking the internal inode btrees. It uses
d_mark_dontcache() so it is actually tearing down each inode as soon
as it has been sampled by the bulkstat code. Hence it is doing two
sb->s_inodes list manipulations per inode and so shows scalability
issues much earlier than the other workloads.

Before:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.269      3.225      4.557      7.316      1.306
       xfs     800000     8      4.844      3.227      4.702      7.905      1.908
       xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
       xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
       xfs    6400000    64     15.344     11.144     14.162     18.604     15.494

Bulkstat starts to show issues at 8 threads, walk and chmod between
16 and 32 threads, and unlink is limited by internal XFS stuff.
Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
sb->s_inodes list management.

After:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.140      3.502      4.154      7.242      1.164
       xfs     800000     8      4.637      2.836      4.444      7.896      1.093
       xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
       xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
       xfs    6400000    64     14.112      3.953     10.365     18.620      1.270

When patched, walk shows little in way of scalability degradation
out to 64 threads, chmod is significantly improved at 32-64 threads,
and bulkstat shows perfect scalability out to 64 threads now.

I did a couple of other longer running, higher inode count tests
with bulkstat to get an idea of inode cache streaming rates - 32
million inodes scanned in 4.4 seconds at 64 threads. That's about
7.2 million inodes/s being streamed through the inode cache with the
IO rates are peaking well above 5.5GB/s (near IO bound).

Hence raw VFS inode cache throughput sees a ~17x scalability
improvement on XFS at 64 threads (and probably a -lot- more on
higher CPU count machines).  That's far better performance than I
ever got from the dlist conversion of the sb->s_inodes list in
previous patchsets, so this seems like a much better direction to be
heading for optimising the way we cache inodes.

I haven't done a lot of testing on this patchset yet - it boots and
appears to work OK for block devices, ext4 and XFS, but checking
stuff like quota on/off is still working properly on ext4 hasn't
been done yet.

What do people think of moving towards per-sb inode caching and
traversal mechanisms like this?

-Dave.

Comments

Christian Brauner Oct. 2, 2024, 10 a.m. UTC | #1
On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> The management of the sb->s_inodes list is a scalability limitation;
> it is protected by a single lock and every inode that is
> instantiated has to be added to the list. Those inodes then need to
> be removed from the list when evicted from cache. Hence every inode
> that moves through the VFS inode cache must take this global scope
> lock twice.
> 
> This proves to be a significant limiting factor for concurrent file
> access workloads that repeatedly miss the dentry cache on lookup.
> Directory search and traversal workloads are particularly prone to
> these issues, though on XFS we have enough concurrency capability
> in file creation and unlink for the sb->s_inodes list to be a
> limitation there as well.
> 
> Previous efforts to solve this problem have
> largely centered around reworking the sb->s_inodes list into
> something more scalable such as this longstanding patchset does:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/
> 
> However, a recent discussion about inode cache behaviour that arose
> from the bcachefs 6.12-rc1 pull request opened a new direction for
> us to explore. With both XFS and bcachefs now providing their own
> per-superblock inode cache implementations, we should try to make
> use of these inode caches as first class citizens.
> 
> With that new direction in mind, it became obvious that XFS could
> elide the sb->s_inodes list completely - "the best part is no part"
> - if iteration was not reliant on open-coded sb->s_inodes list
> walks.
> 
> We already use the internal inode cache for iteration, and we have
> filters for selecting specific inodes to operate on with specific
> callback operations. If we had an abstraction for iterating
> all VFS inodes, we can easily implement that directly on the XFS
> inode cache.
> 
> This is what this patchset aims to implement.
> 
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.
> 
> However, I suspect this is now somewhat sub-optimal because LSMs can
> hold references to inodes beyond evict_inodes(), and they don't get
> torn down until after fsnotify evicts the referenced inodes it
> holds. However, the landlock LSM doesn't have checks for
> unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
> guard is not consistently applied.
> 
> I'm undecided on how best to handle this, but it does not need to be
> solved for this patchset to work. fsnotify and
> landlock don't need to run -after- evict_inodes(), but moving them
> to before evict_inodes() mean we now do three full inode cache
> iterations to evict all the inodes from the cache. That doesn't seem
> like a good idea when there might be hundreds of millions of cached
> inodes at unmount.
> 
> Similarly, needing the iterator to be aware that there should be no
> unreferenced inodes left when they run doesn't seem like a good
> idea, either. So perhaps the answer is that the iterator checks for
> SB_ACTIVE (or some other similar flag) that indicates the superblock
> is being torn down and so will skip zero-referenced inodes
> automatically in this case. Like I said - this doesn't need to be
> solved right now, it's just something to be aware of.
> 
> The second iterator is the "unsafe" iterator variant that only
> provides the callback with an existence guarantee. It does this by
> holding the rcu_read_lock() to guarantee that the inode is not freed
> from under the callback. There are no validity checks performed on
> the inode - it is entirely up to the callback to validate the inode
> can be operated on safely.
> 
> Hence the "unsafe" variant is only for very specific internal uses
> only. Nobody should be adding new uses of this function without
> as there are very few good reasons for external access to inodes
> without holding a valid reference. I have not decided whether the
> unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
> check in them to clearly document the context under which they are
> running.
> 
> The patchset converts all the open coded iterators to use these
> new iterator functions, which means the only use of sb->s_inodes
> is now confined to fs/super.c (iterator API) and fs/inode.c
> (add/remove API). A new superblock operation is then added to
> call out from the iterators into the filesystem to allow them to run
> the iteration instead of walking the sb->s_inodes list.
> 
> XFS is then converted to use this new superblock operation. I didn't
> use the existing iterator function for this functionality right now
> as it is currently based on radix tree tag lookups. It also uses a
> batched 'lookup and lock' mechanism that complicated matters as I
> developed this code. Hence I open coded a new, simpler cache walk
> for testing purposes.
> 
> Now that I have stuff working and I think I have the callback API
> semantics settled, batched radix tree lookups should still work to
> minimise the iteration overhead. Also, we might want to tag VFS
> inodes in the radix tree so that we can filter them efficiently for
> traversals. This would allow us to use the existing generic inode
> cache walker rather than a separate variant as this patch set
> implements. This can be done as future work, though.
> 
> In terms of scalability improvements, a quick 'will it scale' test
> demonstrates where the sb->s_inodes list hurts. Running a sharded,
> share-nothing cold cache workload with 100,000 files per thread in
> per-thread directories gives the following results on a 4-node 64p
> machine with 128GB RAM.
> 
> The workloads "walk", "chmod" and "unlink" are all directory
> traversal workloads that stream cold cache inodes into the cache.
> There is enough memory on this test machine that these indoes are
> not being reclaimed during the workload, and are being freed between
> steps via drop_caches (which iterates the inode cache and so
> explicitly tests the new iteration APIs!). Hence the sb->s_inodes
> scalability issues aren't as bad in these tests as when memory is
> tight and inodes are being reclaimed (i.e. the issues are worse in
> real workloads).
> 
> The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
> inodes via walking the internal inode btrees. It uses
> d_mark_dontcache() so it is actually tearing down each inode as soon
> as it has been sampled by the bulkstat code. Hence it is doing two
> sb->s_inodes list manipulations per inode and so shows scalability
> issues much earlier than the other workloads.
> 
> Before:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.269      3.225      4.557      7.316      1.306
>        xfs     800000     8      4.844      3.227      4.702      7.905      1.908
>        xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
>        xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
>        xfs    6400000    64     15.344     11.144     14.162     18.604     15.494
> 
> Bulkstat starts to show issues at 8 threads, walk and chmod between
> 16 and 32 threads, and unlink is limited by internal XFS stuff.
> Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
> sb->s_inodes list management.
> 
> After:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.140      3.502      4.154      7.242      1.164
>        xfs     800000     8      4.637      2.836      4.444      7.896      1.093
>        xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
>        xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
>        xfs    6400000    64     14.112      3.953     10.365     18.620      1.270
> 
> When patched, walk shows little in way of scalability degradation
> out to 64 threads, chmod is significantly improved at 32-64 threads,
> and bulkstat shows perfect scalability out to 64 threads now.
> 
> I did a couple of other longer running, higher inode count tests
> with bulkstat to get an idea of inode cache streaming rates - 32
> million inodes scanned in 4.4 seconds at 64 threads. That's about
> 7.2 million inodes/s being streamed through the inode cache with the
> IO rates are peaking well above 5.5GB/s (near IO bound).
> 
> Hence raw VFS inode cache throughput sees a ~17x scalability
> improvement on XFS at 64 threads (and probably a -lot- more on
> higher CPU count machines).  That's far better performance than I
> ever got from the dlist conversion of the sb->s_inodes list in
> previous patchsets, so this seems like a much better direction to be
> heading for optimising the way we cache inodes.
> 
> I haven't done a lot of testing on this patchset yet - it boots and
> appears to work OK for block devices, ext4 and XFS, but checking
> stuff like quota on/off is still working properly on ext4 hasn't
> been done yet.
> 
> What do people think of moving towards per-sb inode caching and
> traversal mechanisms like this?

Patches 1-4 are great cleanups that I would like us to merge even
independent of the rest.

I don't have big conceptual issues with the series otherwise. The only
thing that makes me a bit uneasy is that we are now providing an api
that may encourage filesystems to do their own inode caching even if
they don't really have a need for it just because it's there. So really
a way that would've solved this issue generically would have been my
preference.

But the reality is that xfs has been doing that private inode cache for
a long time and reading through 5/7 and 6/7 it clearly provides value
for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
(Though I would like to s/iter_vfs_inodes/iter_inodes/g).
Dave Chinner Oct. 2, 2024, 12:34 p.m. UTC | #2
On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > What do people think of moving towards per-sb inode caching and
> > traversal mechanisms like this?
> 
> Patches 1-4 are great cleanups that I would like us to merge even
> independent of the rest.

Yes, they make it much easier to manage the iteration code.

> I don't have big conceptual issues with the series otherwise. The only
> thing that makes me a bit uneasy is that we are now providing an api
> that may encourage filesystems to do their own inode caching even if
> they don't really have a need for it just because it's there.  So really
> a way that would've solved this issue generically would have been my
> preference.

Well, that's the problem, isn't it? :/

There really isn't a good generic solution for global list access
and management.  The dlist stuff kinda works, but it still has
significant overhead and doesn't get rid of spinlock contention
completely because of the lack of locality between list add and
remove operations.

i.e. dlist is optimised for low contention add operations (i.e.
local to the CPU). However, removal is not a local operation - it
almsot always happens on a different CPU to the add operation.
Hence removal always pulls the list and lock away from the CPU that
"owns" them, and hence there is still contention when inodes are
streaming through memory. This causes enough overhead that dlist
operations are still very visible in CPU profiles during scalability
testing...

XFS (and now bcachefs) have their own per-sb inode cache
implementations, and hence for them the sb->s_inodes list is pure
overhead.  If we restructure the generic inode cache infrastructure
to also be per-sb (this suggestion from Linus was what lead me to
this patch set), then they will also likely not need the
sb->s_inodes list, too.

That's the longer term "generic solution" to the sb->s_inodes list
scalability problem (i.e. get rid of it!), but it's a much larger
and longer term undertaking. Once we know what that new generic
inode cache infrastructure looks like, we'll probably only want to
be converting one filesystem at a time to the new infrastucture.

We'll need infrastructure to allow alternative per-sb iteration
mechanisms for such a conversion take place - the converted
filesystems will likely call a generic ->iter_vfs_inodes()
implementation based on the per-sb inode cache infrastructure rather
than iterating sb->s_inodes. Eventually, we'll end up with that
generic method replacing the sb->s_inodes iteration, we'll end up
with only a couple of filesystems using the callout again.

> But the reality is that xfs has been doing that private inode cache for
> a long time and reading through 5/7 and 6/7 it clearly provides value
> for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
> (Though I would like to s/iter_vfs_inodes/iter_inodes/g).

I named it that way because, from my filesystem centric point of
view, there is a very distinct separation between VFS and filesystem
inodes. The VFS inode (struct inode) is a subset of the filesystem
inode structure and, in XFS's case, a subset of the filesystem inode
life cycle, too.

i.e. this method should not iterate cached filesystem inodes that
exist outside the VFS inode lifecycle or VFS visibility even though
they may be present in the filesystem's internal inode cache.

-Dave.
Kent Overstreet Oct. 2, 2024, 7:29 p.m. UTC | #3
On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > What do people think of moving towards per-sb inode caching and
> > > traversal mechanisms like this?
> > 
> > Patches 1-4 are great cleanups that I would like us to merge even
> > independent of the rest.
> 
> Yes, they make it much easier to manage the iteration code.
> 
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
> 
> Well, that's the problem, isn't it? :/
> 
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

There is though; I haven't posted it yet because it still needs some
work, but the concept works and performs about the same as dlock-list.

https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list

The thing that needs to be sorted before posting is that it can't shrink
the radix tree. generic-radix-tree doesn't support shrinking, and I
could add that, but then ida doesn't provide a way to query the highest
id allocated (xarray doesn't support backwards iteration).

So I'm going to try it using idr and see how that performs (idr is not
really the right data structure for this, split ida and item radix tree
is better, so might end up doing something else).

But - this approach with more work will work for the list_lru lock
contention as well.

From 32cb8103ecfacdd5ed8e1eb390221c3f8339de6f Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Sat, 28 Sep 2024 16:22:38 -0400
Subject: [PATCH] lib/fast_list.c

A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.

Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/fast_list.h b/include/linux/fast_list.h
new file mode 100644
index 000000000000..7d5d8592864d
--- /dev/null
+++ b/include/linux/fast_list.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_FAST_LIST_H
+#define _LINUX_FAST_LIST_H
+
+#include <linux/generic-radix-tree.h>
+#include <linux/idr.h>
+#include <linux/percpu.h>
+
+struct fast_list_pcpu;
+
+struct fast_list {
+	GENRADIX(void *)	items;
+	struct ida		slots_allocated;;
+	struct fast_list_pcpu	*buffer;
+};
+
+int fast_list_get_idx(struct fast_list *l);
+int fast_list_add(struct fast_list *l, void *item);
+void fast_list_remove(struct fast_list *l, unsigned idx);
+void fast_list_exit(struct fast_list *l);
+int fast_list_init(struct fast_list *l);
+
+#endif /* _LINUX_FAST_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 773adf88af41..85cf5a0d36b1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -49,7 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
-	 generic-radix-tree.o bitmap-str.o
+	 generic-radix-tree.o bitmap-str.o fast_list.o
 obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
 obj-y += string_helpers.o
 obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o
diff --git a/lib/fast_list.c b/lib/fast_list.c
new file mode 100644
index 000000000000..bbb69bb29687
--- /dev/null
+++ b/lib/fast_list.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fast, unordered lists
+ *
+ * Supports add, remove, and iterate
+ *
+ * Underneath, they're a radix tree and an IDA, with a percpu buffer for slot
+ * allocation and freeing.
+ *
+ * This means that adding, removing, and iterating over items is lockless,
+ * except when refilling/emptying the percpu slot buffers.
+ */
+
+#include <linux/fast_list.h>
+
+struct fast_list_pcpu {
+	size_t			nr;
+	size_t			entries[31];
+};
+
+/**
+ * fast_list_get_idx - get a slot in a fast_list
+ * @l:		list to get slot in
+ *
+ * This allocates a slot in the radix tree without storing to it, so that we can
+ * take the potential memory allocation failure early and do the list add later
+ * when we can't take an allocation failure.
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_get_idx(struct fast_list *l)
+{
+	int idx;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(!lp->nr))
+		while (lp->nr <= ARRAY_SIZE(lp->entries) / 2) {
+			idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_NOWAIT|__GFP_NOWARN);
+			if (unlikely(idx < 0)) {
+				preempt_enable();
+				idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_KERNEL);
+				if (unlikely(idx < 0))
+					return idx;
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(!genradix_ptr_alloc_inlined(&l->items, idx,
+							GFP_NOWAIT|__GFP_NOWARN))) {
+				preempt_enable();
+				if (!genradix_ptr_alloc(&l->items, idx, GFP_KERNEL)) {
+					ida_free(&l->slots_allocated, idx);
+					return -ENOMEM;
+				}
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+				ida_free(&l->slots_allocated, idx);
+			else
+				lp->entries[lp->nr++] = idx;
+		}
+
+	idx = lp->entries[--lp->nr];
+	preempt_enable();
+
+	return idx;
+}
+
+/**
+ * fast_list_add - add an item to a fast_list
+ * @l:		list
+ * @item:	item to add
+ *
+ * Allocates a slot in the radix tree and stores to it and then returns the
+ * slot index, which must be passed to fast_list_remove().
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_add(struct fast_list *l, void *item)
+{
+	int idx = fast_list_get_idx(l);
+	if (idx < 0)
+		return idx;
+
+	*genradix_ptr_inlined(&l->items, idx) = item;
+	return idx;
+}
+
+/**
+ * fast_list_remove - remove an item from a fast_list
+ * @l:		list
+ * @idx:	item's slot index
+ *
+ * Zeroes out the slot in the radix tree and frees the slot for future
+ * fast_list_add() operations.
+ */
+void fast_list_remove(struct fast_list *l, unsigned idx)
+{
+	if (!idx)
+		return;
+
+	*genradix_ptr_inlined(&l->items, idx) = NULL;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+		while (lp->nr >= ARRAY_SIZE(lp->entries) / 2) {
+			ida_free(&l->slots_allocated, idx);
+			idx = lp->entries[--lp->nr];
+		}
+
+	lp->entries[lp->nr++] = idx;
+	preempt_enable();
+}
+
+void fast_list_exit(struct fast_list *l)
+{
+	/* XXX: warn if list isn't empty */
+	free_percpu(l->buffer);
+	ida_destroy(&l->slots_allocated);
+	genradix_free(&l->items);
+}
+
+int fast_list_init(struct fast_list *l)
+{
+	genradix_init(&l->items);
+	ida_init(&l->slots_allocated);
+	l->buffer = alloc_percpu(*l->buffer);
+	if (!l->buffer)
+		return -ENOMEM;
+	return 0;
+}
Linus Torvalds Oct. 2, 2024, 7:49 p.m. UTC | #4
On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
>
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
>
> Well, that's the problem, isn't it? :/
>
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

I much prefer the approach taken in your patch series, to let the
filesystem own the inode list and keeping the old model as the
"default list".

In many ways, that is how *most* of the VFS layer works - it exposes
helper functions that the filesystems can use (and most do), but
doesn't force them.

Yes, the VFS layer does force some things - you can't avoid using
dentries, for example, because that's literally how the VFS layer
deals with filenames (and things like mounting etc). And honestly, the
VFS layer does a better job of filename caching than any filesystem
really can do, and with the whole UNIX mount model, filenames
fundamentally cross filesystem boundaries anyway.

But clearly the VFS layer inode list handling isn't the best it can
be, and unless we can fix that in some fundamental way (and I don't
love the "let's use crazy lists instead of a simple one" models) I do
think that just letting filesystems do their own thing if they have
something better is a good model.

That's how we deal with all the basic IO, after all. The VFS layer has
lots of support routines, but filesystems don't *have* to use things
like generic_file_read_iter() and friends.

Yes, most filesystems do use generic_file_read_iter() in some form or
other (sometimes raw, sometimes wrapped with filesystem logic),
because it fits their model, it's convenient, and it handles all the
normal stuff well, but you don't *have* to use it if you have special
needs.

Taking that approach to the inode caching sounds sane to me, and I
generally like Dave's series. It looks like an improvement to me.

              Linus
Kent Overstreet Oct. 2, 2024, 8:28 p.m. UTC | #5
On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> >
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> >
> > Well, that's the problem, isn't it? :/
> >
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> I much prefer the approach taken in your patch series, to let the
> filesystem own the inode list and keeping the old model as the
> "default list".
> 
> In many ways, that is how *most* of the VFS layer works - it exposes
> helper functions that the filesystems can use (and most do), but
> doesn't force them.
> 
> Yes, the VFS layer does force some things - you can't avoid using
> dentries, for example, because that's literally how the VFS layer
> deals with filenames (and things like mounting etc). And honestly, the
> VFS layer does a better job of filename caching than any filesystem
> really can do, and with the whole UNIX mount model, filenames
> fundamentally cross filesystem boundaries anyway.
> 
> But clearly the VFS layer inode list handling isn't the best it can
> be, and unless we can fix that in some fundamental way (and I don't
> love the "let's use crazy lists instead of a simple one" models) I do
> think that just letting filesystems do their own thing if they have
> something better is a good model.

Well, I don't love adding more indirection and callbacks.

The underlying approach in this patchset of "just use the inode hash
table if that's available" - that I _do_ like, but this seems like
the wrong way to go about it, we're significantly adding to the amount
of special purpose "things" filesystems have to do if they want to
perform well.

Converting the standard inode hash table to an rhashtable (or more
likely, creating a new standard implementation and converting
filesystems one at a time) still needs to happen, and then the "use the
hash table for iteration" approach could use that without every
filesystem having to specialize.

Failing that, or even regardless, I think we do need either dlock-list
or fast-list. "I need some sort of generic list, but fast" is something
I've seen come up way too many times.
Dave Chinner Oct. 2, 2024, 10:23 p.m. UTC | #6
On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > What do people think of moving towards per-sb inode caching and
> > > > traversal mechanisms like this?
> > > 
> > > Patches 1-4 are great cleanups that I would like us to merge even
> > > independent of the rest.
> > 
> > Yes, they make it much easier to manage the iteration code.
> > 
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> > 
> > Well, that's the problem, isn't it? :/
> > 
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> There is though; I haven't posted it yet because it still needs some
> work, but the concept works and performs about the same as dlock-list.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> 
> The thing that needs to be sorted before posting is that it can't shrink
> the radix tree. generic-radix-tree doesn't support shrinking, and I
> could add that, but then ida doesn't provide a way to query the highest
> id allocated (xarray doesn't support backwards iteration).

That's an interesting construct, but...

> So I'm going to try it using idr and see how that performs (idr is not
> really the right data structure for this, split ida and item radix tree
> is better, so might end up doing something else).
> 
> But - this approach with more work will work for the list_lru lock
> contention as well.

....  it isn't a generic solution because it is dependent on
blocking memory allocation succeeding for list_add() operations.

Hence this cannot do list operations under external synchronisation
constructs like spinlocks or rcu_read_lock(). It also introduces
interesting interactions with memory reclaim - what happens we have
to add an object to one of these lists from memory reclaim context?

Taking the example of list_lru, this list construct will not work
for a variety of reasons. Some of them are:

- list_lru_add() being called from list_lru_add_obj() under RCU for
  memcg aware LRUs so cannot block and must not fail.
- list_lru_add_obj() is called under spinlocks from inode_lru_add(),
  the xfs buffer and dquot caches, the workingset code from under
  the address space mapping xarray lock, etc. Again, this must not
  fail.
- list_lru_add() operations take can place in large numbers in
  memory reclaim context (e.g. dentry reclaim drops inodes which
  adds them to the inode lru). Hence memory reclaim becomes even
  more dependent on PF_MEMALLOC memory allocation making forwards
  progress.
- adding long tail list latency to what are currently O(1) fast path
  operations (e.g.  mulitple allocations tree splits for LRUs
  tracking millions of objects) is not desirable.
- LRU lists are -ordered- (it's right there in the name!) and this
  appears to be an unordered list construct.

So while I think this is an interesting idea that might be useful in
some cases, I don't think it is a viable generic scalable list
construct we can use in areas like list_lru or global list
management that run under external synchronisation mechanisms.

-Dave.
Dave Chinner Oct. 2, 2024, 11:17 p.m. UTC | #7
On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > >
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > >
> > > Well, that's the problem, isn't it? :/
> > >
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > I much prefer the approach taken in your patch series, to let the
> > filesystem own the inode list and keeping the old model as the
> > "default list".
> > 
> > In many ways, that is how *most* of the VFS layer works - it exposes
> > helper functions that the filesystems can use (and most do), but
> > doesn't force them.
> > 
> > Yes, the VFS layer does force some things - you can't avoid using
> > dentries, for example, because that's literally how the VFS layer
> > deals with filenames (and things like mounting etc). And honestly, the
> > VFS layer does a better job of filename caching than any filesystem
> > really can do, and with the whole UNIX mount model, filenames
> > fundamentally cross filesystem boundaries anyway.
> > 
> > But clearly the VFS layer inode list handling isn't the best it can
> > be, and unless we can fix that in some fundamental way (and I don't
> > love the "let's use crazy lists instead of a simple one" models) I do
> > think that just letting filesystems do their own thing if they have
> > something better is a good model.
> 
> Well, I don't love adding more indirection and callbacks.

It's way better than open coding inode cache traversals everywhere.

The callback model is simply "call this function on every object",
and it allows implementations the freedom to decide how they are
going to run those callbacks.

For example, this abstraction allows XFS to parallelise the
traversal. We currently run the traversal across all inodes in a
single thread, but now that XFS is walking the inode cache we can
push each shard off to a workqueue and run each shard concurrently.
IOWs, we can actually make the traversal of large caches much, much
faster without changing the semantics of the operation the traversal
is trying to acheive.

We simply cannot do things like that without a new iteration model.
Abstraction is necessary to facilitate a new iteration model, and a
model that provides independent object callbacks allows scope for
concurrent processing of individual objects.

> The underlying approach in this patchset of "just use the inode hash
> table if that's available" - that I _do_ like, but this seems like
> the wrong way to go about it, we're significantly adding to the amount
> of special purpose "things" filesystems have to do if they want to
> perform well.

I've already addressed this in my response to Christian. This is a
mechanism that allows filesystems to be moved one-by-one to a new
generic cache and iteration implementation without impacting
existing code. Once we have that, scalability of the inode cache and
traversals should not be a reason for filesystems "doing their own
thing" because the generic infrastructure will be sufficient for
most filesystem implementations.

> Converting the standard inode hash table to an rhashtable (or more
> likely, creating a new standard implementation and converting
> filesystems one at a time) still needs to happen, and then the "use the
> hash table for iteration" approach could use that without every
> filesystem having to specialize.

Yes, but this still doesn't help filesystems like XFS where the
structure of the inode cache is highly optimised for the specific
on-disk and in-memory locality of inodes. We aren't going to be
converting XFS to a rhashtable based inode cache anytime soon
because it simply doesn't provide the functionality we require.
e.g. efficient lockless sequential inode number ordered traversal in
-every- inode cluster writeback operation.

> Failing that, or even regardless, I think we do need either dlock-list
> or fast-list. "I need some sort of generic list, but fast" is something
> I've seen come up way too many times.

There's nothing stopping you from using the dlist patchset for your
own purposes. It's public code - just make sure you retain the
correct attributions. :)

-Dave.
Kent Overstreet Oct. 2, 2024, 11:20 p.m. UTC | #8
On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > > What do people think of moving towards per-sb inode caching and
> > > > > traversal mechanisms like this?
> > > > 
> > > > Patches 1-4 are great cleanups that I would like us to merge even
> > > > independent of the rest.
> > > 
> > > Yes, they make it much easier to manage the iteration code.
> > > 
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > > 
> > > Well, that's the problem, isn't it? :/
> > > 
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > There is though; I haven't posted it yet because it still needs some
> > work, but the concept works and performs about the same as dlock-list.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> > 
> > The thing that needs to be sorted before posting is that it can't shrink
> > the radix tree. generic-radix-tree doesn't support shrinking, and I
> > could add that, but then ida doesn't provide a way to query the highest
> > id allocated (xarray doesn't support backwards iteration).
> 
> That's an interesting construct, but...
> 
> > So I'm going to try it using idr and see how that performs (idr is not
> > really the right data structure for this, split ida and item radix tree
> > is better, so might end up doing something else).
> > 
> > But - this approach with more work will work for the list_lru lock
> > contention as well.
> 
> ....  it isn't a generic solution because it is dependent on
> blocking memory allocation succeeding for list_add() operations.
> 
> Hence this cannot do list operations under external synchronisation
> constructs like spinlocks or rcu_read_lock(). It also introduces
> interesting interactions with memory reclaim - what happens we have
> to add an object to one of these lists from memory reclaim context?
> 
> Taking the example of list_lru, this list construct will not work
> for a variety of reasons. Some of them are:
> 
> - list_lru_add() being called from list_lru_add_obj() under RCU for
>   memcg aware LRUs so cannot block and must not fail.
> - list_lru_add_obj() is called under spinlocks from inode_lru_add(),
>   the xfs buffer and dquot caches, the workingset code from under
>   the address space mapping xarray lock, etc. Again, this must not
>   fail.
> - list_lru_add() operations take can place in large numbers in
>   memory reclaim context (e.g. dentry reclaim drops inodes which
>   adds them to the inode lru). Hence memory reclaim becomes even
>   more dependent on PF_MEMALLOC memory allocation making forwards
>   progress.
> - adding long tail list latency to what are currently O(1) fast path
>   operations (e.g.  mulitple allocations tree splits for LRUs
>   tracking millions of objects) is not desirable.
> 
> So while I think this is an interesting idea that might be useful in
> some cases, I don't think it is a viable generic scalable list
> construct we can use in areas like list_lru or global list
> management that run under external synchronisation mechanisms.

There are difficulties, but given the fundamental scalability and
locking issues with linked lists, I think this is the approach we want
if we can make it work.

A couple things that help - we've already determined that the inode LRU
can go away for most filesystems, and we can preallocate slots without
actually adding objects. Iteration will see NULLs that they skip over,
so we can't simply preallocate a slot for everything if nr_live_objects
/ nr_lru_objects is too big. But, we can certainly preallocate slots on
a given code path and then release them back to the percpu buffer if
they're not used.

> - LRU lists are -ordered- (it's right there in the name!) and this
>   appears to be an unordered list construct.

Yes, it is. But in actual practice cache replacement policy tends not to
matter nearly as much as people think; there's many papers showing real
world hit ratio of common algorithms is only a fudge factor from random
replacement - the main thing you want is an accessed bit (or counter, if
you want the analagous version of n-lru for n > 2), and we'll still have
that.