Message ID | 20241002014017.3801899-1-david@fromorbit.com (mailing list archive) |
---|---|
Headers | show |
Series | vfs: improving inode cache iteration scalability | expand |
On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > The management of the sb->s_inodes list is a scalability limitation; > it is protected by a single lock and every inode that is > instantiated has to be added to the list. Those inodes then need to > be removed from the list when evicted from cache. Hence every inode > that moves through the VFS inode cache must take this global scope > lock twice. > > This proves to be a significant limiting factor for concurrent file > access workloads that repeatedly miss the dentry cache on lookup. > Directory search and traversal workloads are particularly prone to > these issues, though on XFS we have enough concurrency capability > in file creation and unlink for the sb->s_inodes list to be a > limitation there as well. > > Previous efforts to solve this problem have > largely centered around reworking the sb->s_inodes list into > something more scalable such as this longstanding patchset does: > > https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ > > However, a recent discussion about inode cache behaviour that arose > from the bcachefs 6.12-rc1 pull request opened a new direction for > us to explore. With both XFS and bcachefs now providing their own > per-superblock inode cache implementations, we should try to make > use of these inode caches as first class citizens. > > With that new direction in mind, it became obvious that XFS could > elide the sb->s_inodes list completely - "the best part is no part" > - if iteration was not reliant on open-coded sb->s_inodes list > walks. > > We already use the internal inode cache for iteration, and we have > filters for selecting specific inodes to operate on with specific > callback operations. If we had an abstraction for iterating > all VFS inodes, we can easily implement that directly on the XFS > inode cache. > > This is what this patchset aims to implement. > > There are two superblock iterator functions provided. The first is a > generic iterator that provides safe, reference counted inodes for > the callback to operate on. This is generally what most sb->s_inodes > iterators use, and it allows the iterator to drop locks and perform > blocking operations on the inode before moving to the next inode in > the sb->s_inodes list. > > There is one quirk to this interface - INO_ITER_REFERENCE - because > fsnotify iterates the inode cache -after- evict_inodes() has been > called during superblock shutdown to evict all non-referenced > inodes. Hence it should only find referenced inodes, and it has > a check to skip unreferenced inodes. This flag does the same. > > However, I suspect this is now somewhat sub-optimal because LSMs can > hold references to inodes beyond evict_inodes(), and they don't get > torn down until after fsnotify evicts the referenced inodes it > holds. However, the landlock LSM doesn't have checks for > unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this > guard is not consistently applied. > > I'm undecided on how best to handle this, but it does not need to be > solved for this patchset to work. fsnotify and > landlock don't need to run -after- evict_inodes(), but moving them > to before evict_inodes() mean we now do three full inode cache > iterations to evict all the inodes from the cache. That doesn't seem > like a good idea when there might be hundreds of millions of cached > inodes at unmount. > > Similarly, needing the iterator to be aware that there should be no > unreferenced inodes left when they run doesn't seem like a good > idea, either. So perhaps the answer is that the iterator checks for > SB_ACTIVE (or some other similar flag) that indicates the superblock > is being torn down and so will skip zero-referenced inodes > automatically in this case. Like I said - this doesn't need to be > solved right now, it's just something to be aware of. > > The second iterator is the "unsafe" iterator variant that only > provides the callback with an existence guarantee. It does this by > holding the rcu_read_lock() to guarantee that the inode is not freed > from under the callback. There are no validity checks performed on > the inode - it is entirely up to the callback to validate the inode > can be operated on safely. > > Hence the "unsafe" variant is only for very specific internal uses > only. Nobody should be adding new uses of this function without > as there are very few good reasons for external access to inodes > without holding a valid reference. I have not decided whether the > unsafe callbacks should require a lockdep_assert_in_rcu_read_lock() > check in them to clearly document the context under which they are > running. > > The patchset converts all the open coded iterators to use these > new iterator functions, which means the only use of sb->s_inodes > is now confined to fs/super.c (iterator API) and fs/inode.c > (add/remove API). A new superblock operation is then added to > call out from the iterators into the filesystem to allow them to run > the iteration instead of walking the sb->s_inodes list. > > XFS is then converted to use this new superblock operation. I didn't > use the existing iterator function for this functionality right now > as it is currently based on radix tree tag lookups. It also uses a > batched 'lookup and lock' mechanism that complicated matters as I > developed this code. Hence I open coded a new, simpler cache walk > for testing purposes. > > Now that I have stuff working and I think I have the callback API > semantics settled, batched radix tree lookups should still work to > minimise the iteration overhead. Also, we might want to tag VFS > inodes in the radix tree so that we can filter them efficiently for > traversals. This would allow us to use the existing generic inode > cache walker rather than a separate variant as this patch set > implements. This can be done as future work, though. > > In terms of scalability improvements, a quick 'will it scale' test > demonstrates where the sb->s_inodes list hurts. Running a sharded, > share-nothing cold cache workload with 100,000 files per thread in > per-thread directories gives the following results on a 4-node 64p > machine with 128GB RAM. > > The workloads "walk", "chmod" and "unlink" are all directory > traversal workloads that stream cold cache inodes into the cache. > There is enough memory on this test machine that these indoes are > not being reclaimed during the workload, and are being freed between > steps via drop_caches (which iterates the inode cache and so > explicitly tests the new iteration APIs!). Hence the sb->s_inodes > scalability issues aren't as bad in these tests as when memory is > tight and inodes are being reclaimed (i.e. the issues are worse in > real workloads). > > The "bulkstat" workload uses the XFS bulkstat ioctl to iterate > inodes via walking the internal inode btrees. It uses > d_mark_dontcache() so it is actually tearing down each inode as soon > as it has been sampled by the bulkstat code. Hence it is doing two > sb->s_inodes list manipulations per inode and so shows scalability > issues much earlier than the other workloads. > > Before: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.269 3.225 4.557 7.316 1.306 > xfs 800000 8 4.844 3.227 4.702 7.905 1.908 > xfs 1600000 16 6.286 3.296 5.592 8.838 4.392 > xfs 3200000 32 8.912 5.681 8.505 11.724 7.085 > xfs 6400000 64 15.344 11.144 14.162 18.604 15.494 > > Bulkstat starts to show issues at 8 threads, walk and chmod between > 16 and 32 threads, and unlink is limited by internal XFS stuff. > Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the > sb->s_inodes list management. > > After: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.140 3.502 4.154 7.242 1.164 > xfs 800000 8 4.637 2.836 4.444 7.896 1.093 > xfs 1600000 16 5.549 3.054 5.213 8.696 1.107 > xfs 3200000 32 8.387 3.218 6.867 10.668 1.125 > xfs 6400000 64 14.112 3.953 10.365 18.620 1.270 > > When patched, walk shows little in way of scalability degradation > out to 64 threads, chmod is significantly improved at 32-64 threads, > and bulkstat shows perfect scalability out to 64 threads now. > > I did a couple of other longer running, higher inode count tests > with bulkstat to get an idea of inode cache streaming rates - 32 > million inodes scanned in 4.4 seconds at 64 threads. That's about > 7.2 million inodes/s being streamed through the inode cache with the > IO rates are peaking well above 5.5GB/s (near IO bound). > > Hence raw VFS inode cache throughput sees a ~17x scalability > improvement on XFS at 64 threads (and probably a -lot- more on > higher CPU count machines). That's far better performance than I > ever got from the dlist conversion of the sb->s_inodes list in > previous patchsets, so this seems like a much better direction to be > heading for optimising the way we cache inodes. > > I haven't done a lot of testing on this patchset yet - it boots and > appears to work OK for block devices, ext4 and XFS, but checking > stuff like quota on/off is still working properly on ext4 hasn't > been done yet. > > What do people think of moving towards per-sb inode caching and > traversal mechanisms like this? Patches 1-4 are great cleanups that I would like us to merge even independent of the rest. I don't have big conceptual issues with the series otherwise. The only thing that makes me a bit uneasy is that we are now providing an api that may encourage filesystems to do their own inode caching even if they don't really have a need for it just because it's there. So really a way that would've solved this issue generically would have been my preference. But the reality is that xfs has been doing that private inode cache for a long time and reading through 5/7 and 6/7 it clearly provides value for xfs. So I find it hard to object to adding ->iter_vfs_inodes() (Though I would like to s/iter_vfs_inodes/iter_inodes/g).
On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > What do people think of moving towards per-sb inode caching and > > traversal mechanisms like this? > > Patches 1-4 are great cleanups that I would like us to merge even > independent of the rest. Yes, they make it much easier to manage the iteration code. > I don't have big conceptual issues with the series otherwise. The only > thing that makes me a bit uneasy is that we are now providing an api > that may encourage filesystems to do their own inode caching even if > they don't really have a need for it just because it's there. So really > a way that would've solved this issue generically would have been my > preference. Well, that's the problem, isn't it? :/ There really isn't a good generic solution for global list access and management. The dlist stuff kinda works, but it still has significant overhead and doesn't get rid of spinlock contention completely because of the lack of locality between list add and remove operations. i.e. dlist is optimised for low contention add operations (i.e. local to the CPU). However, removal is not a local operation - it almsot always happens on a different CPU to the add operation. Hence removal always pulls the list and lock away from the CPU that "owns" them, and hence there is still contention when inodes are streaming through memory. This causes enough overhead that dlist operations are still very visible in CPU profiles during scalability testing... XFS (and now bcachefs) have their own per-sb inode cache implementations, and hence for them the sb->s_inodes list is pure overhead. If we restructure the generic inode cache infrastructure to also be per-sb (this suggestion from Linus was what lead me to this patch set), then they will also likely not need the sb->s_inodes list, too. That's the longer term "generic solution" to the sb->s_inodes list scalability problem (i.e. get rid of it!), but it's a much larger and longer term undertaking. Once we know what that new generic inode cache infrastructure looks like, we'll probably only want to be converting one filesystem at a time to the new infrastucture. We'll need infrastructure to allow alternative per-sb iteration mechanisms for such a conversion take place - the converted filesystems will likely call a generic ->iter_vfs_inodes() implementation based on the per-sb inode cache infrastructure rather than iterating sb->s_inodes. Eventually, we'll end up with that generic method replacing the sb->s_inodes iteration, we'll end up with only a couple of filesystems using the callout again. > But the reality is that xfs has been doing that private inode cache for > a long time and reading through 5/7 and 6/7 it clearly provides value > for xfs. So I find it hard to object to adding ->iter_vfs_inodes() > (Though I would like to s/iter_vfs_inodes/iter_inodes/g). I named it that way because, from my filesystem centric point of view, there is a very distinct separation between VFS and filesystem inodes. The VFS inode (struct inode) is a subset of the filesystem inode structure and, in XFS's case, a subset of the filesystem inode life cycle, too. i.e. this method should not iterate cached filesystem inodes that exist outside the VFS inode lifecycle or VFS visibility even though they may be present in the filesystem's internal inode cache. -Dave.
On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > What do people think of moving towards per-sb inode caching and > > > traversal mechanisms like this? > > > > Patches 1-4 are great cleanups that I would like us to merge even > > independent of the rest. > > Yes, they make it much easier to manage the iteration code. > > > I don't have big conceptual issues with the series otherwise. The only > > thing that makes me a bit uneasy is that we are now providing an api > > that may encourage filesystems to do their own inode caching even if > > they don't really have a need for it just because it's there. So really > > a way that would've solved this issue generically would have been my > > preference. > > Well, that's the problem, isn't it? :/ > > There really isn't a good generic solution for global list access > and management. The dlist stuff kinda works, but it still has > significant overhead and doesn't get rid of spinlock contention > completely because of the lack of locality between list add and > remove operations. There is though; I haven't posted it yet because it still needs some work, but the concept works and performs about the same as dlock-list. https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list The thing that needs to be sorted before posting is that it can't shrink the radix tree. generic-radix-tree doesn't support shrinking, and I could add that, but then ida doesn't provide a way to query the highest id allocated (xarray doesn't support backwards iteration). So I'm going to try it using idr and see how that performs (idr is not really the right data structure for this, split ida and item radix tree is better, so might end up doing something else). But - this approach with more work will work for the list_lru lock contention as well. From 32cb8103ecfacdd5ed8e1eb390221c3f8339de6f Mon Sep 17 00:00:00 2001 From: Kent Overstreet <kent.overstreet@linux.dev> Date: Sat, 28 Sep 2024 16:22:38 -0400 Subject: [PATCH] lib/fast_list.c A fast "list" data structure, which is actually a radix tree, with an IDA for slot allocation and a percpu buffer on top of that. Items cannot be added or moved to the head or tail, only added at some (arbitrary) position and removed. The advantage is that adding, removing and iteration is generally lockless, only hitting the lock in ida when the percpu buffer is full or empty. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/include/linux/fast_list.h b/include/linux/fast_list.h new file mode 100644 index 000000000000..7d5d8592864d --- /dev/null +++ b/include/linux/fast_list.h @@ -0,0 +1,22 @@ +#ifndef _LINUX_FAST_LIST_H +#define _LINUX_FAST_LIST_H + +#include <linux/generic-radix-tree.h> +#include <linux/idr.h> +#include <linux/percpu.h> + +struct fast_list_pcpu; + +struct fast_list { + GENRADIX(void *) items; + struct ida slots_allocated;; + struct fast_list_pcpu *buffer; +}; + +int fast_list_get_idx(struct fast_list *l); +int fast_list_add(struct fast_list *l, void *item); +void fast_list_remove(struct fast_list *l, unsigned idx); +void fast_list_exit(struct fast_list *l); +int fast_list_init(struct fast_list *l); + +#endif /* _LINUX_FAST_LIST_H */ diff --git a/lib/Makefile b/lib/Makefile index 773adf88af41..85cf5a0d36b1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -49,7 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \ bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \ percpu-refcount.o rhashtable.o base64.o \ once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \ - generic-radix-tree.o bitmap-str.o + generic-radix-tree.o bitmap-str.o fast_list.o obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o obj-y += string_helpers.o obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o diff --git a/lib/fast_list.c b/lib/fast_list.c new file mode 100644 index 000000000000..bbb69bb29687 --- /dev/null +++ b/lib/fast_list.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Fast, unordered lists + * + * Supports add, remove, and iterate + * + * Underneath, they're a radix tree and an IDA, with a percpu buffer for slot + * allocation and freeing. + * + * This means that adding, removing, and iterating over items is lockless, + * except when refilling/emptying the percpu slot buffers. + */ + +#include <linux/fast_list.h> + +struct fast_list_pcpu { + size_t nr; + size_t entries[31]; +}; + +/** + * fast_list_get_idx - get a slot in a fast_list + * @l: list to get slot in + * + * This allocates a slot in the radix tree without storing to it, so that we can + * take the potential memory allocation failure early and do the list add later + * when we can't take an allocation failure. + * + * Returns: positive integer on success, -ENOMEM on failure + */ +int fast_list_get_idx(struct fast_list *l) +{ + int idx; + + preempt_disable(); + struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer); + + if (unlikely(!lp->nr)) + while (lp->nr <= ARRAY_SIZE(lp->entries) / 2) { + idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_NOWAIT|__GFP_NOWARN); + if (unlikely(idx < 0)) { + preempt_enable(); + idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_KERNEL); + if (unlikely(idx < 0)) + return idx; + + preempt_disable(); + lp = this_cpu_ptr(l->buffer); + } + + if (unlikely(!genradix_ptr_alloc_inlined(&l->items, idx, + GFP_NOWAIT|__GFP_NOWARN))) { + preempt_enable(); + if (!genradix_ptr_alloc(&l->items, idx, GFP_KERNEL)) { + ida_free(&l->slots_allocated, idx); + return -ENOMEM; + } + + preempt_disable(); + lp = this_cpu_ptr(l->buffer); + } + + if (unlikely(lp->nr == ARRAY_SIZE(lp->entries))) + ida_free(&l->slots_allocated, idx); + else + lp->entries[lp->nr++] = idx; + } + + idx = lp->entries[--lp->nr]; + preempt_enable(); + + return idx; +} + +/** + * fast_list_add - add an item to a fast_list + * @l: list + * @item: item to add + * + * Allocates a slot in the radix tree and stores to it and then returns the + * slot index, which must be passed to fast_list_remove(). + * + * Returns: positive integer on success, -ENOMEM on failure + */ +int fast_list_add(struct fast_list *l, void *item) +{ + int idx = fast_list_get_idx(l); + if (idx < 0) + return idx; + + *genradix_ptr_inlined(&l->items, idx) = item; + return idx; +} + +/** + * fast_list_remove - remove an item from a fast_list + * @l: list + * @idx: item's slot index + * + * Zeroes out the slot in the radix tree and frees the slot for future + * fast_list_add() operations. + */ +void fast_list_remove(struct fast_list *l, unsigned idx) +{ + if (!idx) + return; + + *genradix_ptr_inlined(&l->items, idx) = NULL; + + preempt_disable(); + struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer); + + if (unlikely(lp->nr == ARRAY_SIZE(lp->entries))) + while (lp->nr >= ARRAY_SIZE(lp->entries) / 2) { + ida_free(&l->slots_allocated, idx); + idx = lp->entries[--lp->nr]; + } + + lp->entries[lp->nr++] = idx; + preempt_enable(); +} + +void fast_list_exit(struct fast_list *l) +{ + /* XXX: warn if list isn't empty */ + free_percpu(l->buffer); + ida_destroy(&l->slots_allocated); + genradix_free(&l->items); +} + +int fast_list_init(struct fast_list *l) +{ + genradix_init(&l->items); + ida_init(&l->slots_allocated); + l->buffer = alloc_percpu(*l->buffer); + if (!l->buffer) + return -ENOMEM; + return 0; +}
On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > I don't have big conceptual issues with the series otherwise. The only > > thing that makes me a bit uneasy is that we are now providing an api > > that may encourage filesystems to do their own inode caching even if > > they don't really have a need for it just because it's there. So really > > a way that would've solved this issue generically would have been my > > preference. > > Well, that's the problem, isn't it? :/ > > There really isn't a good generic solution for global list access > and management. The dlist stuff kinda works, but it still has > significant overhead and doesn't get rid of spinlock contention > completely because of the lack of locality between list add and > remove operations. I much prefer the approach taken in your patch series, to let the filesystem own the inode list and keeping the old model as the "default list". In many ways, that is how *most* of the VFS layer works - it exposes helper functions that the filesystems can use (and most do), but doesn't force them. Yes, the VFS layer does force some things - you can't avoid using dentries, for example, because that's literally how the VFS layer deals with filenames (and things like mounting etc). And honestly, the VFS layer does a better job of filename caching than any filesystem really can do, and with the whole UNIX mount model, filenames fundamentally cross filesystem boundaries anyway. But clearly the VFS layer inode list handling isn't the best it can be, and unless we can fix that in some fundamental way (and I don't love the "let's use crazy lists instead of a simple one" models) I do think that just letting filesystems do their own thing if they have something better is a good model. That's how we deal with all the basic IO, after all. The VFS layer has lots of support routines, but filesystems don't *have* to use things like generic_file_read_iter() and friends. Yes, most filesystems do use generic_file_read_iter() in some form or other (sometimes raw, sometimes wrapped with filesystem logic), because it fits their model, it's convenient, and it handles all the normal stuff well, but you don't *have* to use it if you have special needs. Taking that approach to the inode caching sounds sane to me, and I generally like Dave's series. It looks like an improvement to me. Linus
On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > I don't have big conceptual issues with the series otherwise. The only > > > thing that makes me a bit uneasy is that we are now providing an api > > > that may encourage filesystems to do their own inode caching even if > > > they don't really have a need for it just because it's there. So really > > > a way that would've solved this issue generically would have been my > > > preference. > > > > Well, that's the problem, isn't it? :/ > > > > There really isn't a good generic solution for global list access > > and management. The dlist stuff kinda works, but it still has > > significant overhead and doesn't get rid of spinlock contention > > completely because of the lack of locality between list add and > > remove operations. > > I much prefer the approach taken in your patch series, to let the > filesystem own the inode list and keeping the old model as the > "default list". > > In many ways, that is how *most* of the VFS layer works - it exposes > helper functions that the filesystems can use (and most do), but > doesn't force them. > > Yes, the VFS layer does force some things - you can't avoid using > dentries, for example, because that's literally how the VFS layer > deals with filenames (and things like mounting etc). And honestly, the > VFS layer does a better job of filename caching than any filesystem > really can do, and with the whole UNIX mount model, filenames > fundamentally cross filesystem boundaries anyway. > > But clearly the VFS layer inode list handling isn't the best it can > be, and unless we can fix that in some fundamental way (and I don't > love the "let's use crazy lists instead of a simple one" models) I do > think that just letting filesystems do their own thing if they have > something better is a good model. Well, I don't love adding more indirection and callbacks. The underlying approach in this patchset of "just use the inode hash table if that's available" - that I _do_ like, but this seems like the wrong way to go about it, we're significantly adding to the amount of special purpose "things" filesystems have to do if they want to perform well. Converting the standard inode hash table to an rhashtable (or more likely, creating a new standard implementation and converting filesystems one at a time) still needs to happen, and then the "use the hash table for iteration" approach could use that without every filesystem having to specialize. Failing that, or even regardless, I think we do need either dlock-list or fast-list. "I need some sort of generic list, but fast" is something I've seen come up way too many times.
On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote: > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > > What do people think of moving towards per-sb inode caching and > > > > traversal mechanisms like this? > > > > > > Patches 1-4 are great cleanups that I would like us to merge even > > > independent of the rest. > > > > Yes, they make it much easier to manage the iteration code. > > > > > I don't have big conceptual issues with the series otherwise. The only > > > thing that makes me a bit uneasy is that we are now providing an api > > > that may encourage filesystems to do their own inode caching even if > > > they don't really have a need for it just because it's there. So really > > > a way that would've solved this issue generically would have been my > > > preference. > > > > Well, that's the problem, isn't it? :/ > > > > There really isn't a good generic solution for global list access > > and management. The dlist stuff kinda works, but it still has > > significant overhead and doesn't get rid of spinlock contention > > completely because of the lack of locality between list add and > > remove operations. > > There is though; I haven't posted it yet because it still needs some > work, but the concept works and performs about the same as dlock-list. > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list > > The thing that needs to be sorted before posting is that it can't shrink > the radix tree. generic-radix-tree doesn't support shrinking, and I > could add that, but then ida doesn't provide a way to query the highest > id allocated (xarray doesn't support backwards iteration). That's an interesting construct, but... > So I'm going to try it using idr and see how that performs (idr is not > really the right data structure for this, split ida and item radix tree > is better, so might end up doing something else). > > But - this approach with more work will work for the list_lru lock > contention as well. .... it isn't a generic solution because it is dependent on blocking memory allocation succeeding for list_add() operations. Hence this cannot do list operations under external synchronisation constructs like spinlocks or rcu_read_lock(). It also introduces interesting interactions with memory reclaim - what happens we have to add an object to one of these lists from memory reclaim context? Taking the example of list_lru, this list construct will not work for a variety of reasons. Some of them are: - list_lru_add() being called from list_lru_add_obj() under RCU for memcg aware LRUs so cannot block and must not fail. - list_lru_add_obj() is called under spinlocks from inode_lru_add(), the xfs buffer and dquot caches, the workingset code from under the address space mapping xarray lock, etc. Again, this must not fail. - list_lru_add() operations take can place in large numbers in memory reclaim context (e.g. dentry reclaim drops inodes which adds them to the inode lru). Hence memory reclaim becomes even more dependent on PF_MEMALLOC memory allocation making forwards progress. - adding long tail list latency to what are currently O(1) fast path operations (e.g. mulitple allocations tree splits for LRUs tracking millions of objects) is not desirable. - LRU lists are -ordered- (it's right there in the name!) and this appears to be an unordered list construct. So while I think this is an interesting idea that might be useful in some cases, I don't think it is a viable generic scalable list construct we can use in areas like list_lru or global list management that run under external synchronisation mechanisms. -Dave.
On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote: > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > that may encourage filesystems to do their own inode caching even if > > > > they don't really have a need for it just because it's there. So really > > > > a way that would've solved this issue generically would have been my > > > > preference. > > > > > > Well, that's the problem, isn't it? :/ > > > > > > There really isn't a good generic solution for global list access > > > and management. The dlist stuff kinda works, but it still has > > > significant overhead and doesn't get rid of spinlock contention > > > completely because of the lack of locality between list add and > > > remove operations. > > > > I much prefer the approach taken in your patch series, to let the > > filesystem own the inode list and keeping the old model as the > > "default list". > > > > In many ways, that is how *most* of the VFS layer works - it exposes > > helper functions that the filesystems can use (and most do), but > > doesn't force them. > > > > Yes, the VFS layer does force some things - you can't avoid using > > dentries, for example, because that's literally how the VFS layer > > deals with filenames (and things like mounting etc). And honestly, the > > VFS layer does a better job of filename caching than any filesystem > > really can do, and with the whole UNIX mount model, filenames > > fundamentally cross filesystem boundaries anyway. > > > > But clearly the VFS layer inode list handling isn't the best it can > > be, and unless we can fix that in some fundamental way (and I don't > > love the "let's use crazy lists instead of a simple one" models) I do > > think that just letting filesystems do their own thing if they have > > something better is a good model. > > Well, I don't love adding more indirection and callbacks. It's way better than open coding inode cache traversals everywhere. The callback model is simply "call this function on every object", and it allows implementations the freedom to decide how they are going to run those callbacks. For example, this abstraction allows XFS to parallelise the traversal. We currently run the traversal across all inodes in a single thread, but now that XFS is walking the inode cache we can push each shard off to a workqueue and run each shard concurrently. IOWs, we can actually make the traversal of large caches much, much faster without changing the semantics of the operation the traversal is trying to acheive. We simply cannot do things like that without a new iteration model. Abstraction is necessary to facilitate a new iteration model, and a model that provides independent object callbacks allows scope for concurrent processing of individual objects. > The underlying approach in this patchset of "just use the inode hash > table if that's available" - that I _do_ like, but this seems like > the wrong way to go about it, we're significantly adding to the amount > of special purpose "things" filesystems have to do if they want to > perform well. I've already addressed this in my response to Christian. This is a mechanism that allows filesystems to be moved one-by-one to a new generic cache and iteration implementation without impacting existing code. Once we have that, scalability of the inode cache and traversals should not be a reason for filesystems "doing their own thing" because the generic infrastructure will be sufficient for most filesystem implementations. > Converting the standard inode hash table to an rhashtable (or more > likely, creating a new standard implementation and converting > filesystems one at a time) still needs to happen, and then the "use the > hash table for iteration" approach could use that without every > filesystem having to specialize. Yes, but this still doesn't help filesystems like XFS where the structure of the inode cache is highly optimised for the specific on-disk and in-memory locality of inodes. We aren't going to be converting XFS to a rhashtable based inode cache anytime soon because it simply doesn't provide the functionality we require. e.g. efficient lockless sequential inode number ordered traversal in -every- inode cluster writeback operation. > Failing that, or even regardless, I think we do need either dlock-list > or fast-list. "I need some sort of generic list, but fast" is something > I've seen come up way too many times. There's nothing stopping you from using the dlist patchset for your own purposes. It's public code - just make sure you retain the correct attributions. :) -Dave.
On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote: > > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > > > What do people think of moving towards per-sb inode caching and > > > > > traversal mechanisms like this? > > > > > > > > Patches 1-4 are great cleanups that I would like us to merge even > > > > independent of the rest. > > > > > > Yes, they make it much easier to manage the iteration code. > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > that may encourage filesystems to do their own inode caching even if > > > > they don't really have a need for it just because it's there. So really > > > > a way that would've solved this issue generically would have been my > > > > preference. > > > > > > Well, that's the problem, isn't it? :/ > > > > > > There really isn't a good generic solution for global list access > > > and management. The dlist stuff kinda works, but it still has > > > significant overhead and doesn't get rid of spinlock contention > > > completely because of the lack of locality between list add and > > > remove operations. > > > > There is though; I haven't posted it yet because it still needs some > > work, but the concept works and performs about the same as dlock-list. > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list > > > > The thing that needs to be sorted before posting is that it can't shrink > > the radix tree. generic-radix-tree doesn't support shrinking, and I > > could add that, but then ida doesn't provide a way to query the highest > > id allocated (xarray doesn't support backwards iteration). > > That's an interesting construct, but... > > > So I'm going to try it using idr and see how that performs (idr is not > > really the right data structure for this, split ida and item radix tree > > is better, so might end up doing something else). > > > > But - this approach with more work will work for the list_lru lock > > contention as well. > > .... it isn't a generic solution because it is dependent on > blocking memory allocation succeeding for list_add() operations. > > Hence this cannot do list operations under external synchronisation > constructs like spinlocks or rcu_read_lock(). It also introduces > interesting interactions with memory reclaim - what happens we have > to add an object to one of these lists from memory reclaim context? > > Taking the example of list_lru, this list construct will not work > for a variety of reasons. Some of them are: > > - list_lru_add() being called from list_lru_add_obj() under RCU for > memcg aware LRUs so cannot block and must not fail. > - list_lru_add_obj() is called under spinlocks from inode_lru_add(), > the xfs buffer and dquot caches, the workingset code from under > the address space mapping xarray lock, etc. Again, this must not > fail. > - list_lru_add() operations take can place in large numbers in > memory reclaim context (e.g. dentry reclaim drops inodes which > adds them to the inode lru). Hence memory reclaim becomes even > more dependent on PF_MEMALLOC memory allocation making forwards > progress. > - adding long tail list latency to what are currently O(1) fast path > operations (e.g. mulitple allocations tree splits for LRUs > tracking millions of objects) is not desirable. > > So while I think this is an interesting idea that might be useful in > some cases, I don't think it is a viable generic scalable list > construct we can use in areas like list_lru or global list > management that run under external synchronisation mechanisms. There are difficulties, but given the fundamental scalability and locking issues with linked lists, I think this is the approach we want if we can make it work. A couple things that help - we've already determined that the inode LRU can go away for most filesystems, and we can preallocate slots without actually adding objects. Iteration will see NULLs that they skip over, so we can't simply preallocate a slot for everything if nr_live_objects / nr_lru_objects is too big. But, we can certainly preallocate slots on a given code path and then release them back to the percpu buffer if they're not used. > - LRU lists are -ordered- (it's right there in the name!) and this > appears to be an unordered list construct. Yes, it is. But in actual practice cache replacement policy tends not to matter nearly as much as people think; there's many papers showing real world hit ratio of common algorithms is only a fudge factor from random replacement - the main thing you want is an accessed bit (or counter, if you want the analagous version of n-lru for n > 2), and we'll still have that.