Message ID | dtolpfivc4fvdfbqgmljygycyqfqoranpsjty4sle7ouydycez@aw7v34oibdhm (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] bcachefs changes for 6.12-rc1 | expand |
On Sat, 21 Sept 2024 at 12:28, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > We're now using an rhashtable instead of the system inode hash table; > this is another significant performance improvement on multithreaded > metadata workloads, eliminating more lock contention. So I've pulled this, but I reacted to this issue - what is the load where the inode hash table is actually causing issues? Because honestly, the reason we're using that "one single lock" for the inode hash table is that nobody has ever bothered. In particular, it *should* be reasonably straightforward (lots of small details, but not fundamentally hard) to just make the inode hash table be a "bl" hash - with the locking done per-bucket, not using one global lock. But nobody has ever seemed to care before. Because the inode hash just isn't all that heavily used, since normal loads don't look up inodes directly (they look up dentries, and you find the inodes off those - and the *dentry* hash scales well thanks to both RCU and doing the bucket lock thing). So the fact that bcachefs cares makes me go "Why?" Normal filesystems *never* call ilookup() and friends. Why does bcachefs do it so much that you decided that you need to use a whole other hashtable? Linus
The pull request you sent on Sat, 21 Sep 2024 15:27:58 -0400:
> git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b3f391fddf3cfaadda59ec8da8fd17f4520bbf42
Thank you!
On Mon, 23 Sept 2024 at 10:18, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > We're now using an rhashtable instead of the system inode hash table; > > this is another significant performance improvement on multithreaded > > metadata workloads, eliminating more lock contention. > > So I've pulled this, but I reacted to this issue - what is the load > where the inode hash table is actually causing issues? Oh, and I also only now noticed that a third of the commits are fairly recent and from after the merge window even started. Maybe there's a good reason for that, but it sure wasn't explained in the pull request. Grr. Linus
On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > We're now using an rhashtable instead of the system inode hash table; > > this is another significant performance improvement on multithreaded > > metadata workloads, eliminating more lock contention. > > So I've pulled this, but I reacted to this issue - what is the load > where the inode hash table is actually causing issues? > > Because honestly, the reason we're using that "one single lock" for > the inode hash table is that nobody has ever bothered. > > In particular, it *should* be reasonably straightforward (lots of > small details, but not fundamentally hard) to just make the inode hash > table be a "bl" hash - with the locking done per-bucket, not using one > global lock. Dave was working on that - he posted patches and it seemed like they were mostly done, not sure what happened with those. > But nobody has ever seemed to care before. Because the inode hash just > isn't all that heavily used, since normal loads don't look up inodes > directly (they look up dentries, and you find the inodes off those - > and the *dentry* hash scales well thanks to both RCU and doing the > bucket lock thing). > > So the fact that bcachefs cares makes me go "Why?" I've seen benchmarks (not just on bcachefs, ext4 as well) where it was actually _really_ heavy. They were synthetics, but I'd expect to see the same effect from workloads where we're streaming a lot of inodes through the cache - think rsync. Another data point is that the btree key cache work I just did was also for exactly this type of situation - it addressed lock contention when streaming items into and out of the key cache, which are mostly inodes (it's also used for the alloc btree) - and I had reports from multiple users it was a drastic improvement, much more than expected, and eliminated the main source of our "srcu lock held too long" warnings - and if lock contention is bad enough to trip a 10 second warning, you know it's bad, heh. I try to really jump on these sorts of lock contention issues because I've seen many instances where they caused performance cliffs - in practice they're often O(n^2) issues because as your workload scales up it's usually not just number of instances where you take the lock that increases, usually something else is scaling too - even just increased cacheline bouncing from the lock contention itself will be problematic.
On Mon, Sep 23, 2024 at 12:07:24PM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 10:18, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > this is another significant performance improvement on multithreaded > > > metadata workloads, eliminating more lock contention. > > > > So I've pulled this, but I reacted to this issue - what is the load > > where the inode hash table is actually causing issues? > > Oh, and I also only now noticed that a third of the commits are fairly > recent and from after the merge window even started. > > Maybe there's a good reason for that, but it sure wasn't explained in > the pull request. Grr. I rebased to drop a patch that conflicts with Andrew's tree (the PF_MEMALLOC_NORECLAIM stuff, which I would still like to chat with him about since I missed him at Plumbers...)
On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > this is another significant performance improvement on multithreaded > > > metadata workloads, eliminating more lock contention. > > > > So I've pulled this, but I reacted to this issue - what is the load > > where the inode hash table is actually causing issues? > > > > Because honestly, the reason we're using that "one single lock" for > > the inode hash table is that nobody has ever bothered. > > > > In particular, it *should* be reasonably straightforward (lots of > > small details, but not fundamentally hard) to just make the inode hash > > table be a "bl" hash - with the locking done per-bucket, not using one > > global lock. > > Dave was working on that - he posted patches and it seemed like they > were mostly done, not sure what happened with those. I lost interest in that patchset a long while ago. The last posting I did was largely as a community service to get the completed, lockdep and RTPREEMPT compatible version of the hashbl infrastructure it needed out there for other people to be able to easily push this forward if they needed it. That was here: https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ and I posted it because Kent was asking about it because his attempts to convert the inode hash to use rhashtables wasn't working. I've suggested multiple times since then when people have asked me about that patch set that they are free to pick it up and finish it off themselves. Unfortunately, that usually results in silence, a comment of "I don't have time for that", and/or they go off and hack around the issue in some other way.... > > But nobody has ever seemed to care before. Because the inode hash just > > isn't all that heavily used, since normal loads don't look up inodes > > directly (they look up dentries, and you find the inodes off those - > > and the *dentry* hash scales well thanks to both RCU and doing the > > bucket lock thing). Not entirely true. Yes, the dentry cache protects the inode hash from contention when the workload repeatedly hits cached dentries. However, the problematic workload is cold cache operations where the dentry cache repeatedly misses. This places all the operational concurrency directly on the inode hash as new inodes are inserted into the hash. Add memory reclaim and that adds contention as it removes inodes from the hash on eviction. This happens in workloads that cycle lots of inode through memory. Concurrent cold cache lookups, create and unlink hammer the global filesystem inode hash lock at more than 4 active tasks, especially when there is also concurrent memory reclaim running evicting inodes from the inode hash. This inode cache miss contention issue was sort-of hacked around recently by commit 7180f8d91fcb ("vfs: add rcu-based find_inode variants for iget ops"). This allowed filesystems filesystems to use -partially- RCU-based lookups rather fully locked lookups. In the case of a hit, it will be an RCU operation, but as you state, cache hits on the inode hash are the rare case, not the common case. I say "sort-of hacked around" because the RCU lookup only avoids the inode hash lock on the initial lookup that misses. Then insert is still done under the inode_hash_lock and so inode_hash_lock contention still definitely exists in this path. All the above commit did was move the catastrophic cacheline contention breakdown point to be higher than the test workload that was being run could hit. In review of that commit, I suggested that the inode hash_bl conversion patches be finished off instead, but the response was "I don't have time for that". And so here we are again.... -Dave.
On Tue, Sep 24, 2024 at 10:26:56AM GMT, Dave Chinner wrote: > On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote: > > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > > this is another significant performance improvement on multithreaded > > > > metadata workloads, eliminating more lock contention. > > > > > > So I've pulled this, but I reacted to this issue - what is the load > > > where the inode hash table is actually causing issues? > > > > > > Because honestly, the reason we're using that "one single lock" for > > > the inode hash table is that nobody has ever bothered. > > > > > > In particular, it *should* be reasonably straightforward (lots of > > > small details, but not fundamentally hard) to just make the inode hash > > > table be a "bl" hash - with the locking done per-bucket, not using one > > > global lock. > > > > Dave was working on that - he posted patches and it seemed like they > > were mostly done, not sure what happened with those. > > I lost interest in that patchset a long while ago. The last posting > I did was largely as a community service to get the completed, > lockdep and RTPREEMPT compatible version of the hashbl > infrastructure it needed out there for other people to be able to > easily push this forward if they needed it. That was here: > > https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ > > and I posted it because Kent was asking about it because his > attempts to convert the inode hash to use rhashtables wasn't > working. > > I've suggested multiple times since then when people have asked me > about that patch set that they are free to pick it up and finish it > off themselves. Unfortunately, that usually results in silence, a > comment of "I don't have time for that", and/or they go off and hack > around the issue in some other way.... Looking over that thread I'm not clear on what was missing - you mentioned RTPREEMPT/lockdep but you had a patch for that, which looked servicable enough?
On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > However, the problematic workload is cold cache operations where > the dentry cache repeatedly misses. This places all the operational > concurrency directly on the inode hash as new inodes are inserted > into the hash. Add memory reclaim and that adds contention as it > removes inodes from the hash on eviction. Yeah, and then we spend all the time just adding the inodes to the hashes, and probably fairly seldom use them. Oh well. And I had missed the issue with PREEMPT_RT and the fact that right now the inode hash lock is outside the inode lock, which is problematic. So it's all a bit nasty. But I also assume most of the bad issues end up mainly showing up on just fairly synthetic benchmarks with ramdisks, because even with a good SSD I suspect the IO for the cold cache would still dominate? Linus
On Mon, 23 Sept 2024 at 19:26, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. .. although I guess that could be solved by just making the spinlock be external to the hash list, rather than part of the list like list_bh. You wouldn't need to do one lock per list head, you could just do some grouping of the hash. That said, the vfs inode hash code really is pretty disgusting and has been accumulating these warts for a long time. So maybe a "filesystems do their own thing" together with some helper infrastructure (like using rhashtable) is actually the right thing to do. Linus
On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > However, the problematic workload is cold cache operations where > > the dentry cache repeatedly misses. This places all the operational > > concurrency directly on the inode hash as new inodes are inserted > > into the hash. Add memory reclaim and that adds contention as it > > removes inodes from the hash on eviction. > > Yeah, and then we spend all the time just adding the inodes to the > hashes, and probably fairly seldom use them. Oh well. > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. > > So it's all a bit nasty. > > But I also assume most of the bad issues end up mainly showing up on > just fairly synthetic benchmarks with ramdisks, because even with a > good SSD I suspect the IO for the cold cache would still dominate? Not for bcachefs, because filling into the vfs inode cache doesn't require a disk read - they're cached in the inodes btree and much smaller there. We use a varint encoding so they're typically 50-100 bytes, last I checked. Compare to ~1k, plus or minus, in the vfs inode cache. Thomas Bertshinger has been working on applications at LANL where avoiding pulling into the vfs inode cache seems to make a significant difference (file indexing in his case) - it turns out there's an xattr syscall that's missing, which I believe he'll be submitting a patch for. But stat/statx always pulls into the vfs inode cache, and that's likely worth fixing.
On Mon, Sep 23, 2024 at 07:26:31PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > However, the problematic workload is cold cache operations where > > the dentry cache repeatedly misses. This places all the operational > > concurrency directly on the inode hash as new inodes are inserted > > into the hash. Add memory reclaim and that adds contention as it > > removes inodes from the hash on eviction. > > Yeah, and then we spend all the time just adding the inodes to the > hashes, and probably fairly seldom use them. Oh well. > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. *nod* > So it's all a bit nasty. > > But I also assume most of the bad issues end up mainly showing up on > just fairly synthetic benchmarks with ramdisks, because even with a > good SSD I suspect the IO for the cold cache would still dominate? No, all the issues show up on consumer level NVMe SSDs - they have more than enough IO concurrency to cause these CPU concurrency related problems. Keep in mind that when it comes to doing huge amounts of IO, ramdisks are fundamentally flawed and don't scale. That is, the IO is synchonous memcpy() based and so consumes CPU time and both read and write memory bandwidth, and concurrency is limited to the number of CPUs in the system.. With NVMe SSDs, all the data movement is asynchronous and offloaded to hardware with DMA engines that move the data. Those DMA engines can often handle hundreds of concurrent IOs at once. DMA sourced data is also only written to RAM once, and there are no dependent data reads to slow down the DMA write to RAM like there is with a data copy streamed through a CPU. IOWs, once the IO rates and concurrency go up, it is generally much faster to use the CPU to program the DMA engines to move the data than it is to move the data with the CPU itself. The testing I did (and so the numbers in those benchmarks) was done on 2018-era PCIe 3.0 enterprise NVMe SSDs that could do approximately 400k 4kB random read IOPS. The latest consumer PCIe 5.0 NVMe SSDs are *way faster* than these drives when subject to highly concurrent IO requests... -Dave.
On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@fromorbit.com> wrote: > > > > > > However, the problematic workload is cold cache operations where > > > the dentry cache repeatedly misses. This places all the operational > > > concurrency directly on the inode hash as new inodes are inserted > > > into the hash. Add memory reclaim and that adds contention as it > > > removes inodes from the hash on eviction. > > > > Yeah, and then we spend all the time just adding the inodes to the > > hashes, and probably fairly seldom use them. Oh well. > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > > > So it's all a bit nasty. > > > > But I also assume most of the bad issues end up mainly showing up on > > just fairly synthetic benchmarks with ramdisks, because even with a > > good SSD I suspect the IO for the cold cache would still dominate? > > Not for bcachefs, because filling into the vfs inode cache doesn't > require a disk read - they're cached in the inodes btree and much > smaller there. We use a varint encoding so they're typically 50-100 > bytes, last I checked. > > Compare to ~1k, plus or minus, in the vfs inode cache. > > Thomas Bertshinger has been working on applications at LANL where > avoiding pulling into the vfs inode cache seems to make a significant > difference (file indexing in his case) - it turns out there's an xattr > syscall that's missing, which I believe he'll be submitting a patch for. > > But stat/statx always pulls into the vfs inode cache, and that's likely > worth fixing. No, let's not even consider going there. Unlike most people, old time XFS developers have direct experience with the problems that "uncached" inode access for stat purposes. XFS has had the bulkstat API for a long, long time (i.e. since 1998 on Irix). When it was first implemented on Irix, it was VFS cache coherent. But in the early 2000s, that caused problems with HSMs needing to scan billions inodes indexing petabytes of stored data with certain SLA guarantees (i.e. needing to scan at least a million inodes a second). The CPU overhead of cache instantiation and teardown was too great to meet those performance targets on 500MHz MIPS CPUs. So we converted bulkstat to run directly out of the XFS buffer cache (i.e. uncached from the perspective of the VFS). This reduced the CPU over per-inode substantially, allowing bulkstat rates to increase by a factor of 10. However, it introduced all sorts of coherency problems between cached inode state vs what was stored in the buffer cache. It was basically O_DIRECT for stat() and, as you'd expect from that description, the coherency problems were horrible. Detecting iallocated-but-not-yet-updated and unlinked-but-not-yet-freed inodes were particularly consistent sources of issues. The only way to fix these coherency problems was to check the inode cache for a resident inode first, which basically defeated the entire purpose of bypassing the VFS cache in the first place. So we went back to using coherent lookups once we stopped caring about the old SGI HSM SLA requirements back in 2010: commit 7dce11dbac54fce777eea0f5fb25b2694ccd7900 Author: Christoph Hellwig <hch@lst.de> Date: Wed Jun 23 18:11:11 2010 +1000 xfs: always use iget in bulkstat The non-coherent bulkstat versionsthat look directly at the inode buffers causes various problems with performance optimizations that make increased use of just logging inodes. This patch makes bulkstat always use iget, which should be fast enough for normal use with the radix-tree based inode cache introduced a while ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Essentially, we now always being inodes into cache as fully instantiated VFS inodes, and mark them with I_DONT_CACHE so they get torn down immediately after we've used them if they weren't already cached. (i.e. all accesses are done in a single task context with the inode hot in the CPU caches.) Without the patchset I pointed to, bulkstat maxxes out the VFS inode cache cycling on the sb->s_inodes_list_lock at about 700,000 inodes/sec being cycled through the cache. With that patchset, it maxxed out the bandwidth of the NVMe SSDs I was using at ~7GB/s read IO and cycling about 6 million inodes/sec through the VFS cache. IOWs, we can make the VFS inode cache scale way beyond what most users actually need right now. The lesson in all this? Don't hack around VFS scalability issues if it can be avoided. If we fix the problems properly in the first place, then most fs developers (let alone users!) won't even realise there was a scalability problem in the VFS to begin with. Users will, however, notice every inconsistency and/or corruption caused by fs developers trying to work around VFS cache limitations... -Dave.
On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote: > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > > But stat/statx always pulls into the vfs inode cache, and that's likely > > worth fixing. > > No, let's not even consider going there. > > Unlike most people, old time XFS developers have direct experience > with the problems that "uncached" inode access for stat purposes. > > XFS has had the bulkstat API for a long, long time (i.e. since 1998 > on Irix). When it was first implemented on Irix, it was VFS cache > coherent. But in the early 2000s, that caused problems with HSMs > needing to scan billions inodes indexing petabytes of stored data > with certain SLA guarantees (i.e. needing to scan at least a million > inodes a second). The CPU overhead of cache instantiation and > teardown was too great to meet those performance targets on 500MHz > MIPS CPUs. > > So we converted bulkstat to run directly out of the XFS buffer cache > (i.e. uncached from the perspective of the VFS). This reduced the > CPU over per-inode substantially, allowing bulkstat rates to > increase by a factor of 10. However, it introduced all sorts of > coherency problems between cached inode state vs what was stored in > the buffer cache. It was basically O_DIRECT for stat() and, as you'd > expect from that description, the coherency problems were horrible. > Detecting iallocated-but-not-yet-updated and > unlinked-but-not-yet-freed inodes were particularly consistent > sources of issues. > > The only way to fix these coherency problems was to check the inode > cache for a resident inode first, which basically defeated the > entire purpose of bypassing the VFS cache in the first place. Eh? Of course it'd have to be coherent, but just checking if an inode is present in the VFS cache is what, 1-2 cache misses? Depending on hash table fill factor... That's going to show up, but I have a hard time seeing that "defeating the entire purpose" of bypassing the VFS cache, as you say. > Don't hack around VFS scalability issues if it can be avoided. Well, maybe if your dlock list patches make it in - I still see crazy lock contention there...
On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 19:26, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > .. although I guess that could be solved by just making the spinlock > be external to the hash list, rather than part of the list like > list_bh. That's effectively what the patch did - it added a spinlock per hash list. > You wouldn't need to do one lock per list head, you could just do some > grouping of the hash. Yes, that's a more complex thing to do, though - it is effectively using hashed locks for the hash table. > That said, the vfs inode hash code really is pretty disgusting and has > been accumulating these warts for a long time. So maybe a "filesystems > do their own thing" together with some helper infrastructure (like > using rhashtable) is actually the right thing to do. Perhaps. XFS has done it's own thing since 2007, but that was because the XFS inode lifecycle has always existed outside the VFS inode life cycle. i.e. the inode has to outlive evict() and ->destroy_inode because it can still be dirty and tracked by the journal when the VFS evicts it from the cache. We also have internal inode lookup stuff that we don't want to go anywhere near the VFS (e.g. inode writeback clustering from journal item cleaning callbacks). But this has come at a significant complexity cost, and some of the biggest problems we've had to solve have been a result of this architecture. e.g. memory reclaim getting blocked on dirty XFS inodes that the VFS is unaware of via the fs specific callout in the superblock shrinker caused all sorts of long tail memory allocation latency issues for years. We've fixed that, but it took a long time to work out how to avoid blocking in memory reclaim without driving the system immediately into OOM kill frenzies.... There are also issues around RCU freeing of inodes and how the filesysetm level cache handles that. e.g. what happens when the filesystem doesn't immediately free the inode after memory reclaim evicts it, but then the fs re-instantiates it moments later when a new lookup for that inode comes in? The rest of the VFS assumes that the inode won't reappear until a RCU grace period expires, but having the re-instantiation code call synchronise_rcu() (or something similar) is not an option because this sort of situation can occur thousands of times a second under memory pressure.... We know this because there is an outstanding issue in XFS around this situation - none of the general solutions to the problem we've tried have been viable. I suspect that any other filesystem that implements it's own inode cache and does VFS inode re-instantiation on a inode cache hit will have similar issues, and so there will be wider VFS architecture impacts as a result. IOWs, it's not clear to me that this is a caching model we really want to persue in general because of a) the code duplication and b) the issues such an inode life-cycle model has interacting with the VFS life-cycle expectations... -Dave.
On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > That's effectively what the patch did - it added a spinlock per hash > list. Yeah, no, I like your patches, they all seem to be doing sane things and improve the code. The part I don't like is how the basic model that your patches improve seems quite a bit broken. For example, that whole superblock list lock contention just makes me go "Christ, that's stupid". Not because your patches to fix it with Waiman's fancy list would be wrong or bad, but because the whole pain is self-inflicted garbage. And it's all historically very very understandable. It wasn't broken when it was written. That singly linked list is 100% sane in the historical context of "we really don't want anything fancy for this". The whole list of inodes for a superblock is basically unimportant, and it's main use (only use?) is for the final "check that we don't have busy inodes at umount time, remove any pending stuff". So using a stupid linked list was absolutely the right thing to do, but now that just the *locking* for that list is a pain, it turns out that we probably shouldn't use a list at all. Not because the list was wrong, but because a flat list is such a pain for locking, since there's no hierarchy to spread the locking out over. (We used to have that kind of "flat lock" for the dcache too, but "dcache_lock" went away many moons ago, and good riddance - but the only reason it could go away is that the dcache has a hierarchy, so that you can always lock just the local dentry (or the parent dentry) if you are just careful). > [ filesystems doing their own optimized thing ] > > IOWs, it's not clear to me that this is a caching model we really > want to persue in general because of a) the code duplication and b) > the issues such an inode life-cycle model has interacting with the > VFS life-cycle expectations... No, I'm sure you are right, and I'm just frustrated with this code that probably _should_ look a lot more like the dcache code, but doesn't. I get the very strong feeling that we should have a per-superblock hash table that we could also traverse the entries of. That way the superblock inode list would get some structure (the hash buckets) that would allow the locking to be distributed (and we'd only need one lock and just share it between the "hash inode" and "add it to the superblock list"). But that would require something much more involved than "improve the current code". Linus
On Tue, Sep 24, 2024 at 09:57:13AM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. > > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Well, lists are just a pain. They should really be radix trees, we're not doing actual LRU anyways. > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). I was considering that - that's what the bcachefs btree key cache does, the shrinker just iterates over the rhashtable. But referenced inodes aren't kept on the shrinker lists; I hate the overhead of adding/removing, but Dave was emphatic on there being workloads where that was really necessary and I'm somewhat convinced. The other big issue is memcg; I think for that to work we'd need reclaim to coalesce shrinker calls and shrink from multiple memcgs at the same time.
On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. I'm not about to disagree with that assessment. > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Right, that's effectively what the dlist infrastructure has taught us - we need some kind of heirarchy to spread the locking over. But what that heirachy is for a "iterate every object" list looks like isn't really clear cut... Another option I'd considered was to offload the iteration to filesystems that have internal tracking mechanisms. e.g. use a superblock method (say ->for_all_cached_inodes()) that gets passed a callback function for the operation to perform on a stabilised inode. This would work for XFS - we already have such an internal callback based cache-walking infrastructure - xfs_iwalk() - that is used extensively for internal admin, gc and scrub/repair functions. If we could use this for VFS icache traversals, then we wouldn't need to maintain the sb->s_inodes list at all in XFS. But I didn't go down that route because I didn't think we wanted to encourage each major filesysetm to have their own unique internal inode caching implementations with there own special subtle differences. The XFS icache code is pretty complex and really requires a lot of XFS expertise to understand - that makes changing global inode caching behaviour or life cycle semantics much more difficult than it already is. That said, if we do decide that a model where filesystems will provide their own inode caches is acceptible, then as a first step we could convert the generic s_inodes list iteration to the callback model fairly easily.... > (We used to have that kind of "flat lock" for the dcache too, but > "dcache_lock" went away many moons ago, and good riddance - but the > only reason it could go away is that the dcache has a hierarchy, so > that you can always lock just the local dentry (or the parent dentry) > if you are just careful). > > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). The only problem with this is the size of the per-sb hash tables needed for scalability - we can't allocate system sized hash tables for every superblock just in case a superblock might be asked to cache 100 million inodes. That's why Kent used rhashtables for the bcachefs implementation - they resize according to how many objects are being indexed, and hence scale both up and down. That is also why XFS uses multiple radix trees per-sb in it's icache implementation - they scale up efficiently, yet have a small memory footprint when only a few inodes are in cache in a little used filesystem. > But that would require something much more involved than "improve the > current code". Yup. FWIW, I think all this "how do we cache inodes better" discussion is somehwat glossing over a more important question we need to think about first: do we even need a fully fledged inode cache anymore? Every inode brought into cache is pinned by a dentry. The dentry cache has an LRU and cache aging, and so by the time a dentry is aged out of the dentry cache and the inode is finally dropped, it has not been in use for some time. It has aged out of the current working set. Given that we've already aged the inode out of the working set. why do we then need to dump the inode into another LRU and age that out again before we reclaim the inode? What does this double caching actually gaining us? I've done experiments on XFS marking all inodes with I_DONT_CACHE, which means it gets removed from the inode cache the moment the reference count goes to zero (i.e. when the dentry is dropped from cache). This leaves the inode life cycle mirroring the dentry life-cycle. i.e. the inode is evicted when the dentry is aged out as per normal. On SSD based systems, I really don't see any performance degradation for my typical workstation workloads like git tree operations and kernel compiles. I don't see any noticable impact on streaming inode/data workloads, either. IOWs, the dentry cache appears to be handling the workings set maintenance duties pretty well for most common workloads. Hence I question the need for LRU based inode cache aging being needed at all. So: should we be looking towards gutting the inode cache and so the in-memory VFS inode lifecycle tracks actively referenced inodes? If so, then the management of the VFS inodes becomes a lot simpler as the LRU lock, maintenance and shrinker-based reclaim goes away entirely. Lots of code gets simpler if we trim down the VFS inode life cycle to remove the caching of unreferenced inodes... -Dave.
On Mon, Sep 23, 2024 at 11:47:54PM -0400, Kent Overstreet wrote: > On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote: > > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > > > But stat/statx always pulls into the vfs inode cache, and that's likely > > > worth fixing. > > > > No, let's not even consider going there. > > > > Unlike most people, old time XFS developers have direct experience > > with the problems that "uncached" inode access for stat purposes. > > > > XFS has had the bulkstat API for a long, long time (i.e. since 1998 > > on Irix). When it was first implemented on Irix, it was VFS cache > > coherent. But in the early 2000s, that caused problems with HSMs > > needing to scan billions inodes indexing petabytes of stored data > > with certain SLA guarantees (i.e. needing to scan at least a million > > inodes a second). The CPU overhead of cache instantiation and > > teardown was too great to meet those performance targets on 500MHz > > MIPS CPUs. > > > > So we converted bulkstat to run directly out of the XFS buffer cache > > (i.e. uncached from the perspective of the VFS). This reduced the > > CPU over per-inode substantially, allowing bulkstat rates to > > increase by a factor of 10. However, it introduced all sorts of > > coherency problems between cached inode state vs what was stored in > > the buffer cache. It was basically O_DIRECT for stat() and, as you'd > > expect from that description, the coherency problems were horrible. > > Detecting iallocated-but-not-yet-updated and > > unlinked-but-not-yet-freed inodes were particularly consistent > > sources of issues. > > > > The only way to fix these coherency problems was to check the inode > > cache for a resident inode first, which basically defeated the > > entire purpose of bypassing the VFS cache in the first place. > > Eh? Of course it'd have to be coherent, but just checking if an inode is > present in the VFS cache is what, 1-2 cache misses? Depending on hash > table fill factor... Sure, when there is no contention and you have CPU to spare. But the moment the lookup hits contention problems (i.e. we are exceeding the cache lookup scalability capability), we are straight back to running a VFS cache speed instead of uncached speed. IOWs, needing to perform the cache lookup defeated the purpose of using uncached lookups to avoid the cache scalabilty problems. Keep in mind that not having a referenced inode opens up the code to things like pre-emption races. i.e. a cache miss doesn't prevent the current task from being preempted before it reads the inode information into the user buffer. The VFS inode could bei instantiated and modified before the uncached access runs again and pulls stale information from the underlying buffer and returns that to userspace. Those were the sorts of problems we continually had with using low level inode information for stat operations vs using the up-to-date VFS inode state.... -Dave.
On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote: > > FWIW, I think all this "how do we cache inodes better" discussion is > somehwat glossing over a more important question we need to think > about first: do we even need a fully fledged inode cache anymore? I'd be more than happy to try. I have to admit that it's largely my mental picture (ie "inode caches don't really matter"), and why I was surprised that the inode locks even show up on benchmarks. If we just always drop inodes when the refcount goes to zero and never have any cached inodes outside of other references, I think most of the reasons for the superblock inode list just disappear. Most of the heavy lifting has long since been moved to the dentry shrinking. I'd worry that we have a lot of tuning that depends on the whole inode LRU thing (that iirc takes the mapping size etc into account). But maybe it would be worth trying to make I_DONTCACHE the default, with the aim of removing the independent inode lifetimes entirely... Linus
On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > table fill factor... > > Sure, when there is no contention and you have CPU to spare. But the > moment the lookup hits contention problems (i.e. we are exceeding > the cache lookup scalability capability), we are straight back to > running a VFS cache speed instead of uncached speed. The cache lookups are just reads; they don't introduce scalability issues, unless they're contending with other cores writing to those cachelines - checking if an item is present in a hash table is trivial to do locklessly. But pulling an inode into and then evicting it from the inode cache entails a lot more work - just initializing a struct inode is nontrivial, and then there's the (multiple) shared data structures you have to manipulate. > Keep in mind that not having a referenced inode opens up the code to > things like pre-emption races. i.e. a cache miss doesn't prevent > the current task from being preempted before it reads the inode > information into the user buffer. The VFS inode could bei > instantiated and modified before the uncached access runs again and > pulls stale information from the underlying buffer and returns that > to userspace. Yeah, if you're reading from a buffer cache that doesn't have a lock that does get dicy - but for bcachefs where we're reading from a btree node that does have a lock it's quite manageable. And incidentally this sort of "we have a cache on top of the btree, but sometimes we have to do direct access" is already something that comes up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but doesn't actually come up here for us since we don't use the vfs inode cache as a writeback cache. Now, for some completely different sillyness, there's actually _three_ levels of caching for inodes in bcachefs: btree node cache, btree key cache, then the vfs cache. In the first two inodes are packed down to ~100 bytes so it's not that bad, but it does make you go "...what?". It would be nice in theory to collapse - but the upside is that we don't have the interactions between the vfs inode cache and journalling that xfs has. But if vfs inodes no longer have their own lifetime like you've been talking about, that might open up interesting possibilities.
+cc rhashtable folks for bug report On Wed, Sep 25, 2024 at 10:17:46AM GMT, Dave Chinner wrote: > On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@fromorbit.com> wrote: > > > > > > That's effectively what the patch did - it added a spinlock per hash > > > list. > > > > Yeah, no, I like your patches, they all seem to be doing sane things > > and improve the code. > > > > The part I don't like is how the basic model that your patches improve > > seems quite a bit broken. > > > > For example, that whole superblock list lock contention just makes me > > go "Christ, that's stupid". Not because your patches to fix it with > > Waiman's fancy list would be wrong or bad, but because the whole pain > > is self-inflicted garbage. > > I'm not about to disagree with that assessment. > > > And it's all historically very very understandable. It wasn't broken > > when it was written. > > > > That singly linked list is 100% sane in the historical context of "we > > really don't want anything fancy for this". The whole list of inodes > > for a superblock is basically unimportant, and it's main use (only > > use?) is for the final "check that we don't have busy inodes at umount > > time, remove any pending stuff". > > > > So using a stupid linked list was absolutely the right thing to do, > > but now that just the *locking* for that list is a pain, it turns out > > that we probably shouldn't use a list at all. Not because the list was > > wrong, but because a flat list is such a pain for locking, since > > there's no hierarchy to spread the locking out over. > > Right, that's effectively what the dlist infrastructure has taught > us - we need some kind of heirarchy to spread the locking over. But > what that heirachy is for a "iterate every object" list looks like > isn't really clear cut... If you think of dlist as just "faster lru list", I think that's a completely viable approach. So one thing I've been mulling over is a data structure for efficiently tracking async op structs - "some async op went out to lunch" is something that we still have crap debug tooling for, and those bugs take forever to get fixed. There seems to be an outstanding bug in compaction or migration that I've seen multiple reports for, and I doubt it's bcachefs (we're quite standard there) - and it's some async op that should've released a folio lock but isn't. And all we need for that is, roughly, an idr with a percpu buffer for allocating/freeing slots in front. Which would be perfect for shrinker lists, too: walking the list is a lockless radix tree walk, and adding/removing entries mostly just hits the percpu buffers. So now we have another use case for that data structure, so I'll try to get that done in the near future and see how it does. > > (We used to have that kind of "flat lock" for the dcache too, but > > "dcache_lock" went away many moons ago, and good riddance - but the > > only reason it could go away is that the dcache has a hierarchy, so > > that you can always lock just the local dentry (or the parent dentry) > > if you are just careful). > > > > > > [ filesystems doing their own optimized thing ] > > > > > > IOWs, it's not clear to me that this is a caching model we really > > > want to persue in general because of a) the code duplication and b) > > > the issues such an inode life-cycle model has interacting with the > > > VFS life-cycle expectations... > > > > No, I'm sure you are right, and I'm just frustrated with this code > > that probably _should_ look a lot more like the dcache code, but > > doesn't. > > > > I get the very strong feeling that we should have a per-superblock > > hash table that we could also traverse the entries of. That way the > > superblock inode list would get some structure (the hash buckets) that > > would allow the locking to be distributed (and we'd only need one lock > > and just share it between the "hash inode" and "add it to the > > superblock list"). > > The only problem with this is the size of the per-sb hash tables > needed for scalability - we can't allocate system sized hash tables > for every superblock just in case a superblock might be asked to > cache 100 million inodes. That's why Kent used rhashtables for the > bcachefs implementation - they resize according to how many objects > are being indexed, and hence scale both up and down. I've been noticing rhashtable resize is surprisingly heavy (the default parameters don't ever shrink the table, which is why it's not seen as much). And, when I was torture testing that code I tripped over what appeared to be an infinite loop in rht_bucket() when a rehsah is in progress, which I worked around in a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() (because it makes more sense to just not run while rehashing and avoid the function call overhead of rht_bucket()). If rhashtable folks are interested in looking at that I have a test that reproduces it. > So: should we be looking towards gutting the inode cache and so the > in-memory VFS inode lifecycle tracks actively referenced inodes? If > so, then the management of the VFS inodes becomes a lot simpler as > the LRU lock, maintenance and shrinker-based reclaim goes away > entirely. Lots of code gets simpler if we trim down the VFS inode > life cycle to remove the caching of unreferenced inodes... Sounds completely legit to me - except for the weird filesystems. I just started cscoping and afs at least seems to do iget() without ever attaching it to a dentry, I didn't look further. So... If fs/inode.c is grotty, perhaps it makes sense to start a fs/inode_rhashtable.c with all the new semantics and convert filesystems one by one, like we've been doing with the dio code and iomap. I finally have my head sufficiently around all the weird inode lifetime rules that I could do it, although it'll be a bit before I'm not swamped.
On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote: > On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > > table fill factor... > > > > Sure, when there is no contention and you have CPU to spare. But the > > moment the lookup hits contention problems (i.e. we are exceeding > > the cache lookup scalability capability), we are straight back to > > running a VFS cache speed instead of uncached speed. > > The cache lookups are just reads; they don't introduce scalability > issues, unless they're contending with other cores writing to those > cachelines - checking if an item is present in a hash table is trivial > to do locklessly. Which was not something the VFS inode cache did until a couple of months ago. Just because something is possible/easy today, it doesn't mean it was possible or viable 15-20 years ago. > But pulling an inode into and then evicting it from the inode cache > entails a lot more work - just initializing a struct inode is > nontrivial, and then there's the (multiple) shared data structures you > have to manipulate. Yes, but to avoid this we'd need to come up with a mechanism that is generally safe for most filesystems, not just bcachefs. I mean, if you can come up with a stat() mechanism that is safe enough for us read straight out the XFS buffer cache for inode cache misses, then we'll switch over to using it ASAP. That's your challenge - if you want bcachefs to be able to do this, then you have to make sure the infrastructure required works for other filesystems just as safely, too. > And incidentally this sort of "we have a cache on top of the btree, but > sometimes we have to do direct access" is already something that comes > up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but > doesn't actually come up here for us since we don't use the vfs inode > cache as a writeback cache. And there-in lies the problem for the general case. Most filesystems do use writeback caching of inode metadata via the VFS inode state, XFS included, and that's where all the dragons are hiding. -Dave.
On Wed, Sep 25, 2024 at 02:43:15PM GMT, Dave Chinner wrote: > On Tue, Sep 24, 2024 at 10:13:01PM -0400, Kent Overstreet wrote: > > On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > > > table fill factor... > > > > > > Sure, when there is no contention and you have CPU to spare. But the > > > moment the lookup hits contention problems (i.e. we are exceeding > > > the cache lookup scalability capability), we are straight back to > > > running a VFS cache speed instead of uncached speed. > > > > The cache lookups are just reads; they don't introduce scalability > > issues, unless they're contending with other cores writing to those > > cachelines - checking if an item is present in a hash table is trivial > > to do locklessly. > > Which was not something the VFS inode cache did until a couple of > months ago. Just because something is possible/easy today, it > doesn't mean it was possible or viable 15-20 years ago. We've been doing lockless hash table lookups for a lot longer than that. > > But pulling an inode into and then evicting it from the inode cache > > entails a lot more work - just initializing a struct inode is > > nontrivial, and then there's the (multiple) shared data structures you > > have to manipulate. > > Yes, but to avoid this we'd need to come up with a mechanism that is > generally safe for most filesystems, not just bcachefs. > > I mean, if you can come up with a stat() mechanism that is safe > enough for us read straight out the XFS buffer cache for inode cache > misses, then we'll switch over to using it ASAP. So add a per buffer lock to your buffer cache, and check for vfs cache residency under the lock. Whether bulkstat is worth adding that lock, for a filesystem that doesn't need it? That I can't say, it's a tradeoff. > That's your challenge - if you want bcachefs to be able to do this, > then you have to make sure the infrastructure required works for > other filesystems just as safely, too. So... you think bcachefs should be limited to the capabilities of what other filesystems can do? Interesting position...
On Tue, Sep 24, 2024 at 06:45:55PM GMT, Linus Torvalds wrote: > On Tue, 24 Sept 2024 at 17:17, Dave Chinner <david@fromorbit.com> wrote: > > > > FWIW, I think all this "how do we cache inodes better" discussion is > > somehwat glossing over a more important question we need to think > > about first: do we even need a fully fledged inode cache anymore? > > I'd be more than happy to try. I have to admit that it's largely my > mental picture (ie "inode caches don't really matter"), and why I was > surprised that the inode locks even show up on benchmarks. > > If we just always drop inodes when the refcount goes to zero and never > have any cached inodes outside of other references, I think most of > the reasons for the superblock inode list just disappear. Most of the > heavy lifting has long since been moved to the dentry shrinking. > > I'd worry that we have a lot of tuning that depends on the whole inode > LRU thing (that iirc takes the mapping size etc into account). > > But maybe it would be worth trying to make I_DONTCACHE the default, > with the aim of removing the independent inode lifetimes entirely... I would be rather surprised if we didn't see regressions for some workloads but who knows. If we wanted to test this then one easy way would be to add a FS_INODE_NOCACHE fs_flag and have inode_init_always() raise I_DONTCACHE based on that. That way we can experiment with this on a per-fs basis.
On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote: > > I've been noticing rhashtable resize is surprisingly heavy (the default > parameters don't ever shrink the table, which is why it's not seen as > much). Most rhashtable users enable automatic shrinking. > And, when I was torture testing that code I tripped over what appeared > to be an infinite loop in rht_bucket() when a rehsah is in progress, > which I worked around in > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() You must not walk the rhashtable internal data structures by hand. If you need to iterate through the whole table, use the provided walking mechanism. However, as rhashtable is dynamic, walking it during a resize may cause you to see some elements twice. If you want to avoid that, set up your own linked list of all elements for a completely stable walk. Cheers,
On Fri, Sep 27, 2024 at 08:48:14AM GMT, Herbert Xu wrote: > On Tue, Sep 24, 2024 at 10:48:07PM -0400, Kent Overstreet wrote: > > > > I've been noticing rhashtable resize is surprisingly heavy (the default > > parameters don't ever shrink the table, which is why it's not seen as > > much). > > Most rhashtable users enable automatic shrinking. > > > And, when I was torture testing that code I tripped over what appeared > > to be an infinite loop in rht_bucket() when a rehsah is in progress, > > which I worked around in > > > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() > > You must not walk the rhashtable internal data structures by hand. > If you need to iterate through the whole table, use the provided > walking mechanism. I was just doing rht_for_each_entry_rcu() (open coded because you don't provide a safe version). > However, as rhashtable is dynamic, walking it during a resize may > cause you to see some elements twice. If you want to avoid that, > set up your own linked list of all elements for a completely stable > walk. Yeah that's fine, but like I said, I was seeing an infinite loop inside rht_bucket(). That would be an rhashtable bug...
On Fri, Sep 27, 2024 at 08:11:06PM -0400, Kent Overstreet wrote: > > > > And, when I was torture testing that code I tripped over what appeared > > > to be an infinite loop in rht_bucket() when a rehsah is in progress, > > > which I worked around in > > > > > > a592cdf5164d bcachefs: don't use rht_bucket() in btree_key_cache_scan() > > > > You must not walk the rhashtable internal data structures by hand. > > If you need to iterate through the whole table, use the provided > > walking mechanism. > > I was just doing rht_for_each_entry_rcu() (open coded because you don't > provide a safe version). I'm talking about the commit a592cd above. You were doing this: for (i = 0; i < tbl->size; i++) rht_for_each_entry_rcu(ck, pos, tbl, i, hash) { This is absolutely not allowed. The rhashtable must only be accessed through the published API and not iterated over directly. This is guaranteed to break during a resize/rehash. Cheers,