Message ID | 20240222203726.1101861-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Rosebush, a new hash table | expand |
在 2024/2/23 04:37, Matthew Wilcox (Oracle) 写道: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. > > You can see this code as a further exploration of the "Linked lists are > evil" design space. For the workloads which a hashtable is suited to, > it has lower overhead than either the maple tree or the rhashtable. > It cannot support ranges (so is not a replacement for the maple tree), > but it does have per-bucket locks so is more scalable for write-heavy > workloads. I suspect one could reimplement the rhashtable API on top > of the rosebush, but I am not interested in doing that work myself. > > The per-object overhead is 12 bytes, as opposed to 16 bytes for the > rhashtable and 32 bytes for the maple tree. The constant overhead is also > small, being just 16 bytes for the struct rosebush. The exact amount > of memory consumed for a given number of objects is going to depend on > the distribution of hashes; here are some estimated consumptions for > power-of-ten entries distributed evenly over a 32-bit hash space in the > various data structures: > > number xarray maple rhash rosebush > 1 3472 272 280 256 > 10 32272 784 424 256 > 100 262kB 3600 1864 2080 > 1000 [1] 34576 17224 16432 > 10k [1] 343k 168392 131344 > 100k [1] 3.4M 1731272 2101264 > > As you can see, rosebush and rhashtable are close the whole way. > Rosebush moves in larger chunks because it doubles each time; there's > no actual need to double the bucket size, but that works well with > the slab allocator's existing slabs. As noted in the documentation, > we could create our own slabs and get closer to the 12 bytes per object > minimum consumption. [2] > > Where I expect rosebush to shine is on dependent cache misses. > I've assumed an average chain length of 10 for rhashtable in the above > memory calculations. That means on average a lookup would take five cache > misses that can't be speculated. Rosebush does a linear walk of 4-byte > hashes looking for matches, so the CPU can usefully speculate the entire > array of hash values (indeed, we tell it exactly how many we're going to > look at) and then take a single cache miss fetching the correct pointer. > Add that to the cache miss to fetch the bucket and that's just two cache > misses rather than five. > > I have not yet converted any code to use the rosebush. The API is > designed for use by the dcache, and I expect it will evolve as it actually Hello, It seems that the advantage of this hash table is that it does not need to traverse the linked list and has fewer cache misses. I want to know how much performance improvement is expected if it is applied to dcache? Thanks, Peng > gets used. I think there's probably some more refactoring to be done. > I am not aware of any bugs, but the test suite is pitifully small. > The current algorithm grows the buckets more aggressively than the table; > that's probably exactly the wrong thing to do for good performance. > > This version is full of debugging printks. You should probably take > them out if you're going to try to benchmark it. The regex '^print' > should find them all. Contributions welcome; I really want to get back > to working on folios, but this felt like an urgent problem to be fixed. > > [1] I stopped trying to estimate the memory costs for xarray; I couldn't > be bothered to as it's not a serious competitor for this use case. > > [2] We have ideas for improving the maple tree memory consumption for > this kind of workload; a new node type for pivots that fit in 4 bytes and > sparse nodes to avoid putting a NULL entry after each occupied entry. > The maple tree really is optimised for densely packed ranges at the > moment. > > Matthew Wilcox (Oracle) (1): > rosebush: Add new data structure > > Documentation/core-api/index.rst | 1 + > Documentation/core-api/rosebush.rst | 135 ++++++ > MAINTAINERS | 8 + > include/linux/rosebush.h | 41 ++ > lib/Kconfig.debug | 3 + > lib/Makefile | 3 +- > lib/rosebush.c | 707 ++++++++++++++++++++++++++++ > lib/test_rosebush.c | 135 ++++++ > 8 files changed, 1032 insertions(+), 1 deletion(-) > create mode 100644 Documentation/core-api/rosebush.rst > create mode 100644 include/linux/rosebush.h > create mode 100644 lib/rosebush.c > create mode 100644 lib/test_rosebush.c >
Hi Matthew, On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. If you're interested, WireGuard has some pretty primitive hashtables, for which maybe Rosebush would be an interesting replacement: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.c https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.h#n17 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/ratelimiter.c#n167 In peerlookup.c, note the "At the moment, we limit" comment for an idea of some of the hairy issues involved in replacing these. But I wouldn't be entirely opposed to it, if you see some interesting potential for Rosebush here. That's a very tentative interest -- maybe it won't work out in the end -- but nonetheless, seeing this piqued my curiosity. If you're looking to see how this behaves in a place beyond dcache, this might be something to play with. Jason
On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. > > You can see this code as a further exploration of the "Linked lists are > evil" design space. For the workloads which a hashtable is suited to, > it has lower overhead than either the maple tree or the rhashtable. > It cannot support ranges (so is not a replacement for the maple tree), > but it does have per-bucket locks so is more scalable for write-heavy > workloads. I suspect one could reimplement the rhashtable API on top > of the rosebush, but I am not interested in doing that work myself. > > The per-object overhead is 12 bytes, as opposed to 16 bytes for the > rhashtable and 32 bytes for the maple tree. The constant overhead is also > small, being just 16 bytes for the struct rosebush. The exact amount > of memory consumed for a given number of objects is going to depend on > the distribution of hashes; here are some estimated consumptions for > power-of-ten entries distributed evenly over a 32-bit hash space in the > various data structures: > > number xarray maple rhash rosebush > 1 3472 272 280 256 > 10 32272 784 424 256 > 100 262kB 3600 1864 2080 > 1000 [1] 34576 17224 16432 > 10k [1] 343k 168392 131344 > 100k [1] 3.4M 1731272 2101264 So I think the interesting numbers to see (besides performance numbers) are going to be the fill factor numbers under real world use. It's an interesting technique, I've played around with it a bit (actually using it in bcachefs for the nocow locks hash table), but no idea if it makes sense as a general purpose thing yet... you also mentioned that a motivation was API mismatch between rhashtable and dcache - could you elaborate on that?
On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > > Where I expect rosebush to shine is on dependent cache misses. > I've assumed an average chain length of 10 for rhashtable in the above > memory calculations. That means on average a lookup would take five cache > misses that can't be speculated. Rosebush does a linear walk of 4-byte Normally an rhashtable gets resized when it reaches 75% capacity so the average chain length should always be one. Cheers,
From: Herbert Xu > Sent: 24 February 2024 00:21 > > On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > > > > Where I expect rosebush to shine is on dependent cache misses. > > I've assumed an average chain length of 10 for rhashtable in the above > > memory calculations. That means on average a lookup would take five cache > > misses that can't be speculated. Rosebush does a linear walk of 4-byte > > Normally an rhashtable gets resized when it reaches 75% capacity > so the average chain length should always be one. The average length of non-empty hash chains is more interesting. You don't usually search for items in empty chains. The only way you'll get all the chains of length one is if you've carefully picked the data so that it hashed that way. I remember playing around with the elf symbol table for a browser and all its shared libraries. While the hash function is pretty trivial, it really didn't matter whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash chains were always long. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > Normally an rhashtable gets resized when it reaches 75% capacity > > so the average chain length should always be one. > > The average length of non-empty hash chains is more interesting. > You don't usually search for items in empty chains. > The only way you'll get all the chains of length one is if you've > carefully picked the data so that it hashed that way. Sure. But given the 75% capacity, you'd need a really bad hash function to get an *average* (not worst-case) chain length of 10. > I remember playing around with the elf symbol table for a browser > and all its shared libraries. > While the hash function is pretty trivial, it really didn't matter > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > chains were always long. Even in the unlikely event of bad luck and everything bunches up together, we change theh hash function (through hash_rnd) every time we resize so you would expect things to even out after the resize event. A rehash is also automatically triggered if the worst-case chain length exceeds 16. Cheers,
On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > From: Herbert Xu > > Sent: 24 February 2024 00:21 > > > > On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > > > > > > Where I expect rosebush to shine is on dependent cache misses. > > > I've assumed an average chain length of 10 for rhashtable in the above > > > memory calculations. That means on average a lookup would take five cache > > > misses that can't be speculated. Rosebush does a linear walk of 4-byte > > > > Normally an rhashtable gets resized when it reaches 75% capacity > > so the average chain length should always be one. > > The average length of non-empty hash chains is more interesting. > You don't usually search for items in empty chains. > The only way you'll get all the chains of length one is if you've > carefully picked the data so that it hashed that way. > > I remember playing around with the elf symbol table for a browser > and all its shared libraries. > While the hash function is pretty trivial, it really didn't matter > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > chains were always long. that's a pretty bad hash, even golden ratio hash would be better, but still bad; you really should be using at least jhash. you really want a good avalanche effect, because in real world usage your entropy is often only in a relatively few bits. when I implemented cuckoo (which is more obviously sensitive to a weak hash function), I had to go with siphash, even jhash wasn't giving me great reslts. and looking at the code it's not hard to see why, it's all adds, and the rotates are byte aligned... you want mixed adds and xors and the rotates to be more prime-ish. right idea, just old... what would be ideal is something more like siphash, but with fewer rounds, so same number of instructions as jhash. xxhash might fit the bill, I haven't looked at the code yet...
On Sun, Feb 25, 2024 at 08:50:36AM +0800, Herbert Xu wrote: > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > > > Normally an rhashtable gets resized when it reaches 75% capacity > > > so the average chain length should always be one. > > > > The average length of non-empty hash chains is more interesting. > > You don't usually search for items in empty chains. > > The only way you'll get all the chains of length one is if you've > > carefully picked the data so that it hashed that way. > > Sure. But given the 75% capacity, you'd need a really bad hash > function to get an *average* (not worst-case) chain length of > 10. > > > I remember playing around with the elf symbol table for a browser > > and all its shared libraries. > > While the hash function is pretty trivial, it really didn't matter > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > chains were always long. > > Even in the unlikely event of bad luck and everything bunches up > together, we change theh hash function (through hash_rnd) every > time we resize so you would expect things to even out after the > resize event. > > A rehash is also automatically triggered if the worst-case chain > length exceeds 16. 16!? that's crap, use a decent hash function and 3-5 should be your worst upper bound.
On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote: > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > I remember playing around with the elf symbol table for a browser > > and all its shared libraries. > > While the hash function is pretty trivial, it really didn't matter > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > chains were always long. > > that's a pretty bad hash, even golden ratio hash would be better, but > still bad; you really should be using at least jhash. There's a "fun" effect; essentially the "biased observer" effect which leads students to erroneously conclude that the majority of classes are oversubscribed. As somebody observed in this thread, for some usecases you only look up hashes which actually exist. Task a trivial example where you have four entries unevenly distributed between two buckets, three in one bucket and one in the other. Now 3/4 of your lookups hit in one bucket and 1/4 in the other bucket. Obviously it's not as pronounced if you have 1000 buckets with 1000 entries randomly distributed between the buckets. But that distribution is not nearly as even as you might expect: $ ./distrib 0: 362 1: 371 2: 193 3: 57 4: 13 5: 4 That's using lrand48() to decide which bucket to use, so not even a "quality of hash" problem, just a "your mathematical intuition may not be right here". To put this data another way, 371 entries are in a bucket with a single entry, 384 are in a bucket with two entries, 171 are in a 3-entry bucket, 52 are in a 4-entry bucket and 20 are in a 5-entry bucket. $ cat distrib.c #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> int bucket[1000]; int freq[10]; int main(int argc, char **argv) { int i; for (i = 0; i < 1000; i++) bucket[lrand48() % 1000]++; for (i = 0; i < 1000; i++) freq[bucket[i]]++; for (i = 0; i < 10; i++) printf("%d: %d\n", i, freq[i]); return 0; } (ok, quibble about "well, 1000 doesn't divide INT_MAX evenly so your random number generation is biased", but i maintain that will not materially affect these results due to it affecting only 0.00003% of numbers generated)
On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote: > > Task a trivial example where you have four entries unevenly distributed > between two buckets, three in one bucket and one in the other. Now 3/4 > of your lookups hit in one bucket and 1/4 in the other bucket. > Obviously it's not as pronounced if you have 1000 buckets with 1000 > entries randomly distributed between the buckets. But that distribution > is not nearly as even as you might expect: > > $ ./distrib > 0: 362 > 1: 371 > 2: 193 > 3: 57 > 4: 13 > 5: 4 Indeed, that's why rhashtable only triggers a forced rehash at a chain length of 16 even though we expect the average chain length to be just 1. The theoretical worst-case value is expected to be O(lg n/lg lg n). However, I think 16 was picked because it was sufficient even for a hash table that filled all memory. Of course if anyone can provide some calculation showing that this is insufficient I'm happy to raise the limit. Cheers,
On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote: > On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote: > > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > I remember playing around with the elf symbol table for a browser > > > and all its shared libraries. > > > While the hash function is pretty trivial, it really didn't matter > > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > > chains were always long. > > > > that's a pretty bad hash, even golden ratio hash would be better, but > > still bad; you really should be using at least jhash. > > There's a "fun" effect; essentially the "biased observer" effect which > leads students to erroneously conclude that the majority of classes are > oversubscribed. As somebody observed in this thread, for some usecases > you only look up hashes which actually exist. > > Task a trivial example where you have four entries unevenly distributed > between two buckets, three in one bucket and one in the other. Now 3/4 > of your lookups hit in one bucket and 1/4 in the other bucket. > Obviously it's not as pronounced if you have 1000 buckets with 1000 > entries randomly distributed between the buckets. But that distribution > is not nearly as even as you might expect: > > $ ./distrib > 0: 362 > 1: 371 > 2: 193 > 3: 57 > 4: 13 > 5: 4 > > That's using lrand48() to decide which bucket to use, so not even a > "quality of hash" problem, just a "your mathematical intuition may not > be right here". well, golden ratio hash - hash_32(i, 32) 0: 368 1: 264 2: 368 3: 0 but your distribution actually is accurate in general, golden ratio hash is relly nice for sequential integers. the actual problem with your test is that you're testing 100% occupancy - no one does that. 75% occupancy, siphash: 0: 933 1: 60 2: 6 3: 1 4: 0 that looks about right to me.
On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote: > > but your distribution actually is accurate in general, golden ratio hash > is relly nice for sequential integers. the actual problem with your test > is that you're testing 100% occupancy - no one does that. > > 75% occupancy, siphash: > 0: 933 > 1: 60 > 2: 6 > 3: 1 > 4: 0 > > that looks about right to me. The point is that the worst-case length grows with the size of the table so it won't always be 3. You need to take into account the largest table size that you will support. Cheers,
On Sun, Feb 25, 2024 at 01:53:35PM +0800, Herbert Xu wrote: > On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote: > > > > but your distribution actually is accurate in general, golden ratio hash > > is relly nice for sequential integers. the actual problem with your test > > is that you're testing 100% occupancy - no one does that. > > > > 75% occupancy, siphash: > > 0: 933 > > 1: 60 > > 2: 6 > > 3: 1 > > 4: 0 > > > > that looks about right to me. > > The point is that the worst-case length grows with the size of > the table so it won't always be 3. You need to take into account > the largest table size that you will support. ok, but - one million entries, siphash, 75% fill factor 0: 472053 1: 354786 2: 132663 3: 33267 4: 6218 5: 884 6: 110 7: 17 8: 2 9: 0 100 million: 0: 51342703 1: 34224025 2: 11413241 3: 2534946 4: 421816 5: 56271 6: 6346 7: 593 8: 56 9: 3 10: 0 it's a log curve - chain length of 16 means you picked a bad hash function.
On Sun, Feb 25, 2024 at 01:14:39AM -0500, Kent Overstreet wrote: > > it's a log curve - Yes it's O(log N/log log N). > chain length of 16 means you picked a bad hash > function. Or that someone is trying to attack you. The number 16 is the cut-off where we decide that someone has discovered our hash secret and can cause a particular to chain to grow without bound. At this point rhashtable will force a rehash with a different secret. Cheers,
From: Kent Overstreet > Sent: 25 February 2024 03:19 .. > when I implemented cuckoo (which is more obviously sensitive to a weak > hash function), I had to go with siphash, even jhash wasn't giving me > great reslts. and looking at the code it's not hard to see why, it's all > adds, and the rotates are byte aligned... you want mixed adds and xors > and the rotates to be more prime-ish. > > right idea, just old... > > what would be ideal is something more like siphash, but with fewer > rounds, so same number of instructions as jhash. xxhash might fit the > bill, I haven't looked at the code yet... There is likely to be a point where scanning a list of values for the right hash value is faster than executing a hash function that is good enough to separate them to separate buckets. You don't want to scan a linked list because they have horrid cache footprints. The locking is equally horrid - especially for remove. Arrays of pointers ar ethe way forward :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Feb 25, 2024 at 02:47:45PM +0000, David Laight wrote: > From: Kent Overstreet > > Sent: 25 February 2024 03:19 > .. > > when I implemented cuckoo (which is more obviously sensitive to a weak > > hash function), I had to go with siphash, even jhash wasn't giving me > > great reslts. and looking at the code it's not hard to see why, it's all > > adds, and the rotates are byte aligned... you want mixed adds and xors > > and the rotates to be more prime-ish. > > > > right idea, just old... > > > > what would be ideal is something more like siphash, but with fewer > > rounds, so same number of instructions as jhash. xxhash might fit the > > bill, I haven't looked at the code yet... > > There is likely to be a point where scanning a list of values > for the right hash value is faster than executing a hash function > that is good enough to separate them to separate buckets. Executing a bunch of alu ops that parallelize nicely is _fast_ (it's all xors/adds/rotates, chacha20 is a good example of how the modern stuff works). Also, for 2-way cuckoo there's an xor trick so you don't need to compute a second hash. But the key thing about cuckoo is that the loads on lookup _aren't dependent_ - they can run in parallel. Every cache miss that goes all the way to DRAM is stupidly expensive, remember - hundreds of instructions, had you been able to keep the pipeline fed. > You don't want to scan a linked list because they have horrid > cache footprints. > The locking is equally horrid - especially for remove. > Arrays of pointers ar ethe way forward :-) Well, maybe; I'm waiting to see fill factor numbers and benchmarks. Fill factor was my concern when I was playing around with the concept; I used it in a place where the hash table didn't need to be that big, and the point was to avoid having to separately allocate the entries (and it avoids the hassle of tombstone entries with linear/quadratic probing). The fact that it requires a second dependent load, because buckets are separately allocated, also looks like a big negative to me. I still think a good lockless cuckoo hashing implementation ought to beat it.