Message ID | 20230523214539.226387-3-corwin@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add the dm-vdo deduplication and compression device mapper target. | expand |
On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote: > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally > written by Austin Appleby and placed in the public domain. This version has > been modified to produce the same result on both big endian and little > endian processors, making it suitable for use in portable persistent data. > > Signed-off-by: J. corwin Coburn <corwin@redhat.com> > --- > drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ > drivers/md/dm-vdo/murmurhash3.h | 15 +++ > 2 files changed, 190 insertions(+) > create mode 100644 drivers/md/dm-vdo/murmurhash3.c > create mode 100644 drivers/md/dm-vdo/murmurhash3.h Do we really need yet another hash algorithm? xxHash is another very fast non-cryptographic hash algorithm that is already available in the kernel (lib/xxhash.c). - Eric -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 5/23/23 6:06 PM, Eric Biggers wrote: > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote: >> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally >> written by Austin Appleby and placed in the public domain. This version has >> been modified to produce the same result on both big endian and little >> endian processors, making it suitable for use in portable persistent data. >> >> Signed-off-by: J. corwin Coburn <corwin@redhat.com> >> --- >> drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ >> drivers/md/dm-vdo/murmurhash3.h | 15 +++ >> 2 files changed, 190 insertions(+) >> create mode 100644 drivers/md/dm-vdo/murmurhash3.c >> create mode 100644 drivers/md/dm-vdo/murmurhash3.h > > Do we really need yet another hash algorithm? > > xxHash is another very fast non-cryptographic hash algorithm that is already > available in the kernel (lib/xxhash.c). > > - Eric The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in deployment since 2013, and, if I am reading correctly, xxHash did not have a 128 bit variant until 2019. It would certainly be possible to switch vdo to xxHash, but it would be problematic for existing users. corwin -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote: > On 5/23/23 6:06 PM, Eric Biggers wrote: > > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote: > > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally > > > written by Austin Appleby and placed in the public domain. This version has > > > been modified to produce the same result on both big endian and little > > > endian processors, making it suitable for use in portable persistent data. > > > > > > Signed-off-by: J. corwin Coburn <corwin@redhat.com> > > > --- > > > drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ > > > drivers/md/dm-vdo/murmurhash3.h | 15 +++ > > > 2 files changed, 190 insertions(+) > > > create mode 100644 drivers/md/dm-vdo/murmurhash3.c > > > create mode 100644 drivers/md/dm-vdo/murmurhash3.h > > > > Do we really need yet another hash algorithm? > > > > xxHash is another very fast non-cryptographic hash algorithm that is already > > available in the kernel (lib/xxhash.c). > > > > - Eric > > The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in > deployment since 2013, and, if I am reading correctly, xxHash did not have a > 128 bit variant until 2019. Why do you need a 128-bit non-cryptographic hash algorithm? What problem are you trying to solve? > > It would certainly be possible to switch vdo to xxHash, but it would be > problematic for existing users. > Well, this commit doesn't mention that the choice was forced for compatibility reasons. It sounds like the on-disk format (and presumably the UAPI, too?) for dm-vdo was already set in stone before it was ever sent out for review. That takes away some motivation to bother reviewing it, since many review comments will be met with "we won't change this"... - Eric -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote: > On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote: > > On 5/23/23 6:06 PM, Eric Biggers wrote: > > > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote: > > > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally > > > > written by Austin Appleby and placed in the public domain. This version has > > > > been modified to produce the same result on both big endian and little > > > > endian processors, making it suitable for use in portable persistent data. > > > > > > > > Signed-off-by: J. corwin Coburn <corwin@redhat.com> > > > > --- > > > > drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ > > > > drivers/md/dm-vdo/murmurhash3.h | 15 +++ > > > > 2 files changed, 190 insertions(+) > > > > create mode 100644 drivers/md/dm-vdo/murmurhash3.c > > > > create mode 100644 drivers/md/dm-vdo/murmurhash3.h > > > > > > Do we really need yet another hash algorithm? > > > > > > xxHash is another very fast non-cryptographic hash algorithm that is already > > > available in the kernel (lib/xxhash.c). > > > > > > - Eric > > > > The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in > > deployment since 2013, and, if I am reading correctly, xxHash did not have a > > 128 bit variant until 2019. > > Why do you need a 128-bit non-cryptographic hash algorithm? What problem are > you trying to solve? To elaborate a bit: the reason this seems strange to me is that when people say they need a 128-bit (or larger) non-cryptographic hash function, usually they are either mistaken and 64-bit would suffice, or they actually need a cryptographic hash function. In this case, the hash function seems to be used for data deduplication. This is unusual, since data deduplication normally requires a cryptographic hash algorithm. I see that this is touched on by the following paragraph in vdo-design.rst (though it incorrectly calls the hash an "identifier for the block"): Each block of data is hashed to produce a 16-byte block name which serves as an identifier for the block. An index record consists of this block name paired with the presumed location of that data on the underlying storage. However, it is not possible to guarantee that the index is accurate. Most often, this occurs because it is too costly to update the index when a block is over-written or discarded. Doing so would require either storing the block name along with the blocks, which is difficult to do efficiently in block-based storage, or reading and rehashing each block before overwriting it. Inaccuracy can also result from a hash collision where two different blocks have the same name. In practice, this is extremely unlikely, but because vdo does not use a cryptographic hash, a malicious workload can be constructed. Because of these inaccuracies, vdo treats the locations in the index as hints, and reads each indicated block to verify that it is indeed a duplicate before sharing the existing block with a new one. So, dm-vdo handles hash collisions by reading back and verifying that the data matches before allowing deduplication. That solves the security problem. However, it does seem strange, and it's more complex than the usual solution of just using a cryptographic hash. Note that cryptographic hashing is very fast on modern CPUs with modern algorithms. So, some more details about the rationale for designing the data deduplication in this (IMO unusual) way should be included. - Eric -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 5/23/23 7:06 PM, Eric Biggers wrote: > On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote: >> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote: >>> On 5/23/23 6:06 PM, Eric Biggers wrote: >>>> On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote: >>>>> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally >>>>> written by Austin Appleby and placed in the public domain. This version has >>>>> been modified to produce the same result on both big endian and little >>>>> endian processors, making it suitable for use in portable persistent data. >>>>> >>>>> Signed-off-by: J. corwin Coburn <corwin@redhat.com> >>>>> --- >>>>> drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ >>>>> drivers/md/dm-vdo/murmurhash3.h | 15 +++ >>>>> 2 files changed, 190 insertions(+) >>>>> create mode 100644 drivers/md/dm-vdo/murmurhash3.c >>>>> create mode 100644 drivers/md/dm-vdo/murmurhash3.h >>>> >>>> Do we really need yet another hash algorithm? >>>> >>>> xxHash is another very fast non-cryptographic hash algorithm that is already >>>> available in the kernel (lib/xxhash.c). >>>> >>>> - Eric >>> >>> The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in >>> deployment since 2013, and, if I am reading correctly, xxHash did not have a >>> 128 bit variant until 2019. Before I dive into the more technical details of the index, let me elaborate a bit as well. One of the key ideas in the design of vdo is that the index is largely independent of the data store. The index is used only for deduplication hints and is not required to read and write data. As such, switching hash algorithms is not out of the question. It would be inconvenient for existing deployments as their existing indexes would cease to work after an upgrade, and so they would lose deduplication on new writes against data written before the upgrade. However, there would be no loss of data. As such, if we determine that avoiding this issue for existing deployments is not worth the cost of supporting another hash algorithm, we are open to making that change. The on disk format of the data store portion of vdo is more amenable to upgrade, so nothing is set in stone. >> >> Why do you need a 128-bit non-cryptographic hash algorithm? What problem are >> you trying to solve? > > To elaborate a bit: the reason this seems strange to me is that when people say > they need a 128-bit (or larger) non-cryptographic hash function, usually they > are either mistaken and 64-bit would suffice, or they actually need a > cryptographic hash function. > > In this case, the hash function seems to be used for data deduplication. This > is unusual, since data deduplication normally requires a cryptographic hash > algorithm. > > I see that this is touched on by the following paragraph in vdo-design.rst > (though it incorrectly calls the hash an "identifier for the block"): > > Each block of data is hashed to produce a 16-byte block name which serves as > an identifier for the block. An index record consists of this block name > paired with the presumed location of that data on the underlying storage. > However, it is not possible to guarantee that the index is accurate. Most > often, this occurs because it is too costly to update the index when a block > is over-written or discarded. Doing so would require either storing the > block name along with the blocks, which is difficult to do efficiently in > block-based storage, or reading and rehashing each block before overwriting > it. Inaccuracy can also result from a hash collision where two different > blocks have the same name. In practice, this is extremely unlikely, but > because vdo does not use a cryptographic hash, a malicious workload can be > constructed. Because of these inaccuracies, vdo treats the locations in the > index as hints, and reads each indicated block to verify that it is indeed a > duplicate before sharing the existing block with a new one. > > So, dm-vdo handles hash collisions by reading back and verifying that the data > matches before allowing deduplication. > > That solves the security problem. However, it does seem strange, and it's more > complex than the usual solution of just using a cryptographic hash. Note that > cryptographic hashing is very fast on modern CPUs with modern algorithms. Merely using a cryptographic hash doesn't solve the overwrite problem (detailed in the paragraph you quote above). Since vdo has to do a read-verify anyway, there is no benefit to using a stronger hash. > > So, some more details about the rationale for designing the data deduplication > in this (IMO unusual) way should be included. > > - Eric > The design of the index is intended to provide sufficient performance without using too much memory. The index lookup for old records (those not so new that they haven't been written out yet) is detailed in vdo-design.rst: In order to find records in older chapters, the index also maintains a higher level structure called the volume index, which contains entries mapping a block name to the chapter containing its newest record. This mapping is updated as records for the block name are copied or updated, ensuring that only the newer record for a given block name is findable. Older records for a block name can no longer be found even though they have not been deleted. Like the chapter index, the volume index uses only a subset of the block name as its key and can not definitively say that a record exists for a name. It can only say which chapter would contain the record if a record exists. The volume index is stored entirely in memory and is saved to storage only when the vdo target is shut down. From the viewpoint of a request for a particular block name, first it will look up the name in the volume index which will indicate either that the record is new, or which chapter to search. If the latter, the request looks up its name in the chapter index to determine if the record is new, or which record page to search. Finally, if not new, the request will look for its record on the indicated record page. This process may require up to two page reads per request (one for the chapter index page and one for the request page). However, recently accessed pages are cached so that these page reads can be amortized across many block name requests. The volume index and the chapter indexes are implemented using a memory-efficient structure called a delta index. Instead of storing the entire key (the block name) for each entry, the entries are sorted by name and only the difference between adjacent keys (the delta) is stored. Because we expect the hashes to be evenly distributed, the size of the deltas follows an exponential distribution. Because of this distribution, the deltas are expressed in a Huffman code to take up even less space. The entire sorted list of keys is called a delta list. This structure allows the index to use many fewer bytes per entry than a traditional hash table, but it is slightly more expensive to look up entries, because a request must read every entry in a delta list to add up the deltas in order to find the record it needs. The delta index reduces this lookup cost by splitting its key space into many sub-lists, each starting at a fixed key value, so that each individual list is short. Even with the compression inherent in the delta list structure, keeping the entire delta for each entry in memory is too costly. The multi-level structure of the index allows us to reduce the memory size while still achieving acceptable performance. The volume index is computed on only a portion of the full hash, greatly reducing the size of each entry. Similarly, the chapter index is computed on a different subset of the full hash thereby reducing the amount of data that needs to be read and cached. These savings come at the cost of false positives in both the volume index and the chapter index, each of which result in spurious reads and hence reduced performance. In order to bound the reading required for each lookup, when the volume index needs to store multiple entries whose volume index bits collide (but whose full hashed do not), all but the first entry are stored not as deltas but as "collision records" which contain the full 128 bit hash. These records ensure that any lookup requires reading at most one index page and one record page. However, collision records consume much more memory than normal entries. The number of bits used by the volume and chapter index are tuned to balance memory against the false positive rate (note that the false positive rate in the chapter index would be significantly higher if the chapter index bits were to overlap with the volume index bits). We have found that 8 bytes for the volume index and 6 for the chapter index give good performance with an acceptable memory budget (approximately 3 bytes per entry). The index also supports a "sparse" mode which uses ten times as many chapters as the default, "dense" mode with a volume index of the same size. This mode leverages the temporal locality which is usually present in workloads with high dedupe rates. Rather than keeping every entry in the volume index, only 1/32 of entries in most of the index are retained in the volume index. On lookup, if a hash is not found in the volume index, the cached chapter indexes are also searched before deciding the entry is new. 5 of the remainin bits of the hash are used to select which entries are retained in the sparse portion of the volume index. corwin -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
diff --git a/drivers/md/dm-vdo/murmurhash3.c b/drivers/md/dm-vdo/murmurhash3.c new file mode 100644 index 00000000000..c68d0d15390 --- /dev/null +++ b/drivers/md/dm-vdo/murmurhash3.c @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: LGPL-2.1+ +/* + * MurmurHash3 was written by Austin Appleby, and is placed in the public + * domain. The author hereby disclaims copyright to this source code. + * + * Adapted by John Wiele (jwiele@redhat.com). + */ + +#include "murmurhash3.h" + +static inline u64 rotl64(u64 x, s8 r) +{ + return (x << r) | (x >> (64 - r)); +} + +#define ROTL64(x, y) rotl64(x, y) +static __always_inline u64 getblock64(const u64 *p, int i) +{ +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ + return p[i]; +#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + return __builtin_bswap64(p[i]); +#else +#error "can't figure out byte order" +#endif +} + +static __always_inline void putblock64(u64 *p, int i, u64 value) +{ +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ + p[i] = value; +#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + p[i] = __builtin_bswap64(value); +#else +#error "can't figure out byte order" +#endif +} + +/* Finalization mix - force all bits of a hash block to avalanche */ + +static __always_inline u64 fmix64(u64 k) +{ + k ^= k >> 33; + k *= 0xff51afd7ed558ccdLLU; + k ^= k >> 33; + k *= 0xc4ceb9fe1a85ec53LLU; + k ^= k >> 33; + + return k; +} + +void murmurhash3_128(const void *key, const int len, const u32 seed, void *out) +{ + const u8 *data = (const u8 *)key; + const int nblocks = len / 16; + + u64 h1 = seed; + u64 h2 = seed; + + const u64 c1 = 0x87c37b91114253d5LLU; + const u64 c2 = 0x4cf5ad432745937fLLU; + + /* body */ + + const u64 *blocks = (const u64 *)(data); + + int i; + + for (i = 0; i < nblocks; i++) { + u64 k1 = getblock64(blocks, i * 2 + 0); + u64 k2 = getblock64(blocks, i * 2 + 1); + + k1 *= c1; + k1 = ROTL64(k1, 31); + k1 *= c2; + h1 ^= k1; + + h1 = ROTL64(h1, 27); + h1 += h2; + h1 = h1 * 5 + 0x52dce729; + + k2 *= c2; + k2 = ROTL64(k2, 33); + k2 *= c1; + h2 ^= k2; + + h2 = ROTL64(h2, 31); + h2 += h1; + h2 = h2 * 5 + 0x38495ab5; + } + + /* tail */ + + { + const u8 *tail = (const u8 *)(data + nblocks * 16); + + u64 k1 = 0; + u64 k2 = 0; + + switch (len & 15) { + case 15: + k2 ^= ((u64)tail[14]) << 48; + fallthrough; + case 14: + k2 ^= ((u64)tail[13]) << 40; + fallthrough; + case 13: + k2 ^= ((u64)tail[12]) << 32; + fallthrough; + case 12: + k2 ^= ((u64)tail[11]) << 24; + fallthrough; + case 11: + k2 ^= ((u64)tail[10]) << 16; + fallthrough; + case 10: + k2 ^= ((u64)tail[9]) << 8; + fallthrough; + case 9: + k2 ^= ((u64)tail[8]) << 0; + k2 *= c2; + k2 = ROTL64(k2, 33); + k2 *= c1; + h2 ^= k2; + fallthrough; + + case 8: + k1 ^= ((u64)tail[7]) << 56; + fallthrough; + case 7: + k1 ^= ((u64)tail[6]) << 48; + fallthrough; + case 6: + k1 ^= ((u64)tail[5]) << 40; + fallthrough; + case 5: + k1 ^= ((u64)tail[4]) << 32; + fallthrough; + case 4: + k1 ^= ((u64)tail[3]) << 24; + fallthrough; + case 3: + k1 ^= ((u64)tail[2]) << 16; + fallthrough; + case 2: + k1 ^= ((u64)tail[1]) << 8; + fallthrough; + case 1: + k1 ^= ((u64)tail[0]) << 0; + k1 *= c1; + k1 = ROTL64(k1, 31); + k1 *= c2; + h1 ^= k1; + break; + default: + break; + }; + } + /* finalization */ + + h1 ^= len; + h2 ^= len; + + h1 += h2; + h2 += h1; + + h1 = fmix64(h1); + h2 = fmix64(h2); + + h1 += h2; + h2 += h1; + + putblock64((u64 *)out, 0, h1); + putblock64((u64 *)out, 1, h2); +} diff --git a/drivers/md/dm-vdo/murmurhash3.h b/drivers/md/dm-vdo/murmurhash3.h new file mode 100644 index 00000000000..d84711ddb65 --- /dev/null +++ b/drivers/md/dm-vdo/murmurhash3.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: LGPL-2.1+ */ +/* + * MurmurHash3 was written by Austin Appleby, and is placed in the public + * domain. The author hereby disclaims copyright to this source code. + */ + +#ifndef _MURMURHASH3_H_ +#define _MURMURHASH3_H_ + +#include <linux/compiler.h> +#include <linux/types.h> + +void murmurhash3_128(const void *key, int len, u32 seed, void *out); + +#endif /* _MURMURHASH3_H_ */
MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally written by Austin Appleby and placed in the public domain. This version has been modified to produce the same result on both big endian and little endian processors, making it suitable for use in portable persistent data. Signed-off-by: J. corwin Coburn <corwin@redhat.com> --- drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++ drivers/md/dm-vdo/murmurhash3.h | 15 +++ 2 files changed, 190 insertions(+) create mode 100644 drivers/md/dm-vdo/murmurhash3.c create mode 100644 drivers/md/dm-vdo/murmurhash3.h