diff mbox series

[v2,02/39] Add the MurmurHash3 fast hashing algorithm.

Message ID 20230523214539.226387-3-corwin@redhat.com (mailing list archive)
State New, archived
Headers show
Series Add the dm-vdo deduplication and compression device mapper target. | expand

Commit Message

corwin May 23, 2023, 9:45 p.m. UTC
MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
written by Austin Appleby and placed in the public domain. This version has
been modified to produce the same result on both big endian and little
endian processors, making it suitable for use in portable persistent data.

Signed-off-by: J. corwin Coburn <corwin@redhat.com>
---
 drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/murmurhash3.h |  15 +++
 2 files changed, 190 insertions(+)
 create mode 100644 drivers/md/dm-vdo/murmurhash3.c
 create mode 100644 drivers/md/dm-vdo/murmurhash3.h

Comments

Eric Biggers May 23, 2023, 10:06 p.m. UTC | #1
On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
> written by Austin Appleby and placed in the public domain. This version has
> been modified to produce the same result on both big endian and little
> endian processors, making it suitable for use in portable persistent data.
> 
> Signed-off-by: J. corwin Coburn <corwin@redhat.com>
> ---
>  drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
>  drivers/md/dm-vdo/murmurhash3.h |  15 +++
>  2 files changed, 190 insertions(+)
>  create mode 100644 drivers/md/dm-vdo/murmurhash3.c
>  create mode 100644 drivers/md/dm-vdo/murmurhash3.h

Do we really need yet another hash algorithm?

xxHash is another very fast non-cryptographic hash algorithm that is already
available in the kernel (lib/xxhash.c).

- Eric
corwin May 23, 2023, 10:13 p.m. UTC | #2
On 5/23/23 6:06 PM, Eric Biggers wrote:
> On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
>> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
>> written by Austin Appleby and placed in the public domain. This version has
>> been modified to produce the same result on both big endian and little
>> endian processors, making it suitable for use in portable persistent data.
>>
>> Signed-off-by: J. corwin Coburn <corwin@redhat.com>
>> ---
>>   drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
>>   drivers/md/dm-vdo/murmurhash3.h |  15 +++
>>   2 files changed, 190 insertions(+)
>>   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
>>   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
> 
> Do we really need yet another hash algorithm?
> 
> xxHash is another very fast non-cryptographic hash algorithm that is already
> available in the kernel (lib/xxhash.c).
> 
> - Eric

The main reason why vdo uses Murmur3 and not xxHash is that vdo has been 
in deployment since 2013, and, if I am reading correctly, xxHash did not 
have a 128 bit variant until 2019.

It would certainly be possible to switch vdo to xxHash, but it would be 
problematic for existing users.

corwin
Eric Biggers May 23, 2023, 10:25 p.m. UTC | #3
On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
> On 5/23/23 6:06 PM, Eric Biggers wrote:
> > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
> > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
> > > written by Austin Appleby and placed in the public domain. This version has
> > > been modified to produce the same result on both big endian and little
> > > endian processors, making it suitable for use in portable persistent data.
> > > 
> > > Signed-off-by: J. corwin Coburn <corwin@redhat.com>
> > > ---
> > >   drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
> > >   drivers/md/dm-vdo/murmurhash3.h |  15 +++
> > >   2 files changed, 190 insertions(+)
> > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
> > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
> > 
> > Do we really need yet another hash algorithm?
> > 
> > xxHash is another very fast non-cryptographic hash algorithm that is already
> > available in the kernel (lib/xxhash.c).
> > 
> > - Eric
> 
> The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in
> deployment since 2013, and, if I am reading correctly, xxHash did not have a
> 128 bit variant until 2019.

Why do you need a 128-bit non-cryptographic hash algorithm?  What problem are
you trying to solve?

> 
> It would certainly be possible to switch vdo to xxHash, but it would be
> problematic for existing users.
> 

Well, this commit doesn't mention that the choice was forced for compatibility
reasons.

It sounds like the on-disk format (and presumably the UAPI, too?) for dm-vdo was
already set in stone before it was ever sent out for review.

That takes away some motivation to bother reviewing it, since many review
comments will be met with "we won't change this"...

- Eric
Eric Biggers May 23, 2023, 11:06 p.m. UTC | #4
On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote:
> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
> > On 5/23/23 6:06 PM, Eric Biggers wrote:
> > > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
> > > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
> > > > written by Austin Appleby and placed in the public domain. This version has
> > > > been modified to produce the same result on both big endian and little
> > > > endian processors, making it suitable for use in portable persistent data.
> > > > 
> > > > Signed-off-by: J. corwin Coburn <corwin@redhat.com>
> > > > ---
> > > >   drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
> > > >   drivers/md/dm-vdo/murmurhash3.h |  15 +++
> > > >   2 files changed, 190 insertions(+)
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
> > > 
> > > Do we really need yet another hash algorithm?
> > > 
> > > xxHash is another very fast non-cryptographic hash algorithm that is already
> > > available in the kernel (lib/xxhash.c).
> > > 
> > > - Eric
> > 
> > The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in
> > deployment since 2013, and, if I am reading correctly, xxHash did not have a
> > 128 bit variant until 2019.
> 
> Why do you need a 128-bit non-cryptographic hash algorithm?  What problem are
> you trying to solve?

To elaborate a bit: the reason this seems strange to me is that when people say
they need a 128-bit (or larger) non-cryptographic hash function, usually they
are either mistaken and 64-bit would suffice, or they actually need a
cryptographic hash function.

In this case, the hash function seems to be used for data deduplication.  This
is unusual, since data deduplication normally requires a cryptographic hash
algorithm.

I see that this is touched on by the following paragraph in vdo-design.rst
(though it incorrectly calls the hash an "identifier for the block"):

    Each block of data is hashed to produce a 16-byte block name which serves as
    an identifier for the block. An index record consists of this block name
    paired with the presumed location of that data on the underlying storage.
    However, it is not possible to guarantee that the index is accurate. Most
    often, this occurs because it is too costly to update the index when a block
    is over-written or discarded. Doing so would require either storing the
    block name along with the blocks, which is difficult to do efficiently in
    block-based storage, or reading and rehashing each block before overwriting
    it. Inaccuracy can also result from a hash collision where two different
    blocks have the same name. In practice, this is extremely unlikely, but
    because vdo does not use a cryptographic hash, a malicious workload can be
    constructed. Because of these inaccuracies, vdo treats the locations in the
    index as hints, and reads each indicated block to verify that it is indeed a
    duplicate before sharing the existing block with a new one.

So, dm-vdo handles hash collisions by reading back and verifying that the data
matches before allowing deduplication.

That solves the security problem.  However, it does seem strange, and it's more
complex than the usual solution of just using a cryptographic hash.  Note that
cryptographic hashing is very fast on modern CPUs with modern algorithms.

So, some more details about the rationale for designing the data deduplication
in this (IMO unusual) way should be included.

- Eric
corwin May 24, 2023, 4:15 a.m. UTC | #5
On 5/23/23 7:06 PM, Eric Biggers wrote:
 > On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote:
 >> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
 >>> On 5/23/23 6:06 PM, Eric Biggers wrote:
 >>>> On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
 >>>>> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was 
originally
 >>>>> written by Austin Appleby and placed in the public domain. This 
version has
 >>>>> been modified to produce the same result on both big endian and 
little
 >>>>> endian processors, making it suitable for use in portable 
persistent data.
 >>>>>
 >>>>> Signed-off-by: J. corwin Coburn <corwin@redhat.com>
 >>>>> ---
 >>>>>    drivers/md/dm-vdo/murmurhash3.c | 175 
++++++++++++++++++++++++++++++++
 >>>>>    drivers/md/dm-vdo/murmurhash3.h |  15 +++
 >>>>>    2 files changed, 190 insertions(+)
 >>>>>    create mode 100644 drivers/md/dm-vdo/murmurhash3.c
 >>>>>    create mode 100644 drivers/md/dm-vdo/murmurhash3.h
 >>>>
 >>>> Do we really need yet another hash algorithm?
 >>>>
 >>>> xxHash is another very fast non-cryptographic hash algorithm that 
is already
 >>>> available in the kernel (lib/xxhash.c).
 >>>>
 >>>> - Eric
 >>>
 >>> The main reason why vdo uses Murmur3 and not xxHash is that vdo has 
been in
 >>> deployment since 2013, and, if I am reading correctly, xxHash did 
not have a
 >>> 128 bit variant until 2019.

Before I dive into the more technical details of the index, let me
elaborate a bit as well. One of the key ideas in the design of vdo is
that the index is largely independent of the data store. The index is
used only for deduplication hints and is not required to read and
write data.

As such, switching hash algorithms is not out of the question. It
would be inconvenient for existing deployments as their existing
indexes would cease to work after an upgrade, and so they would lose
deduplication on new writes against data written before the
upgrade. However, there would be no loss of data. As such, if we
determine that avoiding this issue for existing deployments is not
worth the cost of supporting another hash algorithm, we are open to
making that change.

The on disk format of the data store portion of vdo is more amenable
to upgrade, so nothing is set in stone.

 >>
 >> Why do you need a 128-bit non-cryptographic hash algorithm?  What 
problem are
 >> you trying to solve?
 >
 > To elaborate a bit: the reason this seems strange to me is that when 
people say
 > they need a 128-bit (or larger) non-cryptographic hash function, 
usually they
 > are either mistaken and 64-bit would suffice, or they actually need a
 > cryptographic hash function.
 >
 > In this case, the hash function seems to be used for data 
deduplication.  This
 > is unusual, since data deduplication normally requires a 
cryptographic hash
 > algorithm.
 >
 > I see that this is touched on by the following paragraph in 
vdo-design.rst
 > (though it incorrectly calls the hash an "identifier for the block"):
 >
 >      Each block of data is hashed to produce a 16-byte block name 
which serves as
 >      an identifier for the block. An index record consists of this 
block name
 >      paired with the presumed location of that data on the underlying 
storage.
 >      However, it is not possible to guarantee that the index is 
accurate. Most
 >      often, this occurs because it is too costly to update the index 
when a block
 >      is over-written or discarded. Doing so would require either 
storing the
 >      block name along with the blocks, which is difficult to do 
efficiently in
 >      block-based storage, or reading and rehashing each block before 
overwriting
 >      it. Inaccuracy can also result from a hash collision where two 
different
 >      blocks have the same name. In practice, this is extremely 
unlikely, but
 >      because vdo does not use a cryptographic hash, a malicious 
workload can be
 >      constructed. Because of these inaccuracies, vdo treats the 
locations in the
 >      index as hints, and reads each indicated block to verify that it 
is indeed a
 >      duplicate before sharing the existing block with a new one.
 >
 > So, dm-vdo handles hash collisions by reading back and verifying that 
the data
 > matches before allowing deduplication.
 >
 > That solves the security problem.  However, it does seem strange, and 
it's more
 > complex than the usual solution of just using a cryptographic hash. 
Note that
 > cryptographic hashing is very fast on modern CPUs with modern algorithms.

Merely using a cryptographic hash doesn't solve the overwrite problem
(detailed in the paragraph you quote above). Since vdo has to do a
read-verify anyway, there is no benefit to using a stronger hash.

 >
 > So, some more details about the rationale for designing the data 
deduplication
 > in this (IMO unusual) way should be included.
 >
 > - Eric
 >

The design of the index is intended to provide sufficient performance
without using too much memory. The index lookup for old records (those
not so new that they haven't been written out yet) is detailed in
vdo-design.rst:

       In order to find records in older chapters, the index also
       maintains a higher level structure called the volume index,
       which contains entries mapping a block name to the chapter
       containing its newest record. This mapping is updated as records
       for the block name are copied or updated, ensuring that only the
       newer record for a given block name is findable. Older records
       for a block name can no longer be found even though they have
       not been deleted. Like the chapter index, the volume index uses
       only a subset of the block name as its key and can not
       definitively say that a record exists for a name. It can only
       say which chapter would contain the record if a record
       exists. The volume index is stored entirely in memory and is
       saved to storage only when the vdo target is shut down.

       From the viewpoint of a request for a particular block name,
       first it will look up the name in the volume index which will
       indicate either that the record is new, or which chapter to
       search. If the latter, the request looks up its name in the
       chapter index to determine if the record is new, or which record
       page to search. Finally, if not new, the request will look for
       its record on the indicated record page. This process may
       require up to two page reads per request (one for the chapter
       index page and one for the request page). However, recently
       accessed pages are cached so that these page reads can be
       amortized across many block name requests.

       The volume index and the chapter indexes are implemented using a
       memory-efficient structure called a delta index. Instead of
       storing the entire key (the block name) for each entry, the
       entries are sorted by name and only the difference between
       adjacent keys (the delta) is stored. Because we expect the
       hashes to be evenly distributed, the size of the deltas follows
       an exponential distribution. Because of this distribution, the
       deltas are expressed in a Huffman code to take up even less
       space. The entire sorted list of keys is called a delta
       list. This structure allows the index to use many fewer bytes
       per entry than a traditional hash table, but it is slightly more
       expensive to look up entries, because a request must read every
       entry in a delta list to add up the deltas in order to find the
       record it needs. The delta index reduces this lookup cost by
       splitting its key space into many sub-lists, each starting at a
       fixed key value, so that each individual list is short.

Even with the compression inherent in the delta list structure,
keeping the entire delta for each entry in memory is too costly. The
multi-level structure of the index allows us to reduce the memory size
while still achieving acceptable performance. The volume index is
computed on only a portion of the full hash, greatly reducing the size
of each entry. Similarly, the chapter index is computed on a different
subset of the full hash thereby reducing the amount of data that needs
to be read and cached. These savings come at the cost of false
positives in both the volume index and the chapter index, each of
which result in spurious reads and hence reduced performance.

In order to bound the reading required for each lookup, when the
volume index needs to store multiple entries whose volume index bits
collide (but whose full hashed do not), all but the first entry are
stored not as deltas but as "collision records" which contain the full
128 bit hash. These records ensure that any lookup requires reading at
most one index page and one record page. However, collision records
consume much more memory than normal entries.

The number of bits used by the volume and chapter index are tuned to
balance memory against the false positive rate (note that the false
positive rate in the chapter index would be significantly higher if
the chapter index bits were to overlap with the volume index bits). We
have found that 8 bytes for the volume index and 6 for the chapter
index give good performance with an acceptable memory budget
(approximately 3 bytes per entry).

The index also supports a "sparse" mode which uses ten times as many
chapters as the default, "dense" mode with a volume index of the same
size. This mode leverages the temporal locality which is usually
present in workloads with high dedupe rates. Rather than keeping every
entry in the volume index, only 1/32 of entries in most of the index
are retained in the volume index. On lookup, if a hash is not found in
the volume index, the cached chapter indexes are also searched before
deciding the entry is new. 5 of the remainin bits of the hash are used
to select which entries are retained in the sparse portion of the
volume index.

corwin
diff mbox series

Patch

diff --git a/drivers/md/dm-vdo/murmurhash3.c b/drivers/md/dm-vdo/murmurhash3.c
new file mode 100644
index 00000000000..c68d0d15390
--- /dev/null
+++ b/drivers/md/dm-vdo/murmurhash3.c
@@ -0,0 +1,175 @@ 
+// SPDX-License-Identifier: LGPL-2.1+
+/*
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain. The author hereby disclaims copyright to this source code.
+ *
+ * Adapted by John Wiele (jwiele@redhat.com).
+ */
+
+#include "murmurhash3.h"
+
+static inline u64 rotl64(u64 x, s8 r)
+{
+	return (x << r) | (x >> (64 - r));
+}
+
+#define ROTL64(x, y) rotl64(x, y)
+static __always_inline u64 getblock64(const u64 *p, int i)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	return p[i];
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	return __builtin_bswap64(p[i]);
+#else
+#error "can't figure out byte order"
+#endif
+}
+
+static __always_inline void putblock64(u64 *p, int i, u64 value)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	p[i] = value;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	p[i] = __builtin_bswap64(value);
+#else
+#error "can't figure out byte order"
+#endif
+}
+
+/* Finalization mix - force all bits of a hash block to avalanche */
+
+static __always_inline u64 fmix64(u64 k)
+{
+	k ^= k >> 33;
+	k *= 0xff51afd7ed558ccdLLU;
+	k ^= k >> 33;
+	k *= 0xc4ceb9fe1a85ec53LLU;
+	k ^= k >> 33;
+
+	return k;
+}
+
+void murmurhash3_128(const void *key, const int len, const u32 seed, void *out)
+{
+	const u8 *data = (const u8 *)key;
+	const int nblocks = len / 16;
+
+	u64 h1 = seed;
+	u64 h2 = seed;
+
+	const u64 c1 = 0x87c37b91114253d5LLU;
+	const u64 c2 = 0x4cf5ad432745937fLLU;
+
+	/* body */
+
+	const u64 *blocks = (const u64 *)(data);
+
+	int i;
+
+	for (i = 0; i < nblocks; i++) {
+		u64 k1 = getblock64(blocks, i * 2 + 0);
+		u64 k2 = getblock64(blocks, i * 2 + 1);
+
+		k1 *= c1;
+		k1 = ROTL64(k1, 31);
+		k1 *= c2;
+		h1 ^= k1;
+
+		h1 = ROTL64(h1, 27);
+		h1 += h2;
+		h1 = h1 * 5 + 0x52dce729;
+
+		k2 *= c2;
+		k2 = ROTL64(k2, 33);
+		k2 *= c1;
+		h2 ^= k2;
+
+		h2 = ROTL64(h2, 31);
+		h2 += h1;
+		h2 = h2 * 5 + 0x38495ab5;
+	}
+
+	/* tail */
+
+	{
+		const u8 *tail = (const u8 *)(data + nblocks * 16);
+
+		u64 k1 = 0;
+		u64 k2 = 0;
+
+		switch (len & 15) {
+		case 15:
+			k2 ^= ((u64)tail[14]) << 48;
+			fallthrough;
+		case 14:
+			k2 ^= ((u64)tail[13]) << 40;
+			fallthrough;
+		case 13:
+			k2 ^= ((u64)tail[12]) << 32;
+			fallthrough;
+		case 12:
+			k2 ^= ((u64)tail[11]) << 24;
+			fallthrough;
+		case 11:
+			k2 ^= ((u64)tail[10]) << 16;
+			fallthrough;
+		case 10:
+			k2 ^= ((u64)tail[9]) << 8;
+			fallthrough;
+		case 9:
+			k2 ^= ((u64)tail[8]) << 0;
+			k2 *= c2;
+			k2 = ROTL64(k2, 33);
+			k2 *= c1;
+			h2 ^= k2;
+			fallthrough;
+
+		case 8:
+			k1 ^= ((u64)tail[7]) << 56;
+			fallthrough;
+		case 7:
+			k1 ^= ((u64)tail[6]) << 48;
+			fallthrough;
+		case 6:
+			k1 ^= ((u64)tail[5]) << 40;
+			fallthrough;
+		case 5:
+			k1 ^= ((u64)tail[4]) << 32;
+			fallthrough;
+		case 4:
+			k1 ^= ((u64)tail[3]) << 24;
+			fallthrough;
+		case 3:
+			k1 ^= ((u64)tail[2]) << 16;
+			fallthrough;
+		case 2:
+			k1 ^= ((u64)tail[1]) << 8;
+			fallthrough;
+		case 1:
+			k1 ^= ((u64)tail[0]) << 0;
+			k1 *= c1;
+			k1 = ROTL64(k1, 31);
+			k1 *= c2;
+			h1 ^= k1;
+			break;
+		default:
+			break;
+		};
+	}
+	/* finalization */
+
+	h1 ^= len;
+	h2 ^= len;
+
+	h1 += h2;
+	h2 += h1;
+
+	h1 = fmix64(h1);
+	h2 = fmix64(h2);
+
+	h1 += h2;
+	h2 += h1;
+
+	putblock64((u64 *)out, 0, h1);
+	putblock64((u64 *)out, 1, h2);
+}
diff --git a/drivers/md/dm-vdo/murmurhash3.h b/drivers/md/dm-vdo/murmurhash3.h
new file mode 100644
index 00000000000..d84711ddb65
--- /dev/null
+++ b/drivers/md/dm-vdo/murmurhash3.h
@@ -0,0 +1,15 @@ 
+/* SPDX-License-Identifier: LGPL-2.1+ */
+/*
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain. The author hereby disclaims copyright to this source code.
+ */
+
+#ifndef _MURMURHASH3_H_
+#define _MURMURHASH3_H_
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+void murmurhash3_128(const void *key, int len, u32 seed, void *out);
+
+#endif /* _MURMURHASH3_H_ */