diff mbox

[v7,01/20] btrfs: dedup: Introduce dedup framework and its header

Message ID 1455774178-3595-2-git-send-email-quwenruo@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Qu Wenruo Feb. 18, 2016, 5:42 a.m. UTC
From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedup
method and 1 dedup hash.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h   |   5 +++
 fs/btrfs/dedup.h   | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |   1 +
 3 files changed, 133 insertions(+)
 create mode 100644 fs/btrfs/dedup.h

Comments

NeilBrown March 9, 2016, 9:27 p.m. UTC | #1
On Thu, Feb 18 2016, Qu Wenruo wrote:

> +
> +/*
> + * Dedup storage backend
> + * On disk is persist storage but overhead is large
> + * In memory is fast but will lose all its hash on umount
> + */
> +#define BTRFS_DEDUP_BACKEND_INMEMORY		0
> +#define BTRFS_DEDUP_BACKEND_ONDISK		1
> +#define BTRFS_DEDUP_BACKEND_LAST		2

Hi,

This may seem petty, but I'm here to complain about the names. :-)

Firstly, "2" is *not* the LAST backend.  The LAST backed is clearly
"ONDISK" with is "1:.
"2" is the number of backends, or the count of them.
So
> +#define BTRFS_DEDUP_BACKEND_LAST		1

would be OK, as would

> +#define BTRFS_DEDUP_BACKEND_COUNT		2

but what you have is wrong.

The place where you use this define:

+	if (backend >= BTRFS_DEDUP_BACKEND_LAST)
+		return -EINVAL;

is correct, but it looks wrong.  It looks like it is saying that it is
invalid to use the LAST backend!

Secondly, you use "dup" as an abbreviation of "duplicate".
The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
It would be nice if we could be consistent and all use the same
abbreviation.

Thanks,
NeilBrown
Qu Wenruo March 10, 2016, 12:57 a.m. UTC | #2
NeilBrown wrote on 2016/03/10 08:27 +1100:
> On Thu, Feb 18 2016, Qu Wenruo wrote:
>
>> +
>> +/*
>> + * Dedup storage backend
>> + * On disk is persist storage but overhead is large
>> + * In memory is fast but will lose all its hash on umount
>> + */
>> +#define BTRFS_DEDUP_BACKEND_INMEMORY		0
>> +#define BTRFS_DEDUP_BACKEND_ONDISK		1
>> +#define BTRFS_DEDUP_BACKEND_LAST		2
>
> Hi,
>
> This may seem petty, but I'm here to complain about the names. :-)

Any complaint is better than no complaint. :)

>
> Firstly, "2" is *not* the LAST backend.  The LAST backed is clearly
> "ONDISK" with is "1:.
> "2" is the number of backends, or the count of them.
> So
>> +#define BTRFS_DEDUP_BACKEND_LAST		1
>
> would be OK, as would
>
>> +#define BTRFS_DEDUP_BACKEND_COUNT		2
>
> but what you have is wrong.
>
> The place where you use this define:
>
> +	if (backend >= BTRFS_DEDUP_BACKEND_LAST)
> +		return -EINVAL;
>
> is correct, but it looks wrong.  It looks like it is saying that it is
> invalid to use the LAST backend!

Makes sense, I'll use BACKEND_COUNT as the name.

>
> Secondly, you use "dup" as an abbreviation of "duplicate".
> The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
> It would be nice if we could be consistent and all use the same
> abbreviation.

Yes, current kernel VFS level offline dedup uses the name "dedupe".
But on the other hand, ZFS uses the name "dedup" for their online dedup.

And personally speaking, I'd like some difference to distinguish inline 
dedup and offline dedup.
In that case, the extra "e" seems somewhat useful.
With "e", it's intended for offline use. Without "e", it's intended for 
online use.

Thanks,
Qu

>
> Thanks,
> NeilBrown
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba March 11, 2016, 11:43 a.m. UTC | #3
On Thu, Mar 10, 2016 at 08:57:12AM +0800, Qu Wenruo wrote:
> > The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
> > It would be nice if we could be consistent and all use the same
> > abbreviation.
> 
> Yes, current kernel VFS level offline dedup uses the name "dedupe".
> But on the other hand, ZFS uses the name "dedup" for their online dedup.
> 
> And personally speaking, I'd like some difference to distinguish inline 
> dedup and offline dedup.
> In that case, the extra "e" seems somewhat useful.
> With "e", it's intended for offline use. Without "e", it's intended for 
> online use.

Such difference is very subtle and I think we should stick to just one
spelling, which shall be 'dedupe'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 12, 2016, 8:16 a.m. UTC | #4
On 03/11/2016 07:43 PM, David Sterba wrote:
> On Thu, Mar 10, 2016 at 08:57:12AM +0800, Qu Wenruo wrote:
>>> The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
>>> It would be nice if we could be consistent and all use the same
>>> abbreviation.
>>
>> Yes, current kernel VFS level offline dedup uses the name "dedupe".
>> But on the other hand, ZFS uses the name "dedup" for their online dedup.
>>
>> And personally speaking, I'd like some difference to distinguish inline
>> dedup and offline dedup.
>> In that case, the extra "e" seems somewhat useful.
>> With "e", it's intended for offline use. Without "e", it's intended for
>> online use.
>
> Such difference is very subtle and I think we should stick to just one
> spelling, which shall be 'dedupe'.

OK, I'll change them to 'dedupe' in next bug fix version.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 13, 2016, 5:16 a.m. UTC | #5
Qu Wenruo wrote on 2016/03/12 16:16 +0800:
>
>
> On 03/11/2016 07:43 PM, David Sterba wrote:
>> On Thu, Mar 10, 2016 at 08:57:12AM +0800, Qu Wenruo wrote:
>>>> The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
>>>> It would be nice if we could be consistent and all use the same
>>>> abbreviation.
>>>
>>> Yes, current kernel VFS level offline dedup uses the name "dedupe".
>>> But on the other hand, ZFS uses the name "dedup" for their online dedup.
>>>
>>> And personally speaking, I'd like some difference to distinguish inline
>>> dedup and offline dedup.
>>> In that case, the extra "e" seems somewhat useful.
>>> With "e", it's intended for offline use. Without "e", it's intended for
>>> online use.
>>
>> Such difference is very subtle and I think we should stick to just one
>> spelling, which shall be 'dedupe'.
>
> OK, I'll change them to 'dedupe' in next bug fix version.
>
> Thanks,
> Qu
>
>
BTW, I am always interested in, why de-duplication can be shorted as 
'dedupe'.

I didn't see any 'e' in the whole word "DUPlication".
Or it's an abbreviation of "DUPlicatE" instead of "DUPlication"?

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown March 13, 2016, 11:33 a.m. UTC | #6
On Sun, Mar 13 2016, Qu Wenruo wrote:

> Qu Wenruo wrote on 2016/03/12 16:16 +0800:
>>
>>
>> On 03/11/2016 07:43 PM, David Sterba wrote:
>>> On Thu, Mar 10, 2016 at 08:57:12AM +0800, Qu Wenruo wrote:
>>>>> The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
>>>>> It would be nice if we could be consistent and all use the same
>>>>> abbreviation.
>>>>
>>>> Yes, current kernel VFS level offline dedup uses the name "dedupe".
>>>> But on the other hand, ZFS uses the name "dedup" for their online dedup.
>>>>
>>>> And personally speaking, I'd like some difference to distinguish inline
>>>> dedup and offline dedup.
>>>> In that case, the extra "e" seems somewhat useful.
>>>> With "e", it's intended for offline use. Without "e", it's intended for
>>>> online use.
>>>
>>> Such difference is very subtle and I think we should stick to just one
>>> spelling, which shall be 'dedupe'.
>>
>> OK, I'll change them to 'dedupe' in next bug fix version.
>>
>> Thanks,
>> Qu
>>
>>
> BTW, I am always interested in, why de-duplication can be shorted as 
> 'dedupe'.

The "u" in "duplicate" is pronounced as a long vowel sound, almost like
   d-you-plicate.

Normal pronunciation rules for English indicate that "dup" should be
pronounced with a short vowel sound, like "cup".  So "dup" sounds wrong.

To make a vowel long you can add an 'e' at the end of a word.
So:
   tub or cub  have a short "u"
   tube or cube  have a long "u".

by analogy, "dupe" has a long "u" and so sounds like the first syllable
of "duplicate".

NeilBrown


>
> I didn't see any 'e' in the whole word "DUPlication".
> Or it's an abbreviation of "DUPlicatE" instead of "DUPlication"?
>
> Thanks,
> Qu
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duncan March 13, 2016, 4:55 p.m. UTC | #7
NeilBrown posted on Sun, 13 Mar 2016 22:33:22 +1100 as excerpted:

> On Sun, Mar 13 2016, Qu Wenruo wrote:
> 
>> BTW, I am always interested in, why de-duplication can be shorted as
>> 'dedupe'.

>> I didn't see any 'e' in the whole word "DUPlication".
>> Or it's an abbreviation of "DUPlicatE" instead of "DUPlication"?
> 
> The "u" in "duplicate" is pronounced as a long vowel sound, almost like
> d-you-plicate.

> To make a vowel long you can add an 'e' at the end of a word.

> by analogy, "dupe" has a long "u" and so sounds like the first syllable
> of "duplicate".

As a native (USian but with some years growing up in the then recently 
independent former Crown colony of Kenya, influencing my personal 
preferences) English speaker, while what Neil says about short "u" vs. 
long "u" is correct, I agree with Qu that the "e" in dupe doesn't make so 
much sense, and would, other things being equal, vastly prefer dedup to 
dedupe, myself.

However, there's some value in consistency, and given the previous dedupe 
precedent in-kernel, sticking to that for consistency reasons makes sense.

But were this debate to have been about the original usage, I'd have 
definitely favored dedup all the way, as not withstanding Neil's argument 
above, adding the "e" makes little sense to me either.  So only because 
it's already in use in kernel code, but if this /were/ the original 
kernel code...

So I definitely understand your confusion, Qu, and have the same personal 
preference even as a native English speaker. =:^)
Nicholas D Steeves March 15, 2016, 10:08 p.m. UTC | #8
On 13 March 2016 at 12:55, Duncan <1i5t5.duncan@cox.net> wrote:
> NeilBrown posted on Sun, 13 Mar 2016 22:33:22 +1100 as excerpted:
>
>> On Sun, Mar 13 2016, Qu Wenruo wrote:
>>
>>> BTW, I am always interested in, why de-duplication can be shorted as
>>> 'dedupe'.
>
>>> I didn't see any 'e' in the whole word "DUPlication".
>>> Or it's an abbreviation of "DUPlicatE" instead of "DUPlication"?
>>
>> The "u" in "duplicate" is pronounced as a long vowel sound, almost like
>> d-you-plicate.
>
>> To make a vowel long you can add an 'e' at the end of a word.
>
>> by analogy, "dupe" has a long "u" and so sounds like the first syllable
>> of "duplicate".
>
> As a native (USian but with some years growing up in the then recently
> independent former Crown colony of Kenya, influencing my personal
> preferences) English speaker, while what Neil says about short "u" vs.
> long "u" is correct, I agree with Qu that the "e" in dupe doesn't make so
> much sense, and would, other things being equal, vastly prefer dedup to
> dedupe, myself.
>
> However, there's some value in consistency, and given the previous dedupe
> precedent in-kernel, sticking to that for consistency reasons makes sense.
>
> But were this debate to have been about the original usage, I'd have
> definitely favored dedup all the way, as not withstanding Neil's argument
> above, adding the "e" makes little sense to me either.  So only because
> it's already in use in kernel code, but if this /were/ the original
> kernel code...
>
> So I definitely understand your confusion, Qu, and have the same personal
> preference even as a native English speaker. =:^)

I'm not sure to what degree the following is a relevant concern, and
I'm guessing it's not, other than for laughs, but to me "dedupe" reads
as "de-dupe" or "undupe".  While it functions as the inverse of the
verb "to dupe", I don't think one can "be unduped" or "be unfooled".
What is that old aphorism?  "Once duped twice shy"? ;-)

Honestly I'm surprised that a verb-form of "tuple" hasn't yet emerged,
because if it had we might be saying "detup" instead of "dedup".

Best regards,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duncan March 15, 2016, 11:19 p.m. UTC | #9
Nicholas D Steeves posted on Tue, 15 Mar 2016 18:08:41 -0400 as excerpted:

> I'm not sure to what degree the following is a relevant concern, and I'm
> guessing it's not, other than for laughs, but to me "dedupe" reads as
> "de-dupe" or "undupe".  While it functions as the inverse of the verb
> "to dupe", I don't think one can "be unduped" or "be unfooled". What is
> that old aphorism?  "Once duped twice shy"? ;-)

That's the obvious association, yes, and the negative connotations of 
dupe are surely why I have such a personal negative reaction to dedupe.  
But precedent and current usage being what they are...
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bc6a87e..094db5c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1866,6 +1866,11 @@  struct btrfs_fs_info {
 	struct list_head pinned_chunks;
 
 	int creating_free_space_tree;
+
+	/* Inband de-duplication related structures*/
+	unsigned int dedup_enabled:1;
+	struct btrfs_dedup_info *dedup_info;
+	struct mutex dedup_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedup.h b/fs/btrfs/dedup.h
new file mode 100644
index 0000000..8e1ff03
--- /dev/null
+++ b/fs/btrfs/dedup.h
@@ -0,0 +1,127 @@ 
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUP__
+#define __BTRFS_DEDUP__
+
+#include <linux/btrfs.h>
+#include <linux/wait.h>
+#include <crypto/hash.h>
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUP_BACKEND_INMEMORY		0
+#define BTRFS_DEDUP_BACKEND_ONDISK		1
+#define BTRFS_DEDUP_BACKEND_LAST		2
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUP_BLOCKSIZE_MAX	(8 * 1024 * 1024)
+#define BTRFS_DEDUP_BLOCKSIZE_MIN	(16 * 1024)
+#define BTRFS_DEDUP_BLOCKSIZE_DEFAULT	(128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUP_HASH_SHA256		0
+
+static int btrfs_dedup_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedup backends should have their own hash structure
+ */
+struct btrfs_dedup_hash {
+	u64 bytenr;
+	u32 num_bytes;
+
+	/* last field is a variable length array of dedup hash */
+	u8 hash[];
+};
+
+struct btrfs_dedup_info {
+	/* dedup blocksize */
+	u64 blocksize;
+	u16 backend;
+	u16 hash_type;
+
+	struct crypto_shash *dedup_driver;
+	struct mutex lock;
+
+	/* following members are only used in in-memory dedup mode */
+	struct rb_root hash_root;
+	struct rb_root bytenr_root;
+	struct list_head lru_list;
+	u64 limit_nr;
+	u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+int btrfs_dedup_hash_size(u16 type);
+struct btrfs_dedup_hash *btrfs_dedup_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedup info
+ * Called at dedup enable time.
+ */
+int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+		       u64 blocksize, u64 limit_nr);
+
+/*
+ * Disable dedup and invalidate all its dedup data.
+ * Called at dedup disable time.
+ */
+int btrfs_dedup_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedup_bs) has valid data.
+ */
+int btrfs_dedup_calc_hash(struct btrfs_fs_info *fs_info,
+			  struct inode *inode, u64 start,
+			  struct btrfs_dedup_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedup_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedup_search(struct btrfs_fs_info *fs_info,
+		       struct inode *inode, u64 file_pos,
+		       struct btrfs_dedup_hash *hash);
+
+/* Add a dedup hash into dedup info */
+int btrfs_dedup_add(struct btrfs_trans_handle *trans,
+		    struct btrfs_fs_info *fs_info,
+		    struct btrfs_dedup_hash *hash);
+
+/* Remove a dedup hash from dedup info */
+int btrfs_dedup_del(struct btrfs_trans_handle *trans,
+		    struct btrfs_fs_info *fs_info, u64 bytenr);
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index de68b8b..bbc17f2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2582,6 +2582,7 @@  int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
+	mutex_init(&fs_info->dedup_ioctl_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);