diff mbox series

[RFC] btrfs: rescue: Introduce new rescue mount option, rescue=skipdatacsum, to skip data csum verification

Message ID 20191029062149.217995-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: rescue: Introduce new rescue mount option, rescue=skipdatacsum, to skip data csum verification | expand

Commit Message

Qu Wenruo Oct. 29, 2019, 6:21 a.m. UTC
When btrfs csum tree is corrupted, e.g. one csum tree block get csum
error for all its copies, we will have no chance to read out the data.

  # Hook before we submit a read bio, will fill up the csum buffer
  # for this read
  btrfs_submit_bio_hook()
  |- btrfs_lookup_bio_sums()
     |- __btrfs_lookup_bio_sums()
        |- item = btrfs_lookup_csum();
        |  Failed with -EIO
        |- memset(csum, 0, csum_size);	# Set the csum buffer to 0
        |- goto found;			# Go to next page lookup

  # Now that btrfs_io_bio->csum has 0 filled for failure range

  # Hook after the read bio is completed, will do the csum verification
  btrfs_readpage_end_io_hook()
  |- __readpage_endio_check()
     |- do csum calculation
     |  Since all csum value is set to 0, the csum verification will
     |  fail.
     |- btrfs_print_data_csum_error();

  # We got EIO in user space, without any usable data

The best solution to this problem would be introduce an extra companion
bitmap for csum verification, notifying the caller whether we failed to
read out the csum or the csum is really all 0.

But that solution can be complex and only works for above described
case.

Here we introduce a new rescue= mount option to completely skip data
csum check.
Since data csum check is completely skipped, for profiles with extra
mirrors/copies, it will return the first copy only, which is not
optimized, but should be good enough for rescue usage.

This option only affects data csum verification, doesn't affect data
csum calcuation, so new data write will still be protected by csum (if
it doesn't get affected by csum tree corruption).

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
This patch needs rescue= mount options patchset.

Reason for RFC:
Need extra feedbacks. I guess users who has experienced data salvage
would like this. But I'm not sure if it's welcomed to add such mount
option.
---
 fs/btrfs/compression.c | 3 ++-
 fs/btrfs/ctree.h       | 1 +
 fs/btrfs/inode.c       | 3 +++
 fs/btrfs/super.c       | 6 ++++++
 4 files changed, 12 insertions(+), 1 deletion(-)

Comments

Roman Mamedov Oct. 29, 2019, 6:53 a.m. UTC | #1
On Tue, 29 Oct 2019 14:21:49 +0800
Qu Wenruo <wqu@suse.com> wrote:

> Here we introduce a new rescue= mount option to completely skip data
> csum check.
> Since data csum check is completely skipped, for profiles with extra
> mirrors/copies, it will return the first copy only, which is not
> optimized, but should be good enough for rescue usage.
> 
> This option only affects data csum verification, doesn't affect data
> csum calcuation, so new data write will still be protected by csum (if
> it doesn't get affected by csum tree corruption).

Maybe just make the "nodatasum" mount option skip verification of existing
checksums as well? Actually before seeing this patch I believed "nodatasum"
already does that.

Also for consistency note that datasum/nodatasum call it just "sum" in the
mount option name, not "csum".
Qu Wenruo Oct. 29, 2019, 7:09 a.m. UTC | #2
On 2019/10/29 下午2:53, Roman Mamedov wrote:
> On Tue, 29 Oct 2019 14:21:49 +0800
> Qu Wenruo <wqu@suse.com> wrote:
> 
>> Here we introduce a new rescue= mount option to completely skip data
>> csum check.
>> Since data csum check is completely skipped, for profiles with extra
>> mirrors/copies, it will return the first copy only, which is not
>> optimized, but should be good enough for rescue usage.
>>
>> This option only affects data csum verification, doesn't affect data
>> csum calcuation, so new data write will still be protected by csum (if
>> it doesn't get affected by csum tree corruption).
> 
> Maybe just make the "nodatasum" mount option skip verification of existing
> checksums as well? Actually before seeing this patch I believed "nodatasum"
> already does that.

Nodatasum only applies to new writes, doesn't affect existing csum
verification.

Normally nodatasum is just considered as a performance optimization, not
something we would like to use for rescue usage.

> 
> Also for consistency note that datasum/nodatasum call it just "sum" in the
> mount option name, not "csum".

That makes sense.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b05b361e2062..08dfc2255490 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -85,7 +85,8 @@  static int check_compressed_csum(struct btrfs_inode *inode,
 	u8 csum[BTRFS_CSUM_SIZE];
 	u8 *cb_sum = cb->sums;
 
-	if (inode->flags & BTRFS_INODE_NODATASUM)
+	if (inode->flags & BTRFS_INODE_NODATASUM ||
+	    btrfs_test_opt(fs_info, SKIP_DATACSUM))
 		return 0;
 
 	shash->tfm = fs_info->csum_shash;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ff24b607bc91..bc64abb42a99 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1191,6 +1191,7 @@  static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_SKIP_BG		(1 << 29)
+#define BTRFS_MOUNT_SKIP_DATACSUM	(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0546401bc0a..bfd1ee35e2e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3275,6 +3275,9 @@  static int __readpage_endio_check(struct inode *inode,
 	u8 *csum_expected;
 	u8 csum[BTRFS_CSUM_SIZE];
 
+	if (btrfs_test_opt(fs_info, SKIP_DATACSUM))
+		return 0;
+
 	csum_expected = ((u8 *)io_bio->csum) + icsum * csum_size;
 
 	kaddr = kmap_atomic(page);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index ab61b0364960..e442ef36ac24 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -333,6 +333,7 @@  enum {
 	Opt_usebackuproot,
 	Opt_nologreplay,
 	Opt_rescue_skip_bg,
+	Opt_rescue_skip_datacsum,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -430,6 +431,7 @@  static const match_table_t rescue_tokens = {
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_rescue_skip_bg, "skipbg"},
+	{Opt_rescue_skip_datacsum, "skipdatacsum"},
 	{Opt_err, NULL},
 };
 
@@ -466,6 +468,10 @@  static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
 			btrfs_set_and_info(info, SKIP_BG,
 				"skip mount time block group searching");
 			break;
+		case Opt_rescue_skip_datacsum:
+			btrfs_set_and_info(info, SKIP_DATACSUM,
+				"skip data checksum verification");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized rescue option '%s'", p);
 			ret = -EINVAL;