From patchwork Sat Sep  3 08:19:21 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964919
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 263C4ECAAD4
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231538AbiICIT6 (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:19:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43158 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229568AbiICITz (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:19:55 -0400
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EACE8326E0
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:52 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out2.suse.de (Postfix) with ESMTPS id 47B8F1F95E
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:51 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193191;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=RByrAnpd61EQxfnkbHaJyJO7fT1eiw1QRi28vl6Og8s=;
        b=XNBL7Y2DE8/45f/6IE6urf4RqzDkgg0iJNK9XverI+hr5aEqF+Q9Pi3sxW3hkKoP+FOdyH
        KHGOoJIHvUBgPY0B+hf/in/GEcPEwmObjAYkOJQtxEokmUYTn6GKLrondfTZLanuK+7pEl
        kla2ilxCh0yYqk+oi5qlUktdx8d3pHg=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id B614E139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:49 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id OMR/HyUOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:49 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
Date: Sat,  3 Sep 2022 16:19:21 +0800
Message-Id: 
 <e37ae2c85731ec307869e7c8f87c10d36d51846f.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

The new ioctls are to address the disadvantages of the existing
btrfs_scrub_dev():

a One thread per-device
  This can cause multiple block groups to be marked read-only for scrub,
  reducing available space temporarily.

  This also causes higher CPU/IO usage.
  For scrub, we should use the minimal amount of CPU and cause less
  IO when possible.

b Extra IO for RAID56
  For data stripes, we will cause at least 2x IO if we run "btrfs scrub
  start <mnt>".
  1x from scrubbing the device of data stripe.
  The other 1x from scrubbing the parity stripe.

  This duplicated IO should definitely be avoided

c Bad progress report for RAID56
  We can not report any repaired P/Q bytes at all.

The a and b will be addressed by the new one thread per-fs
btrfs_scrub_fs ioctl.

While c will be addressed by the new btrfs_scrub_fs_progress structure,
which has better comments and classification for all errors.

This patch is only a skeleton for the new family of ioctls, will return
-EOPNOTSUPP for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 fs/btrfs/ioctl.c           |   6 ++
 include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fe0cc816b4eb..3df3bcdf06eb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_scrub_cancel(fs_info);
 	case BTRFS_IOC_SCRUB_PROGRESS:
 		return btrfs_ioctl_scrub_progress(fs_info, argp);
+	case BTRFS_IOC_SCRUB_FS:
+		return -EOPNOTSUPP;
+	case BTRFS_IOC_SCRUB_FS_CANCEL:
+		return -EOPNOTSUPP;
+	case BTRFS_IOC_SCRUB_FS_PROGRESS:
+		return -EOPNOTSUPP;
 	case BTRFS_IOC_BALANCE_V2:
 		return btrfs_ioctl_balance(file, argp);
 	case BTRFS_IOC_BALANCE_CTL:
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 7ada84e4a3ed..795ed33843ce 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -191,6 +191,174 @@ struct btrfs_ioctl_scrub_args {
 	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
 };
 
+struct btrfs_scrub_fs_progress {
+	/*
+	 * Fatal errors, including -ENOMEM, or csum/extent tree search errors.
+	 *
+	 * Normally after hitting such fatal errors, we error out, thus later
+	 * accounting will no longer be reliable.
+	 */
+	__u16	nr_fatal_errors;
+
+	/*
+	 * All super errors, from invalid members and IO error all go into
+	 * nr_super_errors.
+	 */
+	__u16	nr_super_errors;
+
+	/* Super block accounting. */
+	__u16	nr_super_scrubbed;
+	__u16	nr_super_repaired;
+
+	/*
+	 * Data accounting in bytes.
+	 *
+	 * We only care about how many bytes we scrubbed, thus no
+	 * accounting for number of extents.
+	 *
+	 * This accounting includes the extra mirrors.
+	 * E.g. for RAID1, one 16KiB extent will cause 32KiB in @data_scrubbed.
+	 */
+	__u64	data_scrubbed;
+
+	/* How many bytes can be recovered. */
+	__u64	data_recoverable;
+
+	/*
+	 * How many bytes are uncertain, this can only happen for NODATASUM
+	 * cases.
+	 * Including NODATASUM, and no extra mirror/parity to verify.
+	 * Or has extra mirrors, but they mismatch with each other.
+	 */
+	__u64	data_nocsum_uncertain;
+
+	/*
+	 * For data error bytes, these means determining errors, including:
+	 *
+	 * - IO failure, including missing dev.
+	 * - Data csum mismatch
+	 *   Csum tree search failure must go above case.
+	 */
+	__u64	data_io_fail;
+	__u64	data_csum_mismatch;
+
+	/*
+	 * All the unmentioned cases, including data matching its csum (of
+	 * course, implies IO suceeded) and data has no csum but matches all
+	 * other copies/parities, are the expected cases, no need to record.
+	 */
+
+	/*
+	 * Metadata accounting in bytes, pretty much the same as data.
+	 *
+	 * And since metadata has mandatory csum, there is no uncertain case.
+	 */
+	__u64	meta_scrubbed;
+	__u64	meta_recoverable;
+
+	/*
+	 * For meta, the checks are mostly progressive:
+	 *
+	 * - Unable to read
+	 *   @meta_io_fail
+	 *
+	 * - Unable to pass basic sanity checks (e.g. bytenr check)
+	 *   @meta_invalid
+	 *
+	 * - Pass basic sanity checks, but bad csum
+	 *   @meta_bad_csum
+	 *
+	 * - Pass basic checks and csum, but bad transid
+	 *   @meta_bad_transid
+	 *
+	 * - Pass all checks
+	 *   The expected case, no special accounting needed.
+	 */
+	__u64 meta_io_fail;
+	__u64 meta_invalid;
+	__u64 meta_bad_csum;
+	__u64 meta_bad_transid;
+
+	/*
+	 * Parity accounting.
+	 *
+	 * NOTE: for unused data sectors (but still contributes to P/Q
+	 * calculation, like the following case), they don't contribute
+	 * to any accounting.
+	 *
+	 * Data 1:	|<--- Unused ---->| <<<
+	 * Data 2:	|<- Data extent ->|
+	 * Parity:	|<--- Parity ---->|
+	 */
+	__u64 parity_scrubbed;
+	__u64 parity_recoverable;
+
+	/*
+	 * This happens when there is not enough info to determine if the
+	 * parity is correct, mostly happens when vertical stripes are
+	 * *all* NODATASUM sectors.
+	 *
+	 * If there is any sector with checksum in the vertical stripe,
+	 * parity itself will be no longer uncertain.
+	 */
+	__u64 parity_uncertain;
+
+	/*
+	 * For parity, the checks are progressive too:
+	 *
+	 * - Unable to read
+	 *   @parity_io_fail
+	 *
+	 * - Mismatch and any veritical data stripe has csum and
+	 *   the data stripe csum matches
+	 *   @parity_mismatch
+	 *   We want to repair the parity then.
+	 *
+	 * - Mismatch and veritical data stripe has csum, and data
+	 *   csum mismatch. And rebuilt data passes csum.
+	 *   This will go @data_recoverable or @data_csum_mismatch instead.
+	 *
+	 * - Mismatch but no veritical data stripe has csum
+	 *   @parity_uncertain
+	 *
+	 */
+	__u64 parity_io_fail;
+	__u64 parity_mismatch;
+
+	/* Padding to 256 bytes, and for later expansion. */
+	__u64 __unused[15];
+};
+static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
+
+/*
+ * Readonly scrub fs will not try any repair (thus *_repaired member
+ * in scrub_fs_progress should always be 0).
+ */
+#define BTRFS_SCRUB_FS_FLAG_READONLY	(1ULL << 0)
+
+/*
+ * All supported flags.
+ *
+ * From the very beginning, scrub_fs ioctl would reject any unsupported
+ * flags, making later expansion much simper.
+ */
+#define BTRFS_SCRUB_FS_FLAG_SUPP	(BTRFS_SCRUB_FS_FLAG_READONLY)
+
+struct btrfs_ioctl_scrub_fs_args {
+	/* Input, logical bytenr to start the scrub */
+	__u64 start;
+
+	/* Input, the logical bytenr end (inclusive) */
+	__u64 end;
+
+	__u64 flags;
+	__u64 reserved[8];
+	struct btrfs_scrub_fs_progress progress; /* out */
+
+	/* pad to 1K */
+	__u8 unused[1024 - 24 - 64 - sizeof(struct btrfs_scrub_fs_progress)];
+};
+
 #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
 #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
 struct btrfs_ioctl_dev_replace_start_params {
@@ -1137,5 +1305,10 @@ enum btrfs_err_code {
 				    struct btrfs_ioctl_encoded_io_args)
 #define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
 				     struct btrfs_ioctl_encoded_io_args)
+#define BTRFS_IOC_SCRUB_FS	_IOWR(BTRFS_IOCTL_MAGIC, 65, \
+				      struct btrfs_ioctl_scrub_fs_args)
+#define BTRFS_IOC_SCRUB_FS_CANCEL _IO(BTRFS_IOCTL_MAGIC, 66)
+#define BTRFS_IOC_SCRUB_FS_PROGRESS _IOWR(BTRFS_IOCTL_MAGIC, 67, \
+				       struct btrfs_ioctl_scrub_fs_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */

From patchwork Sat Sep  3 08:19:22 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964920
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3DB92C6FA83
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232251AbiICIUA (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:00 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43356 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231741AbiICIT6 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:19:58 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28AEB32BBB
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:54 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id C5DB43372C
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193192;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=2zuLox11DLccbl2kjlWI78IIfPcKb7m3OP9oqCANzbs=;
        b=Ze2AZmclBxVs3h7yPGxIFIFXRhuNdFnCb8+14pdZQn3cjsQvbEY0o74umgRk68LsmubWhJ
        biI5vnNx4v3RY2S5NeDCSBNxRFhmlYSbSmWq2L+g6EGz3qWZEsWRqyeDtKD7CeoudaYDu4
        kq9fbtoARO9PAtaNFKs4xvLrXUHrh5E=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2036E139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:51 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id aMA3HScOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:51 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 2/9] btrfs: scrub: introduce place holder for
 btrfs_scrub_fs()
Date: Sat,  3 Sep 2022 16:19:22 +0800
Message-Id: 
 <f5548a9dff5061103cfe2806e3d123f8310a2ebd.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

The new function btrfs_scrub_fs() will do the exclusive checking against
regular scrub and dev-replace, then return -EOPNOTSUPP as a place
holder.

Also to let regular scrub/dev-replace to be exclusive against
btrfs_scrub_fs(), also introduce btrfs_fs_info::scrub_fs_running member.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |   4 ++
 fs/btrfs/ioctl.c |  41 +++++++++++++++++-
 fs/btrfs/scrub.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3dc30f5e6fd0..0b360d9ec2e0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -955,6 +955,7 @@ struct btrfs_fs_info {
 	/* private scrub information */
 	struct mutex scrub_lock;
 	atomic_t scrubs_running;
+	atomic_t scrub_fs_running;
 	atomic_t scrub_pause_req;
 	atomic_t scrubs_paused;
 	atomic_t scrub_cancel_req;
@@ -4063,6 +4064,9 @@ int btrfs_should_ignore_reloc_root(struct btrfs_root *root);
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		    u64 end, struct btrfs_scrub_progress *progress,
 		    int readonly, int is_dev_replace);
+int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+		   struct btrfs_scrub_fs_progress *progress,
+		   bool readonly);
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info);
 void btrfs_scrub_continue(struct btrfs_fs_info *fs_info);
 int btrfs_scrub_cancel(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3df3bcdf06eb..8219e2554734 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4112,6 +4112,45 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static long btrfs_ioctl_scrub_fs(struct file *file, void __user *arg)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(file_inode(file)->i_sb);
+	struct btrfs_ioctl_scrub_fs_args *sfsa;
+	bool readonly = false;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	sfsa = memdup_user(arg, sizeof(*sfsa));
+	if (IS_ERR(sfsa))
+		return PTR_ERR(sfsa);
+
+	if (sfsa->flags & ~BTRFS_SCRUB_FS_FLAG_SUPP) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+	if (sfsa->flags & BTRFS_SCRUB_FS_FLAG_READONLY)
+		readonly = true;
+
+	if (!readonly) {
+		ret = mnt_want_write_file(file);
+		if (ret)
+			goto out;
+	}
+
+	ret = btrfs_scrub_fs(fs_info, sfsa->start, sfsa->end, &sfsa->progress,
+			     readonly);
+	if (copy_to_user(arg, sfsa, sizeof(*sfsa)))
+		ret = -EFAULT;
+
+	if (!readonly)
+		mnt_drop_write_file(file);
+out:
+	kfree(sfsa);
+	return ret;
+}
+
 static long btrfs_ioctl_get_dev_stats(struct btrfs_fs_info *fs_info,
 				      void __user *arg)
 {
@@ -5509,7 +5548,7 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SCRUB_PROGRESS:
 		return btrfs_ioctl_scrub_progress(fs_info, argp);
 	case BTRFS_IOC_SCRUB_FS:
-		return -EOPNOTSUPP;
+		return btrfs_ioctl_scrub_fs(file, argp);
 	case BTRFS_IOC_SCRUB_FS_CANCEL:
 		return -EOPNOTSUPP;
 	case BTRFS_IOC_SCRUB_FS_PROGRESS:
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 755273b77a3f..09a1ab6ac54e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4295,6 +4295,15 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	}
 
 	mutex_lock(&fs_info->scrub_lock);
+
+	/* Conflict with scrub_fs ioctls. */
+	if (atomic_read(&fs_info->scrub_fs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
 	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
 	    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &dev->dev_state)) {
 		mutex_unlock(&fs_info->scrub_lock);
@@ -4416,6 +4425,102 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+/*
+ * Unlike btrfs_scrub_dev(), this function works completely in logical bytenr
+ * level, and has the following advantage:
+ *
+ * - Better error reporting
+ *   The new btrfs_scrub_fs_progress has better classified errors, more
+ *   members to include parity errors.
+ *
+ * - Always scrub one block group at one time
+ *   btrfs_scrub_dev() works by starting one scrub for each device.
+ *   This can cause asynchronised progress, and mark multiple block groups
+ *   RO, reducing the avaialbe space unnecessarily.
+ *
+ * - Less IO for RAID56
+ *   Instead of treating RAID56 data and P/Q stripes differently, here we only
+ *   scrub a full stripe at most once.
+ *   Instead of the 2x read for data stripes (one for scrubbing the data stripe itself,
+ *   the other one from scrubbing the P/Q stripe).
+ *
+ * - No bio formshaping and streamlined code
+ *   Always submit bio for all involved mirrors (or data/p/q stripes for
+ *   RAID56), wait for the IO, then run the check.
+ *
+ *   Thus there are at most nr_mirrors (nr_stripes for RAID56) bios on-the-fly,
+ *   and for each device, there is always at most one bio for scrub.
+ *
+ *   This would greatly simplify all involved code.
+ *
+ * - No need to support dev-replace
+ *   Thus we can have simpler code.
+ *
+ * Unfortunately this ioctl has the following disadvantage so far:
+ *
+ * - No resume after unmount
+ *   We may need extra on-disk format to save the progress.
+ *   Thus we may need a new RO compat flags for the resume ability.
+ *
+ * - Conflicts with dev-replace/scrub
+ *
+ * - Needs kernel support.
+ *
+ * - Not fully finished
+ */
+int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+		   struct btrfs_scrub_fs_progress *progress,
+		   bool readonly)
+{
+	int ret;
+
+	if (btrfs_fs_closing(fs_info))
+		return -EAGAIN;
+
+	if (btrfs_is_zoned(fs_info))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Metadata and data unit should be able to be contained inside one
+	 * stripe.
+	 */
+	ASSERT(fs_info->nodesize <= BTRFS_STRIPE_LEN);
+	ASSERT(fs_info->sectorsize <= BTRFS_STRIPE_LEN);
+
+	mutex_lock(&fs_info->scrub_lock);
+	/* This function conflicts with scrub/dev-replace. */
+	if (atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -EINPROGRESS;
+	}
+
+	/* And there can only be one running btrfs_scrub_fs(). */
+	if (atomic_read(&fs_info->scrub_fs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -EINPROGRESS;
+	}
+
+	__scrub_blocked_if_needed(fs_info);
+	atomic_inc(&fs_info->scrub_fs_running);
+
+	/* This is to allow existing scrub pause to be reused. */
+	atomic_inc(&fs_info->scrubs_running);
+	btrfs_info(fs_info, "scrub_fs: started");
+	mutex_unlock(&fs_info->scrub_lock);
+
+	/* Place holder for real workload. */
+	ret = -EOPNOTSUPP;
+
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	atomic_dec(&fs_info->scrub_fs_running);
+	btrfs_info(fs_info, "scrub_fs: finished with status: %d", ret);
+	mutex_unlock(&fs_info->scrub_lock);
+	wake_up(&fs_info->scrub_pause_wait);
+
+	return ret;
+}
+
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info)
 {
 	mutex_lock(&fs_info->scrub_lock);

From patchwork Sat Sep  3 08:19:23 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964921
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4EB2DC6FA82
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232374AbiICIUB (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:01 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43358 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231876AbiICIT6 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:19:58 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9036C32D86
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:55 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id 4F22F336D9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193194;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=u8R8F3oiuyKR4+6179/pxdb4BB11PNUIvZyb0kIm558=;
        b=oZjXbghpimLhN+VppAaZ7pZh+2wzN3FILP5uHbN7ctiuNCVlYILFFQcetNN/4VI5vgv574
        gFpvoOoEnoCEJXpxAB9TrFmNgLQRU5pie6WeMtnQvuNnkIs86rhw8HKJUWeaiYDEbSZCj5
        OGYYoAa74wVdS9eEbXVX+3SSx4US8to=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 37F00139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:53 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id IBGcACkOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:53 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 3/9] btrfs: scrub: introduce a place holder helper
 scrub_fs_iterate_bgs()
Date: Sat,  3 Sep 2022 16:19:23 +0800
Message-Id: 
 <1d5c625673b20c21bd1be32f67fe7ea2fdcbeaca.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

This new helper is mostly the same as scrub_enumerate_chunks(), but with
some small changes:

- No need for dev-replace branches

- No need to search dev-extent tree
  We can directly iterate the block groups.

The new helper currently will only iterate all the bgs, but doing
nothing for the iterated bgs.

Also one smaller helper is introduced:

- scrub_fs_alloc_ctx()
  To allocate a scrub_fs_ctx, which has way less members (for now and
  for the future) compared to scrub_ctx.

  The scrub_fs_ctx will have a very defined lifespan (only inside
  btrfs_scrub_fs(), and can only have one scrub_fs_ctx, thus not need to
  be ref counted)

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 164 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 162 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 09a1ab6ac54e..cf4dc384427e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -198,6 +198,24 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
+/* This structure should only has a lifespan inside btrfs_scrub_fs(). */
+struct scrub_fs_ctx {
+	struct btrfs_fs_info		*fs_info;
+
+	/* Current block group we're scurbbing. */
+	struct btrfs_block_group	*cur_bg;
+
+	/* Current logical bytenr being scrubbed. */
+	u64				cur_logical;
+
+	atomic_t			sectors_under_io;
+
+	bool				readonly;
+
+	/* There will and only be one thread touching @stat. */
+	struct btrfs_scrub_fs_progress	stat;
+};
+
 struct scrub_warning {
 	struct btrfs_path	*path;
 	u64			extent_item_size;
@@ -4425,6 +4443,126 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
+					       bool readonly)
+{
+	struct scrub_fs_ctx *sfctx;
+	int ret;
+
+	sfctx = kzalloc(sizeof(*sfctx), GFP_KERNEL);
+	if (!sfctx) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	sfctx->fs_info = fs_info;
+	sfctx->readonly = readonly;
+	atomic_set(&sfctx->sectors_under_io, 0);
+	return sfctx;
+error:
+	kfree(sfctx);
+	return ERR_PTR(ret);
+}
+
+static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	u64 cur = start;
+	int ret;
+
+	while (cur < end) {
+		struct btrfs_block_group *bg;
+		bool ro_set = false;
+
+		bg = btrfs_lookup_first_block_group(fs_info, cur);
+		if (!bg)
+			break;
+		if (bg->start + bg->length >= end) {
+			btrfs_put_block_group(bg);
+			break;
+		}
+		spin_lock(&bg->lock);
+
+		/* Already deleted bg, skip to the next one. */
+		if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags)) {
+			spin_unlock(&bg->lock);
+			cur = bg->start + bg->length;
+			btrfs_put_block_group(bg);
+			continue;
+		}
+		btrfs_freeze_block_group(bg);
+		spin_unlock(&bg->lock);
+
+		/*
+		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
+		 * to avoid deadlock caused by:
+		 * btrfs_inc_block_group_ro()
+		 * -> btrfs_wait_for_commit()
+		 * -> btrfs_commit_transaction()
+		 * -> btrfs_scrub_pause()
+		 */
+		scrub_pause_on(fs_info);
+
+		/*
+		 * Check the comments before btrfs_inc_block_group_ro() inside
+		 * scrub_enumerate_chunks() for reasons.
+		 */
+		ret = btrfs_inc_block_group_ro(bg, false);
+		if (ret == 0)
+			ro_set = true;
+		if (ret == -ETXTBSY) {
+			btrfs_warn(fs_info,
+		   "skipping scrub of block group %llu due to active swapfile",
+				   bg->start);
+			scrub_pause_off(fs_info);
+			ret = 0;
+			goto next;
+		}
+		if (ret < 0 && ret != -ENOSPC) {
+			btrfs_warn(fs_info,
+				   "failed setting block group ro: %d", ret);
+			scrub_pause_off(fs_info);
+			goto next;
+		}
+
+		scrub_pause_off(fs_info);
+
+		/* Place holder for the real chunk scrubbing code. */
+		ret = 0;
+
+		if (ro_set)
+			btrfs_dec_block_group_ro(bg);
+
+		/*
+		 * We might have prevented the cleaner kthread from deleting
+		 * this block group if it was already unused because we raced
+		 * and set it to RO mode first. So add it back to the unused
+		 * list, otherwise it might not ever be deleted unless a manual
+		 * balance is triggered or it becomes used and unused again.
+		 */
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags) &&
+		    !bg->ro && bg->reserved == 0 && bg->used == 0) {
+			spin_unlock(&bg->lock);
+			if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+				btrfs_discard_queue_work(&fs_info->discard_ctl,
+							 bg);
+			else
+				btrfs_mark_bg_unused(bg);
+		} else {
+			spin_unlock(&bg->lock);
+		}
+next:
+		cur = bg->start + bg->length;
+
+		btrfs_unfreeze_block_group(bg);
+		btrfs_put_block_group(bg);
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
 /*
  * Unlike btrfs_scrub_dev(), this function works completely in logical bytenr
  * level, and has the following advantage:
@@ -4472,6 +4610,8 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 		   struct btrfs_scrub_fs_progress *progress,
 		   bool readonly)
 {
+	struct scrub_fs_ctx *sfctx;
+	unsigned int nofs_flag;
 	int ret;
 
 	if (btrfs_fs_closing(fs_info))
@@ -4508,8 +4648,25 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 	btrfs_info(fs_info, "scrub_fs: started");
 	mutex_unlock(&fs_info->scrub_lock);
 
-	/* Place holder for real workload. */
-	ret = -EOPNOTSUPP;
+	sfctx = scrub_fs_alloc_ctx(fs_info, readonly);
+	if (IS_ERR(sfctx)) {
+		ret = PTR_ERR(sfctx);
+		sfctx = NULL;
+		goto out;
+	}
+
+	if (progress)
+		memcpy(&sfctx->stat, progress, sizeof(*progress));
+
+	/*
+	 * Check the comments before memalloc_nofs_save() in btrfs_scrub_dev()
+	 * for reasons.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = scrub_fs_iterate_bgs(sfctx, start, end);
+	memalloc_nofs_restore(nofs_flag);
+out:
+	kfree(sfctx);
 
 	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&fs_info->scrubs_running);
@@ -4518,6 +4675,9 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 	mutex_unlock(&fs_info->scrub_lock);
 	wake_up(&fs_info->scrub_pause_wait);
 
+	if (progress)
+		memcpy(progress, &sfctx->stat, sizeof(*progress));
+
 	return ret;
 }
 

From patchwork Sat Sep  3 08:19:24 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964922
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8A813C6FA86
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232435AbiICIUC (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43398 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232010AbiICIT6 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:19:58 -0400
Received: from smtp-out2.suse.de (smtp-out2.suse.de
 [IPv6:2001:67c:2178:6::1d])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A7E8326E0
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:56 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out2.suse.de (Postfix) with ESMTPS id 667DC1F95E
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193195;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=KE4VyGDpwHE7RS1MONo5fXTxy8zomZTyTV/igIWTp80=;
        b=UYAHWzQ4HvmhZCBOB2xPGaMExAOjYlmOMPwWdwSIBwCx8fyquNIL1lkrQRJ+cKHjr14gTq
        Ejfvt67j6OG/DCfOp3KNsrUo+hYXOMGtOLw/Q03GsytdsaonmZP0kSmCnlThPBqe5KBs60
        RjpielIQxyWWS3eU9M9LVypUwdVFEaA=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id B57F4139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:54 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id 2Og9HyoOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:54 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 4/9] btrfs: scrub: introduce place holder helper
 scrub_fs_block_group()
Date: Sat,  3 Sep 2022 16:19:24 +0800
Message-Id: 
 <ccd8ba4f67e3631ee9d6b707c1fc1de1558ecba8.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

The main place holder helper scrub_fs_block_group() will:

- Initialize various needed members inside scrub_fs_ctx
  This includes:
  * Calculate the nr_copies for non-RAID56 profiles, or grab nr_stripes
    for RAID56 profiles.
  * Allocate memory for sectors/pages array, and csum_buf if it's data
    bg.
  * Initialize all sectors to type UNUSED.

  All these above memory will stay for each stripe we run, thus we only
  need to allocate these memories once-per-bg.

- Iterate stripes containing any used sector
  This is the code to be implemented.

- Cleanup above memories before we finish the block group.

The real work of scrubbing a stripe is not yet implemented.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 233 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cf4dc384427e..46885f966bb3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -198,6 +198,45 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
+#define SCRUB_FS_SECTOR_FLAG_UNUSED		(1 << 0)
+#define SCRUB_FS_SECTOR_FLAG_DATA		(1 << 1)
+#define SCRUB_FS_SECTOR_FLAG_META		(1 << 2)
+#define SCRUB_FS_SECTOR_FLAG_PARITY		(1 << 3)
+
+/*
+ * Represent a sector.
+ *
+ * To access the content of a sector, the caller should have the index inside
+ * the scrub_fs_ctx->sectors[] array, and use that index to calculate the page
+ * and page offset innside scrub_fs_ctx->pages[] array.
+ *
+ * To get the logical/physical bytenr of the a sector, the caller should use
+ * scrub_fs_ctx->bioc and the sector index to calclulate the logical/physical
+ * bytenr.
+ */
+struct scrub_fs_sector {
+	unsigned int flags;
+	union {
+		/*
+		 * For SCRUB_FS_SECTOR_TYPE_DATA, either it points to some byte
+		 * inside scrub_fs_ctx->csum_buf, or it's NULL for NODATACSUM
+		 * case.
+		 */
+		u8 *csum;
+
+		/*
+		 * For SECRUB_FS_SECTOR_TYPE_META, this records the generation
+		 * and the logical bytenr of the tree block.
+		 * (So we can grab the first sector to calculate their inline
+		 * csum).
+		 */
+		struct {
+			u64 eb_logical;
+			u64 eb_generation;
+		};
+	};
+};
+
 /* This structure should only has a lifespan inside btrfs_scrub_fs(). */
 struct scrub_fs_ctx {
 	struct btrfs_fs_info		*fs_info;
@@ -208,12 +247,64 @@ struct scrub_fs_ctx {
 	/* Current logical bytenr being scrubbed. */
 	u64				cur_logical;
 
+
 	atomic_t			sectors_under_io;
 
 	bool				readonly;
 
 	/* There will and only be one thread touching @stat. */
 	struct btrfs_scrub_fs_progress	stat;
+
+	/*
+	 * How many sectors we read per stripe.
+	 *
+	 * For now, it's fixed to BTRFS_STRIPE_LEN / sectorsize.
+	 *
+	 * This can be enlarged to full stripe size / sectorsize
+	 * for later RAID0/10/5/6 code.
+	 */
+	int				sectors_per_stripe;
+	/*
+	 * For non-RAID56 profiles, we only care how many copies the block
+	 * group has.
+	 * For RAID56 profiles, we care how many stripes the block group
+	 * has (including data and parities).
+	 */
+	union {
+		int			nr_stripes;
+		int			nr_copies;
+	};
+
+	/*
+	 * The total number of sectors we scrub in one run (including
+	 * the extra mirrors/parities).
+	 *
+	 * For non-RAID56 profiles, it would be:
+	 *   nr_copie * (BTRFS_STRIPE_LEN / sectorsize).
+	 *
+	 * For RAID56 profiles, it would be:
+	 *   nr_stripes * (BTRFS_STRIPE_LEN / sectorsize).
+	 */
+	int				total_sectors;
+
+	/* Page array for above total_sectors. */
+	struct page			**pages;
+
+	/*
+	 * Sector array for above total_sectors. The page content will be
+	 * inside above pages array.
+	 *
+	 * Both array should be initialized when start to scrub a block group.
+	 */
+	struct scrub_fs_sector		*sectors;
+
+	/*
+	 * Csum buffer allocated for the stripe.
+	 *
+	 * All sectors in different mirrors for the same logical bytenr
+	 * would point to the same location inside the buffer.
+	 */
+	u8				*csum_buf;
 };
 
 struct scrub_warning {
@@ -4464,6 +4555,147 @@ static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 	return ERR_PTR(ret);
 }
 
+/*
+ * Cleanup the memory allocation, mostly after finishing a bg, or for error
+ * path.
+ */
+static void scrub_fs_cleanup_for_bg(struct scrub_fs_ctx *sfctx)
+{
+	int i;
+	const int nr_pages = sfctx->nr_copies * (BTRFS_STRIPE_LEN >> PAGE_SHIFT);
+
+	if (sfctx->pages) {
+		for (i = 0; i < nr_pages; i++) {
+			if (sfctx->pages[i]) {
+				__free_page(sfctx->pages[i]);
+				sfctx->pages[i] = NULL;
+			}
+		}
+	}
+	kfree(sfctx->pages);
+	sfctx->pages = NULL;
+
+	kfree(sfctx->sectors);
+	sfctx->sectors = NULL;
+
+	kfree(sfctx->csum_buf);
+	sfctx->csum_buf = NULL;
+
+	/* NOTE: block group will only be put inside scrub_fs_iterate_bgs(). */
+	sfctx->cur_bg = NULL;
+}
+
+/* Do the block group specific initialization. */
+static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
+				struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	bool is_raid56 = !!(bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK);
+	int ret = 0;
+	int nr_pages;
+	int i;
+
+	/*
+	 * One stripe should be page aligned, aka, PAGE_SIZE should not be
+	 * larger than 64K.
+	 */
+	ASSERT(IS_ALIGNED(BTRFS_STRIPE_LEN, PAGE_SIZE));
+
+	/* Last run should have cleanedup all the memories. */
+	ASSERT(!sfctx->cur_bg);
+	ASSERT(!sfctx->pages);
+	ASSERT(!sfctx->sectors);
+	ASSERT(!sfctx->csum_buf);
+
+	read_lock(&map_tree->lock);
+	em = lookup_extent_mapping(map_tree, bg->start, bg->length);
+	read_unlock(&map_tree->lock);
+
+	/*
+	 * Might have been an unused block group deleted by the cleaner
+	 * kthread or relocation.
+	 */
+	if (!em) {
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags))
+			ret = -EINVAL;
+		spin_unlock(&bg->lock);
+		return ret;
+	}
+	/*
+	 * Since we're ensured to be executed without any other
+	 * dev-replace/scrub running, the num_stripes should be the total
+	 * number of stripes, without the replace target device.
+	 */
+	if (is_raid56)
+		sfctx->nr_stripes = em->map_lookup->num_stripes;
+	free_extent_map(em);
+
+	if (!is_raid56)
+		sfctx->nr_copies = btrfs_num_copies(fs_info, bg->start,
+						    fs_info->sectorsize);
+	sfctx->sectors_per_stripe = BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits;
+	sfctx->total_sectors = sfctx->sectors_per_stripe * sfctx->nr_copies;
+
+	nr_pages = (BTRFS_STRIPE_LEN >> PAGE_SHIFT) * sfctx->nr_copies;
+
+	sfctx->pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!sfctx->pages)
+		goto enomem;
+
+	for (i = 0; i < nr_pages; i++) {
+		sfctx->pages[i] = alloc_page(GFP_KERNEL);
+		if (!sfctx->pages[i])
+			goto enomem;
+	}
+
+	sfctx->sectors = kcalloc(sfctx->total_sectors,
+				 sizeof(struct scrub_fs_sector), GFP_KERNEL);
+	if (!sfctx->sectors)
+		goto enomem;
+
+	for (i = 0; i < sfctx->total_sectors; i++)
+		sfctx->sectors[i].flags = SCRUB_FS_SECTOR_FLAG_UNUSED;
+
+	if (bg->flags & BTRFS_BLOCK_GROUP_DATA) {
+		sfctx->csum_buf = kzalloc(fs_info->csum_size *
+					  sfctx->sectors_per_stripe, GFP_KERNEL);
+		if (!sfctx->csum_buf)
+			goto enomem;
+	}
+	sfctx->cur_bg = bg;
+	sfctx->cur_logical = bg->start;
+	return 0;
+
+enomem:
+	sfctx->stat.nr_fatal_errors++;
+	scrub_fs_cleanup_for_bg(sfctx);
+	return -ENOMEM;
+}
+
+
+static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
+				struct btrfs_block_group *bg)
+{
+	int ret;
+
+	/* Not yet supported, just skip RAID56 bgs for now. */
+	if (bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return 0;
+
+	ret = scrub_fs_init_for_bg(sfctx, bg);
+	if (ret < 0)
+		return ret;
+
+	/* Place holder for the loop itearting the sectors. */
+	ret = 0;
+
+	scrub_fs_cleanup_for_bg(sfctx);
+	return ret;
+}
+
 static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -4527,8 +4759,7 @@ static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
 
 		scrub_pause_off(fs_info);
 
-		/* Place holder for the real chunk scrubbing code. */
-		ret = 0;
+		ret = scrub_fs_block_group(sfctx, bg);
 
 		if (ro_set)
 			btrfs_dec_block_group_ro(bg);

From patchwork Sat Sep  3 08:19:25 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964923
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75120C6FA83
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232010AbiICIUD (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:03 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43430 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232043AbiICIT7 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:19:59 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2001:67c:2178:6::1c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7337357E0
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:57 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id 809DC337C4
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193196;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=YoVEwcUNb2w0LGPkKAHNjG83Zt95NB5r5aj81TaSQlc=;
        b=PVc2MOEjaaEdGtex6KNmNrqHQAow+9j5LbXIXVse/ycwlL2QKDfI339gr/6cqfN3I6r8XC
        Drfr66D1en1lDZ9VFn6ilc8gO65PEpqoPsgrMzeH+vCdK6P7mMLvxFiAbzPrKwP5OhvHqu
        nBEGfS6EwgIGT+bDQzg+gn5onE+GxQA=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id CF7B1139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:55 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id CIelJSsOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:55 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill
 csum/extent_generation
Date: Sat,  3 Sep 2022 16:19:25 +0800
Message-Id: 
 <e8c6d760b9c9bf5826751909ce30a6f0c0a2a056.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

This patch will introduce two new major helpers:

- scrub_fs_locate_and_fill_stripe()
  This will find a stripe which contains any extent.
  And then fill corresponding sectors inside sectors[] array with its
  extent_type.

  If it's a metadata extent, it will also fill eb_generation member.

- scrub_fs_fill_stripe_csum()
  This is for data block groups only.

  This helper will find all csums for the stripe, and copy the csum into
  the corresponding position inside scrub_fs_ctx->csum_buf.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 fs/btrfs/scrub.c | 303 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 301 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 46885f966bb3..96daed3f3274 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4534,6 +4534,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static struct scrub_fs_sector *scrub_fs_get_sector(struct scrub_fs_ctx *sfctx,
+						   int sector_nr, int mirror_nr)
+{
+	/* Basic boudary checks. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->sectors_per_stripe);
+	ASSERT(mirror_nr >= 0 && mirror_nr < sfctx->nr_copies);
+
+	return &sfctx->sectors[mirror_nr * sfctx->sectors_per_stripe + sector_nr];
+}
+
 static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 					       bool readonly)
 {
@@ -4675,10 +4685,264 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 	return -ENOMEM;
 }
 
+static int scrub_fs_fill_sector_types(struct scrub_fs_ctx *sfctx,
+				      u64 stripe_start, u64 extent_start,
+				      u64 extent_len, u64 extent_flags,
+				      u64 extent_gen)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const u64 stripe_end = stripe_start + (sfctx->sectors_per_stripe <<
+					       fs_info->sectorsize_bits);
+	const u64 real_start = max(stripe_start, extent_start);
+	const u64 real_len = min(stripe_end, extent_start + extent_len) - real_start;
+	bool is_meta = false;
+	u64 cur_logical;
+	int sector_flags;
+
+	if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		sector_flags = SCRUB_FS_SECTOR_FLAG_META;
+		is_meta = true;
+		/* Metadata should never corss stripe boundary. */
+		if (extent_start != real_start) {
+			btrfs_err(fs_info,
+				"tree block at bytenr %llu crossed stripe boundary",
+				extent_start);
+			return -EUCLEAN;
+		}
+	} else {
+		sector_flags = SCRUB_FS_SECTOR_FLAG_DATA;
+	}
+
+	for (cur_logical = real_start; cur_logical < real_start + real_len;
+	     cur_logical += fs_info->sectorsize) {
+		const int sector_nr = (cur_logical - stripe_start) >>
+				       fs_info->sectorsize_bits;
+		int mirror_nr;
+
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			/*
+			 * All sectors in the range should have not been
+			 * initialized.
+			 */
+			ASSERT(sector->flags == SCRUB_FS_SECTOR_FLAG_UNUSED);
+			ASSERT(sector->csum == NULL);
+			ASSERT(sector->eb_generation == 0);
+
+			sector->flags = sector_flags;
+			/*
+			 * Here we only populate eb_*, for csum it will be later
+			 * filled in a dedicated csum tree search.
+			 */
+			if (is_meta) {
+				sector->eb_logical = extent_start;
+				sector->eb_generation = extent_gen;
+			}
+		}
+	}
+	return 0;
+}
+
+/*
+ * To locate a stripe where there is any extent inside it.
+ *
+ * @start:	logical bytenr to start the search. Result stripe should
+ *		be >= @start.
+ * @found_ret:	logical bytenr of the found stripe. Should also be a stripe start
+ *		bytenr.
+ *
+ * Return 0 if we found such stripe, and update @found_ret, furthermore, we will
+ * fill sfctx->stripes[] array with the needed extent info (generation for tree
+ * block, csum for data extents).
+ *
+ * Return <0 if we hit fatal errors.
+ *
+ * Return >0 if there is no more stripe containing any extent after @start.
+ */
+static int scrub_fs_locate_and_fill_stripe(struct scrub_fs_ctx *sfctx, u64 start,
+					   u64 *found_ret)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_path path = {0};
+	struct btrfs_root *extent_root = btrfs_extent_root(fs_info,
+							   sfctx->cur_bg->start);
+	const u64 bg_start = sfctx->cur_bg->start;
+	const u64 bg_end = bg_start + sfctx->cur_bg->length;
+	const u32 stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	u64 cur_logical = start;
+	/*
+	 * The full stripe start we found. If 0, it means we haven't yet found
+	 * any extent.
+	 */
+	u64 stripe_start = 0;
+	u64 extent_start;
+	u64 extent_size;
+	u64 extent_flags;
+	u64 extent_gen;
+	int ret;
+
+	path.search_commit_root = true;
+	path.skip_locking = true;
+
+	/* Initial search to find any extent inside the block group. */
+	ret = find_first_extent_item(extent_root, &path, cur_logical,
+				     bg_end - cur_logical);
+	/* Either error out or no more extent items. */
+	if (ret)
+		goto out;
+
+	get_extent_info(&path, &extent_start, &extent_size, &extent_flags,
+			&extent_gen);
+	/*
+	 * Note here a full stripe for RAID56 may not be power of 2, thus
+	 * we have to use rounddown(), not round_down().
+	 */
+	stripe_start = rounddown(max(extent_start, cur_logical) - bg_start,
+				 stripe_len) + bg_start;
+	*found_ret = stripe_start;
+
+	scrub_fs_fill_sector_types(sfctx, stripe_start, extent_start,
+				   extent_size, extent_flags, extent_gen);
+
+	cur_logical = min(stripe_start + stripe_len, extent_start + extent_size);
+
+	/* Now iterate all the remaining extents inside the stripe. */
+	while (cur_logical < stripe_start + stripe_len) {
+		ret = find_first_extent_item(extent_root, &path, cur_logical,
+				stripe_start + stripe_len - cur_logical);
+		if (ret)
+			goto out;
+
+		get_extent_info(&path, &extent_start, &extent_size,
+				&extent_flags, &extent_gen);
+		scrub_fs_fill_sector_types(sfctx, stripe_start, extent_start,
+					   extent_size, extent_flags, extent_gen);
+		cur_logical = extent_start + extent_size;
+	}
+out:
+	btrfs_release_path(&path);
+	/*
+	 * Found nothing, the first get_extent_info() returned error or no
+	 * extent found at all, just return @ret directly.
+	 */
+	if (!stripe_start)
+		return ret;
+
+	/*
+	 * Now we have hit at least one extent, if ret > 0, then it means
+	 * we still need to handle the extents we found, in that case we
+	 * return 0, so we will scrub what we found.
+	 */
+	if (ret > 0)
+		ret = 0;
+	return ret;
+}
+
+static void scrub_fs_fill_one_ordered_sum(struct scrub_fs_ctx *sfctx,
+					  struct btrfs_ordered_sum *sum)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const u64 stripe_start = sfctx->cur_logical;
+	const u32 stripe_len = sfctx->sectors_per_stripe <<
+			       fs_info->sectorsize_bits;
+	u64 cur;
+
+	ASSERT(stripe_start <= sum->bytenr &&
+	       sum->bytenr + sum->len <= stripe_start + stripe_len);
+
+	for (cur = sum->bytenr; cur < sum->bytenr + sum->len;
+	     cur += fs_info->sectorsize) {
+		int sector_nr = (cur - stripe_start) >> fs_info->sectorsize_bits;
+		int mirror_nr;
+		u8 *csum = sum->sums + (((cur - sum->bytenr) >>
+					fs_info->sectorsize_bits) * fs_info->csum_size);
+
+		/* Fill csum_buf first. */
+		memcpy(sfctx->csum_buf + sector_nr * fs_info->csum_size,
+		       csum, fs_info->csum_size);
+
+		/* Make sectors in all mirrors to point to the correct csum. */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			ASSERT(sector->flags & SCRUB_FS_SECTOR_FLAG_DATA);
+			sector->csum = sfctx->csum_buf + sector_nr * fs_info->csum_size;
+		}
+	}
+}
+
+static int scrub_fs_fill_stripe_csum(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
+						       sfctx->cur_bg->start);
+	const u64 stripe_start = sfctx->cur_logical;
+	const u32 stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	LIST_HEAD(csum_list);
+	int ret;
+
+	ret = btrfs_lookup_csums_range(csum_root, stripe_start,
+				       stripe_start + stripe_len - 1,
+				       &csum_list, true);
+	if (ret < 0)
+		return ret;
+
+	/* Extract csum_list and fill them into csum_buf. */
+	while (!list_empty(&csum_list)) {
+		struct btrfs_ordered_sum *sum;
+
+		sum = list_first_entry(&csum_list, struct btrfs_ordered_sum,
+				       list);
+		scrub_fs_fill_one_ordered_sum(sfctx, sum);
+		list_del(&sum->list);
+		kfree(sum);
+	}
+	return 0;
+}
+
+/*
+ * Reset the content of pages/csum_buf and reset sector types/csum, so
+ * no leftover data for the next run.
+ */
+static void scrub_fs_reset_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int nr_pages = (sfctx->total_sectors <<
+			      fs_info->sectorsize_bits) >> PAGE_SHIFT;
+	int i;
+
+	ASSERT(nr_pages);
+
+	/* Zero page content. */
+	for (i = 0; i < nr_pages; i++)
+		memzero_page(sfctx->pages[i], 0, PAGE_SIZE);
+
+	/* Zero csum_buf. */
+	if (sfctx->csum_buf)
+		memset(sfctx->csum_buf, 0, sfctx->sectors_per_stripe *
+		       fs_info->csum_size);
+
+	/* Clear sector types and its csum pointer. */
+	for (i = 0; i < sfctx->total_sectors; i++) {
+		struct scrub_fs_sector *sector = &sfctx->sectors[i];
+
+		sector->flags = SCRUB_FS_SECTOR_FLAG_UNUSED;
+		sector->csum = NULL;
+		sector->eb_generation = 0;
+		sector->eb_logical = 0;
+	}
+}
 
 static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				struct btrfs_block_group *bg)
 {
+	const struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	bool is_data = bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	u32 stripe_len;
+	u64 cur_logical = bg->start;
 	int ret;
 
 	/* Not yet supported, just skip RAID56 bgs for now. */
@@ -4689,8 +4953,43 @@ static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 	if (ret < 0)
 		return ret;
 
-	/* Place holder for the loop itearting the sectors. */
-	ret = 0;
+	/*
+	 * We can only trust anything inside sfctx after
+	 * scrub_fs_init_for_bg().
+	 */
+	stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	ASSERT(stripe_len);
+
+	while (cur_logical < bg->start + bg->length) {
+		u64 stripe_start;
+
+		ret = scrub_fs_locate_and_fill_stripe(sfctx, cur_logical,
+						      &stripe_start);
+		if (ret < 0)
+			break;
+
+		/* No more extent left in the bg, we have finished the bg. */
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+
+		sfctx->cur_logical = stripe_start;
+
+		if (is_data) {
+			ret = scrub_fs_fill_stripe_csum(sfctx);
+			if (ret < 0)
+				break;
+		}
+
+		/* Place holder for real stripe scrubbing. */
+		ret = 0;
+
+		/* Reset the stripe for next run. */
+		scrub_fs_reset_stripe(sfctx);
+
+		cur_logical = stripe_start + stripe_len;
+	}
 
 	scrub_fs_cleanup_for_bg(sfctx);
 	return ret;

From patchwork Sat Sep  3 08:19:26 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964924
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9A160ECAAD4
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232504AbiICIUE (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:04 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43572 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231741AbiICIUB (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:20:01 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D800D38446
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:19:58 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id 980CA337DC
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193197;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=WLUvtG6M2IIIaIfiy2r5oq9o6sxwem5CIfq7ZLNLQNA=;
        b=K5VyUrnlCswCBcXPSNuUaZcxFD9f/YdhNIGmZEJsWFihzUIkS0TWAVnONNJOCxjIAee201
        pPtkb7IyybkPOKu19u9tsjBY7Qz3R/6grMCE8ja7RdjyHGjHbSlI+D/KT9Aabgse3n/4av
        lv9YFLdcv6/uLn7EaDHjqSe42qEjO9U=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E6D35139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:56 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id OMNMKywOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:56 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 6/9] btrfs: scrub: submit and wait for the read of each
 copy
Date: Sat,  3 Sep 2022 16:19:26 +0800
Message-Id: 
 <0caa3a0e2f9adac89b0f8c3b9747bede11faff26.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

This patch introduce a helper, scrub_fs_one_stripe().

Currently it's only doing the following work:

- Submit bios for each copy of 64K stripe
  We don't need to skip any range which doesn't have data/metadata.
  That would only eat up the IOPS performance of the disk.

  At per-stripe initialization time we have marked all sectors unused,
  until extent tree search time marks the needed sectors DATA/METADATA.

  So at verification time we can skip those unused sectors.

- Wait for the bios to finish

No csum verification yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 220 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 218 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 96daed3f3274..70a54c6d8888 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -203,6 +203,11 @@ struct scrub_ctx {
 #define SCRUB_FS_SECTOR_FLAG_META		(1 << 2)
 #define SCRUB_FS_SECTOR_FLAG_PARITY		(1 << 3)
 
+/* This marks if the sector belongs to a missing device. */
+#define SCRUB_FS_SECTOR_FLAG_DEV_MISSING	(1 << 4)
+#define SCRUB_FS_SECTOR_FLAG_IO_ERROR		(1 << 5)
+#define SCRUB_FS_SECTOR_FLAG_IO_DONE		(1 << 6)
+
 /*
  * Represent a sector.
  *
@@ -305,6 +310,8 @@ struct scrub_fs_ctx {
 	 * would point to the same location inside the buffer.
 	 */
 	u8				*csum_buf;
+
+	wait_queue_head_t		wait;
 };
 
 struct scrub_warning {
@@ -4559,6 +4566,7 @@ static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 	sfctx->fs_info = fs_info;
 	sfctx->readonly = readonly;
 	atomic_set(&sfctx->sectors_under_io, 0);
+	init_waitqueue_head(&sfctx->wait);
 	return sfctx;
 error:
 	kfree(sfctx);
@@ -4936,6 +4944,213 @@ static void scrub_fs_reset_stripe(struct scrub_fs_ctx *sfctx)
 	}
 }
 
+static void mark_missing_dev_sectors(struct scrub_fs_ctx *sfctx,
+				     int stripe_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	int i;
+
+	for (i = 0; i < sectors_per_stripe; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, i, stripe_nr);
+
+		sector->flags |= SCRUB_FS_SECTOR_FLAG_DEV_MISSING;
+	}
+}
+
+static struct page *scrub_fs_get_page(struct scrub_fs_ctx *sfctx,
+				      int sector_nr)
+{
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       sfctx->fs_info->sectorsize_bits;
+	int page_index;
+
+	/* Basic checks to make sure we're accessing a valid sector. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->nr_copies * sectors_per_stripe);
+
+	page_index = sector_nr / (PAGE_SIZE >> sfctx->fs_info->sectorsize_bits);
+
+	ASSERT(sfctx->pages[page_index]);
+	return sfctx->pages[page_index];
+}
+
+static unsigned int scrub_fs_get_page_offset(struct scrub_fs_ctx *sfctx,
+					     int sector_nr)
+{
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       sfctx->fs_info->sectorsize_bits;
+
+	/* Basic checks to make sure we're accessing a valid sector. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->nr_copies * sectors_per_stripe);
+
+	return offset_in_page(sector_nr << sfctx->fs_info->sectorsize_bits);
+}
+
+static int scrub_fs_get_stripe_nr(struct scrub_fs_ctx *sfctx,
+				  struct page *first_page,
+				  unsigned int page_off)
+{
+	const int pages_per_stripe = BTRFS_STRIPE_LEN >> PAGE_SHIFT;
+	bool found = false;
+	int i;
+
+	/* The first sector should always be page aligned. */
+	ASSERT(page_off == 0);
+
+	for (i = 0; i < pages_per_stripe * sfctx->nr_copies; i++) {
+		if (first_page == sfctx->pages[i]) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		return -1;
+
+	ASSERT(IS_ALIGNED(i, pages_per_stripe));
+
+	return i / pages_per_stripe;
+}
+
+static void scrub_fs_read_endio(struct bio *bio)
+{
+	struct scrub_fs_ctx *sfctx = bio->bi_private;
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct bio_vec *first_bvec = bio_first_bvec_all(bio);
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	int bio_size = 0;
+	bool error = (bio->bi_status != BLK_STS_OK);
+	const int stripe_nr = scrub_fs_get_stripe_nr(sfctx, first_bvec->bv_page,
+						     first_bvec->bv_offset);
+	int i;
+
+	/* Grab the bio size for later sanity checks. */
+	bio_for_each_segment_all(bvec, bio, iter_all)
+		bio_size += bvec->bv_len;
+
+	/* We always submit a bio for a stripe length. */
+	ASSERT(bio_size == BTRFS_STRIPE_LEN);
+
+	for (i = 0; i < sfctx->sectors_per_stripe; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, i, stripe_nr);
+		/*
+		 * Here we only set the sector flags, don't do any stat update,
+		 * that will be done by the main thread when doing verification.
+		 */
+		if (error)
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_IO_ERROR;
+		else
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_IO_DONE;
+	}
+	atomic_sub(bio_size >> fs_info->sectorsize_bits,
+		   &sfctx->sectors_under_io);
+	wake_up(&sfctx->wait);
+	bio_put(bio);
+}
+
+static void submit_stripe_read_bio(struct scrub_fs_ctx *sfctx,
+				   struct btrfs_io_context *bioc,
+				   int stripe_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	struct btrfs_io_stripe *stripe = &bioc->stripes[stripe_nr];
+	struct btrfs_device *dev = stripe->dev;
+	struct bio *bio;
+	int ret;
+	int i;
+
+	/*
+	 * Missing device, just mark the sectors with missing device
+	 * and continue to next copy.
+	 */
+	if (!dev || !dev->bdev) {
+		mark_missing_dev_sectors(sfctx, stripe_nr);
+		return;
+	}
+
+	/* Submit a bio to read the stripe length. */
+	bio = bio_alloc(dev->bdev, BIO_MAX_VECS,
+			REQ_OP_READ | REQ_BACKGROUND, GFP_KERNEL);
+
+	/* Bio is backed up by mempool, allocation should not fail. */
+	ASSERT(bio);
+
+	bio->bi_iter.bi_sector = stripe->physical >> SECTOR_SHIFT;
+	for (i = sectors_per_stripe * stripe_nr;
+	     i < sectors_per_stripe * (stripe_nr + 1); i++) {
+		struct page *page = scrub_fs_get_page(sfctx, i);
+		unsigned int page_off = scrub_fs_get_page_offset(sfctx, i);
+
+		ret = bio_add_page(bio, page, fs_info->sectorsize, page_off);
+
+		/*
+		 * Should not fail as we will at most add STRIPE_LEN / 4K
+		 * (aka, 16) sectors, way smaller than BIO_MAX_VECS.
+		 */
+		ASSERT(ret == fs_info->sectorsize);
+	}
+
+	bio->bi_private = sfctx;
+	bio->bi_end_io = scrub_fs_read_endio;
+	atomic_add(sectors_per_stripe, &sfctx->sectors_under_io);
+	submit_bio(bio);
+}
+
+static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_io_context *bioc = NULL;
+	u64 mapped_len = BTRFS_STRIPE_LEN;
+	int i;
+	int ret;
+
+	/* We should at a stripe start inside current block group. */
+	ASSERT(sfctx->cur_bg->start <= sfctx->cur_logical &&
+	       sfctx->cur_logical < sfctx->cur_bg->start +
+				    sfctx->cur_bg->length);
+	ASSERT(IS_ALIGNED(sfctx->cur_logical - sfctx->cur_bg->start,
+			  BTRFS_STRIPE_LEN));
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
+			sfctx->cur_logical, &mapped_len, &bioc);
+	if (ret < 0)
+		goto out;
+
+	if (mapped_len < BTRFS_STRIPE_LEN) {
+		btrfs_err_rl(fs_info,
+	"get short map for bytenr %llu, got mapped length %llu expect %u",
+			     sfctx->cur_logical, mapped_len, BTRFS_STRIPE_LEN);
+		ret = -EUCLEAN;
+		sfctx->stat.nr_fatal_errors++;
+		goto out;
+	}
+
+	if (bioc->num_stripes != sfctx->nr_copies) {
+		btrfs_err_rl(fs_info,
+		"got unexpected number of stripes, got %d stripes expect %d",
+			     bioc->num_stripes, sfctx->nr_copies);
+		ret = -EUCLEAN;
+		sfctx->stat.nr_fatal_errors++;
+		goto out;
+	}
+
+	for (i = 0; i < sfctx->nr_copies; i++)
+		submit_stripe_read_bio(sfctx, bioc, i);
+	wait_event(sfctx->wait, atomic_read(&sfctx->sectors_under_io) == 0);
+
+	/* Place holder to verify the read data. */
+out:
+	btrfs_put_bioc(bioc);
+	btrfs_bio_counter_dec(fs_info);
+	return ret;
+}
+
 static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				struct btrfs_block_group *bg)
 {
@@ -4982,8 +5197,9 @@ static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				break;
 		}
 
-		/* Place holder for real stripe scrubbing. */
-		ret = 0;
+		ret = scrub_fs_one_stripe(sfctx);
+		if (ret < 0)
+			break;
 
 		/* Reset the stripe for next run. */
 		scrub_fs_reset_stripe(sfctx);

From patchwork Sat Sep  3 08:19:27 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964927
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8AEECC6FA8C
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232512AbiICIUH (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:07 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43688 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232484AbiICIUD (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:20:03 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05219326E0
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:20:00 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id AF490336D9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193198;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=tKcYskMmq6cOw1OZ7yu/EPOje8CnSygaLDmUiKF0kG4=;
        b=HJTHtMyLwSd2gVtOAbt+EH0Vda4xqFKi1bAnWizKwsyMFwFhEpXbGAgHETE983kkwyHen2
        1skZPmVC4W/XyYraLXJw+1EW76z7w7OML3/9wOYnzyiwDFhzvcLEI6GEni3Ck/fzWw3fpx
        BB4lAOYDGfBdAnWoLDUH41/TZpxloho=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 0A2A5139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:57 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id uPYVMS0OE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:57 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 7/9] btrfs: scrub: implement metadata verification code
 for scrub_fs
Date: Sat,  3 Sep 2022 16:19:27 +0800
Message-Id: 
 <1b48b161aa6cd64378cdb83178deea49ef0625ce.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

This patch introduces the following functions:

- scrub_fs_verify_one_stripe()
  The entrance for all verification code.

  Which will iterate every sector in the same vertical stripe.

- scrub_fs_verify_meta()
  The helper to verify metadata in one vertical stripe.
  (Since no RAID56 support, one vertical stripe just contains
   all the same data from different mirrors)

- scrub_fs_verify_one_meta()
  This is the real work, the checks includes:

  * Basic metadata header checks (bytenr, fsid, level)
    For this part, we refactor those checks from
    validate_extent_buffer() into btrfs_validate_eb_basic(),
    allowing us to suppress the error messages.

  * Checksum verification
    For this part, we refactor this one check from
    validate_extent_buffer() into btrfs_validate_eb_csum(),
    allowing us to suppress the error message.

  * Tree check verification (NEW)
    This is the new one, the old scrub code never fully utilize the
    whole extent buffer related facilities, thus only very basic checks.
    Now scrub_fs has (almost) the same checks as tree block read
    routine.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c |  83 ++++++++++++++++-------
 fs/btrfs/disk-io.h |   2 +
 fs/btrfs/scrub.c   | 164 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 222 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e67614afcf4f..6eda9e615ae7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -457,55 +457,87 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
 	return 1;
 }
 
-/* Do basic extent buffer checks at read time */
-static int validate_extent_buffer(struct extent_buffer *eb)
+/*
+ * The very basic extent buffer checks, including:
+ *
+ * - Bytenr check
+ * - FSID check
+ * - Level check
+ *
+ * If @error_message is true, it will output error message (rate limited).
+ */
+int btrfs_validate_eb_basic(struct extent_buffer *eb, bool error_message)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	u64 found_start;
-	const u32 csum_size = fs_info->csum_size;
 	u8 found_level;
-	u8 result[BTRFS_CSUM_SIZE];
-	const u8 *header_csum;
 	int ret = 0;
 
 	found_start = btrfs_header_bytenr(eb);
 	if (found_start != eb->start) {
-		btrfs_err_rl(fs_info,
+		if (error_message)
+			btrfs_err_rl(fs_info,
 			"bad tree block start, mirror %u want %llu have %llu",
-			     eb->read_mirror, eb->start, found_start);
-		ret = -EIO;
-		goto out;
+				     eb->read_mirror, eb->start, found_start);
+		return -EIO;
 	}
 	if (check_tree_block_fsid(eb)) {
-		btrfs_err_rl(fs_info, "bad fsid on logical %llu mirror %u",
-			     eb->start, eb->read_mirror);
-		ret = -EIO;
-		goto out;
+		if (error_message)
+			btrfs_err_rl(fs_info, "bad fsid on logical %llu mirror %u",
+				     eb->start, eb->read_mirror);
+		return -EIO;
 	}
 	found_level = btrfs_header_level(eb);
 	if (found_level >= BTRFS_MAX_LEVEL) {
-		btrfs_err(fs_info,
-			"bad tree block level, mirror %u level %d on logical %llu",
-			eb->read_mirror, btrfs_header_level(eb), eb->start);
-		ret = -EIO;
-		goto out;
+		if (error_message)
+			btrfs_err(fs_info,
+				"bad tree block level, mirror %u level %d on logical %llu",
+				eb->read_mirror, btrfs_header_level(eb), eb->start);
+		return -EIO;
 	}
+	return ret;
+}
+
+int btrfs_validate_eb_csum(struct extent_buffer *eb, bool error_message)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u8 result[BTRFS_CSUM_SIZE];
+	const u8 *header_csum;
+	const u32 csum_size = fs_info->csum_size;
 
 	csum_tree_block(eb, result);
 	header_csum = page_address(eb->pages[0]) +
 		get_eb_offset_in_page(eb, offsetof(struct btrfs_header, csum));
 
 	if (memcmp(result, header_csum, csum_size) != 0) {
-		btrfs_warn_rl(fs_info,
+		if (error_message)
+			btrfs_warn_rl(fs_info,
 "checksum verify failed on logical %llu mirror %u wanted " CSUM_FMT " found " CSUM_FMT " level %d",
-			      eb->start, eb->read_mirror,
-			      CSUM_FMT_VALUE(csum_size, header_csum),
-			      CSUM_FMT_VALUE(csum_size, result),
-			      btrfs_header_level(eb));
-		ret = -EUCLEAN;
-		goto out;
+				      eb->start, eb->read_mirror,
+				      CSUM_FMT_VALUE(csum_size, header_csum),
+				      CSUM_FMT_VALUE(csum_size, result),
+				      btrfs_header_level(eb));
+		return -EUCLEAN;
 	}
+	return 0;
+}
+
+/* Do basic extent buffer checks at read time */
+static inline int validate_extent_buffer(struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u8 found_level;
+	int ret = 0;
+
+	ret = btrfs_validate_eb_basic(eb, true);
+	if (ret < 0)
+		return ret;
 
+	ret = btrfs_validate_eb_csum(eb, true);
+	if (ret < 0)
+		return ret;
+
+	found_level = btrfs_header_level(eb);
 	/*
 	 * If this is a leaf block and it is corrupt, set the corrupt bit so
 	 * that we don't try and read the other copies of this block, just
@@ -525,7 +557,6 @@ static int validate_extent_buffer(struct extent_buffer *eb)
 		btrfs_err(fs_info,
 		"read time tree block corruption detected on logical %llu mirror %u",
 			  eb->start, eb->read_mirror);
-out:
 	return ret;
 }
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 47ad8e0a2d33..3d154873d4e5 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -80,6 +80,8 @@ void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
 int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
 				   struct page *page, u64 start, u64 end,
 				   int mirror);
+int btrfs_validate_eb_basic(struct extent_buffer *eb, bool error_message);
+int btrfs_validate_eb_csum(struct extent_buffer *eb, bool error_message);
 void btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio, int mirror_num);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_root *btrfs_alloc_dummy_root(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 70a54c6d8888..f587d0373517 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -20,6 +20,7 @@
 #include "rcu-string.h"
 #include "raid56.h"
 #include "block-group.h"
+#include "tree-checker.h"
 #include "zoned.h"
 
 /*
@@ -208,6 +209,9 @@ struct scrub_ctx {
 #define SCRUB_FS_SECTOR_FLAG_IO_ERROR		(1 << 5)
 #define SCRUB_FS_SECTOR_FLAG_IO_DONE		(1 << 6)
 
+/* This marks if the sector is a good one (aka, passed all checks). */
+#define SCRUB_FS_SECTOR_FLAG_GOOD		(1 << 7)
+
 /*
  * Represent a sector.
  *
@@ -311,6 +315,11 @@ struct scrub_fs_ctx {
 	 */
 	u8				*csum_buf;
 
+	/*
+	 * This is for METADATA block group verification, we use this dummy eb
+	 * to utilize all the accessors for extent buffers.
+	 */
+	struct extent_buffer		*dummy_eb;
 	wait_queue_head_t		wait;
 };
 
@@ -4599,6 +4608,10 @@ static void scrub_fs_cleanup_for_bg(struct scrub_fs_ctx *sfctx)
 	kfree(sfctx->csum_buf);
 	sfctx->csum_buf = NULL;
 
+	if (sfctx->dummy_eb) {
+		free_extent_buffer_stale(sfctx->dummy_eb);
+		sfctx->dummy_eb = NULL;
+	}
 	/* NOTE: block group will only be put inside scrub_fs_iterate_bgs(). */
 	sfctx->cur_bg = NULL;
 }
@@ -4626,6 +4639,7 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 	ASSERT(!sfctx->pages);
 	ASSERT(!sfctx->sectors);
 	ASSERT(!sfctx->csum_buf);
+	ASSERT(!sfctx->dummy_eb);
 
 	read_lock(&map_tree->lock);
 	em = lookup_extent_mapping(map_tree, bg->start, bg->length);
@@ -4683,6 +4697,12 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 		if (!sfctx->csum_buf)
 			goto enomem;
 	}
+	if (bg->flags & (BTRFS_BLOCK_GROUP_METADATA |
+			 BTRFS_BLOCK_GROUP_SYSTEM)) {
+		sfctx->dummy_eb = alloc_dummy_extent_buffer(fs_info, 0);
+		if (!sfctx->dummy_eb)
+			goto enomem;
+	}
 	sfctx->cur_bg = bg;
 	sfctx->cur_logical = bg->start;
 	return 0;
@@ -5101,6 +5121,148 @@ static void submit_stripe_read_bio(struct scrub_fs_ctx *sfctx,
 	submit_bio(bio);
 }
 
+static void scrub_fs_verify_one_meta(struct scrub_fs_ctx *sfctx, int sector_nr,
+				     int mirror_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct extent_buffer *eb = sfctx->dummy_eb;
+	const int sectors_per_mirror = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	const u64 logical = sfctx->cur_logical + (sector_nr <<
+						  fs_info->sectorsize_bits);
+	const u64 expected_gen = sfctx->sectors[sector_nr].eb_generation;
+	int ret;
+	int i;
+
+	sfctx->stat.meta_scrubbed += fs_info->nodesize;
+
+	/* No IO is done for the sectors. Just update the accounting and exit. */
+	if (sfctx->sectors[sector_nr + mirror_nr * sectors_per_mirror].flags &
+	    (SCRUB_FS_SECTOR_FLAG_DEV_MISSING | SCRUB_FS_SECTOR_FLAG_IO_ERROR)) {
+		sfctx->stat.meta_io_fail += fs_info->nodesize;
+		return;
+	}
+
+	/* Sector_nr should always be the sector number in the first mirror. */
+	ASSERT(sector_nr >= 0 && sector_nr < sectors_per_mirror);
+	ASSERT(eb);
+
+	eb->start = logical;
+
+	/* Copy all the metadata sectors into the dummy eb. */
+	for (i = sector_nr + mirror_nr * sectors_per_mirror;
+	     i < sector_nr + mirror_nr * sectors_per_mirror +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); i++) {
+		struct page *page = scrub_fs_get_page(sfctx, i);
+		int page_off = scrub_fs_get_page_offset(sfctx, i);
+		int off_in_eb = (i - mirror_nr * sectors_per_mirror -
+				 sector_nr) << fs_info->sectorsize_bits;
+
+		write_extent_buffer(eb, page_address(page) + page_off,
+				    off_in_eb, fs_info->sectorsize);
+	}
+
+	/* Basic extent buffer checks. */
+	ret = btrfs_validate_eb_basic(eb, false);
+	if (ret < 0) {
+		sfctx->stat.meta_invalid += fs_info->nodesize;
+		return;
+	}
+	/* Csum checks. */
+	ret = btrfs_validate_eb_csum(eb, false);
+	if (ret < 0) {
+		sfctx->stat.meta_bad_csum += fs_info->nodesize;
+		return;
+	}
+	/* Full tree-check checks. */
+	if (btrfs_header_level(eb) > 0)
+		ret = btrfs_check_node(eb);
+	else
+		ret = btrfs_check_leaf_full(eb);
+	if (ret < 0) {
+		sfctx->stat.meta_invalid += fs_info->nodesize;
+		return;
+	}
+
+	/* Transid check */
+	if (btrfs_header_generation(eb) != expected_gen) {
+		sfctx->stat.meta_bad_transid += fs_info->nodesize;
+		return;
+	}
+
+	/*
+	 * All good, set involved sectors with GOOD_COPY flag so later we can
+	 * know if the veritical stripe has any good copy, thus if corrupted
+	 * sectors can be recoverable.
+	 */
+	for (i = sector_nr + mirror_nr * sectors_per_mirror;
+	     i < sector_nr + mirror_nr * sectors_per_mirror +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); i++)
+		sfctx->sectors[i].flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
+}
+
+static void scrub_fs_verify_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct extent_buffer *eb = sfctx->dummy_eb;
+	int i;
+
+	/*
+	 * This should be allocated when we enter the metadata block group.
+	 * If not allocated, this means the block group flags is unreliable.
+	 *
+	 * Anyway we can only exit with invalid metadata errors increased.
+	 */
+	if (!eb) {
+		btrfs_err_rl(sfctx->fs_info,
+	"block group %llu flag 0x%llx should not have metadata at %llu",
+			     sfctx->cur_bg->start, sfctx->cur_bg->flags,
+			     sfctx->cur_logical +
+			     (sector_nr << sfctx->fs_info->sectorsize_bits));
+		sfctx->stat.meta_invalid += sfctx->fs_info->nodesize;
+		return;
+	}
+
+	for (i = 0; i < sfctx->nr_copies; i++)
+		scrub_fs_verify_one_meta(sfctx, sector_nr, i);
+}
+
+static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	int sector_nr;
+
+	for (sector_nr = 0; sector_nr < sectors_per_stripe; sector_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, 0);
+
+		/*
+		 * All sectors in the same veritical stripe should share
+		 * the same UNUSED/DATA/META flags, thus checking the UNUSED
+		 * flag of mirror 0 is enough to determine if this
+		 * vertical stripe needs verification.
+		 */
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_UNUSED)
+			continue;
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_META) {
+			scrub_fs_verify_meta(sfctx, sector_nr);
+			/*
+			 * For metadata, we need to skip to the end of the tree
+			 * block, and don't forget @sector_nr will get
+			 * increased by 1 from the for loop.
+			 */
+			sector_nr += (fs_info->nodesize >>
+				      fs_info->sectorsize_bits) - 1;
+			continue;
+		}
+
+		/* Place holder for data verification. */
+	}
+	/* Place holder for recoverable checks. */
+}
+
 static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5144,7 +5306,7 @@ static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
 		submit_stripe_read_bio(sfctx, bioc, i);
 	wait_event(sfctx->wait, atomic_read(&sfctx->sectors_under_io) == 0);
 
-	/* Place holder to verify the read data. */
+	scrub_fs_verify_one_stripe(sfctx);
 out:
 	btrfs_put_bioc(bioc);
 	btrfs_bio_counter_dec(fs_info);

From patchwork Sat Sep  3 08:19:28 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964925
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DBA4BC6FA89
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:07 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232509AbiICIUG (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:06 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43704 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232470AbiICIUD (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:20:03 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2001:67c:2178:6::1c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C2D13A4A2
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:20:01 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id C402C33800
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193199;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=TWq7dtKt8tQvNCZ0plBbnOU1Qx+vTTV3wI64GMjUF2U=;
        b=tPsLwU1Gs2oSTijl+MAn20SU2alfVM8XQfxZ68P6sQFYnRMHyCZs+UHoUmOn/Ak0dG2TlM
        5fKJDon1+jvyFYsUIspyY3cvmKiSrk/nMkhS5Av8/S5hx7NCGVcABLaU6jZXJx2vKezA8D
        cU17m/Ro0lhP/6Fj2ULTZTejHdHh1jo=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 20D6B139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:19:58 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id UMOwNi4OE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:19:58 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 8/9] btrfs: scrub: implement data verification code for
 scrub_fs
Date: Sat,  3 Sep 2022 16:19:28 +0800
Message-Id: 
 <c83d3feb9812f0c3047b39ef97db3c9d9ca3ab35.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

For data verification it's much simpler, only need to do csum
verification and we have very handy helper for it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f587d0373517..145bd5c9601d 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -5226,6 +5226,60 @@ static void scrub_fs_verify_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
 		scrub_fs_verify_one_meta(sfctx, sector_nr, i);
 }
 
+/* Convert a sector pointer to its index inside sfctx->sectors[] array. */
+static int scrub_fs_sector_to_sector_index(struct scrub_fs_ctx *sfctx,
+					   struct scrub_fs_sector *sector)
+{
+	int i;
+
+	ASSERT(sector);
+	for (i = 0; i < sfctx->total_sectors; i++) {
+		if (&sfctx->sectors[i] == sector)
+			break;
+	}
+	/* As long as @sector is from sfctx->sectors[], we should have found it. */
+	ASSERT(i < sfctx->total_sectors);
+	return i; 
+}
+
+static void scrub_fs_verify_one_data(struct scrub_fs_ctx *sfctx, int sector_nr,
+				     int mirror_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct scrub_fs_sector *sector =
+		scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+	int index = scrub_fs_sector_to_sector_index(sfctx, sector);
+	struct page *data_page = scrub_fs_get_page(sfctx, index);
+	unsigned int data_page_off = scrub_fs_get_page_offset(sfctx, index);
+	u8 csum_result[BTRFS_CSUM_SIZE];
+	u8 *csum_expected = sector->csum;
+	int ret;
+
+	sfctx->stat.data_scrubbed += fs_info->sectorsize;
+
+	/* No IO done, just increase the accounting. */
+	if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE)) {
+		sfctx->stat.data_io_fail += fs_info->sectorsize;
+		return;
+	}
+
+	/*
+	 * NODATASUM case, just skip, we will come back later to determine
+	 * the status when checking the sectors inside the same vertical stripe.
+	 */
+	if (!csum_expected)
+		return;
+
+	ret = btrfs_check_sector_csum(fs_info, data_page, data_page_off,
+				      csum_result, csum_expected);
+	if (ret < 0) {
+		sfctx->stat.data_csum_mismatch += fs_info->sectorsize;
+		return;
+	}
+	/* All good. */
+	sector->flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
+}
+
 static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5236,6 +5290,7 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 	for (sector_nr = 0; sector_nr < sectors_per_stripe; sector_nr++) {
 		struct scrub_fs_sector *sector =
 			scrub_fs_get_sector(sfctx, sector_nr, 0);
+		int mirror_nr;
 
 		/*
 		 * All sectors in the same veritical stripe should share
@@ -5258,7 +5313,8 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 			continue;
 		}
 
-		/* Place holder for data verification. */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++)
+			scrub_fs_verify_one_data(sfctx, sector_nr, mirror_nr);
 	}
 	/* Place holder for recoverable checks. */
 }

From patchwork Sat Sep  3 08:19:29 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 12964926
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AB523C6FA8B
	for <linux-btrfs@archiver.kernel.org>; Sat,  3 Sep 2022 08:20:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232043AbiICIUF (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Sat, 3 Sep 2022 04:20:05 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43780 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232476AbiICIUD (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Sat, 3 Sep 2022 04:20:03 -0400
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 310963AB13
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 01:20:02 -0700 (PDT)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by smtp-out1.suse.de (Postfix) with ESMTPS id DA751337C4
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:20:00 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
        t=1662193200;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
         mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=tzaffeIf1OYhRn0wexY5jcn4Fovra8k8FsGlw3rrSSs=;
        b=Vtao4LRCDCKL088UBXy3tNProjOOMFe3uuxayC4N4AOUA+eBIeqcOxGqnUg4LaYKXr9QAw
        cb9o4cz/3oYOzpZeYfSsBRaQXKOnZsXIFJAnoDGZJBWDIv106p+HKtiHQbwKTccfG1pkK4
        7cJDbJTwdJWwDsk3t6DFB4GRa1aBpm0=
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de
 [192.168.254.74])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature ECDSA (P-521) server-digest
 SHA512)
        (No client certificate requested)
        by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 3694F139F9
        for <linux-btrfs@vger.kernel.org>;
 Sat,  3 Sep 2022 08:20:00 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
        by imap2.suse-dmz.suse.de with ESMTPSA
        id uIloADAOE2OzagAAMHmgww
        (envelope-from <wqu@suse.com>)
        for <linux-btrfs@vger.kernel.org>; Sat, 03 Sep 2022 08:20:00 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report
 for scrub_fs
Date: Sat,  3 Sep 2022 16:19:29 +0800
Message-Id: 
 <06e4f67a9e50c2b6dfc49a086ee62053cbdcc0ae.1662191784.git.wqu@suse.com>
X-Mailer: git-send-email 2.37.3
In-Reply-To: <cover.1662191784.git.wqu@suse.com>
References: <cover.1662191784.git.wqu@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

This patch introduces two new functions:
- scrub_fs_recover_meta()
- scrub_fs_recover_data()

Both of them are going to check if we can recovery the data/meta
sectors inside a vertical stripe, and if we can update the
stat->data/meta_recoverable accounting.

None of them has implemented the writeback function though.

For scrub_fs_recover_meta() it's not much different than the existing
scrub, just make sure we have one good copy them all the other copies
can be recoverable.

For scrub_fs_recover_data() besides the existing csum based
verification, for NODATASUM cases we will make sure all the copies match
each other.

If they don't match, we will report all the sectors as uncertain.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 fs/btrfs/scrub.c | 197 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 145bd5c9601d..fdac9b8e4cf1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -212,6 +212,12 @@ struct scrub_ctx {
 /* This marks if the sector is a good one (aka, passed all checks). */
 #define SCRUB_FS_SECTOR_FLAG_GOOD		(1 << 7)
 
+/*
+ * This marks if the sectors is recovery (thus it should be corrupted in
+ * the first place )
+ */
+#define SCRUB_FS_SECTOR_FLAG_RECOVERABLE	(1 << 8)
+
 /*
  * Represent a sector.
  *
@@ -5280,6 +5286,177 @@ static void scrub_fs_verify_one_data(struct scrub_fs_ctx *sfctx, int sector_nr,
 	sector->flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
 }
 
+static void scrub_fs_recover_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	int nr_good = 0;
+	int mirror_nr;
+	int cur_sector;
+
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+			nr_good++;
+	}
+
+	/* No good copy found, just error out. */
+	if (!nr_good)
+		return;
+	/*
+	 * We have at least one good copy, and we also have bad copies,
+	 * all those bad copies can be recovered.
+	 */
+	sfctx->stat.meta_recoverable += (sfctx->nr_copies - nr_good) *
+					fs_info->nodesize;
+
+	/* Mark those bad sectors recoverable. */
+	for (cur_sector = sector_nr; cur_sector < sector_nr +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); cur_sector++) {
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+				continue;
+
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
+		}
+	}
+	/* Place holder for writeback. */
+}
+
+static int scrub_fs_memcmp_sectors(struct scrub_fs_ctx *sfctx,
+				   struct scrub_fs_sector *sector1,
+				   struct scrub_fs_sector *sector2)
+{
+	struct page *page1;
+	struct page *page2;
+	unsigned int page_off1;
+	unsigned int page_off2;
+	int sector_nr1;
+	int sector_nr2;
+
+	sector_nr1 = scrub_fs_sector_to_sector_index(sfctx, sector1);
+	sector_nr2 = scrub_fs_sector_to_sector_index(sfctx, sector2);
+
+	page1 = scrub_fs_get_page(sfctx, sector_nr1);
+	page_off1 = scrub_fs_get_page_offset(sfctx, sector_nr1);
+
+	page2 = scrub_fs_get_page(sfctx, sector_nr2);
+	page_off2 = scrub_fs_get_page_offset(sfctx, sector_nr2);
+
+	return memcmp(page_address(page1) + page_off1,
+		      page_address(page2) + page_off2,
+		      sfctx->fs_info->sectorsize);
+}
+
+/* Find the next sector in the same vertical stripe which has read IO done. */
+static struct scrub_fs_sector *scrub_fs_find_next_working_mirror(
+		struct scrub_fs_ctx *sfctx, int sector_nr, int mirror_nr)
+{
+	int i;
+
+	for (i = mirror_nr + 1; i < sfctx->nr_copies; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, i);
+
+		/* This copy has IO error, skip it. */
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			continue;
+
+		/* Found a good copy. */
+		return sector;
+	}
+	return NULL;
+}
+
+static void scrub_fs_recover_data(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const bool has_csum = !!sfctx->sectors[sector_nr].csum;
+	bool mismatch_found = false;
+	int nr_good = 0;
+	int io_fail = 0;
+	int mirror_nr;
+
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+			nr_good++;
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			io_fail++;
+	}
+
+	if (has_csum) {
+		/*
+		 * There is at least one good copy, thus all the other
+		 * corrupted sectors can also be recovered.
+		 */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+				continue;
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
+			sfctx->stat.data_recoverable += fs_info->sectorsize;
+		}
+
+		/* Place holder for writeback */
+		return;
+	}
+
+	/*
+	 * No datasum case, it's much harder.
+	 *
+	 * The idea is, we have to compare all the sectors to determine if they
+	 * match.
+	 *
+	 * Firstly rule out sectors which don't have extra working copies.
+	 */
+	if (sfctx->nr_copies - io_fail <= 1) {
+		sfctx->stat.data_nocsum_uncertain += fs_info->sectorsize;
+		return;
+	}
+
+	/*
+	 * For now, we can only support one case, all data read matches with each
+	 * other, or we consider them all uncertain.
+	 */
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies - 1; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+		struct scrub_fs_sector *next_sector;
+		int ret;
+
+		/* The first sector has IO error, skip to the next run. */
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			continue;
+
+		next_sector = scrub_fs_find_next_working_mirror(sfctx,
+				sector_nr, mirror_nr);
+		/* We're already the last working copy, can break now. */
+		if (!next_sector)
+			break;
+
+		ret = scrub_fs_memcmp_sectors(sfctx, sector, next_sector);
+		if (ret)
+			mismatch_found = true;
+	}
+
+	/*
+	 * We have found mismatched contents, mark all those sectors
+	 * which doesn't have IO error as uncertain.
+	 */
+	if (mismatch_found)
+		sfctx->stat.data_nocsum_uncertain +=
+			(sfctx->nr_copies - io_fail) << fs_info->sectorsize_bits;
+}
+
 static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5316,7 +5493,25 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++)
 			scrub_fs_verify_one_data(sfctx, sector_nr, mirror_nr);
 	}
-	/* Place holder for recoverable checks. */
+
+	/*
+	 * Now all sectors are checked, re-check which sectors are recoverable.
+	 */
+	for (sector_nr  = 0; sector_nr < sectors_per_stripe; sector_nr++) {
+		struct scrub_fs_sector *sector = &sfctx->sectors[sector_nr];
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_UNUSED)
+			continue;
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_META) {
+			scrub_fs_recover_meta(sfctx, sector_nr);
+			sector_nr += (fs_info->nodesize >>
+				      fs_info->sectorsize_bits) - 1;
+			continue;
+		}
+
+		scrub_fs_recover_data(sfctx, sector_nr);
+	}
 }
 
 static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)