From patchwork Wed Jul 19 05:30:23 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13318236 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FC89C001DF for ; Wed, 19 Jul 2023 05:31:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230131AbjGSFbC (ORCPT ); Wed, 19 Jul 2023 01:31:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229596AbjGSFbA (ORCPT ); Wed, 19 Jul 2023 01:31:00 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A521D2 for ; Tue, 18 Jul 2023 22:30:58 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 24C931F8BF for ; Wed, 19 Jul 2023 05:30:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1689744648; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=heGKVEtKdtmNxMJefGiQZKb2bIHq52QvStG7qlUm+oI=; b=mDdpSrTLqMtEXzYlh2nvGyInStIluwCE1KabJvdGW5DzxJh6jvk+7OfnVQ0tKvHPWSdzWp JVFP79AHWHnTqLW6FMP2vUe4f39YbJUjj1o9vZ6/C2OnNRI+ry0kJNgvSjiD/I27GefHPI XJg9RcL7a+qNuva3pLBOL+XASrIT6oU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 053B7138EE for ; Wed, 19 Jul 2023 05:30:46 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 8LcdMAZ1t2R7JQAAMHmgww (envelope-from ) for ; Wed, 19 Jul 2023 05:30:46 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 1/4] btrfs: scrub: move write back of repaired sectors into scrub_stripe_read_repair_worker() Date: Wed, 19 Jul 2023 13:30:23 +0800 Message-ID: X-Mailer: git-send-email 2.41.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Currently the scrub_stripe_read_repair_worker() only do reads to rebuild the corrupted sectors, it doesn't do any writeback. The design is mostly to put writeback into a more ordered manner, to co-operate with dev-replace with zoned mode, which requires every write to be submitted in their bytenr order. However the writeback for repaired sectors into the original mirror doesn't need such strong requirement, as it can only happen for non-zoned devices. This patch would move the writeback for repaired sectors into scrub_stripe_read_repair_worker(). To do that, we have to move the following functions forward: - scrub_write_endio() - scrub_subnmit_write_bio() - scrub_write_sectors() Signed-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 256 +++++++++++++++++++++++------------------------ 1 file changed, 126 insertions(+), 130 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 3b09d359c914..a3aff0296ba4 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -983,6 +983,108 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, spin_unlock(&sctx->stat_lock); } +static void scrub_submit_write_bio(struct scrub_ctx *sctx, + struct scrub_stripe *stripe, + struct btrfs_bio *bbio, bool dev_replace) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + u32 bio_len = bbio->bio.bi_iter.bi_size; + u32 bio_off = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT) - + stripe->logical; + + fill_writer_pointer_gap(sctx, stripe->physical + bio_off); + atomic_inc(&stripe->pending_io); + btrfs_submit_repair_write(bbio, stripe->mirror_num, dev_replace); + if (!btrfs_is_zoned(fs_info)) + return; + /* + * For zoned writeback, queue depth must be 1, thus we must wait for + * the write to finish before the next write. + */ + wait_scrub_stripe_io(stripe); + + /* + * And also need to update the write pointer if write finished + * successfully. + */ + if (!test_bit(bio_off >> fs_info->sectorsize_bits, + &stripe->write_error_bitmap)) + sctx->write_pointer += bio_len; +} + +static void scrub_write_endio(struct btrfs_bio *bbio) +{ + struct scrub_stripe *stripe = bbio->private; + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct bio_vec *bvec; + int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio)); + u32 bio_size = 0; + int i; + + bio_for_each_bvec_all(bvec, &bbio->bio, i) + bio_size += bvec->bv_len; + + if (bbio->bio.bi_status) { + unsigned long flags; + + spin_lock_irqsave(&stripe->write_error_lock, flags); + bitmap_set(&stripe->write_error_bitmap, sector_nr, + bio_size >> fs_info->sectorsize_bits); + spin_unlock_irqrestore(&stripe->write_error_lock, flags); + } + bio_put(&bbio->bio); + + if (atomic_dec_and_test(&stripe->pending_io)) + wake_up(&stripe->io_wait); +} + +/* + * Submit the write bio(s) for the sectors specified by @write_bitmap. + * + * Here we utilize btrfs_submit_repair_write(), which has some extra benefits: + * + * - Only needs logical bytenr and mirror_num + * Just like the scrub read path + * + * - Would only result in writes to the specified mirror + * Unlike the regular writeback path, which would write back to all stripes + * + * - Handle dev-replace and read-repair writeback differently + */ +static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *stripe, + unsigned long write_bitmap, bool dev_replace) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct btrfs_bio *bbio = NULL; + int sector_nr; + + for_each_set_bit(sector_nr, &write_bitmap, stripe->nr_sectors) { + struct page *page = scrub_stripe_get_page(stripe, sector_nr); + unsigned int pgoff = scrub_stripe_get_page_offset(stripe, sector_nr); + int ret; + + /* We should only writeback sectors covered by an extent. */ + ASSERT(test_bit(sector_nr, &stripe->extent_sector_bitmap)); + + /* Cannot merge with previous sector, submit the current one. */ + if (bbio && sector_nr && !test_bit(sector_nr - 1, &write_bitmap)) { + scrub_submit_write_bio(sctx, stripe, bbio, dev_replace); + bbio = NULL; + } + if (!bbio) { + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_WRITE, + fs_info, scrub_write_endio, stripe); + bbio->bio.bi_iter.bi_sector = (stripe->logical + + (sector_nr << fs_info->sectorsize_bits)) >> + SECTOR_SHIFT; + } + ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); + ASSERT(ret == fs_info->sectorsize); + } + if (bbio) + scrub_submit_write_bio(sctx, stripe, bbio, dev_replace); +} + /* * The main entrance for all read related scrub work, including: * @@ -991,12 +1093,16 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, * - Go through the remaining mirrors and try to read as large blocksize as * possible * - Go through all mirrors (including the failed mirror) sector-by-sector + * - Submit the write bio for repaired sectors if needed * - * Writeback does not happen here, it needs extra synchronization. + * Writeback for dev-replace does not happen here, it needs extra + * synchronization to ensure they all happen in correct order (for zoned + * devices). */ static void scrub_stripe_read_repair_worker(struct work_struct *work) { struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work); + struct scrub_ctx *sctx = stripe->sctx; struct btrfs_fs_info *fs_info = stripe->bg->fs_info; int num_copies = btrfs_num_copies(fs_info, stripe->bg->start, stripe->bg->length); @@ -1062,7 +1168,25 @@ static void scrub_stripe_read_repair_worker(struct work_struct *work) goto out; } out: - scrub_stripe_report_errors(stripe->sctx, stripe); + /* + * Submit the repaired sectors. For zoned case, we cannot do repair + * in-place, but queue the bg to be relocated. + */ + if (btrfs_is_zoned(fs_info)) { + if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) { + btrfs_repair_one_zone(fs_info, + sctx->stripes[0].bg->start); + } + } else if (!sctx->readonly) { + unsigned long repaired; + + bitmap_andnot(&repaired, &stripe->init_error_bitmap, + &stripe->error_bitmap, stripe->nr_sectors); + scrub_write_sectors(sctx, stripe, repaired, false); + wait_scrub_stripe_io(stripe); + } + + scrub_stripe_report_errors(sctx, stripe); set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state); wake_up(&stripe->repair_wait); } @@ -1085,108 +1209,6 @@ static void scrub_read_endio(struct btrfs_bio *bbio) } } -static void scrub_write_endio(struct btrfs_bio *bbio) -{ - struct scrub_stripe *stripe = bbio->private; - struct btrfs_fs_info *fs_info = stripe->bg->fs_info; - struct bio_vec *bvec; - int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio)); - u32 bio_size = 0; - int i; - - bio_for_each_bvec_all(bvec, &bbio->bio, i) - bio_size += bvec->bv_len; - - if (bbio->bio.bi_status) { - unsigned long flags; - - spin_lock_irqsave(&stripe->write_error_lock, flags); - bitmap_set(&stripe->write_error_bitmap, sector_nr, - bio_size >> fs_info->sectorsize_bits); - spin_unlock_irqrestore(&stripe->write_error_lock, flags); - } - bio_put(&bbio->bio); - - if (atomic_dec_and_test(&stripe->pending_io)) - wake_up(&stripe->io_wait); -} - -static void scrub_submit_write_bio(struct scrub_ctx *sctx, - struct scrub_stripe *stripe, - struct btrfs_bio *bbio, bool dev_replace) -{ - struct btrfs_fs_info *fs_info = sctx->fs_info; - u32 bio_len = bbio->bio.bi_iter.bi_size; - u32 bio_off = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT) - - stripe->logical; - - fill_writer_pointer_gap(sctx, stripe->physical + bio_off); - atomic_inc(&stripe->pending_io); - btrfs_submit_repair_write(bbio, stripe->mirror_num, dev_replace); - if (!btrfs_is_zoned(fs_info)) - return; - /* - * For zoned writeback, queue depth must be 1, thus we must wait for - * the write to finish before the next write. - */ - wait_scrub_stripe_io(stripe); - - /* - * And also need to update the write pointer if write finished - * successfully. - */ - if (!test_bit(bio_off >> fs_info->sectorsize_bits, - &stripe->write_error_bitmap)) - sctx->write_pointer += bio_len; -} - -/* - * Submit the write bio(s) for the sectors specified by @write_bitmap. - * - * Here we utilize btrfs_submit_repair_write(), which has some extra benefits: - * - * - Only needs logical bytenr and mirror_num - * Just like the scrub read path - * - * - Would only result in writes to the specified mirror - * Unlike the regular writeback path, which would write back to all stripes - * - * - Handle dev-replace and read-repair writeback differently - */ -static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *stripe, - unsigned long write_bitmap, bool dev_replace) -{ - struct btrfs_fs_info *fs_info = stripe->bg->fs_info; - struct btrfs_bio *bbio = NULL; - int sector_nr; - - for_each_set_bit(sector_nr, &write_bitmap, stripe->nr_sectors) { - struct page *page = scrub_stripe_get_page(stripe, sector_nr); - unsigned int pgoff = scrub_stripe_get_page_offset(stripe, sector_nr); - int ret; - - /* We should only writeback sectors covered by an extent. */ - ASSERT(test_bit(sector_nr, &stripe->extent_sector_bitmap)); - - /* Cannot merge with previous sector, submit the current one. */ - if (bbio && sector_nr && !test_bit(sector_nr - 1, &write_bitmap)) { - scrub_submit_write_bio(sctx, stripe, bbio, dev_replace); - bbio = NULL; - } - if (!bbio) { - bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_WRITE, - fs_info, scrub_write_endio, stripe); - bbio->bio.bi_iter.bi_sector = (stripe->logical + - (sector_nr << fs_info->sectorsize_bits)) >> - SECTOR_SHIFT; - } - ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); - ASSERT(ret == fs_info->sectorsize); - } - if (bbio) - scrub_submit_write_bio(sctx, stripe, bbio, dev_replace); -} - /* * Throttling of IO submission, bandwidth-limit based, the timeslice is 1 * second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max. @@ -1701,32 +1723,6 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx) test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state)); } - /* - * Submit the repaired sectors. For zoned case, we cannot do repair - * in-place, but queue the bg to be relocated. - */ - if (btrfs_is_zoned(fs_info)) { - for (int i = 0; i < nr_stripes; i++) { - stripe = &sctx->stripes[i]; - - if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) { - btrfs_repair_one_zone(fs_info, - sctx->stripes[0].bg->start); - break; - } - } - } else if (!sctx->readonly) { - for (int i = 0; i < nr_stripes; i++) { - unsigned long repaired; - - stripe = &sctx->stripes[i]; - - bitmap_andnot(&repaired, &stripe->init_error_bitmap, - &stripe->error_bitmap, stripe->nr_sectors); - scrub_write_sectors(sctx, stripe, repaired, false); - } - } - /* Submit for dev-replace. */ if (sctx->is_dev_replace) { /* From patchwork Wed Jul 19 05:30:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13318232 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BDD9C001B0 for ; Wed, 19 Jul 2023 05:30:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230083AbjGSFaz (ORCPT ); Wed, 19 Jul 2023 01:30:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57130 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229471AbjGSFay (ORCPT ); Wed, 19 Jul 2023 01:30:54 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 87333D2 for ; Tue, 18 Jul 2023 22:30:50 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 349E221901 for ; Wed, 19 Jul 2023 05:30:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1689744649; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dn9Uo1pjgKdfxxn9FdavaFwL3D78/XCfAuyt45us0yM=; b=kQIo5bTBa8v7XZVrxfVyjjMDKw/2yOsoKS9QJRsek2c5uoy1ow1WMpxRRYYEZZeH375Y0c VoL1bezgqS58CboLEqaP33uDebco65mgQ+LgN+1Le3bVj3id1HLyv6gXlIdraVUwpKcPth Nsj8/Mtu/KRhklaIOWW3ktgSnYHK9w8= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 8752B138EE for ; Wed, 19 Jul 2023 05:30:48 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gLoaFAh1t2R7JQAAMHmgww (envelope-from ) for ; Wed, 19 Jul 2023 05:30:48 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 2/4] btrfs: scrub: don't go ordered workqueue for dev-replace Date: Wed, 19 Jul 2023 13:30:24 +0800 Message-ID: X-Mailer: git-send-email 2.41.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The workqueue fs_info->scrub_worker would go ordered workqueue if it's a device replace operation. However the scrub is relying on multiple workers to do data csum verification, and we always submit several read requests in a row. Thus there is no need to use ordered workqueue just for dev-replace. We have extra synchronization (the main thread will always submit-and-wait for dev-replace writes) to handle it for zoned devices. Signed-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a3aff0296ba4..9cb1106bdc11 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2741,8 +2741,7 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info) /* * get a reference count on fs_info->scrub_workers. start worker if necessary */ -static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, - int is_dev_replace) +static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info) { struct workqueue_struct *scrub_workers = NULL; unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND; @@ -2752,10 +2751,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt)) return 0; - if (is_dev_replace) - scrub_workers = alloc_ordered_workqueue("btrfs-scrub", flags); - else - scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active); + scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active); if (!scrub_workers) return -ENOMEM; @@ -2807,7 +2803,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, if (IS_ERR(sctx)) return PTR_ERR(sctx); - ret = scrub_workers_get(fs_info, is_dev_replace); + ret = scrub_workers_get(fs_info); if (ret) goto out_free_ctx; From patchwork Wed Jul 19 05:30:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13318233 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3AD5C001DC for ; Wed, 19 Jul 2023 05:30:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230039AbjGSFa5 (ORCPT ); Wed, 19 Jul 2023 01:30:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229471AbjGSFa4 (ORCPT ); Wed, 19 Jul 2023 01:30:56 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F1920119 for ; Tue, 18 Jul 2023 22:30:51 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id ADEBC21902 for ; Wed, 19 Jul 2023 05:30:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1689744650; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=i/Wmzk7iRx28jnqFxlxIhjkcUFkWY8FTU/g0s+t8rX8=; b=H4cOltqv7zL5HUnkptz29TVOwXTzg+i6gyNR0j969l8JEtV7LQPuV1ge0Z7oonc7LWLVEn ivijmbIozwHBwvWx7pWQa4HC0wCmpDfXf7Lx/3+4j4BNwn3+Xs8W+t+bhxBOezeOBDXiWi uzIQf5GBuJscXRqmp4vEQMkCTqc8Mj4= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 0AF9D138EE for ; Wed, 19 Jul 2023 05:30:49 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id AFGCMQl1t2R7JQAAMHmgww (envelope-from ) for ; Wed, 19 Jul 2023 05:30:49 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 3/4] btrfs: scrub: use btrfs workqueue to synchronize the write back for dev-replace Date: Wed, 19 Jul 2023 13:30:25 +0800 Message-ID: <010815c1c06275ab502647d6863514508a6c4f01.1689744163.git.wqu@suse.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Currently we submit the last group of stripes and wait for all stripes in flush_scrub_stripes(), then submit writes for dev-replace. This design is mostly to make sure the writeback for dev-replace target is correct for zoned devices. As for zone devices, we have to ensure not only all writes are properly submitted and waited (queue depth = 1), but also all writes must be in their bytenr order. On the other hand, the repair-from-other-mirrors part can happen in any order, thus we can not simply let the dev-replace writeback happen in the scrub_stripe_read_repair_worker(). That's why we go the wait for all stripes in flush_scrub_stripes(), then submit writes for dev-replace in a dedicated loop. This patch would enhance the situation by utilizing btrfs_workqueue, as it provides an ordered execution part for the dev-replace writeback, while the async part can be utilized for read-repair. Signed-off-by: Qu Wenruo --- fs/btrfs/fs.h | 2 +- fs/btrfs/scrub.c | 193 ++++++++++++++++++++++++----------------------- 2 files changed, 100 insertions(+), 95 deletions(-) diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 203d2a267828..670f114e8bca 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -642,7 +642,7 @@ struct btrfs_fs_info { * running. */ refcount_t scrub_workers_refcnt; - struct workqueue_struct *scrub_workers; + struct btrfs_workqueue *scrub_workers; struct btrfs_subpage_info *subpage_info; struct btrfs_discard_ctl discard_ctl; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 9cb1106bdc11..784fbe2341d4 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -97,10 +97,16 @@ enum scrub_stripe_flags { /* * Set for data stripes if it's triggered from P/Q stripe. - * During such scrub, we should not report errors in data stripes, nor - * update the accounting. + * This flag prevents us from doing the following work: + * - Writeback for dev-replace + * Since we're doing P/Q scrub, it's we just need the data stripe + * contents, should not do dev-replace for them. + * + * - No csum error report + * - No reset + * Keep the contents and bitmaps, but cleanup the state. */ - SCRUB_STRIPE_FLAG_NO_REPORT, + SCRUB_STRIPE_FLAG_DATA_STRIPES_FOR_PQ, }; #define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE) @@ -182,7 +188,7 @@ struct scrub_stripe { */ u8 *csums; - struct work_struct work; + struct btrfs_work work; }; struct scrub_ctx { @@ -199,6 +205,8 @@ struct scrub_ctx { ktime_t throttle_deadline; u64 throttle_sent; + /* Should abort the dev-replace. Set when we have metadata errors. */ + bool replace_abort; int is_dev_replace; u64 write_pointer; @@ -871,7 +879,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, int nr_repaired_sectors = 0; int sector_nr; - if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state)) + if (test_bit(SCRUB_STRIPE_FLAG_DATA_STRIPES_FOR_PQ, &stripe->state)) return; /* @@ -1099,7 +1107,7 @@ static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *str * synchronization to ensure they all happen in correct order (for zoned * devices). */ -static void scrub_stripe_read_repair_worker(struct work_struct *work) +static void scrub_stripe_read_repair_worker(struct btrfs_work *work) { struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work); struct scrub_ctx *sctx = stripe->sctx; @@ -1202,11 +1210,8 @@ static void scrub_read_endio(struct btrfs_bio *bbio) bitmap_clear(&stripe->io_error_bitmap, 0, stripe->nr_sectors); } bio_put(&bbio->bio); - if (atomic_dec_and_test(&stripe->pending_io)) { + if (atomic_dec_and_test(&stripe->pending_io)) wake_up(&stripe->io_wait); - INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker); - queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work); - } } /* @@ -1629,6 +1634,65 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe) } } +static bool stripe_has_metadata_error(struct scrub_stripe *stripe) +{ + int i; + + for_each_set_bit(i, &stripe->error_bitmap, stripe->nr_sectors) { + if (stripe->sectors[i].is_metadata) { + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + + btrfs_err(fs_info, + "stripe %llu has unrepaired metadata sector at %llu", + stripe->logical, + stripe->logical + (i << fs_info->sectorsize_bits)); + return true; + } + } + return false; +} + +static void scrub_stripe_writeback_worker(struct btrfs_work *work) +{ + struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work); + struct scrub_ctx *sctx = stripe->sctx; + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + + if (sctx->is_dev_replace && + !test_bit(SCRUB_STRIPE_FLAG_DATA_STRIPES_FOR_PQ, &stripe->state)) { + unsigned long good; + + /* + * For dev-replace, if we know there is something wrong with + * metadata, we should immedately abort. + */ + if (stripe_has_metadata_error(stripe)) { + btrfs_err_rl(fs_info, + "stripe logical %llu has metadata error, can not continue dev-replace", + stripe->logical); + sctx->replace_abort = true; + return; + } + ASSERT(stripe->dev == fs_info->dev_replace.srcdev); + + bitmap_andnot(&good, &stripe->extent_sector_bitmap, + &stripe->error_bitmap, stripe->nr_sectors); + scrub_write_sectors(sctx, stripe, good, true); + } + wait_scrub_stripe_io(stripe); +} + +static void scrub_stripe_release_worker(struct btrfs_work *work) +{ + struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work); + + if (!test_bit(SCRUB_STRIPE_FLAG_DATA_STRIPES_FOR_PQ, &stripe->state)) + scrub_reset_stripe(stripe); + else + stripe->state = 0; + wake_up(&stripe->repair_wait); +} + static void scrub_submit_initial_read(struct scrub_ctx *sctx, struct scrub_stripe *stripe) { @@ -1636,13 +1700,17 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx, struct btrfs_bio *bbio; int mirror = stripe->mirror_num; - ASSERT(stripe->bg); - ASSERT(stripe->mirror_num > 0); - ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state)); + /* Not yet initialized. */ + if (!test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state)) + return; if (test_and_set_bit(SCRUB_STRIPE_FLAG_READ_SUBMITTED, &stripe->state)) return; + ASSERT(stripe->bg); + ASSERT(stripe->mirror_num > 0); + ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state)); + bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info, scrub_read_endio, stripe); @@ -1671,91 +1739,43 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx, mirror = calc_next_mirror(mirror, num_copies); } btrfs_submit_bio(bbio, mirror); -} - -static bool stripe_has_metadata_error(struct scrub_stripe *stripe) -{ - int i; - - for_each_set_bit(i, &stripe->error_bitmap, stripe->nr_sectors) { - if (stripe->sectors[i].is_metadata) { - struct btrfs_fs_info *fs_info = stripe->bg->fs_info; - - btrfs_err(fs_info, - "stripe %llu has unrepaired metadata sector at %llu", - stripe->logical, - stripe->logical + (i << fs_info->sectorsize_bits)); - return true; - } - } - return false; + btrfs_init_work(&stripe->work, scrub_stripe_read_repair_worker, + scrub_stripe_writeback_worker, + scrub_stripe_release_worker); + btrfs_queue_work(fs_info->scrub_workers, &stripe->work); } static int flush_scrub_stripes(struct scrub_ctx *sctx) { - struct btrfs_fs_info *fs_info = sctx->fs_info; struct scrub_stripe *stripe; const int nr_stripes = sctx->cur_stripe; + const int first_stripe = round_down(nr_stripes - 1, SCRUB_STRIPES_PER_GROUP); struct blk_plug plug; int ret = 0; if (!nr_stripes) return 0; - ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state)); - /* We should only have at most one group to submit. */ scrub_throttle_dev_io(sctx, sctx->stripes[0].dev, btrfs_stripe_nr_to_offset( nr_stripes % SCRUB_STRIPES_PER_GROUP ?: SCRUB_STRIPES_PER_GROUP)); blk_start_plug(&plug); - for (int i = 0; i < nr_stripes; i++) { + for (int i = first_stripe; i < nr_stripes; i++) { stripe = &sctx->stripes[i]; + scrub_submit_initial_read(sctx, stripe); } blk_finish_plug(&plug); + /* Wait for all stripes to finish. */ for (int i = 0; i < nr_stripes; i++) { stripe = &sctx->stripes[i]; - wait_event(stripe->repair_wait, - test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state)); + wait_event(stripe->repair_wait, stripe->state == 0); } - /* Submit for dev-replace. */ - if (sctx->is_dev_replace) { - /* - * For dev-replace, if we know there is something wrong with - * metadata, we should immedately abort. - */ - for (int i = 0; i < nr_stripes; i++) { - if (stripe_has_metadata_error(&sctx->stripes[i])) { - ret = -EIO; - goto out; - } - } - for (int i = 0; i < nr_stripes; i++) { - unsigned long good; - - stripe = &sctx->stripes[i]; - - ASSERT(stripe->dev == fs_info->dev_replace.srcdev); - - bitmap_andnot(&good, &stripe->extent_sector_bitmap, - &stripe->error_bitmap, stripe->nr_sectors); - scrub_write_sectors(sctx, stripe, good, true); - } - } - - /* Wait for the above writebacks to finish. */ - for (int i = 0; i < nr_stripes; i++) { - stripe = &sctx->stripes[i]; - - wait_scrub_stripe_io(stripe); - scrub_reset_stripe(stripe); - } -out: sctx->cur_stripe = 0; return ret; } @@ -1844,7 +1864,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx, btrfs_stripe_nr_to_offset(rot); scrub_reset_stripe(stripe); - set_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state); + set_bit(SCRUB_STRIPE_FLAG_DATA_STRIPES_FOR_PQ, &stripe->state); ret = scrub_find_fill_first_stripe(bg, map->stripes[stripe_index].dev, physical, 1, full_stripe_start + btrfs_stripe_nr_to_offset(i), @@ -1877,35 +1897,19 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx, goto out; } - for (int i = 0; i < data_stripes; i++) { - stripe = &sctx->raid56_data_stripes[i]; - scrub_submit_initial_read(sctx, stripe); - } for (int i = 0; i < data_stripes; i++) { stripe = &sctx->raid56_data_stripes[i]; - wait_event(stripe->repair_wait, - test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state)); + scrub_submit_initial_read(sctx, stripe); } /* For now, no zoned support for RAID56. */ ASSERT(!btrfs_is_zoned(sctx->fs_info)); - /* Writeback for the repaired sectors. */ - for (int i = 0; i < data_stripes; i++) { - unsigned long repaired; - - stripe = &sctx->raid56_data_stripes[i]; - - bitmap_andnot(&repaired, &stripe->init_error_bitmap, - &stripe->error_bitmap, stripe->nr_sectors); - scrub_write_sectors(sctx, stripe, repaired, false); - } - /* Wait for the above writebacks to finish. */ for (int i = 0; i < data_stripes; i++) { stripe = &sctx->raid56_data_stripes[i]; - wait_scrub_stripe_io(stripe); + wait_event(stripe->repair_wait, stripe->state == 0); } /* @@ -2728,13 +2732,13 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info) { if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt, &fs_info->scrub_lock)) { - struct workqueue_struct *scrub_workers = fs_info->scrub_workers; + struct btrfs_workqueue *scrub_workers = fs_info->scrub_workers; fs_info->scrub_workers = NULL; mutex_unlock(&fs_info->scrub_lock); if (scrub_workers) - destroy_workqueue(scrub_workers); + btrfs_destroy_workqueue(scrub_workers); } } @@ -2743,7 +2747,7 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info) */ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info) { - struct workqueue_struct *scrub_workers = NULL; + struct btrfs_workqueue *scrub_workers = NULL; unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND; int max_active = fs_info->thread_pool_size; int ret = -ENOMEM; @@ -2751,7 +2755,8 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info) if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt)) return 0; - scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active); + scrub_workers = btrfs_alloc_workqueue(fs_info, "btrfs-scrub", flags, + max_active, 0); if (!scrub_workers) return -ENOMEM; @@ -2769,7 +2774,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info) ret = 0; - destroy_workqueue(scrub_workers); + btrfs_destroy_workqueue(scrub_workers); return ret; } From patchwork Wed Jul 19 05:30:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13318234 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23441C001B0 for ; Wed, 19 Jul 2023 05:31:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230118AbjGSFa6 (ORCPT ); Wed, 19 Jul 2023 01:30:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229471AbjGSFa6 (ORCPT ); Wed, 19 Jul 2023 01:30:58 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 919501BF3 for ; Tue, 18 Jul 2023 22:30:53 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 350691F8C0 for ; Wed, 19 Jul 2023 05:30:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1689744652; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DTod6vZDONodYAOBNkj9/yN+oL6sApf/WCN0QlUQUow=; b=oZCZNp90Ncs1y59m8Axd0v4JZfoynjX9M+r9R+m+/1FFab8nJbfw7jdYgZr6QvRPM1jb5l CCtscdjA8xJCaGnF9g7YnDkc7W16bHVzUn4rdbhMj/0pnqA52JVm76UzDseLE2RF7yxCXj B9hmpjXdZOmn8u+OG6ynQtbGe91jMNk= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1E040138EE for ; Wed, 19 Jul 2023 05:30:50 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gOYXNgp1t2R7JQAAMHmgww (envelope-from ) for ; Wed, 19 Jul 2023 05:30:50 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 4/4] btrfs: scrub: make sctx->stripes[] array work as a ring buffer Date: Wed, 19 Jul 2023 13:30:26 +0800 Message-ID: <04b653322f5d8b32c22083a4df9baf6d4c9097ea.1689744163.git.wqu@suse.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Since all scrub works (repair using other mirrors, writeback to the original mirror, writeback to the dev-replace target) all happens inside a btrfs workqueue, there is no need to manually flush all the stripes inside queue_scrub_stripe(). We only need to wait for the current slot to finish its work, then queue the stripe, and finally increase the cur_stripe number (and mod it). This should allow scrub to increase its queue depth when submitting read requests. Signed-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 38 ++++++++++++++++++++++---------------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 784fbe2341d4..e71ee8d2fbf7 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1770,7 +1770,7 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx) blk_finish_plug(&plug); /* Wait for all stripes to finish. */ - for (int i = 0; i < nr_stripes; i++) { + for (int i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) { stripe = &sctx->stripes[i]; wait_event(stripe->repair_wait, stripe->state == 0); @@ -1794,40 +1794,46 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group * int ret; /* - * We should always have at least one slot, when full the last one who - * queued a slot should handle the flush. + * We should always have at least one slot, as the stripes[] array + * is used as a ring buffer. */ ASSERT(sctx->cur_stripe < SCRUB_STRIPES_PER_SCTX); stripe = &sctx->stripes[sctx->cur_stripe]; - scrub_reset_stripe(stripe); + + /* Wait if the stripe is still under usage. */ + wait_event(stripe->repair_wait, stripe->state == 0); + ret = scrub_find_fill_first_stripe(bg, dev, physical, mirror_num, logical, length, stripe); /* Either >0 as no more extents or <0 for error. */ if (ret) return ret; + *found_stripe_start_ret = stripe->logical; - sctx->cur_stripe++; - - /* Last slot used, flush them all. */ - if (sctx->cur_stripe == SCRUB_STRIPES_PER_SCTX) - return flush_scrub_stripes(sctx); - - /* We have filled one group, submit them now. */ - if (sctx->cur_stripe % SCRUB_STRIPES_PER_GROUP == 0) { + /* Last slot in the group, submit the group.*/ + if ((sctx->cur_stripe + 1) % SCRUB_STRIPES_PER_GROUP == 0) { + const int first_stripe = sctx->cur_stripe + 1 - + SCRUB_STRIPES_PER_GROUP; struct blk_plug plug; - scrub_throttle_dev_io(sctx, sctx->stripes[0].dev, - btrfs_stripe_nr_to_offset(SCRUB_STRIPES_PER_GROUP)); + scrub_throttle_dev_io(sctx, dev, + btrfs_stripe_nr_to_offset(SCRUB_STRIPES_PER_GROUP)); blk_start_plug(&plug); - for (int i = sctx->cur_stripe - SCRUB_STRIPES_PER_GROUP; - i < sctx->cur_stripe; i++) { + for (int i = first_stripe; + i < first_stripe + SCRUB_STRIPES_PER_GROUP; i++) { stripe = &sctx->stripes[i]; + + /* All stripes in the group should haven been queued. */ + ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, + &stripe->state)); scrub_submit_initial_read(sctx, stripe); } blk_finish_plug(&plug); } + + sctx->cur_stripe = (sctx->cur_stripe + 1) % SCRUB_STRIPES_PER_SCTX; return 0; }