From patchwork Fri Mar 31 01:20:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195171 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8063C6FD1D for ; Fri, 31 Mar 2023 01:20:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229602AbjCaBUj (ORCPT ); Thu, 30 Mar 2023 21:20:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40480 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbjCaBUh (ORCPT ); Thu, 30 Mar 2023 21:20:37 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C609CCDCF for ; Thu, 30 Mar 2023 18:20:35 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 78CA61FE32; Fri, 31 Mar 2023 01:20:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225634; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PhfVyTdjvVUPBITmJpPodiRuw6jQsvA/Nl7KFfSiT/o=; b=nuUaHYmCthPHQQ6CC7/eHpmFwOgWeW4YNRhl0ctve81K+DJ4gNZbJ6KRaT4OkxbJg8Tf+f Y/9pr0TQigtx8rMVCaRbEQgK4VDk9XEJAMdn4At2iBuXFp5EQRf5gCvBru3W/39TKTTyP2 v6Fh0nS3T3VjwAH2E2lfPwZZCIVEk84= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 42C3113451; Fri, 31 Mar 2023 01:20:33 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 4MndBGE1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:33 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: Christoph Hellwig , Anand Jain , David Sterba Subject: [PATCH v8 01/12] btrfs: scrub: use dedicated super block verification function to scrub one super block Date: Fri, 31 Mar 2023 09:20:04 +0800 Message-Id: <3d1f229744c0d6edfa3e6f54599b207471913376.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org There is really no need to go through the super complex scrub_sectors() to just handle super blocks. This patch will introduce a dedicated function (less than 50 lines) to handle super block scrubing. This new function will introduce a behavior change, instead of using the complex but concurrent scrub_bio system, here we just go submit-and-wait. There is really not much sense to care the performance of super block scrubbing. It only has 3 super blocks at most, and they are all scattered around the devices already. Reviewed-by: Christoph Hellwig Reviewed-by: Anand Jain Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Reviewed-by: Johannes Thumshirn --- fs/btrfs/scrub.c | 60 +++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 3cdf73277e7e..ef4046a2572c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4243,18 +4243,62 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, return ret; } +static int scrub_one_super(struct scrub_ctx *sctx, struct btrfs_device *dev, + struct page *page, u64 physical, u64 generation) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + struct bio_vec bvec; + struct bio bio; + struct btrfs_super_block *sb = page_address(page); + int ret; + + bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_READ); + bio.bi_iter.bi_sector = physical >> SECTOR_SHIFT; + __bio_add_page(&bio, page, BTRFS_SUPER_INFO_SIZE, 0); + ret = submit_bio_wait(&bio); + bio_uninit(&bio); + + if (ret < 0) + return ret; + ret = btrfs_check_super_csum(fs_info, sb); + if (ret != 0) { + btrfs_err_rl(fs_info, + "super block at physical %llu devid %llu has bad csum", + physical, dev->devid); + return -EIO; + } + if (btrfs_super_generation(sb) != generation) { + btrfs_err_rl(fs_info, +"super block at physical %llu devid %llu has bad generation, has %llu expect %llu", + physical, dev->devid, + btrfs_super_generation(sb), generation); + return -EUCLEAN; + } + + return btrfs_validate_super(fs_info, sb, -1); +} + static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, struct btrfs_device *scrub_dev) { int i; u64 bytenr; u64 gen; - int ret; + int ret = 0; + struct page *page; struct btrfs_fs_info *fs_info = sctx->fs_info; if (BTRFS_FS_ERROR(fs_info)) return -EROFS; + page = alloc_page(GFP_KERNEL); + if (!page) { + spin_lock(&sctx->stat_lock); + sctx->stat.malloc_errors++; + spin_unlock(&sctx->stat_lock); + return -ENOMEM; + } + /* Seed devices of a new filesystem has their own generation. */ if (scrub_dev->fs_devices != fs_info->fs_devices) gen = scrub_dev->generation; @@ -4269,14 +4313,14 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, if (!btrfs_check_super_location(scrub_dev, bytenr)) continue; - ret = scrub_sectors(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr, - scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i, - NULL, bytenr); - if (ret) - return ret; + ret = scrub_one_super(sctx, scrub_dev, page, bytenr, gen); + if (ret) { + spin_lock(&sctx->stat_lock); + sctx->stat.super_errors++; + spin_unlock(&sctx->stat_lock); + } } - wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); - + __free_page(page); return 0; } From patchwork Fri Mar 31 01:20:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195173 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 053EBC77B6C for ; Fri, 31 Mar 2023 01:20:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229634AbjCaBUk (ORCPT ); Thu, 30 Mar 2023 21:20:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229529AbjCaBUi (ORCPT ); Thu, 30 Mar 2023 21:20:38 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B0602CDD1 for ; Thu, 30 Mar 2023 18:20:36 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 636A91FE33 for ; Fri, 31 Mar 2023 01:20:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225635; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5Yl8TKvauQNwCGQ8I4xWnrfQ9fyqWfJyJD/DHblGE7o=; b=FpkReN/v/ADrV5HGJaurgewN20J37YeV3yI3MX07DY7koKFsLE47PYUwGd/oMW/iGBTOqX kc2oJjJHOUG5pdvtB+HA5YJb5sJiCYhtUuu/IH7GFjDXytQrG3pc2nwU/6wVqJYoS58tBP iAx9v8zrKW1NGjdj/nrHsQ9PrwQKEUA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id C611E13451 for ; Fri, 31 Mar 2023 01:20:34 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id aMfxJGI1JmSKWAAAMHmgww (envelope-from ) for ; Fri, 31 Mar 2023 01:20:34 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v8 02/12] btrfs: introduce btrfs_bio::fs_info member Date: Fri, 31 Mar 2023 09:20:05 +0800 Message-Id: <9a96394d97db24ad695b7f89f2d28563ea9d6313.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Currently we're doing a lot of work for btrfs_bio: - Checksum verification for data read bios - Bio splits if it crosses stripe boundary - Read repair for data read bios However for the incoming scrub patches, we don't want those extra functionality at all, just pure logical + mirror -> physical mapping ability. Thus here we do the following changes: - Introduce btrfs_bio::fs_info This is for the new scrub specific btrfs_bio, which would not populate btrfs_bio::inode. Thus we need such new member to grab a fs_info This new member would always be populated. - Replace @inode argument with @fs_info for btrfs_bio_init() and its caller Since @inode is no longer a mandatory member, replace it with @fs_info, and let involved users populate @inode. - Skip checksum verification and geneartion if @bbio->inode is NULL - Add extra ASSERT()s To make sure: * bbio->inode is properly set for involved read repair path * if @file_offset is set, bbio->inode is also populated - Grab @fs_info from @bbio directly We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be NULL. This involves: * btrfs_simple_end_io() * should_async_write() * btrfs_wq_submit_bio() * btrfs_use_zone_append() Signed-off-by: Qu Wenruo Reviewed-by: Johannes Thumshirn --- fs/btrfs/bio.c | 42 +++++++++++++++++++++++++----------------- fs/btrfs/bio.h | 12 +++++++++--- fs/btrfs/compression.c | 3 ++- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 13 +++++++++---- fs/btrfs/zoned.c | 4 ++-- 6 files changed, 49 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index cf09c6271edb..fb87f1c35e22 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -31,11 +31,11 @@ struct btrfs_failed_bio { * Initialize a btrfs_bio structure. This skips the embedded bio itself as it * is already initialized by the block layer. */ -void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_inode *inode, +void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info, btrfs_bio_end_io_t end_io, void *private) { memset(bbio, 0, offsetof(struct btrfs_bio, bio)); - bbio->inode = inode; + bbio->fs_info = fs_info; bbio->end_io = end_io; bbio->private = private; atomic_set(&bbio->pending_ios, 1); @@ -49,7 +49,7 @@ void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_inode *inode, * a mempool. */ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, - struct btrfs_inode *inode, + struct btrfs_fs_info *fs_info, btrfs_bio_end_io_t end_io, void *private) { struct btrfs_bio *bbio; @@ -57,7 +57,7 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, bio = bio_alloc_bioset(NULL, nr_vecs, opf, GFP_NOFS, &btrfs_bioset); bbio = btrfs_bio(bio); - btrfs_bio_init(bbio, inode, end_io, private); + btrfs_bio_init(bbio, fs_info, end_io, private); return bbio; } @@ -78,8 +78,8 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info, GFP_NOFS, &btrfs_clone_bioset); } bbio = btrfs_bio(bio); - btrfs_bio_init(bbio, orig_bbio->inode, NULL, orig_bbio); - + btrfs_bio_init(bbio, fs_info, NULL, orig_bbio); + bbio->inode = orig_bbio->inode; bbio->file_offset = orig_bbio->file_offset; if (!(orig_bbio->bio.bi_opf & REQ_BTRFS_ONE_ORDERED)) orig_bbio->file_offset += map_length; @@ -230,7 +230,8 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio, bio_add_page(repair_bio, bv->bv_page, bv->bv_len, bv->bv_offset); repair_bbio = btrfs_bio(repair_bio); - btrfs_bio_init(repair_bbio, failed_bbio->inode, NULL, fbio); + btrfs_bio_init(repair_bbio, fs_info, NULL, fbio); + repair_bbio->inode = failed_bbio->inode; repair_bbio->file_offset = failed_bbio->file_offset + bio_offset; mirror = next_repair_mirror(fbio, failed_bbio->mirror_num); @@ -249,6 +250,9 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de struct btrfs_failed_bio *fbio = NULL; u32 offset = 0; + /* Read-repair requires the inode field to be set by the submitter. */ + ASSERT(inode); + /* * Hand off repair bios to the repair code as there is no upper level * submitter for them. @@ -309,17 +313,17 @@ static void btrfs_end_bio_work(struct work_struct *work) struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work); /* Metadata reads are checked and repaired by the submitter. */ - if (bbio->bio.bi_opf & REQ_META) - bbio->end_io(bbio); - else + if (bbio->inode && !(bbio->bio.bi_opf & REQ_META)) btrfs_check_read_bio(bbio, bbio->bio.bi_private); + else + bbio->end_io(bbio); } static void btrfs_simple_end_io(struct bio *bio) { struct btrfs_bio *bbio = btrfs_bio(bio); struct btrfs_device *dev = bio->bi_private; - struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; + struct btrfs_fs_info *fs_info = bbio->fs_info; btrfs_bio_counter_dec(fs_info); @@ -343,7 +347,8 @@ static void btrfs_raid56_end_io(struct bio *bio) btrfs_bio_counter_dec(bioc->fs_info); bbio->mirror_num = bioc->mirror_num; - if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META)) + if (bio_op(bio) == REQ_OP_READ && bbio->inode && + !(bbio->bio.bi_opf & REQ_META)) btrfs_check_read_bio(bbio, NULL); else btrfs_orig_bbio_end_io(bbio); @@ -565,7 +570,7 @@ static bool should_async_write(struct btrfs_bio *bbio) * in order. */ if (bbio->bio.bi_opf & REQ_META) { - struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; + struct btrfs_fs_info *fs_info = bbio->fs_info; if (btrfs_is_zoned(fs_info)) return false; @@ -585,7 +590,7 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio, struct btrfs_io_context *bioc, struct btrfs_io_stripe *smap, int mirror_num) { - struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; + struct btrfs_fs_info *fs_info = bbio->fs_info; struct async_submit_bio *async; async = kmalloc(sizeof(*async), GFP_NOFS); @@ -609,7 +614,7 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio, static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) { struct btrfs_inode *inode = bbio->inode; - struct btrfs_fs_info *fs_info = inode->root->fs_info; + struct btrfs_fs_info *fs_info = bbio->fs_info; struct btrfs_bio *orig_bbio = bbio; struct bio *bio = &bbio->bio; u64 logical = bio->bi_iter.bi_sector << 9; @@ -642,7 +647,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) * Save the iter for the end_io handler and preload the checksums for * data reads. */ - if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) { + if (bio_op(bio) == REQ_OP_READ && inode && !(bio->bi_opf & REQ_META)) { bbio->saved_iter = bio->bi_iter; ret = btrfs_lookup_bio_sums(bbio); if (ret) @@ -662,7 +667,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) * Csum items for reloc roots have already been cloned at this * point, so they are handled as part of the no-checksum case. */ - if (!(inode->flags & BTRFS_INODE_NODATASUM) && + if (inode && !(inode->flags & BTRFS_INODE_NODATASUM) && !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) && !btrfs_is_data_reloc_root(inode->root)) { if (should_async_write(bbio) && @@ -691,6 +696,9 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num) { + /* If bbio->inode is not populated, its file_offset must be 0. */ + ASSERT(bbio->inode || bbio->file_offset == 0); + while (!btrfs_submit_chunk(bbio, mirror_num)) ; } diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h index dbf125f6fa33..e54eaee81f8f 100644 --- a/fs/btrfs/bio.h +++ b/fs/btrfs/bio.h @@ -30,7 +30,10 @@ typedef void (*btrfs_bio_end_io_t)(struct btrfs_bio *bbio); * passed to btrfs_submit_bio for mapping to the physical devices. */ struct btrfs_bio { - /* Inode and offset into it that this I/O operates on. */ + /* + * Inode and offset into it that this I/O operates on. + * Only set for data I/O. + */ struct btrfs_inode *inode; u64 file_offset; @@ -58,6 +61,9 @@ struct btrfs_bio { atomic_t pending_ios; struct work_struct end_io_work; + /* File system that this I/O operates on. */ + struct btrfs_fs_info *fs_info; + /* * This member must come last, bio_alloc_bioset will allocate enough * bytes for entire btrfs_bio but relies on bio being last. @@ -73,10 +79,10 @@ static inline struct btrfs_bio *btrfs_bio(struct bio *bio) int __init btrfs_bioset_init(void); void __cold btrfs_bioset_exit(void); -void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_inode *inode, +void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info, btrfs_bio_end_io_t end_io, void *private); struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, - struct btrfs_inode *inode, + struct btrfs_fs_info *fs_info, btrfs_bio_end_io_t end_io, void *private); static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 44c4276741ce..50183e599213 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -69,7 +69,8 @@ static struct compressed_bio *alloc_compressed_bio(struct btrfs_inode *inode, bbio = btrfs_bio(bio_alloc_bioset(NULL, BTRFS_MAX_COMPRESSED_PAGES, op, GFP_NOFS, &btrfs_compressed_bioset)); - btrfs_bio_init(bbio, inode, end_io, NULL); + btrfs_bio_init(bbio, inode->root->fs_info, end_io, NULL); + bbio->inode = inode; bbio->file_offset = start; return to_compressed_bio(bbio); } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1221f699ffc5..9ee39c588125 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -898,9 +898,10 @@ static void alloc_new_bio(struct btrfs_inode *inode, struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_bio *bbio; - bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, inode, + bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, fs_info, bio_ctrl->end_io_func, NULL); bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT; + bbio->inode = inode; bbio->file_offset = file_offset; bio_ctrl->bbio = bbio; bio_ctrl->len_to_oe_boundary = U32_MAX; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 76d93b9e94a9..da540fd5b4b7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7754,7 +7754,9 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, container_of(bbio, struct btrfs_dio_private, bbio); struct btrfs_dio_data *dio_data = iter->private; - btrfs_bio_init(bbio, BTRFS_I(iter->inode), btrfs_dio_end_io, bio->bi_private); + btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, + btrfs_dio_end_io, bio->bi_private); + bbio->inode = BTRFS_I(iter->inode); bbio->file_offset = file_offset; dip->file_offset = file_offset; @@ -9924,6 +9926,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode, u64 file_offset, u64 disk_bytenr, u64 disk_io_size, struct page **pages) { + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_encoded_read_private priv = { .pending = ATOMIC_INIT(1), }; @@ -9932,9 +9935,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode, init_waitqueue_head(&priv.wait); - bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, inode, - btrfs_encoded_read_endio, &priv); + bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info, + btrfs_encoded_read_endio, &priv); bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT; + bbio->inode = inode; do { size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE); @@ -9943,9 +9947,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode, atomic_inc(&priv.pending); btrfs_submit_bio(bbio, 0); - bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, inode, + bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info, btrfs_encoded_read_endio, &priv); bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT; + bbio->inode = inode; continue; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 45d04092f2f8..a9b32ba6b2ce 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1640,14 +1640,14 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio) { u64 start = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT); struct btrfs_inode *inode = bbio->inode; - struct btrfs_fs_info *fs_info = inode->root->fs_info; + struct btrfs_fs_info *fs_info = bbio->fs_info; struct btrfs_block_group *cache; bool ret = false; if (!btrfs_is_zoned(fs_info)) return false; - if (!is_data_inode(&inode->vfs_inode)) + if (!inode || !is_data_inode(&inode->vfs_inode)) return false; if (btrfs_op(&bbio->bio) != BTRFS_MAP_WRITE) From patchwork Fri Mar 31 01:20:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195172 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82567C77B6D for ; Fri, 31 Mar 2023 01:20:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229639AbjCaBUl (ORCPT ); Thu, 30 Mar 2023 21:20:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229603AbjCaBUj (ORCPT ); Thu, 30 Mar 2023 21:20:39 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9584CDFD for ; Thu, 30 Mar 2023 18:20:37 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 895521FE35; Fri, 31 Mar 2023 01:20:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225636; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Hrk36/oz57F8SHz/NUGlyf3+iTatsSN7XtUduPNctCQ=; b=Y83dDMrE9xxOyxw3dMyDfk/uZas5OWF9fbT2qdN6oDwWXd5v1SHibc933ym4Sy5oreVmEf uKHL2kOM8cV2Crw9PP/f1mTze2KmH52eQFnqjoq0FH+8ZAiai8BFN2uuz1jbV+3Hiddv3E O4OV688E6vfFEHv+kwO/wPcOfNGB+QI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id BC8D213451; Fri, 31 Mar 2023 01:20:35 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gD+WImM1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:35 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 03/12] btrfs: introduce a new helper to submit write bio for repair Date: Fri, 31 Mar 2023 09:20:06 +0800 Message-Id: <819f09deb98ecda08057cb74b33e1083bf4568a8.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Both scrub and read-repair is utilizing a special repair writes that: - Only writes back to a single device Even for read-repair on RAID56, we only update the corrupted data stripe itself, not triggering the full RMW path. - Requires a valid @mirror_num For RAID56 case, only @mirror_num == 1 is valid. For non-RAID56 cases, we need @mirror_num to locate our stripe. - No data csum generation needed Those two call sites still have some difference though: - Read-repair goes plain bio It doesn't need a full btrfs_bio, and goes submit_bio_wait(). - New scrub repair would go btrfs_bio To simplify both read and write path. So here this patch would: - Introduce a common helper, btrfs_map_repair_block() Due to the single device nature, we can use an on-stack btrfs_io_stripe to pass device and its physical bytenr. - Introduce a new interface, btrfs_submit_repair_bio(), for later scrub code This is for the incoming scrub code. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/bio.c | 96 +++++++++++++++++++++++++--------------------- fs/btrfs/bio.h | 2 + fs/btrfs/raid56.h | 5 +++ fs/btrfs/volumes.c | 73 +++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 3 ++ 5 files changed, 135 insertions(+), 44 deletions(-) diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index fb87f1c35e22..430acf7142ef 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -717,12 +717,9 @@ int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, u64 length, u64 logical, struct page *page, unsigned int pg_offset, int mirror_num) { - struct btrfs_device *dev; + struct btrfs_io_stripe smap = { 0 }; struct bio_vec bvec; struct bio bio; - u64 map_length = 0; - u64 sector; - struct btrfs_io_context *bioc = NULL; int ret = 0; ASSERT(!(fs_info->sb->s_flags & SB_RDONLY)); @@ -731,68 +728,38 @@ int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, if (btrfs_repair_one_zone(fs_info, logical)) return 0; - map_length = length; - /* * Avoid races with device replace and make sure our bioc has devices * associated to its stripes that don't go away while we are doing the * read repair operation. */ btrfs_bio_counter_inc_blocked(fs_info); - if (btrfs_is_parity_mirror(fs_info, logical, length)) { - /* - * Note that we don't use BTRFS_MAP_WRITE because it's supposed - * to update all raid stripes, but here we just want to correct - * bad stripe, thus BTRFS_MAP_READ is abused to only get the bad - * stripe's dev and sector. - */ - ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, - &map_length, &bioc, 0); - if (ret) - goto out_counter_dec; - ASSERT(bioc->mirror_num == 1); - } else { - ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, - &map_length, &bioc, mirror_num); - if (ret) - goto out_counter_dec; - /* - * This happens when dev-replace is also running, and the - * mirror_num indicates the dev-replace target. - * - * In this case, we don't need to do anything, as the read - * error just means the replace progress hasn't reached our - * read range, and later replace routine would handle it well. - */ - if (mirror_num != bioc->mirror_num) - goto out_counter_dec; - } + ret = btrfs_map_repair_block(fs_info, &smap, logical, length, mirror_num); + if (ret < 0) + goto out_counter_dec; - sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9; - dev = bioc->stripes[bioc->mirror_num - 1].dev; - btrfs_put_bioc(bioc); - - if (!dev || !dev->bdev || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) { + if (!smap.dev->bdev || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, &smap.dev->dev_state)) { ret = -EIO; goto out_counter_dec; } - bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC); - bio.bi_iter.bi_sector = sector; + bio_init(&bio, smap.dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC); + bio.bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT; __bio_add_page(&bio, page, length, pg_offset); btrfsic_check_bio(&bio); ret = submit_bio_wait(&bio); if (ret) { /* try to remap that extent elsewhere? */ - btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); + btrfs_dev_stat_inc_and_print(smap.dev, BTRFS_DEV_STAT_WRITE_ERRS); goto out_bio_uninit; } btrfs_info_rl_in_rcu(fs_info, "read error corrected: ino %llu off %llu (dev %s sector %llu)", - ino, start, btrfs_dev_name(dev), sector); + ino, start, btrfs_dev_name(smap.dev), + smap.physical >> SECTOR_SHIFT); ret = 0; out_bio_uninit: @@ -802,6 +769,47 @@ int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, return ret; } +/* + * Submit a btrfs_bio based repair write. + * + * If @dev_replace is true, the write would be submitted to dev-replace target. + */ +void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, + bool dev_replace) +{ + struct btrfs_fs_info *fs_info = bbio->fs_info; + u64 logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; + u64 length = bbio->bio.bi_iter.bi_size; + struct btrfs_io_stripe smap = { 0 }; + int ret; + + ASSERT(fs_info); + ASSERT(mirror_num > 0); + ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE); + ASSERT(!bbio->inode); + + btrfs_bio_counter_inc_blocked(fs_info); + ret = btrfs_map_repair_block(fs_info, &smap, logical, length, mirror_num); + if (ret < 0) + goto fail; + + if (dev_replace) { + if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE && + btrfs_is_zoned(fs_info)) { + bbio->bio.bi_opf &= ~REQ_OP_WRITE; + bbio->bio.bi_opf |= REQ_OP_ZONE_APPEND; + } + ASSERT(smap.dev == fs_info->dev_replace.srcdev); + smap.dev = fs_info->dev_replace.tgtdev; + } + __btrfs_submit_bio(&bbio->bio, NULL, &smap, mirror_num); + return; + +fail: + btrfs_bio_counter_dec(fs_info); + btrfs_bio_end_io(bbio, errno_to_blk_status(ret)); +} + int __init btrfs_bioset_init(void) { if (bioset_init(&btrfs_bioset, BIO_POOL_SIZE, diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h index e54eaee81f8f..b158c920cc58 100644 --- a/fs/btrfs/bio.h +++ b/fs/btrfs/bio.h @@ -95,6 +95,8 @@ static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status) #define REQ_BTRFS_ONE_ORDERED REQ_DRV void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num); +void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, + bool dev_replace); int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, u64 length, u64 logical, struct page *page, unsigned int pg_offset, int mirror_num); diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index df0e0abdeb1f..6583c225b1bd 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -170,6 +170,11 @@ static inline int nr_data_stripes(const struct map_lookup *map) return map->num_stripes - btrfs_nr_parity_stripes(map->type); } +static inline int nr_bioc_data_stripes(const struct btrfs_io_context *bioc) +{ + return bioc->num_stripes - btrfs_nr_parity_stripes(bioc->map_type); +} + #define RAID5_P_STRIPE ((u64)-2) #define RAID6_Q_STRIPE ((u64)-1) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 93bc45001e68..54bee59c1ce8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -8004,3 +8004,76 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical) return true; } + +static void map_raid56_repair_block(struct btrfs_io_context *bioc, + struct btrfs_io_stripe *smap, + u64 logical) +{ + int data_stripes = nr_bioc_data_stripes(bioc); + int i; + + for (i = 0; i < data_stripes; i++) { + u64 stripe_start = bioc->full_stripe_logical + + (i << BTRFS_STRIPE_LEN_SHIFT); + + if (logical >= stripe_start && + logical < stripe_start + BTRFS_STRIPE_LEN) + break; + } + ASSERT(i < data_stripes); + smap->dev = bioc->stripes[i].dev; + smap->physical = bioc->stripes[i].physical + + ((logical - bioc->full_stripe_logical) & + BTRFS_STRIPE_LEN_MASK); +} + +/* + * Map a repair write into a single device. + * + * A repair write is triggered by read time repair or scrub, which would only + * update the contents of a single device. + * Not update any other mirrors nor go through RMW path. + * + * Callers should ensure: + * - Call btrfs_bio_counter_inc_blocked() first + * - The range does not cross stripe boundary + * - Has a valid @mirror_num passed in. + */ +int btrfs_map_repair_block(struct btrfs_fs_info *fs_info, + struct btrfs_io_stripe *smap, u64 logical, + u32 length, int mirror_num) +{ + struct btrfs_io_context *bioc = NULL; + u64 map_length = length; + int ret; + int mirror_ret = mirror_num; + + ASSERT(mirror_num > 0); + + ret = __btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, &map_length, + &bioc, smap, &mirror_ret, true); + if (ret < 0) + return ret; + + /* The map range should not cross stripe boundary. */ + ASSERT(map_length >= length); + + /* Already mapped to single stripe. */ + if (!bioc) + goto out; + + /* Map the RAID56 multi-stripe writes to a single one. */ + if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) { + map_raid56_repair_block(bioc, smap, logical); + goto out; + } + + ASSERT(mirror_num <= bioc->num_stripes); + smap->dev = bioc->stripes[mirror_num - 1].dev; + smap->physical = bioc->stripes[mirror_num - 1].physical; +out: + btrfs_put_bioc(bioc); + ASSERT(smap->dev); + return 0; +} + diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 650e131d079e..bf47a1a70813 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -587,6 +587,9 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, struct btrfs_io_context **bioc_ret, struct btrfs_io_stripe *smap, int *mirror_num_ret, int need_raid_map); +int btrfs_map_repair_block(struct btrfs_fs_info *fs_info, + struct btrfs_io_stripe *smap, u64 logical, + u32 length, int mirror_num); struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info, u64 logical, u64 *length_ret, u32 *num_stripes); From patchwork Fri Mar 31 01:20:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195174 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F777C761AF for ; Fri, 31 Mar 2023 01:20:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229650AbjCaBUm (ORCPT ); Thu, 30 Mar 2023 21:20:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40526 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229614AbjCaBUk (ORCPT ); Thu, 30 Mar 2023 21:20:40 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E56ECDC8 for ; Thu, 30 Mar 2023 18:20:38 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id AFAAE21A64; Fri, 31 Mar 2023 01:20:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225637; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+oNh12unZIHVnv23cdGBUh3px2olEjiGOraLDPHKZEg=; b=FjbmiNmql9HgC/+nS+z5wP5Xs85g1zBTkyFrj410JgIaxGsUsFXL8zqEfdA2v5Qw45PDNR nAayVOM9oH+q9USVGkrrcMYbGCKM59rgE1edFV28q4odbqEyAhoxLbzLQWWJbL/zNBNaAu SF8V/zNz9QHLFNZeDiA0T4mFzvMaf2M= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E146413451; Fri, 31 Mar 2023 01:20:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id aBCnK2Q1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:36 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 04/12] btrfs: scrub: introduce the structure for new BTRFS_STRIPE_LEN based interface Date: Fri, 31 Mar 2023 09:20:07 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org This patch introduces the following structures: - scrub_sector_verification Contains all the needed info to verify one sector (data or metadata). - scrub_stripe Contains all needed members (mostly bitmap based) to scrub one stripe (with a length of BTRFS_STRIPE_LEN). The basic idea is, we keep the existing per-device scrub behavior, but merge all the scrub_bio/scrub_bio into one generic structure, and read the full BTRFS_STRIPE_LEN stripe in the first try. This means we will read some sectors which is not scrub target, but that's fine. At dev-replace time we only writeback the utilized and good sectors, and for read-repair we only writeback the repaired sectors. With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio formshaping would be gone. Although to get the same performance of the old scrub behavior, we would need to submit the initial read for two stripes at once. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/scrub.h | 8 +++ 2 files changed, 150 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index ef4046a2572c..c67c98dcd942 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -70,6 +70,94 @@ struct scrub_ctx; */ #define BTRFS_MAX_MIRRORS (4 + 1) +/* Represent one sector and its needed info to verify the content. */ +struct scrub_sector_verification { + bool is_metadata; + + union { + /* + * Csum pointer for data csum verification. + * Should point to a sector csum inside scrub_stripe::csums. + * + * NULL if this data sector has no csum. + */ + u8 *csum; + + /* + * Extra info for metadata verification. + * All sectors inside a tree block shares the same + * geneartion. + */ + u64 generation; + }; +}; + +enum scrub_stripe_flags { + /* Set when @mirror_num, @dev, @physical and @logical is set. */ + SCRUB_STRIPE_FLAG_INITIALIZED, + + /* Set when the read-repair is finished. */ + SCRUB_STRIPE_FLAG_REPAIR_DONE, +}; + +#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE) +/* + * Represent one continuous range with a length of BTRFS_STRIPE_LEN. + */ +struct scrub_stripe { + struct btrfs_block_group *bg; + + struct page *pages[SCRUB_STRIPE_PAGES]; + struct scrub_sector_verification *sectors; + + struct btrfs_device *dev; + u64 logical; + u64 physical; + + u16 mirror_num; + + /* Should be BTRFS_STRIPE_LEN / sectorsize. */ + u16 nr_sectors; + + atomic_t pending_io; + wait_queue_head_t io_wait; + + /* + * Indicates the states of the stripe. + * Bits are defined in scrub_stripe_flags enum. + */ + unsigned long state; + + /* Indicates which sectors are covered by extent items. */ + unsigned long extent_sector_bitmap; + + /* + * The errors hit during the initial read of the stripe. + * + * Would be utilized for error reporting and repair. + */ + unsigned long init_error_bitmap; + + /* + * The following error bitmaps are all for the current status. + * Every time we submit a new read, those bitmaps may be updated. + * + * error_bitmap = io_error_bitmap | csum_error_bitmap | meta_error_bitmap; + * + * IO and csum errors can happen for both metadata and data. + */ + unsigned long error_bitmap; + unsigned long io_error_bitmap; + unsigned long csum_error_bitmap; + unsigned long meta_error_bitmap; + + /* + * Checksum for the whole stripe if this stripe is inside a data block + * group. + */ + u8 *csums; +}; + struct scrub_recover { refcount_t refs; struct btrfs_io_context *bioc; @@ -266,6 +354,60 @@ static void detach_scrub_page_private(struct page *page) #endif } +static void release_scrub_stripe(struct scrub_stripe *stripe) +{ + if (!stripe) + return; + + for (int i = 0; i < SCRUB_STRIPE_PAGES; i++) { + if (stripe->pages[i]) + __free_page(stripe->pages[i]); + stripe->pages[i] = NULL; + } + kfree(stripe->sectors); + kfree(stripe->csums); + stripe->sectors = NULL; + stripe->csums = NULL; + stripe->state = 0; +} + +int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe) +{ + int ret; + + memset(stripe, 0, sizeof(*stripe)); + + stripe->nr_sectors = BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits; + stripe->state = 0; + + init_waitqueue_head(&stripe->io_wait); + atomic_set(&stripe->pending_io, 0); + + ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages); + if (ret < 0) + goto error; + + stripe->sectors = kcalloc(stripe->nr_sectors, + sizeof(struct scrub_sector_verification), + GFP_KERNEL); + if (!stripe->sectors) + goto error; + + stripe->csums = kzalloc((BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits) * + fs_info->csum_size, GFP_KERNEL); + if (!stripe->csums) + goto error; + return 0; +error: + release_scrub_stripe(stripe); + return -ENOMEM; +} + +void wait_scrub_stripe_io(struct scrub_stripe *stripe) +{ + wait_event(stripe->io_wait, atomic_read(&stripe->pending_io) == 0); +} + static struct scrub_block *alloc_scrub_block(struct scrub_ctx *sctx, struct btrfs_device *dev, u64 logical, u64 physical, diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index 7639103ebf9d..e04764f8bb7e 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -13,4 +13,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_device *dev); int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, struct btrfs_scrub_progress *progress); +/* + * The following functions are temporary exports to avoid warning on unused + * static functions. + */ +struct scrub_stripe; +int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe); +void wait_scrub_stripe_io(struct scrub_stripe *stripe); + #endif From patchwork Fri Mar 31 01:20:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195175 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34AE1C7619A for ; Fri, 31 Mar 2023 01:20:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229614AbjCaBUn (ORCPT ); Thu, 30 Mar 2023 21:20:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229529AbjCaBUl (ORCPT ); Thu, 30 Mar 2023 21:20:41 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2BD6DCDCF for ; Thu, 30 Mar 2023 18:20:40 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id D08BD1FE1E; Fri, 31 Mar 2023 01:20:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225638; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iFXgVFHiAEYTs/1VBpjpGPWGtc7UNKrVAyO9Yqv/fb8=; b=pcQyV8falRymTOHStJR+nZSQLe9jp5d06H8niX9OltI4oUBNCAv/mXbuGBd/H9MNgm+ln2 mNO4PJbLJIHT8qbcGz1TEYyWbVpi8hdGUsxvsc3ov7BAJ5Ve4AaluVCQXLB+WW16Hccs4u HDAdgBNEARTal1MNpr9N3/ts68cVNQI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1028A13451; Fri, 31 Mar 2023 01:20:37 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id YIs6NGU1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:37 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 05/12] btrfs: scrub: introduce a helper to find and fill the sector info for a scrub_stripe Date: Fri, 31 Mar 2023 09:20:08 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info fulfilled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid wanring on unused static function. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/file-item.c | 9 ++- fs/btrfs/file-item.h | 3 +- fs/btrfs/raid56.c | 2 +- fs/btrfs/scrub.c | 149 +++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/scrub.h | 4 ++ 5 files changed, 164 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 1ce306cea690..018c711a0bc8 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -597,7 +597,8 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end, * in is large enough to contain all csums. */ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end, - u8 *csum_buf, unsigned long *csum_bitmap) + u8 *csum_buf, unsigned long *csum_bitmap, + bool search_commit) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_key key; @@ -614,6 +615,12 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end, if (!path) return -ENOMEM; + if (search_commit) { + path->skip_locking = 1; + path->reada = READA_FORWARD; + path->search_commit_root = 1; + } + key.objectid = BTRFS_EXTENT_CSUM_OBJECTID; key.type = BTRFS_EXTENT_CSUM_KEY; key.offset = start; diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h index cd7f2ae515c0..6be8725cd574 100644 --- a/fs/btrfs/file-item.h +++ b/fs/btrfs/file-item.h @@ -57,7 +57,8 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit, bool nowait); int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end, - u8 *csum_buf, unsigned long *csum_bitmap); + u8 *csum_buf, unsigned long *csum_bitmap, + bool search_commit); void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, const struct btrfs_path *path, struct btrfs_file_extent_item *fi, diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6cbbaa6c06ca..a64b40000d12 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2113,7 +2113,7 @@ static void fill_data_csums(struct btrfs_raid_bio *rbio) } ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1, - rbio->csum_buf, rbio->csum_bitmap); + rbio->csum_buf, rbio->csum_bitmap, false); if (ret < 0) goto error; if (bitmap_empty(rbio->csum_bitmap, len >> fs_info->sectorsize_bits)) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index c67c98dcd942..00f43e3a2471 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3642,6 +3642,155 @@ static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical, return ret; } +static void fill_one_extent_info(struct btrfs_fs_info *fs_info, + struct scrub_stripe *stripe, + u64 extent_start, u64 extent_len, + u64 extent_flags, u64 extent_gen) +{ + u64 cur_logical; + + for (cur_logical = max(stripe->logical, extent_start); + cur_logical < min(stripe->logical + BTRFS_STRIPE_LEN, + extent_start + extent_len); + cur_logical += fs_info->sectorsize) { + const int nr_sector = (cur_logical - stripe->logical) >> + fs_info->sectorsize_bits; + struct scrub_sector_verification *sector = + &stripe->sectors[nr_sector]; + + set_bit(nr_sector, &stripe->extent_sector_bitmap); + if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) { + sector->is_metadata = true; + sector->generation = extent_gen; + } + } +} + +static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe) +{ + stripe->extent_sector_bitmap = 0; + stripe->init_error_bitmap = 0; + stripe->error_bitmap = 0; + stripe->io_error_bitmap = 0; + stripe->csum_error_bitmap = 0; + stripe->meta_error_bitmap = 0; +} + +/* + * Locate one stripe which has at least one extent in its range. + * + * Return 0 if found such stripe, and store its info into @stripe. + * Return >0 if there is no such stripe in the specified range. + * Return <0 for error. + */ +int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, + struct btrfs_device *dev, u64 physical, + int mirror_num, u64 logical_start, + u32 logical_len, struct scrub_stripe *stripe) +{ + struct btrfs_fs_info *fs_info = bg->fs_info; + struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start); + struct btrfs_root *csum_root = btrfs_csum_root(fs_info, bg->start); + const u64 logical_end = logical_start + logical_len; + struct btrfs_path path = { 0 }; + u64 cur_logical = logical_start; + u64 stripe_end; + u64 extent_start; + u64 extent_len; + u64 extent_flags; + u64 extent_gen; + int ret; + + memset(stripe->sectors, 0, sizeof(struct scrub_sector_verification) * + stripe->nr_sectors); + scrub_stripe_reset_bitmaps(stripe); + + /* The range must be inside the bg */ + ASSERT(logical_start >= bg->start && logical_end <= bg->start + bg->length); + + path.search_commit_root = 1; + path.skip_locking = 1; + + ret = find_first_extent_item(extent_root, &path, logical_start, + logical_len); + /* Either error or not found. */ + if (ret) + goto out; + get_extent_info(&path, &extent_start, &extent_len, + &extent_flags, &extent_gen); + cur_logical = max(extent_start, cur_logical); + + /* + * Round down to stripe boundary. + * + * The extra calculation against bg->start is to handle block groups + * whose logical bytenr is not BTRFS_STRIPE_LEN aligned. + */ + stripe->logical = round_down(cur_logical - bg->start, BTRFS_STRIPE_LEN) + + bg->start; + stripe->physical = physical + stripe->logical - logical_start; + stripe->dev = dev; + stripe->bg = bg; + stripe->mirror_num = mirror_num; + stripe_end = stripe->logical + BTRFS_STRIPE_LEN - 1; + + /* Fill the first extent info into stripe->sectors[] array. */ + fill_one_extent_info(fs_info, stripe, extent_start, extent_len, + extent_flags, extent_gen); + cur_logical = extent_start + extent_len; + + /* Fill the extent info for the remaining sectors. */ + while (cur_logical <= stripe_end) { + ret = find_first_extent_item(extent_root, &path, cur_logical, + stripe_end - cur_logical + 1); + if (ret < 0) + goto out; + if (ret > 0) { + ret = 0; + break; + } + get_extent_info(&path, &extent_start, &extent_len, + &extent_flags, &extent_gen); + fill_one_extent_info(fs_info, stripe, extent_start, extent_len, + extent_flags, extent_gen); + cur_logical = extent_start + extent_len; + } + + /* Now fill the data csum. */ + if (bg->flags & BTRFS_BLOCK_GROUP_DATA) { + int sector_nr; + unsigned long csum_bitmap = 0; + + /* Csum space should have already been allocated. */ + ASSERT(stripe->csums); + + /* + * Our csum bitmap should be large enough, as BTRFS_STRIPE_LEN + * should contain at most 16 sectors. + */ + ASSERT(BITS_PER_LONG >= + BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits); + + ret = btrfs_lookup_csums_bitmap(csum_root, stripe->logical, + stripe_end, stripe->csums, + &csum_bitmap, true); + if (ret < 0) + goto out; + if (ret > 0) + ret = 0; + + for_each_set_bit(sector_nr, &csum_bitmap, stripe->nr_sectors) { + stripe->sectors[sector_nr].csum = stripe->csums + + sector_nr * fs_info->csum_size; + } + } + set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state); +out: + btrfs_release_path(&path); + return ret; +} + + /* * Scrub one range which can only has simple mirror based profile. * (Including all range in SINGLE/DUP/RAID1/RAID1C*, and each stripe in diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index e04764f8bb7e..27019d86b539 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -20,5 +20,9 @@ int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, struct scrub_stripe; int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe); void wait_scrub_stripe_io(struct scrub_stripe *stripe); +int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, + struct btrfs_device *dev, u64 physical, + int mirror_num, u64 logical_start, + u32 logical_len, struct scrub_stripe *stripe); #endif From patchwork Fri Mar 31 01:20:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195176 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B96AC6FD1D for ; Fri, 31 Mar 2023 01:20:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229620AbjCaBUo (ORCPT ); Thu, 30 Mar 2023 21:20:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40574 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229646AbjCaBUm (ORCPT ); Thu, 30 Mar 2023 21:20:42 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E80D1207F for ; Thu, 30 Mar 2023 18:20:41 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id F281821A86; Fri, 31 Mar 2023 01:20:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225639; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QtbCW/HIDgkew2jFQM+Jlzcn1B/zQ3X3wXCrRf/y9fU=; b=uj2ouYpobYmntv2UFBE487nMrJZovMFjqoqqKs92/uGJJStkV2aLb1yNAA9JSBCf2KP5EO eHd3BDn34qYvqo4knHad5GfVwtwLyyNKZYlSIMHM71ej0gRS2NKuMHyJ9eE+4lDJFkqHHN WUUGX44TF8vgmG4mR6ULlDg6ETm9Y3c= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 3300113451; Fri, 31 Mar 2023 01:20:39 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id WBoPAWc1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:39 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 06/12] btrfs: scrub: introduce a helper to verify one metadata Date: Fri, 31 Mar 2023 09:20:09 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper, scrub_verify_one_metadata(), is almost the same as scrub_checksum_tree_block(). The difference is in how we grab the pages. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/scrub.h | 1 + 2 files changed, 117 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 00f43e3a2471..33ff0dc6a785 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2159,6 +2159,122 @@ static int scrub_checksum_data(struct scrub_block *sblock) return sblock->checksum_error; } +static struct page *scrub_stripe_get_page(struct scrub_stripe *stripe, + int sector_nr) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + int page_index = (sector_nr << fs_info->sectorsize_bits) >> PAGE_SHIFT; + + return stripe->pages[page_index]; +} + +static unsigned int scrub_stripe_get_page_offset(struct scrub_stripe *stripe, + int sector_nr) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + + return offset_in_page(sector_nr << fs_info->sectorsize_bits); +} + +void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + const unsigned int sectors_per_tree = fs_info->nodesize >> + fs_info->sectorsize_bits; + const u64 logical = stripe->logical + (sector_nr << fs_info->sectorsize_bits); + const struct page *first_page = scrub_stripe_get_page(stripe, sector_nr); + const unsigned int first_off = scrub_stripe_get_page_offset(stripe, sector_nr); + SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); + u8 on_disk_csum[BTRFS_CSUM_SIZE]; + u8 calculated_csum[BTRFS_CSUM_SIZE]; + struct btrfs_header *header; + + /* + * Here we don't have a good way to attach the pages (and subpages) + * to a dummy extent buffer, thus we have to directly grab the members + * from pages. + */ + header = (struct btrfs_header *)(page_address(first_page) + first_off); + memcpy(on_disk_csum, header->csum, fs_info->csum_size); + + if (logical != btrfs_stack_header_bytenr(header)) { + bitmap_set(&stripe->csum_error_bitmap, sector_nr, + sectors_per_tree); + bitmap_set(&stripe->error_bitmap, sector_nr, + sectors_per_tree); + btrfs_warn_rl(fs_info, + "tree block %llu mirror %u has bad bytenr, has %llu want %llu", + logical, stripe->mirror_num, + btrfs_stack_header_bytenr(header), logical); + return; + } + if (memcmp(header->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE) != 0) { + bitmap_set(&stripe->meta_error_bitmap, sector_nr, + sectors_per_tree); + bitmap_set(&stripe->error_bitmap, sector_nr, + sectors_per_tree); + btrfs_warn_rl(fs_info, + "tree block %llu mirror %u has bad fsid, has %pU want %pU", + logical, stripe->mirror_num, + header->fsid, fs_info->fs_devices->fsid); + return; + } + if (memcmp(header->chunk_tree_uuid, fs_info->chunk_tree_uuid, + BTRFS_UUID_SIZE) != 0) { + bitmap_set(&stripe->meta_error_bitmap, sector_nr, + sectors_per_tree); + bitmap_set(&stripe->error_bitmap, sector_nr, + sectors_per_tree); + btrfs_warn_rl(fs_info, + "tree block %llu mirror %u has bad chunk tree uuid, has %pU want %pU", + logical, stripe->mirror_num, + header->chunk_tree_uuid, fs_info->chunk_tree_uuid); + return; + } + + /* Now check tree block csum. */ + shash->tfm = fs_info->csum_shash; + crypto_shash_init(shash); + crypto_shash_update(shash, page_address(first_page) + first_off + + BTRFS_CSUM_SIZE, fs_info->sectorsize - BTRFS_CSUM_SIZE); + + for (int i = sector_nr + 1; i < sector_nr + sectors_per_tree; i++) { + struct page *page = scrub_stripe_get_page(stripe, i); + unsigned int page_off = scrub_stripe_get_page_offset(stripe, i); + + crypto_shash_update(shash, page_address(page) + page_off, + fs_info->sectorsize); + } + crypto_shash_final(shash, calculated_csum); + if (memcmp(calculated_csum, on_disk_csum, fs_info->csum_size) != 0) { + bitmap_set(&stripe->meta_error_bitmap, sector_nr, + sectors_per_tree); + bitmap_set(&stripe->error_bitmap, sector_nr, + sectors_per_tree); + btrfs_warn_rl(fs_info, + "tree block %llu mirror %u has bad csum, has " CSUM_FMT " want " CSUM_FMT, + logical, stripe->mirror_num, + CSUM_FMT_VALUE(fs_info->csum_size, on_disk_csum), + CSUM_FMT_VALUE(fs_info->csum_size, calculated_csum)); + return; + } + if (stripe->sectors[sector_nr].generation != + btrfs_stack_header_generation(header)) { + bitmap_set(&stripe->meta_error_bitmap, sector_nr, + sectors_per_tree); + bitmap_set(&stripe->error_bitmap, sector_nr, + sectors_per_tree); + btrfs_warn_rl(fs_info, + "tree block %llu mirror %u has bad geneartion, has %llu want %llu", + logical, stripe->mirror_num, + btrfs_stack_header_generation(header), + stripe->sectors[sector_nr].generation); + } + bitmap_clear(&stripe->error_bitmap, sector_nr, sectors_per_tree); + bitmap_clear(&stripe->csum_error_bitmap, sector_nr, sectors_per_tree); + bitmap_clear(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree); +} + static int scrub_checksum_tree_block(struct scrub_block *sblock) { struct scrub_ctx *sctx = sblock->sctx; diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index 27019d86b539..0d8bdc7df89c 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -24,5 +24,6 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, struct btrfs_device *dev, u64 physical, int mirror_num, u64 logical_start, u32 logical_len, struct scrub_stripe *stripe); +void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr); #endif From patchwork Fri Mar 31 01:20:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195177 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9B1FC77B6C for ; Fri, 31 Mar 2023 01:20:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229683AbjCaBUp (ORCPT ); Thu, 30 Mar 2023 21:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40614 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229660AbjCaBUn (ORCPT ); Thu, 30 Mar 2023 21:20:43 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 61B0E1204C for ; Thu, 30 Mar 2023 18:20:42 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 223921FE32; Fri, 31 Mar 2023 01:20:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225641; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QIE0GIL6dQj2K/1YLpycaexfwr7ks6BiwGU6zVB8PfM=; b=nNGDpIeMmJkwr6XyM0YpvG28kXapu2sXJKCL73H+8BgaXKU7W+GtiAyj96NZBNFQILCzMc dGZUI2ehmtGtK/Cv45tH14M6fvIHcyBt6h35S5lUySTz7GKXdAxMVmiqmybipHkv3Fx/y/ wUGvVa9drhL00FHKzTZrL4b8LW+lgRU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 5679013451; Fri, 31 Mar 2023 01:20:40 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 2L3dCWg1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:40 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 07/12] btrfs: scrub: introduce a helper to verify one scrub_stripe Date: Fri, 31 Mar 2023 09:20:10 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper, scrub_verify_stripe(), shares the same main workflow of the old scrub code. The major differences are: - How pages/page_offset is grabbed Everything can be grabbed from scrub_stripe easily. - When error report happens Currently the helper only verify the sectors, not really doing any error reporting. The error reporting would be done after we have done the repair. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++- fs/btrfs/scrub.h | 2 +- 2 files changed, 80 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 33ff0dc6a785..f6c19e3a35f3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2176,7 +2176,7 @@ static unsigned int scrub_stripe_get_page_offset(struct scrub_stripe *stripe, return offset_in_page(sector_nr << fs_info->sectorsize_bits); } -void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr) +static void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr) { struct btrfs_fs_info *fs_info = stripe->bg->fs_info; const unsigned int sectors_per_tree = fs_info->nodesize >> @@ -2275,6 +2275,84 @@ void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr) bitmap_clear(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree); } +static void scrub_verify_one_sector(struct scrub_stripe *stripe, + int sector_nr) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct scrub_sector_verification *sector = &stripe->sectors[sector_nr]; + const unsigned int sectors_per_tree = fs_info->nodesize >> + fs_info->sectorsize_bits; + struct page *page = scrub_stripe_get_page(stripe, sector_nr); + unsigned int pgoff = scrub_stripe_get_page_offset(stripe, sector_nr); + u8 csum_buf[BTRFS_CSUM_SIZE]; + int ret; + + ASSERT(sector_nr >= 0 && sector_nr < stripe->nr_sectors); + + /* Sector not utilized, skip it. */ + if (!test_bit(sector_nr, &stripe->extent_sector_bitmap)) + return; + + /* IO error, no need to check. */ + if (test_bit(sector_nr, &stripe->io_error_bitmap)) + return; + + /* Metadata, verify the full tree block. */ + if (sector->is_metadata) { + /* + * Check if the tree block crosses the stripe boudary. + * If crossed the boundary, we can not verify it but only + * gives a warning. + * + * This can only happen in very old fs where chunks are not + * ensured to be stripe aligned. + */ + if (unlikely(sector_nr + sectors_per_tree > stripe->nr_sectors)) { + btrfs_warn_rl(fs_info, + "tree block at %llu crosses stripe boundary %llu", + stripe->logical + + (sector_nr << fs_info->sectorsize_bits), + stripe->logical); + return; + } + scrub_verify_one_metadata(stripe, sector_nr); + return; + } + + /* + * Data is much easier, we just verify the data csum (if we have). + * For cases without csum, we have no other choice but to trust it. + */ + if (!sector->csum) { + clear_bit(sector_nr, &stripe->error_bitmap); + return; + } + + ret = btrfs_check_sector_csum(fs_info, page, pgoff, csum_buf, sector->csum); + if (ret < 0) { + set_bit(sector_nr, &stripe->csum_error_bitmap); + set_bit(sector_nr, &stripe->error_bitmap); + } else { + clear_bit(sector_nr, &stripe->csum_error_bitmap); + clear_bit(sector_nr, &stripe->error_bitmap); + } +} + +/* Verify specified sectors of a stripe. */ +void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + const unsigned int sectors_per_tree = fs_info->nodesize >> + fs_info->sectorsize_bits; + int sector_nr; + + for_each_set_bit(sector_nr, &bitmap, stripe->nr_sectors) { + scrub_verify_one_sector(stripe, sector_nr); + if (stripe->sectors[sector_nr].is_metadata) + sector_nr += sectors_per_tree - 1; + } +} + static int scrub_checksum_tree_block(struct scrub_block *sblock) { struct scrub_ctx *sctx = sblock->sctx; diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index 0d8bdc7df89c..45ff7e149806 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -24,6 +24,6 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, struct btrfs_device *dev, u64 physical, int mirror_num, u64 logical_start, u32 logical_len, struct scrub_stripe *stripe); -void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr); +void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap); #endif From patchwork Fri Mar 31 01:20:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195178 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5FF3AC761AF for ; Fri, 31 Mar 2023 01:20:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229690AbjCaBUq (ORCPT ); Thu, 30 Mar 2023 21:20:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229665AbjCaBUp (ORCPT ); Thu, 30 Mar 2023 21:20:45 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9648F12BF7 for ; Thu, 30 Mar 2023 18:20:43 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 45DA01FE1E; Fri, 31 Mar 2023 01:20:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225642; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J9ZPGYVFCJ3I1+cwR/N+w1YgppzIXz3GQs/G2dFf0T8=; b=oLEYVU7HmxMaR/4vnXwdTobGVxdxRuw2cW3Kszu6v8fRgGCMXBw7JlAcGtlSYEpjcihZLk kKMG59K/ENHIV2RSJAWv0IpblWjktZJu4XFFV3O85RjdwFk8dp/3WZseM92/xn/xs1DEJX 4LNHuLtME7lTWD4q2gnvuaj4Rck33hI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 7A2AE13451; Fri, 31 Mar 2023 01:20:41 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id oEuEEmk1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:41 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 08/12] btrfs: scrub: introduce the main read repair worker for scrub_stripe Date: Fri, 31 Mar 2023 09:20:11 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper, scrub_stripe_read_repair_worker(), would handle the read-repair part: - Wait for the previous submitted read IO to finish - Verify the contents of the stripe - Go through the remaining mirrors, using as large blocksize as possible At this stage, we just read out all the failed sectors from each mirror and re-verify. If no more failed sector, we can exit. - Go through all mirrors again, sector-by-sector this time This time, we read sector by sector, this is to address cases where one bad sector mismatches the drive's internal checksum, and cause the whole read range to fail. We put this recovery method as the last resort, as sector-by-sector reading is slow, and read from other mirrors may have already fixed the errors. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++- fs/btrfs/scrub.h | 3 +- 2 files changed, 209 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index f6c19e3a35f3..a89051dd0d1c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -121,6 +121,7 @@ struct scrub_stripe { atomic_t pending_io; wait_queue_head_t io_wait; + wait_queue_head_t repair_wait; /* * Indicates the states of the stripe. @@ -156,6 +157,8 @@ struct scrub_stripe { * group. */ u8 *csums; + + struct work_struct work; }; struct scrub_recover { @@ -381,6 +384,7 @@ int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe stripe->state = 0; init_waitqueue_head(&stripe->io_wait); + init_waitqueue_head(&stripe->repair_wait); atomic_set(&stripe->pending_io, 0); ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages); @@ -403,7 +407,7 @@ int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe return -ENOMEM; } -void wait_scrub_stripe_io(struct scrub_stripe *stripe) +static void wait_scrub_stripe_io(struct scrub_stripe *stripe) { wait_event(stripe->io_wait, atomic_read(&stripe->pending_io) == 0); } @@ -2339,7 +2343,8 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, } /* Verify specified sectors of a stripe. */ -void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap) +static void scrub_verify_one_stripe(struct scrub_stripe *stripe, + unsigned long bitmap) { struct btrfs_fs_info *fs_info = stripe->bg->fs_info; const unsigned int sectors_per_tree = fs_info->nodesize >> @@ -2353,6 +2358,207 @@ void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap) } } +static int calc_sector_number(struct scrub_stripe *stripe, + struct bio_vec *first_bvec) +{ + int i; + + for (i = 0; i < stripe->nr_sectors; i++) { + if (scrub_stripe_get_page(stripe, i) == first_bvec->bv_page && + scrub_stripe_get_page_offset(stripe, i) == first_bvec->bv_offset) + break; + } + ASSERT(i < stripe->nr_sectors); + return i; +} + +/* + * Repair read is different to the regular read by: + * + * - Only reads the failed sectors + * - May have extra blocksize limits + */ +static void scrub_repair_read_endio(struct btrfs_bio *bbio) +{ + struct scrub_stripe *stripe = bbio->private; + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct bio_vec *bvec; + int sector_nr = calc_sector_number(stripe, + bio_first_bvec_all(&bbio->bio)); + int bio_size = 0; + int i; + + ASSERT(sector_nr < stripe->nr_sectors); + + bio_for_each_bvec_all(bvec, &bbio->bio, i) + bio_size += bvec->bv_len; + + if (bbio->bio.bi_status) { + bitmap_set(&stripe->io_error_bitmap, sector_nr, + bio_size >> fs_info->sectorsize_bits); + bitmap_set(&stripe->error_bitmap, sector_nr, + bio_size >> fs_info->sectorsize_bits); + } else { + bitmap_clear(&stripe->io_error_bitmap, sector_nr, + bio_size >> fs_info->sectorsize_bits); + } + bio_put(&bbio->bio); + if (atomic_dec_and_test(&stripe->pending_io)) + wake_up(&stripe->io_wait); +} + +static int calc_next_mirror(int mirror, int num_copies) +{ + ASSERT(mirror <= num_copies); + return (mirror + 1 > num_copies) ? 1 : mirror + 1; +} + +static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe, + int mirror, int blocksize, + bool wait) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct btrfs_bio *bbio = NULL; + const unsigned long old_error_bitmap = stripe->error_bitmap; + int i; + + ASSERT(stripe->mirror_num >= 1); + ASSERT(atomic_read(&stripe->pending_io) == 0); + + for_each_set_bit(i, &old_error_bitmap, stripe->nr_sectors) { + struct page *page; + int pgoff; + int ret; + + page = scrub_stripe_get_page(stripe, i); + pgoff = scrub_stripe_get_page_offset(stripe, i); + + /* The current sector can not be merged, submit the bio. */ + if (bbio && ((i > 0 && !test_bit(i - 1, &stripe->error_bitmap)) || + bbio->bio.bi_iter.bi_size >= blocksize)) { + ASSERT(bbio->bio.bi_iter.bi_size); + atomic_inc(&stripe->pending_io); + btrfs_submit_bio(bbio, mirror); + if (wait) + wait_scrub_stripe_io(stripe); + bbio = NULL; + } + + if (!bbio) { + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, + fs_info, scrub_repair_read_endio, stripe); + bbio->bio.bi_iter.bi_sector = (stripe->logical + + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; + } + + ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); + ASSERT(ret == fs_info->sectorsize); + } + if (bbio) { + ASSERT(bbio->bio.bi_iter.bi_size); + atomic_inc(&stripe->pending_io); + btrfs_submit_bio(bbio, mirror); + if (wait) + wait_scrub_stripe_io(stripe); + } +} + +/* + * The main entrance for all read related scrub work, including: + * + * - Wait for the initial read to finish + * - Verify and locate any bad sectors + * - Go through the remaining mirrors and try to read as large blocksize as + * possible + * + * - Go through all mirrors (including the failed mirror) sector-by-sector + * + * Writeback does not happen here, they need extra synchronization. + */ +static void scrub_stripe_read_repair_worker(struct work_struct *work) +{ + struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, + work); + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + int num_copies = btrfs_num_copies(fs_info, stripe->bg->start, + stripe->bg->length); + int mirror; + int i; + + ASSERT(stripe->mirror_num > 0); + + wait_scrub_stripe_io(stripe); + scrub_verify_one_stripe(stripe, stripe->extent_sector_bitmap); + /* Save the initial failed bitmap for later repair and report usage. */ + stripe->init_error_bitmap = stripe->error_bitmap; + + if (bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) + goto out; + + /* + * Try all remaining mirrors. + * + * Here we still try read as large block as possible, as this is faster + * and we have extra safe nets to rely on. + */ + for (mirror = calc_next_mirror(stripe->mirror_num, num_copies); + mirror != stripe->mirror_num; + mirror = calc_next_mirror(mirror, num_copies)) { + const unsigned long old_error_bitmap = stripe->error_bitmap; + + scrub_stripe_submit_repair_read(stripe, mirror, + BTRFS_STRIPE_LEN, false); + wait_scrub_stripe_io(stripe); + scrub_verify_one_stripe(stripe, old_error_bitmap); + if (bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) + goto out; + } + + /* + * Last safenet, try re-check all mirrors, including the failed one, + * sector-by-sector. + * + * As if one sector failed the drive's internal csum, the whole read + * containing the offending sector would be marked error. + * Thus here we do sector-by-sector read. + * + * This can be slow, thus we only try it as the last resort. + */ + + for (i = 0, mirror = stripe->mirror_num; i < num_copies; + i++, mirror = calc_next_mirror(mirror, num_copies)) { + const unsigned long old_error_bitmap = stripe->error_bitmap; + + scrub_stripe_submit_repair_read(stripe, mirror, + fs_info->sectorsize, true); + wait_scrub_stripe_io(stripe); + scrub_verify_one_stripe(stripe, old_error_bitmap); + if (bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) + goto out; + } +out: + set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state); + wake_up(&stripe->repair_wait); +} + +void scrub_read_endio(struct btrfs_bio *bbio) +{ + struct scrub_stripe *stripe = bbio->private; + + if (bbio->bio.bi_status) { + bitmap_set(&stripe->io_error_bitmap, 0, stripe->nr_sectors); + bitmap_set(&stripe->error_bitmap, 0, stripe->nr_sectors); + } else { + bitmap_clear(&stripe->io_error_bitmap, 0, stripe->nr_sectors); + } + bio_put(&bbio->bio); + if (atomic_dec_and_test(&stripe->pending_io)) { + wake_up(&stripe->io_wait); + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker); + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work); + } +} + static int scrub_checksum_tree_block(struct scrub_block *sblock) { struct scrub_ctx *sctx = sblock->sctx; diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index 45ff7e149806..bcc9d398fe07 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -19,11 +19,10 @@ int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, */ struct scrub_stripe; int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe); -void wait_scrub_stripe_io(struct scrub_stripe *stripe); int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, struct btrfs_device *dev, u64 physical, int mirror_num, u64 logical_start, u32 logical_len, struct scrub_stripe *stripe); -void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap); +void scrub_read_endio(struct btrfs_bio *bbio); #endif From patchwork Fri Mar 31 01:20:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195179 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70440C7619A for ; Fri, 31 Mar 2023 01:20:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229708AbjCaBUs (ORCPT ); Thu, 30 Mar 2023 21:20:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229660AbjCaBUp (ORCPT ); Thu, 30 Mar 2023 21:20:45 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAB301882A for ; Thu, 30 Mar 2023 18:20:44 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 68BBC1FE32; Fri, 31 Mar 2023 01:20:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225643; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7yRAyvR0TXk+x59am6gGKBimhYuWFIhcb5m4OqR2ou4=; b=VXosfdCmBawLTbVdaE+kJg54ZQ7LwvTeDDpr3sbMon7o9EpJsbikeosNLKSsNwNpl/KBaX 7/BuR8HxOrEv2cHLBPTfh8aT9oDDn72YQ9JXMsIWk3MLHabelXKY6Y/iTHRrCO/pzjV3s2 ctPQzl/7NZ6AM+8jvMZQeBgjxKktIVU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9EBCE13451; Fri, 31 Mar 2023 01:20:42 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id eBtlG2o1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:42 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 09/12] btrfs: scrub: introduce a writeback helper for scrub_stripe Date: Fri, 31 Mar 2023 09:20:12 +0800 Message-Id: <6956a0e6585da55226a27fd6f052aa120fafe0d8.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Add a new helper, scrub_write_sectors(), to submit write bios for specified sectors to the target disk There are several differences compared to read path: - Utilize btrfs_submit_scrub_write() Now we still rely on the @mirror_num based writeback, but the requirement is also a little different than regular writeback or read, thus we have to call btrfs_submit_scrub_write(). - We can not write the full stripe back We can only write the sectors we have. There will be two call sites later, one for repaired sectors, one for all utilized sectors of dev-replace. Thus the callers should specify their own write_bitmap. This function only submit the bios, will not wait for them unless for zoned case. Caller must explicitly wait for the IO to finish. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/scrub.h | 3 ++ 2 files changed, 98 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a89051dd0d1c..cb361560c8f4 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -152,6 +152,12 @@ struct scrub_stripe { unsigned long csum_error_bitmap; unsigned long meta_error_bitmap; + /* For writeback (repair or replace) error report. */ + unsigned long write_error_bitmap; + + /* Writeback can be concurrent, thus we need to protect the bitmap. */ + spinlock_t write_error_lock; + /* * Checksum for the whole stripe if this stripe is inside a data block * group. @@ -386,6 +392,7 @@ int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe init_waitqueue_head(&stripe->io_wait); init_waitqueue_head(&stripe->repair_wait); atomic_set(&stripe->pending_io, 0); + spin_lock_init(&stripe->write_error_lock); ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages); if (ret < 0) @@ -2559,6 +2566,94 @@ void scrub_read_endio(struct btrfs_bio *bbio) } } +static void scrub_write_endio(struct btrfs_bio *bbio) +{ + struct scrub_stripe *stripe = bbio->private; + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct bio_vec *bvec; + unsigned long flags; + int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio)); + unsigned int bio_size = 0; + int i; + + bio_for_each_bvec_all(bvec, &bbio->bio, i) + bio_size += bvec->bv_len; + + if (bbio->bio.bi_status) { + spin_lock_irqsave(&stripe->write_error_lock, flags); + bitmap_set(&stripe->write_error_bitmap, sector_nr, + bio_size >> fs_info->sectorsize_bits); + spin_unlock_irqrestore(&stripe->write_error_lock, flags); + } + bio_put(&bbio->bio); + + if (atomic_dec_and_test(&stripe->pending_io)) + wake_up(&stripe->io_wait); +} + +/* + * Submit the write bio(s) for the sectors specified by @write_bitmap. + * + * Here we utilize btrfs_submit_repair_write(), which has some extra benefits: + * + * - Only needs logical bytenr and mirror_num + * Just like the scrub read path + * + * - Would only result writes to the specified mirror + * Unlike the regular writeback path, which would write back to all stripes + * + * - Handle dev-replace and read-repair writeback differently + */ +void scrub_write_sectors(struct scrub_ctx *sctx, + struct scrub_stripe *stripe, + unsigned long write_bitmap, bool dev_replace) +{ + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; + struct btrfs_bio *bbio = NULL; + bool zoned = btrfs_is_zoned(fs_info); + int sector_nr; + + for_each_set_bit(sector_nr, &write_bitmap, stripe->nr_sectors) { + struct page *page = scrub_stripe_get_page(stripe, sector_nr); + unsigned int pgoff = scrub_stripe_get_page_offset(stripe, + sector_nr); + int ret; + + /* We should only writeback sectors covered by an extent. */ + ASSERT(test_bit(sector_nr, &stripe->extent_sector_bitmap)); + + /* Can not merge with previous sector, submit the current one. */ + if (bbio && sector_nr && !test_bit(sector_nr - 1, &write_bitmap)) { + fill_writer_pointer_gap(sctx, stripe->physical + + (sector_nr << fs_info->sectorsize_bits)); + atomic_inc(&stripe->pending_io); + btrfs_submit_repair_write(bbio, stripe->mirror_num, + dev_replace); + /* For zoned writeback, queue depth must be 1. */ + if (zoned) + wait_scrub_stripe_io(stripe); + bbio = NULL; + } + if (!bbio) { + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_WRITE, + fs_info, scrub_write_endio, stripe); + bbio->bio.bi_iter.bi_sector = (stripe->logical + + (sector_nr << fs_info->sectorsize_bits)) >> + SECTOR_SHIFT; + } + ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); + ASSERT(ret == fs_info->sectorsize); + } + if (bbio) { + fill_writer_pointer_gap(sctx, bbio->bio.bi_iter.bi_sector << + SECTOR_SHIFT); + atomic_inc(&stripe->pending_io); + btrfs_submit_repair_write(bbio, stripe->mirror_num, dev_replace); + if (zoned) + wait_scrub_stripe_io(stripe); + } +} + static int scrub_checksum_tree_block(struct scrub_block *sblock) { struct scrub_ctx *sctx = sblock->sctx; diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index bcc9d398fe07..3027d4c23ee8 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -24,5 +24,8 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, int mirror_num, u64 logical_start, u32 logical_len, struct scrub_stripe *stripe); void scrub_read_endio(struct btrfs_bio *bbio); +void scrub_write_sectors(struct scrub_ctx *sctx, + struct scrub_stripe *stripe, + unsigned long write_bitmap, bool dev_replace); #endif From patchwork Fri Mar 31 01:20:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195180 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6AE14C6FD1D for ; Fri, 31 Mar 2023 01:20:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229502AbjCaBUt (ORCPT ); Thu, 30 Mar 2023 21:20:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40748 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229665AbjCaBUr (ORCPT ); Thu, 30 Mar 2023 21:20:47 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5C12CDD1 for ; Thu, 30 Mar 2023 18:20:45 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 96D1E1FE33; Fri, 31 Mar 2023 01:20:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225644; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=E+1dF7vFBDrF9o6ojQ2ZlEx4XEE8v1n3GsJCGzrQBnc=; b=unEHZzjPlm9WSH1fDnKZVYiSdsIHdHgfJP3W7YcRFPRzqN/f1fYbYd+HkJPewJGVDrQQfn Ru551hw3zwi1DIhpGQCReq0wOtBYhWBPu17g1md+LlbiCn0E2CDhDdGrAhRREYVA6GMfzW dnnZCnqkL+TftSAavsohjesCTm9PI0E= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id C136E13451; Fri, 31 Mar 2023 01:20:43 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id eEnUI2s1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:43 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 10/12] btrfs: scrub: introduce error reporting functionality for scrub_stripe Date: Fri, 31 Mar 2023 09:20:13 +0800 Message-Id: <5797718b1a6cd6c366b838dd78a276c30c580b28.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper, scrub_stripe_report_errors(), will report the result of the scrub to dmesg. The main reporting is done by introducing a new helper, scrub_print_common_warning(), which is mostly the same content from scrub_print_wanring(), but without the need for a scrub_block. Since we're reporting the errors, it's the perfect timing to update the scrub stat too. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 167 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 158 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index cb361560c8f4..1e256714620c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -105,6 +105,7 @@ enum scrub_stripe_flags { * Represent one continuous range with a length of BTRFS_STRIPE_LEN. */ struct scrub_stripe { + struct scrub_ctx *sctx; struct btrfs_block_group *bg; struct page *pages[SCRUB_STRIPE_PAGES]; @@ -119,6 +120,13 @@ struct scrub_stripe { /* Should be BTRFS_STRIPE_LEN / sectorsize. */ u16 nr_sectors; + /* + * How many data/meta extents are in this stripe. + * Only for scrub stat report purpose. + */ + u16 nr_data_extents; + u16 nr_meta_extents; + atomic_t pending_io; wait_queue_head_t io_wait; wait_queue_head_t repair_wait; @@ -377,6 +385,7 @@ static void release_scrub_stripe(struct scrub_stripe *stripe) kfree(stripe->csums); stripe->sectors = NULL; stripe->csums = NULL; + stripe->sctx = NULL; stripe->state = 0; } @@ -1046,9 +1055,9 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes, return 0; } -static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) +static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev, + bool is_super, u64 logical, u64 physical) { - struct btrfs_device *dev; struct btrfs_fs_info *fs_info; struct btrfs_path *path; struct btrfs_key found_key; @@ -1062,22 +1071,20 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) u8 ref_level = 0; int ret; - WARN_ON(sblock->sector_count < 1); - dev = sblock->dev; - fs_info = sblock->sctx->fs_info; + fs_info = dev->fs_info; /* Super block error, no need to search extent tree. */ - if (sblock->sectors[0]->flags & BTRFS_EXTENT_FLAG_SUPER) { + if (is_super) { btrfs_warn_in_rcu(fs_info, "%s on device %s, physical %llu", - errstr, btrfs_dev_name(dev), sblock->physical); + errstr, btrfs_dev_name(dev), physical); return; } path = btrfs_alloc_path(); if (!path) return; - swarn.physical = sblock->physical; - swarn.logical = sblock->logical; + swarn.physical = physical; + swarn.logical = logical; swarn.errstr = errstr; swarn.dev = NULL; @@ -1126,6 +1133,13 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) btrfs_free_path(path); } +static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) +{ + scrub_print_common_warning(errstr, sblock->dev, + sblock->sectors[0]->flags & BTRFS_EXTENT_FLAG_SUPER, + sblock->logical, sblock->physical); +} + static inline void scrub_get_recover(struct scrub_recover *recover) { refcount_inc(&recover->refs); @@ -2470,6 +2484,132 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe, } } +static void scrub_stripe_report_errors(struct scrub_ctx *sctx, + struct scrub_stripe *stripe) +{ + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + struct btrfs_fs_info *fs_info = sctx->fs_info; + struct btrfs_device *dev = NULL; + u64 physical = 0; + int nr_data_sectors = 0; + int nr_meta_sectors = 0; + int nr_nodatacsum_sectors = 0; + int nr_repaired_sectors = 0; + int sector_nr; + + /* + * Init needed infos for error reporting. + * + * Although our scrub_stripe infrastucture is mostly based on btrfs_submit_bio() + * thus no need for dev/physical, error reporting still needs dev and physical. + */ + if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) { + u64 mapped_len = fs_info->sectorsize; + struct btrfs_io_context *bioc = NULL; + int stripe_index = stripe->mirror_num - 1; + int ret; + + /* For scrub, our mirror_num should always start at 1. */ + ASSERT(stripe->mirror_num >= 1); + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, + stripe->logical, &mapped_len, &bioc); + /* + * If we failed, dev will be NULL, and later detailed reports + * will just be skipped. + */ + if (ret < 0) + goto skip; + physical = bioc->stripes[stripe_index].physical; + dev = bioc->stripes[stripe_index].dev; + btrfs_put_bioc(bioc); + } + +skip: + for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, + stripe->nr_sectors) { + bool repaired = false; + + if (stripe->sectors[sector_nr].is_metadata) { + nr_meta_sectors++; + } else { + nr_data_sectors++; + if (!stripe->sectors[sector_nr].csum) + nr_nodatacsum_sectors++; + } + + if (test_bit(sector_nr, &stripe->init_error_bitmap) && + !test_bit(sector_nr, &stripe->error_bitmap)) { + nr_repaired_sectors++; + repaired = true; + } + + /* Good sector from the beginning, nothing need to be done. */ + if (!test_bit(sector_nr, &stripe->init_error_bitmap)) + continue; + + /* + * Report error for the corrupted sectors. + * If repaired, just output the message of repaired message. + */ + if (repaired) { + if (dev) + btrfs_err_rl_in_rcu(fs_info, + "fixed up error at logical %llu on dev %s physical %llu", + stripe->logical, btrfs_dev_name(dev), + physical); + else + btrfs_err_rl_in_rcu(fs_info, + "fixed up error at logical %llu on mirror %u", + stripe->logical, stripe->mirror_num); + continue; + } + + /* The remaining are all for unrepaired. */ + if (dev) + btrfs_err_rl_in_rcu(fs_info, + "unable to fixup (regular) error at logical %llu on dev %s physical %llu", + stripe->logical, btrfs_dev_name(dev), + physical); + else + btrfs_err_rl_in_rcu(fs_info, + "unable to fixup (regular) error at logical %llu on mirror %u", + stripe->logical, stripe->mirror_num); + + if (test_bit(sector_nr, &stripe->io_error_bitmap)) + if (__ratelimit(&rs) && dev) + scrub_print_common_warning("i/o error", dev, false, + stripe->logical, physical); + if (test_bit(sector_nr, &stripe->csum_error_bitmap)) + if (__ratelimit(&rs) && dev) + scrub_print_common_warning("checksum error", dev, false, + stripe->logical, physical); + if (test_bit(sector_nr, &stripe->meta_error_bitmap)) + if (__ratelimit(&rs) && dev) + scrub_print_common_warning("header error", dev, false, + stripe->logical, physical); + } + + spin_lock(&sctx->stat_lock); + sctx->stat.data_extents_scrubbed += stripe->nr_data_extents; + sctx->stat.tree_extents_scrubbed += stripe->nr_meta_extents; + sctx->stat.data_bytes_scrubbed += nr_data_sectors << + fs_info->sectorsize_bits; + sctx->stat.tree_bytes_scrubbed += nr_meta_sectors << + fs_info->sectorsize_bits; + sctx->stat.no_csum += nr_nodatacsum_sectors; + sctx->stat.read_errors += + bitmap_weight(&stripe->io_error_bitmap, stripe->nr_sectors); + sctx->stat.csum_errors += + bitmap_weight(&stripe->csum_error_bitmap, stripe->nr_sectors); + sctx->stat.verify_errors += + bitmap_weight(&stripe->meta_error_bitmap, stripe->nr_sectors); + sctx->stat.uncorrectable_errors += + bitmap_weight(&stripe->error_bitmap, stripe->nr_sectors); + sctx->stat.corrected_errors += nr_repaired_sectors; + spin_unlock(&sctx->stat_lock); +} + /* * The main entrance for all read related scrub work, including: * @@ -2544,6 +2684,7 @@ static void scrub_stripe_read_repair_worker(struct work_struct *work) goto out; } out: + scrub_stripe_report_errors(stripe->sctx, stripe); set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state); wake_up(&stripe->repair_wait); } @@ -4213,6 +4354,10 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, goto out; get_extent_info(&path, &extent_start, &extent_len, &extent_flags, &extent_gen); + if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) + stripe->nr_meta_extents++; + if (extent_flags & BTRFS_EXTENT_FLAG_DATA) + stripe->nr_data_extents++; cur_logical = max(extent_start, cur_logical); /* @@ -4246,6 +4391,10 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, } get_extent_info(&path, &extent_start, &extent_len, &extent_flags, &extent_gen); + if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) + stripe->nr_meta_extents++; + if (extent_flags & BTRFS_EXTENT_FLAG_DATA) + stripe->nr_data_extents++; fill_one_extent_info(fs_info, stripe, extent_start, extent_len, extent_flags, extent_gen); cur_logical = extent_start + extent_len; From patchwork Fri Mar 31 01:20:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195181 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2881AC77B6C for ; Fri, 31 Mar 2023 01:20:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229575AbjCaBUu (ORCPT ); Thu, 30 Mar 2023 21:20:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229710AbjCaBUs (ORCPT ); Thu, 30 Mar 2023 21:20:48 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07CF1D301 for ; Thu, 30 Mar 2023 18:20:47 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id BB6CF21A64; Fri, 31 Mar 2023 01:20:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225645; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qerVdpjagv3sK2gGTYHW6qjqQCtXxlLYfUJmKXtj8Fw=; b=b7C0J2E2PLUHeO+tU6gLgVnP8VU3itvBThhvQxikk2ZUrFQVG4OQsQWJYSrEAvzsDH7qhl dqSNTTHKOutiaSPm1NR+kL/PtCYOtpbAKopS8G++gTjlZHY6y8QO20VwEqk1jowxR8QGR1 VUZitnwvRW83hhja8SPL/AM/PFgQSFs= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E4F3913451; Fri, 31 Mar 2023 01:20:44 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id KOJkLGw1JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:44 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 11/12] btrfs: scrub: introduce the helper to queue a stripe for scrub Date: Fri, 31 Mar 2023 09:20:14 +0800 Message-Id: <60358ac2e75219d5ecf1dcfd0d5fdd0d26393e3a.1680225140.git.wqu@suse.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The new helper, queue_scrub_stripe(), would try to queue a stripe for scrub. If all stripes are already in use, we will submit all the existing stripes and wait them to finish. Currently we would queue up to 8 stripes, to enlarge the blocksize to 512KiB to improve the performance. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 187 ++++++++++++++++++++++++++++++++++++++++++++--- fs/btrfs/scrub.h | 13 +--- 2 files changed, 182 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 1e256714620c..ee727f0cfab9 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -50,6 +50,7 @@ struct scrub_ctx; */ #define SCRUB_SECTORS_PER_BIO 32 /* 128KiB per bio for 4KiB pages */ #define SCRUB_BIOS_PER_SCTX 64 /* 8MiB per device in flight for 4KiB pages */ +#define SCRUB_STRIPES_PER_SCTX 8 /* That would be 8 64K stripe per-device. */ /* * The following value times PAGE_SIZE needs to be large enough to match the @@ -277,9 +278,11 @@ struct scrub_parity { struct scrub_ctx { struct scrub_bio *bios[SCRUB_BIOS_PER_SCTX]; + struct scrub_stripe stripes[SCRUB_STRIPES_PER_SCTX]; struct btrfs_fs_info *fs_info; int first_free; int curr; + int cur_stripe; atomic_t bios_in_flight; atomic_t workers_pending; spinlock_t list_lock; @@ -389,7 +392,8 @@ static void release_scrub_stripe(struct scrub_stripe *stripe) stripe->state = 0; } -int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe) +static int init_scrub_stripe(struct btrfs_fs_info *fs_info, + struct scrub_stripe *stripe) { int ret; @@ -895,6 +899,9 @@ static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx) kfree(sbio); } + for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) + release_scrub_stripe(&sctx->stripes[i]); + kfree(sctx->wr_curr_bio); scrub_free_csums(sctx); kfree(sctx); @@ -939,6 +946,14 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx( else sctx->bios[i]->next_free = -1; } + for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) { + int ret; + + ret = init_scrub_stripe(fs_info, &sctx->stripes[i]); + if (ret < 0) + goto nomem; + sctx->stripes[i].sctx = sctx; + } sctx->first_free = 0; atomic_set(&sctx->bios_in_flight, 0); atomic_set(&sctx->workers_pending, 0); @@ -2689,7 +2704,7 @@ static void scrub_stripe_read_repair_worker(struct work_struct *work) wake_up(&stripe->repair_wait); } -void scrub_read_endio(struct btrfs_bio *bbio) +static void scrub_read_endio(struct btrfs_bio *bbio) { struct scrub_stripe *stripe = bbio->private; @@ -2745,9 +2760,9 @@ static void scrub_write_endio(struct btrfs_bio *bbio) * * - Handle dev-replace and read-repair writeback differently */ -void scrub_write_sectors(struct scrub_ctx *sctx, - struct scrub_stripe *stripe, - unsigned long write_bitmap, bool dev_replace) +static void scrub_write_sectors(struct scrub_ctx *sctx, + struct scrub_stripe *stripe, + unsigned long write_bitmap, bool dev_replace) { struct btrfs_fs_info *fs_info = stripe->bg->fs_info; struct btrfs_bio *bbio = NULL; @@ -4319,10 +4334,11 @@ static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe) * Return >0 if there is no such stripe in the specified range. * Return <0 for error. */ -int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, - struct btrfs_device *dev, u64 physical, - int mirror_num, u64 logical_start, - u32 logical_len, struct scrub_stripe *stripe) +static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, + struct btrfs_device *dev, u64 physical, + int mirror_num, u64 logical_start, + u32 logical_len, + struct scrub_stripe *stripe) { struct btrfs_fs_info *fs_info = bg->fs_info; struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start); @@ -4434,6 +4450,159 @@ int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, return ret; } +static void scrub_reset_stripe(struct scrub_stripe *stripe) +{ + scrub_stripe_reset_bitmaps(stripe); + + stripe->nr_meta_extents = 0; + stripe->nr_data_extents = 0; + stripe->state = 0; + + for (int i = 0; i < stripe->nr_sectors; i++) { + stripe->sectors[i].is_metadata = false; + stripe->sectors[i].csum = NULL; + stripe->sectors[i].generation = 0; + } +} + +static void scrub_submit_initial_read(struct scrub_ctx *sctx, + struct scrub_stripe *stripe) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + struct btrfs_bio *bbio; + int mirror = stripe->mirror_num; + + ASSERT(stripe->bg); + ASSERT(stripe->mirror_num > 0); + ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state)); + + bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info, + scrub_read_endio, stripe); + + /* Read the whole stripe. */ + bbio->bio.bi_iter.bi_sector = stripe->logical >> SECTOR_SHIFT; + for (int i = 0; i < BTRFS_STRIPE_LEN >> PAGE_SHIFT; i++) { + int ret; + + ret = bio_add_page(&bbio->bio, stripe->pages[i], PAGE_SIZE, 0); + /* We should have allocated enough bio vectors. */ + ASSERT(ret == PAGE_SIZE); + } + atomic_inc(&stripe->pending_io); + + /* + * For dev-replace, either user asks to avoid the source dev, or + * the device is missing, we try the next mirror instead. + */ + if (sctx->is_dev_replace && + (fs_info->dev_replace.cont_reading_from_srcdev_mode == + BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID || + !stripe->dev->bdev)) { + int num_copies = btrfs_num_copies(fs_info, stripe->bg->start, + stripe->bg->length); + mirror = calc_next_mirror(mirror, num_copies); + } + btrfs_submit_bio(bbio, mirror); +} + +static void flush_scrub_stripes(struct scrub_ctx *sctx) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + struct scrub_stripe *stripe; + const int nr_stripes = sctx->cur_stripe; + + if (!nr_stripes) + return; + + ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state)); + for (int i = 0; i < nr_stripes; i++) { + stripe = &sctx->stripes[i]; + scrub_submit_initial_read(sctx, stripe); + } + + for (int i = 0; i < nr_stripes; i++) { + stripe = &sctx->stripes[i]; + + wait_event(stripe->repair_wait, + test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, + &stripe->state)); + } + + /* + * Submit the repaired sectors. + * For zoned case, we can not do repair in-place, but + * queue the bg to be relocated. + */ + if (btrfs_is_zoned(fs_info)) { + for (int i = 0; i < nr_stripes; i++) { + stripe = &sctx->stripes[i]; + + if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) { + btrfs_repair_one_zone(fs_info, sctx->stripes[0].bg->start); + break; + } + } + } else { + for (int i = 0; i < nr_stripes; i++) { + unsigned long repaired; + + stripe = &sctx->stripes[i]; + + bitmap_andnot(&repaired, &stripe->init_error_bitmap, + &stripe->error_bitmap, stripe->nr_sectors); + scrub_write_sectors(sctx, stripe, repaired, false); + } + } + + /* Submit for dev-replace. */ + if (sctx->is_dev_replace) { + for (int i = 0; i < nr_stripes; i++) { + unsigned long good; + + stripe = &sctx->stripes[i]; + + ASSERT(stripe->dev == fs_info->dev_replace.srcdev); + + bitmap_andnot(&good, &stripe->extent_sector_bitmap, + &stripe->error_bitmap, stripe->nr_sectors); + scrub_write_sectors(sctx, stripe, good, true); + } + } + + /* Wait for above writebacks to finish. */ + for (int i = 0; i < nr_stripes; i++) { + stripe = &sctx->stripes[i]; + + wait_scrub_stripe_io(stripe); + scrub_reset_stripe(stripe); + } + sctx->cur_stripe = 0; +} + +int queue_scrub_stripe(struct scrub_ctx *sctx, + struct btrfs_block_group *bg, + struct btrfs_device *dev, int mirror_num, + u64 logical, u32 length, u64 physical) +{ + struct scrub_stripe *stripe; + int ret; + + /* No available slot, submit all stripes and wait for them. */ + if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX) + flush_scrub_stripes(sctx); + + stripe = &sctx->stripes[sctx->cur_stripe]; + + /* We can queue one stripe using the remaining slot. */ + scrub_reset_stripe(stripe); + ret = scrub_find_fill_first_stripe(bg, dev, physical, mirror_num, + logical, length, stripe); + /* Either >0 as no more extent or <0 for error. */ + if (ret) + return ret; + sctx->cur_stripe++; + return 0; +} /* * Scrub one range which can only has simple mirror based profile. diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index 3027d4c23ee8..fb9d906f5a17 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -18,14 +18,9 @@ int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, * static functions. */ struct scrub_stripe; -int init_scrub_stripe(struct btrfs_fs_info *fs_info, struct scrub_stripe *stripe); -int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, - struct btrfs_device *dev, u64 physical, - int mirror_num, u64 logical_start, - u32 logical_len, struct scrub_stripe *stripe); -void scrub_read_endio(struct btrfs_bio *bbio); -void scrub_write_sectors(struct scrub_ctx *sctx, - struct scrub_stripe *stripe, - unsigned long write_bitmap, bool dev_replace); +int queue_scrub_stripe(struct scrub_ctx *sctx, + struct btrfs_block_group *bg, + struct btrfs_device *dev, int mirror_num, + u64 logical, u32 length, u64 physical); #endif From patchwork Fri Mar 31 01:20:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13195182 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9CE6C7619A for ; Fri, 31 Mar 2023 01:20:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229527AbjCaBUv (ORCPT ); Thu, 30 Mar 2023 21:20:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229453AbjCaBUu (ORCPT ); Thu, 30 Mar 2023 21:20:50 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F0B218817 for ; Thu, 30 Mar 2023 18:20:48 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 0C03921A86; Fri, 31 Mar 2023 01:20:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1680225647; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lI8qUBqlY8lM1i0cHzNpmOdoHEV7jdRcVAwKuDEpm9U=; b=VrzDo/OWq3q2gGCijq99THoDbUoFpttf+joH9zxG3DDMLaO5buryW3hXXH44oHGz33Lil6 npCvv2ETzk9PmTdPn9JuuvMl0b0zxMMzD0c29s8f/cYQ5la+6jcVutFl81h36kQWDYwBd3 9t+QePi+prx/WXrQh3oNvoXa6UkApgA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 144BE13451; Fri, 31 Mar 2023 01:20:45 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id UI1CNW01JmSKWAAAMHmgww (envelope-from ); Fri, 31 Mar 2023 01:20:45 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: David Sterba Subject: [PATCH v8 12/12] btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure Date: Fri, 31 Mar 2023 09:20:15 +0800 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Switch scrub_simple_mirror() to the new scrub_stripe infrastructure. Since scrub_simple_mirror() is the core part of scrub (only RAID56 P/Q stripes doesn't utilize it), we can get rid of a big hunk of code, mostly scrub_extent() and scrub_sectors(). There is a functionality change: - Scrub speed throttle now only affects read on the scrubbing device Writes (for repair and replace), and reads from other mirrors won't be limited by the limits. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/scrub.c | 494 +++-------------------------------------------- fs/btrfs/scrub.h | 10 - 2 files changed, 30 insertions(+), 474 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index ee727f0cfab9..e9c76f84482b 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -582,10 +582,6 @@ static void scrub_sector_get(struct scrub_sector *sector); static void scrub_sector_put(struct scrub_sector *sector); static void scrub_parity_get(struct scrub_parity *sparity); static void scrub_parity_put(struct scrub_parity *sparity); -static int scrub_sectors(struct scrub_ctx *sctx, u64 logical, u32 len, - u64 physical, struct btrfs_device *dev, u64 flags, - u64 gen, int mirror_num, u8 *csum, - u64 physical_for_dev_replace); static void scrub_bio_end_io(struct bio *bio); static void scrub_bio_end_io_worker(struct work_struct *work); static void scrub_block_complete(struct scrub_block *sblock); @@ -2975,22 +2971,16 @@ static void scrub_sector_put(struct scrub_sector *sector) kfree(sector); } -/* - * Throttling of IO submission, bandwidth-limit based, the timeslice is 1 - * second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max. - */ -static void scrub_throttle(struct scrub_ctx *sctx) +static void scrub_throttle_dev_io(struct scrub_ctx *sctx, + struct btrfs_device *device, + unsigned int bio_size) { const int time_slice = 1000; - struct scrub_bio *sbio; - struct btrfs_device *device; s64 delta; ktime_t now; u32 div; u64 bwlimit; - sbio = sctx->bios[sctx->curr]; - device = sbio->dev; bwlimit = READ_ONCE(device->scrub_speed_max); if (bwlimit == 0) return; @@ -3012,7 +3002,7 @@ static void scrub_throttle(struct scrub_ctx *sctx) /* Still in the time to send? */ if (ktime_before(now, sctx->throttle_deadline)) { /* If current bio is within the limit, send it */ - sctx->throttle_sent += sbio->bio->bi_iter.bi_size; + sctx->throttle_sent += bio_size; if (sctx->throttle_sent <= div_u64(bwlimit, div)) return; @@ -3034,6 +3024,17 @@ static void scrub_throttle(struct scrub_ctx *sctx) sctx->throttle_deadline = 0; } +/* + * Throttling of IO submission, bandwidth-limit based, the timeslice is 1 + * second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max. + */ +static void scrub_throttle(struct scrub_ctx *sctx) +{ + struct scrub_bio *sbio = sctx->bios[sctx->curr]; + + scrub_throttle_dev_io(sctx, sbio->dev, sbio->bio->bi_iter.bi_size); +} + static void scrub_submit(struct scrub_ctx *sctx) { struct scrub_bio *sbio; @@ -3118,202 +3119,6 @@ static int scrub_add_sector_to_rd_bio(struct scrub_ctx *sctx, return 0; } -static void scrub_missing_raid56_end_io(struct bio *bio) -{ - struct scrub_block *sblock = bio->bi_private; - struct btrfs_fs_info *fs_info = sblock->sctx->fs_info; - - btrfs_bio_counter_dec(fs_info); - if (bio->bi_status) - sblock->no_io_error_seen = 0; - - bio_put(bio); - - queue_work(fs_info->scrub_workers, &sblock->work); -} - -static void scrub_missing_raid56_worker(struct work_struct *work) -{ - struct scrub_block *sblock = container_of(work, struct scrub_block, work); - struct scrub_ctx *sctx = sblock->sctx; - struct btrfs_fs_info *fs_info = sctx->fs_info; - u64 logical; - struct btrfs_device *dev; - - logical = sblock->logical; - dev = sblock->dev; - - if (sblock->no_io_error_seen) - scrub_recheck_block_checksum(sblock); - - if (!sblock->no_io_error_seen) { - spin_lock(&sctx->stat_lock); - sctx->stat.read_errors++; - spin_unlock(&sctx->stat_lock); - btrfs_err_rl_in_rcu(fs_info, - "IO error rebuilding logical %llu for dev %s", - logical, btrfs_dev_name(dev)); - } else if (sblock->header_error || sblock->checksum_error) { - spin_lock(&sctx->stat_lock); - sctx->stat.uncorrectable_errors++; - spin_unlock(&sctx->stat_lock); - btrfs_err_rl_in_rcu(fs_info, - "failed to rebuild valid logical %llu for dev %s", - logical, btrfs_dev_name(dev)); - } else { - scrub_write_block_to_dev_replace(sblock); - } - - if (sctx->is_dev_replace && sctx->flush_all_writes) { - mutex_lock(&sctx->wr_lock); - scrub_wr_submit(sctx); - mutex_unlock(&sctx->wr_lock); - } - - scrub_block_put(sblock); - scrub_pending_bio_dec(sctx); -} - -static void scrub_missing_raid56_pages(struct scrub_block *sblock) -{ - struct scrub_ctx *sctx = sblock->sctx; - struct btrfs_fs_info *fs_info = sctx->fs_info; - u64 length = sblock->sector_count << fs_info->sectorsize_bits; - u64 logical = sblock->logical; - struct btrfs_io_context *bioc = NULL; - struct bio *bio; - struct btrfs_raid_bio *rbio; - int ret; - int i; - - btrfs_bio_counter_inc_blocked(fs_info); - ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, - &length, &bioc); - if (ret || !bioc) - goto bioc_out; - - if (WARN_ON(!sctx->is_dev_replace || - !(bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK))) { - /* - * We shouldn't be scrubbing a missing device. Even for dev - * replace, we should only get here for RAID 5/6. We either - * managed to mount something with no mirrors remaining or - * there's a bug in scrub_find_good_copy()/btrfs_map_block(). - */ - goto bioc_out; - } - - bio = bio_alloc(NULL, BIO_MAX_VECS, REQ_OP_READ, GFP_NOFS); - bio->bi_iter.bi_sector = logical >> 9; - bio->bi_private = sblock; - bio->bi_end_io = scrub_missing_raid56_end_io; - - rbio = raid56_alloc_missing_rbio(bio, bioc); - if (!rbio) - goto rbio_out; - - for (i = 0; i < sblock->sector_count; i++) { - struct scrub_sector *sector = sblock->sectors[i]; - - raid56_add_scrub_pages(rbio, scrub_sector_get_page(sector), - scrub_sector_get_page_offset(sector), - sector->offset + sector->sblock->logical); - } - - INIT_WORK(&sblock->work, scrub_missing_raid56_worker); - scrub_block_get(sblock); - scrub_pending_bio_inc(sctx); - raid56_submit_missing_rbio(rbio); - btrfs_put_bioc(bioc); - return; - -rbio_out: - bio_put(bio); -bioc_out: - btrfs_bio_counter_dec(fs_info); - btrfs_put_bioc(bioc); - spin_lock(&sctx->stat_lock); - sctx->stat.malloc_errors++; - spin_unlock(&sctx->stat_lock); -} - -static int scrub_sectors(struct scrub_ctx *sctx, u64 logical, u32 len, - u64 physical, struct btrfs_device *dev, u64 flags, - u64 gen, int mirror_num, u8 *csum, - u64 physical_for_dev_replace) -{ - struct scrub_block *sblock; - const u32 sectorsize = sctx->fs_info->sectorsize; - int index; - - sblock = alloc_scrub_block(sctx, dev, logical, physical, - physical_for_dev_replace, mirror_num); - if (!sblock) { - spin_lock(&sctx->stat_lock); - sctx->stat.malloc_errors++; - spin_unlock(&sctx->stat_lock); - return -ENOMEM; - } - - for (index = 0; len > 0; index++) { - struct scrub_sector *sector; - /* - * Here we will allocate one page for one sector to scrub. - * This is fine if PAGE_SIZE == sectorsize, but will cost - * more memory for PAGE_SIZE > sectorsize case. - */ - u32 l = min(sectorsize, len); - - sector = alloc_scrub_sector(sblock, logical); - if (!sector) { - spin_lock(&sctx->stat_lock); - sctx->stat.malloc_errors++; - spin_unlock(&sctx->stat_lock); - scrub_block_put(sblock); - return -ENOMEM; - } - sector->flags = flags; - sector->generation = gen; - if (csum) { - sector->have_csum = 1; - memcpy(sector->csum, csum, sctx->fs_info->csum_size); - } else { - sector->have_csum = 0; - } - len -= l; - logical += l; - physical += l; - physical_for_dev_replace += l; - } - - WARN_ON(sblock->sector_count == 0); - if (test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) { - /* - * This case should only be hit for RAID 5/6 device replace. See - * the comment in scrub_missing_raid56_pages() for details. - */ - scrub_missing_raid56_pages(sblock); - } else { - for (index = 0; index < sblock->sector_count; index++) { - struct scrub_sector *sector = sblock->sectors[index]; - int ret; - - ret = scrub_add_sector_to_rd_bio(sctx, sector); - if (ret) { - scrub_block_put(sblock); - return ret; - } - } - - if (flags & BTRFS_EXTENT_FLAG_SUPER) - scrub_submit(sctx); - } - - /* last one frees, either here or in bio completion for last page */ - scrub_block_put(sblock); - return 0; -} - static void scrub_bio_end_io(struct bio *bio) { struct scrub_bio *sbio = bio->bi_private; @@ -3498,179 +3303,6 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u8 *csum) return 1; } -static bool should_use_device(struct btrfs_fs_info *fs_info, - struct btrfs_device *dev, - bool follow_replace_read_mode) -{ - struct btrfs_device *replace_srcdev = fs_info->dev_replace.srcdev; - struct btrfs_device *replace_tgtdev = fs_info->dev_replace.tgtdev; - - if (!dev->bdev) - return false; - - /* - * We're doing scrub/replace, if it's pure scrub, no tgtdev should be - * here. If it's replace, we're going to write data to tgtdev, thus - * the current data of the tgtdev is all garbage, thus we can not use - * it at all. - */ - if (dev == replace_tgtdev) - return false; - - /* No need to follow replace read mode, any existing device is fine. */ - if (!follow_replace_read_mode) - return true; - - /* Need to follow the mode. */ - if (fs_info->dev_replace.cont_reading_from_srcdev_mode == - BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID) - return dev != replace_srcdev; - return true; -} -static int scrub_find_good_copy(struct btrfs_fs_info *fs_info, - u64 extent_logical, u32 extent_len, - u64 *extent_physical, - struct btrfs_device **extent_dev, - int *extent_mirror_num) -{ - u64 mapped_length; - struct btrfs_io_context *bioc = NULL; - int ret; - int i; - - mapped_length = extent_len; - ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS, - extent_logical, &mapped_length, &bioc, 0); - if (ret || !bioc || mapped_length < extent_len) { - btrfs_put_bioc(bioc); - btrfs_err_rl(fs_info, "btrfs_map_block() failed for logical %llu: %d", - extent_logical, ret); - return -EIO; - } - - /* - * First loop to exclude all missing devices and the source device if - * needed. And we don't want to use target device as mirror either, as - * we're doing the replace, the target device range contains nothing. - */ - for (i = 0; i < bioc->num_stripes - bioc->replace_nr_stripes; i++) { - struct btrfs_io_stripe *stripe = &bioc->stripes[i]; - - if (!should_use_device(fs_info, stripe->dev, true)) - continue; - goto found; - } - /* - * We didn't find any alternative mirrors, we have to break our replace - * read mode, or we can not read at all. - */ - for (i = 0; i < bioc->num_stripes - bioc->replace_nr_stripes; i++) { - struct btrfs_io_stripe *stripe = &bioc->stripes[i]; - - if (!should_use_device(fs_info, stripe->dev, false)) - continue; - goto found; - } - - btrfs_err_rl(fs_info, "failed to find any live mirror for logical %llu", - extent_logical); - return -EIO; - -found: - *extent_physical = bioc->stripes[i].physical; - *extent_mirror_num = i + 1; - *extent_dev = bioc->stripes[i].dev; - btrfs_put_bioc(bioc); - return 0; -} - -static bool scrub_need_different_mirror(struct scrub_ctx *sctx, - struct map_lookup *map, - struct btrfs_device *dev) -{ - /* - * For RAID56, all the extra mirrors are rebuilt from other P/Q, - * cannot utilize other mirrors directly. - */ - if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) - return false; - - if (!dev->bdev) - return true; - - return sctx->fs_info->dev_replace.cont_reading_from_srcdev_mode == - BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID; -} - -/* scrub extent tries to collect up to 64 kB for each bio */ -static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map, - u64 logical, u32 len, - u64 physical, struct btrfs_device *dev, u64 flags, - u64 gen, int mirror_num) -{ - struct btrfs_device *src_dev = dev; - u64 src_physical = physical; - int src_mirror = mirror_num; - int ret; - u8 csum[BTRFS_CSUM_SIZE]; - u32 blocksize; - - if (flags & BTRFS_EXTENT_FLAG_DATA) { - if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) - blocksize = BTRFS_STRIPE_LEN; - else - blocksize = sctx->fs_info->sectorsize; - spin_lock(&sctx->stat_lock); - sctx->stat.data_extents_scrubbed++; - sctx->stat.data_bytes_scrubbed += len; - spin_unlock(&sctx->stat_lock); - } else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) { - if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) - blocksize = BTRFS_STRIPE_LEN; - else - blocksize = sctx->fs_info->nodesize; - spin_lock(&sctx->stat_lock); - sctx->stat.tree_extents_scrubbed++; - sctx->stat.tree_bytes_scrubbed += len; - spin_unlock(&sctx->stat_lock); - } else { - blocksize = sctx->fs_info->sectorsize; - WARN_ON(1); - } - - /* - * For dev-replace case, we can have @dev being a missing device, or - * we want to avoid reading from the source device if possible. - */ - if (sctx->is_dev_replace && scrub_need_different_mirror(sctx, map, dev)) { - ret = scrub_find_good_copy(sctx->fs_info, logical, len, - &src_physical, &src_dev, &src_mirror); - if (ret < 0) - return ret; - } - while (len) { - u32 l = min(len, blocksize); - int have_csum = 0; - - if (flags & BTRFS_EXTENT_FLAG_DATA) { - /* push csums to sbio */ - have_csum = scrub_find_csum(sctx, logical, csum); - if (have_csum == 0) - ++sctx->stat.no_csum; - } - ret = scrub_sectors(sctx, logical, l, src_physical, src_dev, - flags, gen, src_mirror, - have_csum ? csum : NULL, physical); - if (ret) - return ret; - len -= l; - logical += l; - physical += l; - src_physical += l; - } - return 0; -} - static int scrub_sectors_for_parity(struct scrub_parity *sparity, u64 logical, u32 len, u64 physical, struct btrfs_device *dev, @@ -4253,20 +3885,6 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } -static void sync_replace_for_zoned(struct scrub_ctx *sctx) -{ - if (!btrfs_is_zoned(sctx->fs_info)) - return; - - sctx->flush_all_writes = true; - scrub_submit(sctx); - mutex_lock(&sctx->wr_lock); - scrub_wr_submit(sctx); - mutex_unlock(&sctx->wr_lock); - - wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); -} - static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical, u64 physical, u64 physical_end) { @@ -4515,6 +4133,9 @@ static void flush_scrub_stripes(struct scrub_ctx *sctx) return; ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state)); + + scrub_throttle_dev_io(sctx, sctx->stripes[0].dev, + nr_stripes << BTRFS_STRIPE_LEN_SHIFT); for (int i = 0; i < nr_stripes; i++) { stripe = &sctx->stripes[i]; scrub_submit_initial_read(sctx, stripe); @@ -4579,10 +4200,10 @@ static void flush_scrub_stripes(struct scrub_ctx *sctx) sctx->cur_stripe = 0; } -int queue_scrub_stripe(struct scrub_ctx *sctx, - struct btrfs_block_group *bg, - struct btrfs_device *dev, int mirror_num, - u64 logical, u32 length, u64 physical) +static int queue_scrub_stripe(struct scrub_ctx *sctx, + struct btrfs_block_group *bg, + struct btrfs_device *dev, int mirror_num, + u64 logical, u32 length, u64 physical) { struct scrub_stripe *stripe; int ret; @@ -4620,11 +4241,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx, u64 physical, int mirror_num) { struct btrfs_fs_info *fs_info = sctx->fs_info; - struct btrfs_root *csum_root = btrfs_csum_root(fs_info, bg->start); - struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start); const u64 logical_end = logical_start + logical_length; /* An artificial limit, inherit from old scrub behavior */ - const u32 max_length = SZ_64K; struct btrfs_path path = { 0 }; u64 cur_logical = logical_start; int ret; @@ -4636,11 +4254,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx, path.skip_locking = 1; /* Go through each extent items inside the logical range */ while (cur_logical < logical_end) { - u64 extent_start; - u64 extent_len; - u64 extent_flags; - u64 extent_gen; - u64 scrub_len; + u64 cur_physical = physical + cur_logical - logical_start; /* Canceled? */ if (atomic_read(&fs_info->scrub_cancel_req) || @@ -4670,8 +4284,9 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx, } spin_unlock(&bg->lock); - ret = find_first_extent_item(extent_root, &path, cur_logical, - logical_end - cur_logical); + ret = queue_scrub_stripe(sctx, bg, device, mirror_num, + cur_logical, logical_end - cur_logical, + cur_physical); if (ret > 0) { /* No more extent, just update the accounting */ sctx->stat.last_physical = physical + logical_length; @@ -4680,52 +4295,11 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx, } if (ret < 0) break; - get_extent_info(&path, &extent_start, &extent_len, - &extent_flags, &extent_gen); - /* Skip hole range which doesn't have any extent */ - cur_logical = max(extent_start, cur_logical); - /* - * Scrub len has three limits: - * - Extent size limit - * - Scrub range limit - * This is especially imporatant for RAID0/RAID10 to reuse - * this function - * - Max scrub size limit - */ - scrub_len = min(min(extent_start + extent_len, - logical_end), cur_logical + max_length) - - cur_logical; + ASSERT(sctx->cur_stripe > 0); + cur_logical = sctx->stripes[sctx->cur_stripe - 1].logical + + BTRFS_STRIPE_LEN; - if (extent_flags & BTRFS_EXTENT_FLAG_DATA) { - ret = btrfs_lookup_csums_list(csum_root, cur_logical, - cur_logical + scrub_len - 1, - &sctx->csum_list, 1, false); - if (ret) - break; - } - if ((extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) && - does_range_cross_boundary(extent_start, extent_len, - logical_start, logical_length)) { - btrfs_err(fs_info, -"scrub: tree block %llu spanning boundaries, ignored. boundary=[%llu, %llu)", - extent_start, logical_start, logical_end); - spin_lock(&sctx->stat_lock); - sctx->stat.uncorrectable_errors++; - spin_unlock(&sctx->stat_lock); - cur_logical += scrub_len; - continue; - } - ret = scrub_extent(sctx, map, cur_logical, scrub_len, - cur_logical - logical_start + physical, - device, extent_flags, extent_gen, - mirror_num); - scrub_free_csums(sctx); - if (ret) - break; - if (sctx->is_dev_replace) - sync_replace_for_zoned(sctx); - cur_logical += scrub_len; /* Don't hold CPU for too long time */ cond_resched(); } @@ -4810,7 +4384,6 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, int stripe_index) { struct btrfs_fs_info *fs_info = sctx->fs_info; - struct blk_plug plug; struct map_lookup *map = em->map_lookup; const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK; const u64 chunk_logical = bg->start; @@ -4832,12 +4405,6 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, atomic_read(&sctx->bios_in_flight) == 0); scrub_blocked_if_needed(fs_info); - /* - * collect all data csums for the stripe to avoid seeking during - * the scrub. This might currently (crc32) end up to be about 1MB - */ - blk_start_plug(&plug); - if (sctx->is_dev_replace && btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { mutex_lock(&sctx->wr_lock); @@ -4939,8 +4506,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, mutex_lock(&sctx->wr_lock); scrub_wr_submit(sctx); mutex_unlock(&sctx->wr_lock); - - blk_finish_plug(&plug); + flush_scrub_stripes(sctx); if (sctx->is_dev_replace && ret >= 0) { int ret2; diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h index fb9d906f5a17..7639103ebf9d 100644 --- a/fs/btrfs/scrub.h +++ b/fs/btrfs/scrub.h @@ -13,14 +13,4 @@ int btrfs_scrub_cancel_dev(struct btrfs_device *dev); int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, struct btrfs_scrub_progress *progress); -/* - * The following functions are temporary exports to avoid warning on unused - * static functions. - */ -struct scrub_stripe; -int queue_scrub_stripe(struct scrub_ctx *sctx, - struct btrfs_block_group *bg, - struct btrfs_device *dev, int mirror_num, - u64 logical, u32 length, u64 physical); - #endif