From patchwork Fri Oct 14 07:17:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13006712 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DEE9C433FE for ; Fri, 14 Oct 2022 07:17:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229701AbiJNHRi (ORCPT ); Fri, 14 Oct 2022 03:17:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229554AbiJNHRf (ORCPT ); Fri, 14 Oct 2022 03:17:35 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3985EC6955 for ; Fri, 14 Oct 2022 00:17:33 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 922461F385 for ; Fri, 14 Oct 2022 07:17:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1665731852; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hug+dIeeUptzXKXWLSCXA+hEhBZuU+xQN6yNLCsNJKM=; b=Q6SLxZ/ojHiGPh1YbLVj9ie/N6WnuNYwlFtVj0D+VsbFOyAY9DXiM/7G7U+KWAVQ8g8wt+ riSN41a9hMluU6wN+1xsdctHZ1PmRE4D5uVG1Q6fsE5MewrQ2cw2dR/yMYrwK2I69Nl7UE IVch8+1L3jyQAYJU3+LbyIpNdxWdOSQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id D4A2B13451 for ; Fri, 14 Oct 2022 07:17:31 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uD7PJQsNSWOsUwAAMHmgww (envelope-from ) for ; Fri, 14 Oct 2022 07:17:31 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Date: Fri, 14 Oct 2022 15:17:09 +0800 Message-Id: X-Mailer: git-send-email 2.37.3 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The refactor involves the following parts: - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers As we have quite some open-coded calculation, some of them are even split into two assignments just to fit 80 chars limit. - Remove the @csum_size parameter from max_ordered_sum_bytes() Csum size can be fetched from @fs_info. And we will use the csum_size_to_bytes() helper anyway. - Add a comment explaining how we had the first search result - Use newly introduced helpers to cleanup btrfs_lookup_csums_range() - Move variables declaration to the minimal scope - Never mix number of sectors with bytes There are several locations doing things like: size = min_t(size_t, csum_end - start, max_ordered_sum_bytes(fs_info)); ... size >>= fs_info->sectorsize_bits Or offset = (start - key.offset) >> fs_info->sectorsize_bits; offset *= csum_size; Make sure those variables can only represent BYTES inside the function, by using the above bytes_to_csum_size() helpers. Signed-off-by: Qu Wenruo --- fs/btrfs/file-item.c | 69 ++++++++++++++++++++++++++++++-------------- 1 file changed, 48 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 6bb9fa961a6a..8d9c4488d86a 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -121,12 +121,26 @@ int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start, start + len - 1, EXTENT_DIRTY, NULL); } -static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info, - u16 csum_size) +static size_t bytes_to_csum_size(struct btrfs_fs_info *fs_info, u32 bytes) { - u32 ncsums = (PAGE_SIZE - sizeof(struct btrfs_ordered_sum)) / csum_size; + ASSERT(IS_ALIGNED(bytes, fs_info->sectorsize)); - return ncsums * fs_info->sectorsize; + return (bytes >> fs_info->sectorsize_bits) * fs_info->csum_size; +} + +static size_t csum_size_to_bytes(struct btrfs_fs_info *fs_info, u32 csum_size) +{ + ASSERT(IS_ALIGNED(csum_size, fs_info->csum_size)); + + return (csum_size / fs_info->csum_size) << fs_info->sectorsize_bits; +} + +static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info) +{ + int max_csum_size = round_down(PAGE_SIZE - sizeof(struct btrfs_ordered_sum), + fs_info->csum_size); + + return csum_size_to_bytes(fs_info, max_csum_size); } /* @@ -135,9 +149,8 @@ static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info, */ static int btrfs_ordered_sum_size(struct btrfs_fs_info *fs_info, unsigned long bytes) { - int num_sectors = (int)DIV_ROUND_UP(bytes, fs_info->sectorsize); - - return sizeof(struct btrfs_ordered_sum) + num_sectors * fs_info->csum_size; + return sizeof(struct btrfs_ordered_sum) + + bytes_to_csum_size(fs_info, bytes); } int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans, @@ -521,11 +534,7 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct btrfs_ordered_sum *sums; struct btrfs_csum_item *item; LIST_HEAD(tmplist); - unsigned long offset; int ret; - size_t size; - u64 csum_end; - const u32 csum_size = fs_info->csum_size; ASSERT(IS_ALIGNED(start, fs_info->sectorsize) && IS_ALIGNED(end + 1, fs_info->sectorsize)); @@ -551,16 +560,33 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, if (ret > 0 && path->slots[0] > 0) { leaf = path->nodes[0]; btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1); + + /* + * There are two cases we can hit here for the previous + * csum item. + * + * |<- search range ->| + * |<- csum item ->| + * + * Or + * |<- search range ->| + * |<- csum item ->| + * + * Check if the previous csum item covers the leading part + * of the search range. + * If so we have to start from previous csum item. + */ if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID && key.type == BTRFS_EXTENT_CSUM_KEY) { - offset = (start - key.offset) >> fs_info->sectorsize_bits; - if (offset * csum_size < + if (bytes_to_csum_size(fs_info, start - key.offset) < btrfs_item_size(leaf, path->slots[0] - 1)) path->slots[0]--; } } while (start <= end) { + u64 csum_end; + leaf = path->nodes[0]; if (path->slots[0] >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(root, path); @@ -580,8 +606,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, if (key.offset > start) start = key.offset; - size = btrfs_item_size(leaf, path->slots[0]); - csum_end = key.offset + (size / csum_size) * fs_info->sectorsize; + csum_end = key.offset + csum_size_to_bytes(fs_info, + btrfs_item_size(leaf, path->slots[0])); if (csum_end <= start) { path->slots[0]++; continue; @@ -591,8 +617,11 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_csum_item); while (start < csum_end) { + unsigned long offset; + size_t size; + size = min_t(size_t, csum_end - start, - max_ordered_sum_bytes(fs_info, csum_size)); + max_ordered_sum_bytes(fs_info)); sums = kzalloc(btrfs_ordered_sum_size(fs_info, size), GFP_NOFS); if (!sums) { @@ -603,16 +632,14 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, sums->bytenr = start; sums->len = (int)size; - offset = (start - key.offset) >> fs_info->sectorsize_bits; - offset *= csum_size; - size >>= fs_info->sectorsize_bits; + offset = bytes_to_csum_size(fs_info, start - key.offset); read_extent_buffer(path->nodes[0], sums->sums, ((unsigned long)item) + offset, - csum_size * size); + bytes_to_csum_size(fs_info, size)); - start += fs_info->sectorsize * size; + start += size; list_add_tail(&sums->list, &tmplist); } path->slots[0]++; From patchwork Fri Oct 14 07:17:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13006713 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC417C4167B for ; Fri, 14 Oct 2022 07:17:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229746AbiJNHRk (ORCPT ); Fri, 14 Oct 2022 03:17:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229629AbiJNHRg (ORCPT ); Fri, 14 Oct 2022 03:17:36 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 250EC32D94 for ; Fri, 14 Oct 2022 00:17:35 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id BFF7A22013 for ; Fri, 14 Oct 2022 07:17:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1665731853; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uHrwGkYk1/BCWBJQu4asFmmiFCGKJPGcRT1pcYmErFA=; b=lpwAm7cs+R+j3nyOWvCidXUa5+YIgR4KDXBKxHEvZAY1LsYI9ILUdnlNmhNi3tWCiakJE5 6Edb+KlQjfFsWT71lzSu2FWWqBURDxp+2Xb/rScmZ2cu3vbfMs2dda6g/7bTJG1nEKAn0U hrVkTkB5FqsfRdhyxIs7rXJ2d9jNQiA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 0D19413451 for ; Fri, 14 Oct 2022 07:17:32 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id qD+wMAwNSWOsUwAAMHmgww (envelope-from ) for ; Fri, 14 Oct 2022 07:17:32 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io() Date: Fri, 14 Oct 2022 15:17:10 +0800 Message-Id: <6cbdf72be1c531f62dba95803c4ed7c32018506d.1665730948.git.wqu@suse.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org This refactor includes the following behavior change first: - Don't error out if only P/Q is corrupted The old code will directly error out if only P/Q is corrupted. Although it is an logical error if we go into rebuild path with only P/Q corrupted, there is no need to error out. Just skip the rebuild and return the already good data. Then comes the following refactor which shouldn't cause behavior changes: - Introduce a helper to do vertical stripe recovery This not only reduce one indent level, but also paves the road for later data checksum verification in RMW cycles. - Sort rbio->faila/b before recovery So we don't need to do the same swap every vertical stripe - Replace a BUG_ON() with ASSERT() Or checkpatch won't let me pass. - Mark recovered sectors uptodate after the recover loop - Do the cleanup for pointers unconditionally We only need to initialize @pointers and @unmap_array to NULL, so we can safely free them unconditionally. Signed-off-by: Qu Wenruo --- fs/btrfs/raid56.c | 277 ++++++++++++++++++++++++---------------------- 1 file changed, 146 insertions(+), 131 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c009c0a2081e..1b2899173ae1 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1885,6 +1885,127 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc) bio_endio(bio); } +/* + * Recover a vertical stripe specified by @sector_nr. + * @*pointers are the pre-allocated pointers by the caller, so we don't + * need to allocate/free the pointers again and again. + */ +static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr, + void **pointers, void **unmap_array) +{ + struct btrfs_fs_info *fs_info = rbio->bioc->fs_info; + struct sector_ptr *sector; + const u32 sectorsize = fs_info->sectorsize; + const int faila = rbio->faila; + const int failb = rbio->failb; + int stripe_nr; + + /* + * Now we just use bitmap to mark the horizontal stripes in + * which we have data when doing parity scrub. + */ + if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB && + !test_bit(sector_nr, &rbio->dbitmap)) + return; + + /* + * Setup our array of pointers with sectors from each stripe + * + * NOTE: store a duplicate array of pointers to preserve the + * pointer order. + */ + for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) { + /* + * If we're rebuilding a read, we have to use + * pages from the bio list + */ + if ((rbio->operation == BTRFS_RBIO_READ_REBUILD || + rbio->operation == BTRFS_RBIO_REBUILD_MISSING) && + (stripe_nr == faila || stripe_nr == failb)) { + sector = sector_in_rbio(rbio, stripe_nr, sector_nr, 0); + } else { + sector = rbio_stripe_sector(rbio, stripe_nr, sector_nr); + } + ASSERT(sector->page); + pointers[stripe_nr] = kmap_local_page(sector->page) + + sector->pgoff; + unmap_array[stripe_nr] = pointers[stripe_nr]; + } + + /* All raid6 handling here */ + if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) { + /* Single failure, rebuild from parity raid5 style */ + if (failb < 0) { + if (faila == rbio->nr_data) + /* + * Just the P stripe has failed, without + * a bad data or Q stripe. + * We have nothing to do, just skip the + * recovery for this stripe. + */ + goto cleanup; + /* + * a single failure in raid6 is rebuilt + * in the pstripe code below + */ + goto pstripe; + } + + /* + * If the q stripe is failed, do a pstripe reconstruction from + * the xors. + * If both the q stripe and the P stripe are failed, we're + * here due to a crc mismatch and we can't give them the + * data they want. + */ + if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) { + if (rbio->bioc->raid_map[faila] == + RAID5_P_STRIPE) + /* + * Only P and Q are corrupted. + * We only care about data stripes recovery, + * can skip this vertical stripe. + */ + goto cleanup; + /* + * Otherwise we have one bad data stripe and + * a good P stripe. raid5! + */ + goto pstripe; + } + + if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) { + raid6_datap_recov(rbio->real_stripes, sectorsize, + faila, pointers); + } else { + raid6_2data_recov(rbio->real_stripes, sectorsize, + faila, failb, pointers); + } + } else { + void *p; + + /* Rebuild from P stripe here (raid5 or raid6). */ + ASSERT(failb == -1); +pstripe: + /* Copy parity block into failed block to start with */ + memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize); + + /* Rearrange the pointer array */ + p = pointers[faila]; + for (stripe_nr = faila; stripe_nr < rbio->nr_data - 1; + stripe_nr++) + pointers[stripe_nr] = pointers[stripe_nr + 1]; + pointers[rbio->nr_data - 1] = p; + + /* Xor in the rest */ + run_xor(pointers, rbio->nr_data - 1, sectorsize); + } + +cleanup: + for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--) + kunmap_local(unmap_array[stripe_nr]); +} + /* * all parity reconstruction happens here. We've read in everything * we can find from the drives and this does the heavy lifting of @@ -1892,11 +2013,9 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc) */ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) { - const u32 sectorsize = rbio->bioc->fs_info->sectorsize; - int sectornr, stripe; - void **pointers; - void **unmap_array; - int faila = -1, failb = -1; + int sectornr; + void **pointers = NULL; + void **unmap_array = NULL; blk_status_t err; int i; @@ -1907,7 +2026,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS); if (!pointers) { err = BLK_STS_RESOURCE; - goto cleanup_io; + goto cleanup; } /* @@ -1917,11 +2036,12 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS); if (!unmap_array) { err = BLK_STS_RESOURCE; - goto cleanup_pointers; + goto cleanup; } - faila = rbio->faila; - failb = rbio->failb; + /* Make sure faila and fail b are in order. */ + if (rbio->faila >= 0 && rbio->failb >= 0 && rbio->faila > rbio->failb) + swap(rbio->faila, rbio->failb); if (rbio->operation == BTRFS_RBIO_READ_REBUILD || rbio->operation == BTRFS_RBIO_REBUILD_MISSING) { @@ -1932,138 +2052,33 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) { - struct sector_ptr *sector; - - /* - * Now we just use bitmap to mark the horizontal stripes in - * which we have data when doing parity scrub. - */ - if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB && - !test_bit(sectornr, &rbio->dbitmap)) - continue; - - /* - * Setup our array of pointers with sectors from each stripe - * - * NOTE: store a duplicate array of pointers to preserve the - * pointer order - */ - for (stripe = 0; stripe < rbio->real_stripes; stripe++) { - /* - * If we're rebuilding a read, we have to use - * pages from the bio list - */ - if ((rbio->operation == BTRFS_RBIO_READ_REBUILD || - rbio->operation == BTRFS_RBIO_REBUILD_MISSING) && - (stripe == faila || stripe == failb)) { - sector = sector_in_rbio(rbio, stripe, sectornr, 0); - } else { - sector = rbio_stripe_sector(rbio, stripe, sectornr); - } - ASSERT(sector->page); - pointers[stripe] = kmap_local_page(sector->page) + - sector->pgoff; - unmap_array[stripe] = pointers[stripe]; - } - - /* All raid6 handling here */ - if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) { - /* Single failure, rebuild from parity raid5 style */ - if (failb < 0) { - if (faila == rbio->nr_data) { - /* - * Just the P stripe has failed, without - * a bad data or Q stripe. - * TODO, we should redo the xor here. - */ - err = BLK_STS_IOERR; - goto cleanup; - } - /* - * a single failure in raid6 is rebuilt - * in the pstripe code below - */ - goto pstripe; - } - - /* make sure our ps and qs are in order */ - if (faila > failb) - swap(faila, failb); + for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) + recover_vertical(rbio, sectornr, pointers, unmap_array); - /* if the q stripe is failed, do a pstripe reconstruction - * from the xors. - * If both the q stripe and the P stripe are failed, we're - * here due to a crc mismatch and we can't give them the - * data they want - */ - if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) { - if (rbio->bioc->raid_map[faila] == - RAID5_P_STRIPE) { - err = BLK_STS_IOERR; - goto cleanup; - } - /* - * otherwise we have one bad data stripe and - * a good P stripe. raid5! - */ - goto pstripe; - } - - if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) { - raid6_datap_recov(rbio->real_stripes, - sectorsize, faila, pointers); - } else { - raid6_2data_recov(rbio->real_stripes, - sectorsize, faila, failb, - pointers); - } - } else { - void *p; - - /* rebuild from P stripe here (raid5 or raid6) */ - BUG_ON(failb != -1); -pstripe: - /* Copy parity block into failed block to start with */ - memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize); - - /* rearrange the pointer array */ - p = pointers[faila]; - for (stripe = faila; stripe < rbio->nr_data - 1; stripe++) - pointers[stripe] = pointers[stripe + 1]; - pointers[rbio->nr_data - 1] = p; + /* + * No matter if this is a RMW or recovery, we should have all + * failed sectors repaired, thus they are now uptodate. + * Especially if we determine to cache the rbio, we need to + * have at least all data sectors uptodate. + */ + for (i = 0; i < rbio->stripe_nsectors; i++) { + struct sector_ptr *sector; - /* xor in the rest */ - run_xor(pointers, rbio->nr_data - 1, sectorsize); + if (rbio->faila != -1) { + sector = rbio_stripe_sector(rbio, rbio->faila, i); + sector->uptodate = 1; } - - /* - * No matter if this is a RMW or recovery, we should have all - * failed sectors repaired, thus they are now uptodate. - * Especially if we determine to cache the rbio, we need to - * have at least all data sectors uptodate. - */ - for (i = 0; i < rbio->stripe_nsectors; i++) { - if (faila != -1) { - sector = rbio_stripe_sector(rbio, faila, i); - sector->uptodate = 1; - } - if (failb != -1) { - sector = rbio_stripe_sector(rbio, failb, i); - sector->uptodate = 1; - } + if (rbio->failb != -1) { + sector = rbio_stripe_sector(rbio, rbio->failb, i); + sector->uptodate = 1; } - for (stripe = rbio->real_stripes - 1; stripe >= 0; stripe--) - kunmap_local(unmap_array[stripe]); } - err = BLK_STS_OK; + cleanup: kfree(unmap_array); -cleanup_pointers: kfree(pointers); -cleanup_io: /* * Similar to READ_REBUILD, REBUILD_MISSING at this point also has a * valid rbio which is consistent with ondisk content, thus such a From patchwork Fri Oct 14 07:17:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13006715 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5AFD5C433FE for ; Fri, 14 Oct 2022 07:17:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229807AbiJNHRl (ORCPT ); Fri, 14 Oct 2022 03:17:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229680AbiJNHRh (ORCPT ); Fri, 14 Oct 2022 03:17:37 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 398A4C6955 for ; Fri, 14 Oct 2022 00:17:36 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id EEC9A2201F for ; Fri, 14 Oct 2022 07:17:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1665731854; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dp8NxvPfeb2cmoIUSOHSQxUCClGPz6Jekny4F3iSyJk=; b=Tbbz1EpsBQyc9oiTtBlVFiukhtmVGcLaH4l1tKdZBjHrIIKK19qxOV/o7wqJBJ10BuuK3e 3UGzNTS1xy4fSSCUn6kNkP9QFleLH9uiiMv8ebKRt8gETSYRszWe1w/8JQdARDF5EDjJH2 Wn4pfTyLWLPcZ0w5CNohZbHlzqCZ9VA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 3B7DF13451 for ; Fri, 14 Oct 2022 07:17:34 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 2Iw2AA4NSWOsUwAAMHmgww (envelope-from ) for ; Fri, 14 Oct 2022 07:17:34 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function Date: Fri, 14 Oct 2022 15:17:11 +0800 Message-Id: <0c4cd30d323b4e25937849652985391dd666d32e.1665730948.git.wqu@suse.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksum for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time memory allocating the temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). The new helper will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 8 ++- fs/btrfs/file-item.c | 127 +++++++++++++++++++++++++++++++++++++++++- fs/btrfs/inode.c | 6 +- fs/btrfs/relocation.c | 4 +- fs/btrfs/scrub.c | 8 +-- fs/btrfs/tree-log.c | 16 +++--- 6 files changed, 146 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a8b629a166be..7cf011b28c71 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2990,9 +2990,11 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans, struct btrfs_ordered_sum *sums); blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, u64 offset, bool one_ordered); -int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, - struct list_head *list, int search_commit, - bool nowait); +int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end, + struct list_head *list, int search_commit, + bool nowait); +int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end, + u8 *csum_buf, unsigned long *csum_bitmap); void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, const struct btrfs_path *path, struct btrfs_file_extent_item *fi, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 8d9c4488d86a..b52f13a1bb20 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -523,9 +523,9 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst return ret; } -int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, - struct list_head *list, int search_commit, - bool nowait) +int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end, + struct list_head *list, int search_commit, + bool nowait) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_key key; @@ -657,6 +657,127 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, return ret; } +/* + * Do the same work as btrfs_lookup_csums_list(), the different is in how + * we return the result. + * + * This version will set the corresponding bits in @csum_bitmap to represent + * there is a csum found. + * Each bit represents a sector. Thus caller should ensure @csum_buf passed + * in is large enough to contain all csums. + */ +int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end, + u8 *csum_buf, unsigned long *csum_bitmap) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_csum_item *item; + const u64 orig_start = start; + int ret; + + ASSERT(IS_ALIGNED(start, fs_info->sectorsize) && + IS_ALIGNED(end + 1, fs_info->sectorsize)); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = BTRFS_EXTENT_CSUM_OBJECTID; + key.offset = start; + key.type = BTRFS_EXTENT_CSUM_KEY; + + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + goto fail; + if (ret > 0 && path->slots[0] > 0) { + leaf = path->nodes[0]; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1); + + /* + * There are two cases we can hit here for the previous + * csum item. + * + * |<- search range ->| + * |<- csum item ->| + * + * Or + * |<- search range ->| + * |<- csum item ->| + * + * Check if the previous csum item covers the leading part + * of the search range. + * If so we have to start from previous csum item. + */ + if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID && + key.type == BTRFS_EXTENT_CSUM_KEY) { + if (bytes_to_csum_size(fs_info, start - key.offset) < + btrfs_item_size(leaf, path->slots[0] - 1)) + path->slots[0]--; + } + } + + while (start <= end) { + u64 csum_end; + + leaf = path->nodes[0]; + if (path->slots[0] >= btrfs_header_nritems(leaf)) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + goto fail; + if (ret > 0) + break; + leaf = path->nodes[0]; + } + + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID || + key.type != BTRFS_EXTENT_CSUM_KEY || + key.offset > end) + break; + + if (key.offset > start) + start = key.offset; + + csum_end = key.offset + csum_size_to_bytes(fs_info, + btrfs_item_size(leaf, path->slots[0])); + if (csum_end <= start) { + path->slots[0]++; + continue; + } + + csum_end = min(csum_end, end + 1); + item = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_csum_item); + while (start < csum_end) { + unsigned long offset; + size_t size; + u8 *csum_dest = csum_buf + bytes_to_csum_size(fs_info, + start - orig_start); + + size = min_t(size_t, csum_end - start, end + 1 - start); + + offset = bytes_to_csum_size(fs_info, start - key.offset); + + read_extent_buffer(path->nodes[0], csum_dest, + ((unsigned long)item) + offset, + bytes_to_csum_size(fs_info, size)); + + bitmap_set(csum_bitmap, + (start - orig_start) >> fs_info->sectorsize_bits, + size >> fs_info->sectorsize_bits); + + start += size; + } + path->slots[0]++; + } + ret = 0; +fail: + btrfs_free_path(path); + return ret; +} + /** * Calculate checksums of the data contained inside a bio * diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d347362a87d3..69ea6bbda293 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1695,9 +1695,9 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, int ret; LIST_HEAD(list); - ret = btrfs_lookup_csums_range(csum_root, bytenr, - bytenr + num_bytes - 1, &list, 0, - nowait); + ret = btrfs_lookup_csums_list(csum_root, bytenr, + bytenr + num_bytes - 1, &list, 0, + nowait); if (ret == 0 && list_empty(&list)) return 0; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 216a4485d914..3cbcf7f010be 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4343,8 +4343,8 @@ int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len) disk_bytenr = file_pos + inode->index_cnt; csum_root = btrfs_csum_root(fs_info, disk_bytenr); - ret = btrfs_lookup_csums_range(csum_root, disk_bytenr, - disk_bytenr + len - 1, &list, 0, false); + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr, + disk_bytenr + len - 1, &list, 0, false); if (ret) goto out; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 9e3b2e60e571..ac59493977a9 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3238,9 +3238,9 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx, extent_dev = bioc->stripes[0].dev; btrfs_put_bioc(bioc); - ret = btrfs_lookup_csums_range(csum_root, extent_start, - extent_start + extent_size - 1, - &sctx->csum_list, 1, false); + ret = btrfs_lookup_csums_list(csum_root, extent_start, + extent_start + extent_size - 1, + &sctx->csum_list, 1, false); if (ret) { scrub_parity_mark_sectors_error(sparity, extent_start, extent_size); @@ -3464,7 +3464,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx, cur_logical; if (extent_flags & BTRFS_EXTENT_FLAG_DATA) { - ret = btrfs_lookup_csums_range(csum_root, cur_logical, + ret = btrfs_lookup_csums_list(csum_root, cur_logical, cur_logical + scrub_len - 1, &sctx->csum_list, 1, false); if (ret) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 813986e38258..faf783a3c00b 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -799,7 +799,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, btrfs_file_extent_num_bytes(eb, item); } - ret = btrfs_lookup_csums_range(root->log_root, + ret = btrfs_lookup_csums_list(root->log_root, csum_start, csum_end - 1, &ordered_sums, 0, false); if (ret) @@ -4400,9 +4400,9 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, csum_root = btrfs_csum_root(trans->fs_info, disk_bytenr); disk_bytenr += extent_offset; - ret = btrfs_lookup_csums_range(csum_root, disk_bytenr, - disk_bytenr + extent_num_bytes - 1, - &ordered_sums, 0, false); + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr, + disk_bytenr + extent_num_bytes - 1, + &ordered_sums, 0, false); if (ret) goto out; @@ -4595,10 +4595,10 @@ static int log_extent_csums(struct btrfs_trans_handle *trans, /* block start is already adjusted for the file extent offset. */ csum_root = btrfs_csum_root(trans->fs_info, em->block_start); - ret = btrfs_lookup_csums_range(csum_root, - em->block_start + csum_offset, - em->block_start + csum_offset + - csum_len - 1, &ordered_sums, 0, false); + ret = btrfs_lookup_csums_list(csum_root, + em->block_start + csum_offset, + em->block_start + csum_offset + + csum_len - 1, &ordered_sums, 0, false); if (ret) return ret; From patchwork Fri Oct 14 07:17:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13006714 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 306FFC4332F for ; Fri, 14 Oct 2022 07:17:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229823AbiJNHRn (ORCPT ); Fri, 14 Oct 2022 03:17:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34338 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229696AbiJNHRi (ORCPT ); Fri, 14 Oct 2022 03:17:38 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BAB715CB10 for ; Fri, 14 Oct 2022 00:17:37 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 27D9022021 for ; Fri, 14 Oct 2022 07:17:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1665731856; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NMNIMioz+hOVToNE2NXwuGAikME210tXdlLTvTjBbeI=; b=h9k10p0/utwuS34RczbyqX/+NMT4sy0Dllm7OO34OFdftISLwsl+q2S0x8DGUDgJJ4J/BW GkOZWkfUpwGxfdeQcSMpG1zmHkpPkWcKZsV1++9n2FaIyzMmYVDMk9UL35K4cSfejb7kee Le99S6ucIGm3fvW91svczE09UZ0oKhA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 68EC813451 for ; Fri, 14 Oct 2022 07:17:35 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id WAZ9Cw8NSWOsUwAAMHmgww (envelope-from ) for ; Fri, 14 Oct 2022 07:17:35 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification Date: Fri, 14 Oct 2022 15:17:12 +0800 Message-Id: <1e3f5e809daa4e819e140b260e47755357b4a3dd.1665730948.git.wqu@suse.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org This is for later data verification at RMW time. This patch will try to allocate the needed memory for a locked rbio if the rbio is for data exclusively (we don't want to handle mixed bg yet). And the memory will be released when the rbio is unlocked. Signed-off-by: Qu Wenruo --- fs/btrfs/raid56.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++- fs/btrfs/raid56.h | 12 ++++++++ 2 files changed, 83 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 1b2899173ae1..8f7e25001a2b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -19,6 +19,7 @@ #include "volumes.h" #include "raid56.h" #include "async-thread.h" +#include "ordered-data.h" /* set when additional merges to this rbio are not allowed */ #define RBIO_RMW_LOCKED_BIT 1 @@ -854,6 +855,16 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err) struct bio *cur = bio_list_get(&rbio->bio_list); struct bio *extra; + /* + * Also freeup the rbio->csum_*. + * Every sub-stripe write should allocate their own csum buffer and + * bitmap by their own, no cached result. + */ + kfree(rbio->csum_buf); + bitmap_free(rbio->csum_bitmap); + rbio->csum_buf = NULL; + rbio->csum_bitmap = NULL; + /* * Clear the data bitmap, as the rbio may be cached for later usage. * do this before before unlock_stripe() so there will be no new bio @@ -1675,6 +1686,63 @@ static int full_stripe_write(struct btrfs_raid_bio *rbio) return 0; } +static void fill_data_csums(struct btrfs_raid_bio *rbio) +{ + struct btrfs_fs_info *fs_info = rbio->bioc->fs_info; + struct btrfs_root *csum_root = btrfs_csum_root(fs_info, + rbio->bioc->raid_map[0]); + const u64 start = rbio->bioc->raid_map[0]; + const u32 len = (rbio->nr_data * rbio->stripe_nsectors) << + fs_info->sectorsize_bits; + int ret; + + /* The rbio should not has its csum buffer initialized. */ + ASSERT(!rbio->csum_buf && !rbio->csum_bitmap); + + /* + * Skip the csum search if: + * + * - The rbio doesn't belongs to data block groups + * Then we are doing IO for tree blocks, no need to + * search csums. + * + * - The rbio belongs to mixed block groups + * This is to avoid deadlock, as we're already holding + * the full stripe lock, if we trigger a metadata read, and + * it needs to do raid56 recovery, we will deadlock. + */ + if (!(rbio->bioc->map_type & BTRFS_BLOCK_GROUP_DATA) || + rbio->bioc->map_type & BTRFS_BLOCK_GROUP_METADATA) + return; + + rbio->csum_buf = kzalloc(rbio->nr_data * rbio->stripe_nsectors * + fs_info->csum_size, GFP_NOFS); + rbio->csum_bitmap = bitmap_zalloc(rbio->nr_data * rbio->stripe_nsectors, + GFP_NOFS); + if (!rbio->csum_buf || !rbio->csum_bitmap) + goto enomem; + + ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1, + rbio->csum_buf, rbio->csum_bitmap); + if (ret < 0) + goto enomem; + return; + +enomem: + /* + * We failed to allocated memory for rbio csum verification, + * but it's not the end of day, we can still continue. + * But better to warn users that RMW is no longer safe. + */ + btrfs_warn_rl(fs_info, +"failed to allocated memory, sub-stripe write for full stripe %llu is not safe", + rbio->bioc->raid_map[0]); + kfree(rbio->csum_buf); + bitmap_free(rbio->csum_bitmap); + rbio->csum_buf = NULL; + rbio->csum_bitmap = NULL; +} + /* * partial stripe writes get handed over to async helpers. * We're really hoping to merge a few more writes into this @@ -1685,8 +1753,10 @@ static int partial_stripe_write(struct btrfs_raid_bio *rbio) int ret; ret = lock_stripe_add(rbio); - if (ret == 0) + if (ret == 0) { + fill_data_csums(rbio); start_async_work(rbio, rmw_work); + } return 0; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index 91d5c0adad15..fa82ca158899 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -126,6 +126,18 @@ struct btrfs_raid_bio { /* Allocated with real_stripes-many pointers for finish_*() calls */ void **finish_pointers; + + /* + * Checksum buffer if the rbio is for data. + * The buffer should cover all data sectors (exlcuding P/Q sectors). + */ + u8 *csum_buf; + + /* + * Each bit represents if the corresponding sector has data csum found. + * Should only cover data sectors (excluding P/Q sectors). + */ + unsigned long *csum_bitmap; }; /* From patchwork Fri Oct 14 07:17:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 13006716 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D24E0C43219 for ; Fri, 14 Oct 2022 07:17:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229680AbiJNHRo (ORCPT ); Fri, 14 Oct 2022 03:17:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34380 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229796AbiJNHRk (ORCPT ); Fri, 14 Oct 2022 03:17:40 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA5BB160EC0 for ; Fri, 14 Oct 2022 00:17:38 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 56B0422013 for ; Fri, 14 Oct 2022 07:17:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1665731857; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VIrj4ve6NmN8pYUbFzBoByE7YlLRarTxDOK46LptGhE=; b=UKz+RQIwR5O8MmRu67SeXuwNCTtltLcD5hJ06X7qLlJXShpF9BVIamcpuyGwXjU2YQG8p4 xR15kLhfjOetKB3jDCj/FZ/47jtKYIkQZPOAmJR2ksdGDYCLdWVyKSVb9FYvcKsf+uliJr lTL+vABfn8b8aVU7Tw+x0ebKAQcEtjI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 96EA313451 for ; Fri, 14 Oct 2022 07:17:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id QIG5FhANSWOsUwAAMHmgww (envelope-from ) for ; Fri, 14 Oct 2022 07:17:36 +0000 From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Date: Fri, 14 Oct 2022 15:17:13 +0800 Message-Id: <9389bb7901990ef7c0d0c58dc4c1168fd4240d27.1665730948.git.wqu@suse.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org [BUG] For the following small script, btrfs will be unable to recover the content of file1: mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1 md5sum $mnt/file1 umount $mnt # Corrupt the above 64K data stripe. xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3 mount $dev1 $mnt # Write a new 64K, which should be in the other data stripe # And this is a sub-stripe write, which will cause RMW xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2 md5sum $mnt/file1 umount $mnt Above md5sum would fail. [CAUSE] There is a long existing problem for raid56 (not limited to btrfs raid56) that, if we already have some corrupted on-disk data, and then trigger a sub-stripe write (which needs RMW cycle), it can cause further damage into P/Q stripe. Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| Disk 2: parity |0xffffffffffff| In above case, data 1 is already corrupted, the original data should be 64KiB of 0xff. At this stage, if we read data 1, and it has data checksum, we can still recover going the regular RAID56 recovery path. But if now we decide to write some data into data 2, then we need to go RMW. Just say we want to write 64KiB of '0x00' into data 2, then we read the on-disk data of data 1, calculate the new parity, resulting the following layout: Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| <- New '0x00' writes Disk 2: parity |0x000000000000| <- New Parity. But the new parity is calculated using the *corrupted* data 1, we can no longer recover the correct data of data1. Thus the corruption is forever there. [FIX] To solve above problem, this patch will do a full stripe data checksum verification at RMW time. This involves the following changes: - For raid56_rmw_stripe() we need to read the full stripe Unlike the old behavior, which will only read out the missing part, now we have to read out the full stripe (including data and P/Q), so we can do recovery later. All the read will directly go into the rbio stripe sectors. This will not affect cached case, as cached rbio will always has all its data sectors cached. And since cached sectors are already after a RMW (which implies the same data csum check), they should be always correct. - Do data checksum verification for the full stripe The target is only the rbio stripe sectors. The checksum is already filled before we reach finish_rmw(). Currently we only do the verification if there is no missing device. The checksum verification will do the following work: * Verify the data checksums for sectors which has data checksum of a vertical stripe. For sectors which doesn't has csum, they will be considered fine. * Check if we need to repair the vertical stripe If no checksum mismatch, we can continue to the next vertical stripe. If too many corrupted sectors than the tolerance, we error out, marking all the bios failed, to avoid further corruption. * Repair the vertical stripe * Recheck the content of the failed sectors If repaired sectors can not pass the checksum verification, we error out This is a much strict workflow, meaning now we will not write a full stripe if it can not pass full stripe data verification. Obviously this will have a performance impact, as we are doing more works during RMW cycle: - Fetch the data checksum - Do checksum verification for all data stripes - Do recovery if needed - Do checksum verification again But for full stripe write we won't do it at all, thus for fully optimized RAID56 workload (always full stripe write), there should be no extra overhead. To me, the extra overhead looks reasonable, as data consistency is way more important than performance. Signed-off-by: Qu Wenruo --- fs/btrfs/raid56.c | 279 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 251 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 8f7e25001a2b..e51c07bd4c7b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1203,6 +1203,150 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio, trace_info->stripe_nr = -1; } +static void assign_one_failed(int failed, int *faila, int *failb) +{ + if (*faila < 0) + *faila = failed; + else if (*failb < 0) + *failb = failed; +} + +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr, + int extra_faila, int extra_failb, + void **pointers, void **unmap_array); + +static int rmw_validate_data_csums(struct btrfs_raid_bio *rbio) +{ + struct btrfs_fs_info *fs_info = rbio->bioc->fs_info; + int sector_nr; + int tolerance; + void **pointers = NULL; + void **unmap_array = NULL; + int ret = 0; + + /* No checksum, no need to check. */ + if (!rbio->csum_buf || !rbio->csum_bitmap) + return 0; + + /* No datacsum in the range. */ + if (bitmap_empty(rbio->csum_bitmap, + rbio->nr_data * rbio->stripe_nsectors)) + return 0; + + pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS); + unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS); + if (!pointers || !unmap_array) { + ret = -ENOMEM; + goto out; + } + + if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5) + tolerance = 1; + else + tolerance = 2; + + for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) { + int stripe_nr; + int found_error = 0; + int faila = -1; + int failb = -1; + + /* Check the datacsum of this vertical stripe. */ + for (stripe_nr = 0; stripe_nr < rbio->nr_data; stripe_nr++) { + struct sector_ptr *sector = rbio_stripe_sector(rbio, + stripe_nr, sector_nr); + int total_sector = stripe_nr * rbio->stripe_nsectors + + sector_nr; + u8 *csum_expected = rbio->csum_buf + total_sector * + fs_info->csum_size; + u8 csum[BTRFS_CSUM_SIZE]; + int ret; + + /* + * No datacsum for this sector, by default consider + * the sector correct. + */ + if (!test_bit(total_sector, rbio->csum_bitmap)) + continue; + + ret = btrfs_check_sector_csum(fs_info, sector->page, + sector->pgoff, csum, csum_expected); + if (ret < 0) { + assign_one_failed(stripe_nr, &faila, &failb); + found_error++; + } + } + + /* No error found, everything is fine. */ + if (found_error == 0) + continue; + + /* Check if we can handle all the errors we found above. */ + if (found_error + atomic_read(&rbio->error) > tolerance) { + btrfs_err_rl(fs_info, +"full stripe %llu veritical stripe %d has too many corruptions, tolerance %d corruptions %d", + rbio->bioc->raid_map[0], sector_nr, + tolerance, + found_error + atomic_read(&rbio->error)); + ret = -EIO; + goto out; + } + + /* Try recovery the corrupted sectors in the vertical stripe. */ + ret = recover_vertical(rbio, sector_nr, faila, failb, pointers, unmap_array); + if (ret < 0) { + btrfs_err_rl(fs_info, + "failed to recovery full stripe %llu vertical stripe %d: %d", + rbio->bioc->raid_map[0], sector_nr, ret); + goto out; + } + + /* Recheck the failed sectors. */ + found_error = 0; + for (stripe_nr = 0; stripe_nr < rbio->nr_data; stripe_nr++) { + struct sector_ptr *sector = rbio_stripe_sector(rbio, + stripe_nr, sector_nr); + int total_sector = stripe_nr * rbio->stripe_nsectors + + sector_nr; + u8 *csum_expected = rbio->csum_buf + total_sector * + fs_info->csum_size; + u8 csum[BTRFS_CSUM_SIZE]; + int ret; + + /* + * No datacsum for this sector, by default consider + * the sector correct. + */ + if (!test_bit(total_sector, rbio->csum_bitmap)) + continue; + + /* We only care about the failed sector. */ + if (stripe_nr != faila && stripe_nr != failb) + continue; + + ret = btrfs_check_sector_csum(fs_info, sector->page, + sector->pgoff, csum, csum_expected); + if (ret < 0) + found_error++; + } + /* + * TODO: We may want to rollback to the old content. For now + * we will error out, thus no big deal. + */ + if (found_error) { + ret = -EIO; + btrfs_err_rl(fs_info, + "full stripe %llu vertical stripe %d still has error after repair", + rbio->bioc->raid_map[0], sector_nr); + goto out; + } + } +out: + kfree(pointers); + kfree(unmap_array); + return ret; +} + /* * this is called from one of two situations. We either * have a full stripe from the higher layers, or we've read all @@ -1227,6 +1371,14 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) struct bio *bio; int ret; + ret = rmw_validate_data_csums(rbio); + if (ret < 0) { + btrfs_err_rl(bioc->fs_info, + "full stripe %llu already has corrupted data, abort RMW cycle for it", + bioc->raid_map[0]); + goto cleanup; + } + bio_list_init(&bio_list); if (rbio->real_stripes - rbio->nr_data == 1) @@ -1583,6 +1735,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) const int nr_data_sectors = rbio->stripe_nsectors * rbio->nr_data; int ret; int total_sector_nr; + bool need_read = false; struct bio *bio; bio_list_init(&bio_list); @@ -1594,46 +1747,49 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); atomic_set(&rbio->error, 0); - /* Build a list of bios to read all the missing data sectors. */ + + /* First check if we need to read anything from the disk. */ for (total_sector_nr = 0; total_sector_nr < nr_data_sectors; total_sector_nr++) { - struct sector_ptr *sector; int stripe = total_sector_nr / rbio->stripe_nsectors; int sectornr = total_sector_nr % rbio->stripe_nsectors; + struct sector_ptr *sector; - /* - * We want to find all the sectors missing from the rbio and - * read them from the disk. If sector_in_rbio() finds a page - * in the bio list we don't need to read it off the stripe. - */ + /* This sector is already covered by bio. */ sector = sector_in_rbio(rbio, stripe, sectornr, 1); if (sector) continue; sector = rbio_stripe_sector(rbio, stripe, sectornr); - /* - * The bio cache may have handed us an uptodate page. If so, - * use it. - */ + /* This sector is already covered by cached sector. */ if (sector->uptodate) continue; + /* + * Any sector not covered by bio nor cache means we need to + * read out the full stripe for later full stripe verification. + */ + need_read = true; + break; + } + if (!need_read) + goto finish; + + /* Build a list of all sectors (including data and P/Q) */ + for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors; + total_sector_nr++) { + struct sector_ptr *sector; + int stripe = total_sector_nr / rbio->stripe_nsectors; + int sectornr = total_sector_nr % rbio->stripe_nsectors; + + sector = rbio_stripe_sector(rbio, stripe, sectornr); ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe, sectornr, REQ_OP_READ); if (ret) goto cleanup; } - bios_to_read = bio_list_size(&bio_list); - if (!bios_to_read) { - /* - * this can happen if others have merged with - * us, it means there is nothing left to read. - * But if there are missing devices it may not be - * safe to do the full stripe write yet. - */ - goto finish; - } + ASSERT(bio_list_size(&bio_list)); /* * The bioc may be freed once we submit the last bio. Make sure not to @@ -1955,28 +2111,87 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc) bio_endio(bio); } +/* + * We can have two sources of failab, one from rbio->faila/b, the other + * from @extra_faila/b (caused by data verification at RMW time). + * + * So we need to assign them proper using this helper. + * + * Return the total number of failed stripes (excluding the case where + * extra_faila/b is the same as rbio->failab) + */ +static int calculate_failab(struct btrfs_raid_bio *rbio, + int extra_faila, int extra_failb, + int *faila, int *failb) +{ + int failed = 0; + + *faila = -1; + *failb = -1; + + /* Check rbio fails first. */ + if (rbio->faila >= 0) { + assign_one_failed(rbio->faila, faila, failb); + failed++; + } + if (rbio->failb >= 0) { + assign_one_failed(rbio->failb, faila, failb); + failed++; + } + + /* Then the extra ones*/ + if (extra_faila >= 0) { + assign_one_failed(extra_faila, faila, failb); + if (extra_faila != *faila && extra_faila != *failb) + failed++; + } + if (extra_failb >= 0) { + assign_one_failed(extra_failb, faila, failb); + if (extra_faila != *faila && extra_faila != *failb) + failed++; + } + return failed; +} + /* * Recover a vertical stripe specified by @sector_nr. * @*pointers are the pre-allocated pointers by the caller, so we don't * need to allocate/free the pointers again and again. + * + * @extra_faila and @extra_failb is for RMW data verification case, where + * we do scrub level check to find out the corrupted sectors in a vertical + * stripe. */ -static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr, - void **pointers, void **unmap_array) +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr, + int extra_faila, int extra_failb, + void **pointers, void **unmap_array) { struct btrfs_fs_info *fs_info = rbio->bioc->fs_info; struct sector_ptr *sector; const u32 sectorsize = fs_info->sectorsize; - const int faila = rbio->faila; - const int failb = rbio->failb; + int faila = -1; + int failb = -1; + int failed = 0; + int tolerance; int stripe_nr; + if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5) + tolerance = 1; + else + tolerance = 2; + + failed = calculate_failab(rbio, extra_faila, extra_failb, &faila, &failb); + + if (failed > tolerance) + return -EIO; + /* * Now we just use bitmap to mark the horizontal stripes in * which we have data when doing parity scrub. */ if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB && !test_bit(sector_nr, &rbio->dbitmap)) - return; + return 0; /* * Setup our array of pointers with sectors from each stripe @@ -2074,6 +2289,7 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr, cleanup: for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--) kunmap_local(unmap_array[stripe_nr]); + return 0; } /* @@ -2122,8 +2338,15 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) - recover_vertical(rbio, sectornr, pointers, unmap_array); + for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) { + int ret; + + ret = recover_vertical(rbio, sectornr, -1, -1, pointers, unmap_array); + if (ret < 0) { + err = BLK_STS_IOERR; + goto cleanup; + } + } /* * No matter if this is a RMW or recovery, we should have all