From patchwork Thu Oct 1 18:36:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812187 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C22091668 for ; Thu, 1 Oct 2020 18:38:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A0069207DE for ; Thu, 1 Oct 2020 18:38:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="lZaAfXcr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732643AbgJASiC (ORCPT ); Thu, 1 Oct 2020 14:38:02 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24678 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729927AbgJASiB (ORCPT ); Thu, 1 Oct 2020 14:38:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577481; x=1633113481; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0B7GyYvxBZKiLZNQGeYgaqd1Sn4PpO+kC4na0Gsi84M=; b=lZaAfXcrnqU4WEMOI3awKIfXQxk3UDHRjoWmTvIZ3LHW9ZrLF+cpdA62 NFqb6M2SlayxioCb9hfWUp7oYMFb+nTSXjGN0hFIGN2AaMH6JcJeLtFZo k5PmrKfLn5VDcM1kksVdel5gNY/75GjMGOfE0R/hVdrDPispFB8KXALcw TUKET+qXABK/B14vodGTXnSVPUO6LIY4TVSGm3aVvjsi3EHWhed0NKTiw Pxe22vXAs8jUv27Iu6e9YWRxyB9l9wy2MuIM5qggFXTI8rCrcbauCufmQ yfwVM7eCzpcmgYnvvU+w7TYvl9v8XiwhEV+V7qfh9IPwLtMTd2oIgv+rc A==; IronPort-SDR: yhfxsAloTbWh554jJDGTUtI543nYddA/9CdtYYsJoYjDUU0/UpDYrb0+1j9UWgdhEdk7Ik2pyI jmekZzF4TpzZHW6VM13FMkJFuI4KJSrnT2/a/7uD7Q4k4qZ8H72mLBbIj3Ac3DAzCczVaC8uH5 1QvB7FfqPBXRu2WWJiQ6u8O8pG48WlPxS2CqfayLf3JjwOsWBQBH5iAIlRnNsxsEEdIn2t9w3T bKNrNMaCHk3aJLVazaUuJVNw0Bckw5BGUwEkovFX+FeXBjuFRMOEAQ6Aej3xbi8T/p35HqNEdV L3g= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036761" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:01 +0800 IronPort-SDR: iRHX3kRgGvNQ1RQqAT9E24T8KP4P2O30ZhDaH2hqNBxW2Ku9nkllvUPFWW5ovRvp+7n3JnEYCM /F/NSBqezLGQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:23:57 -0700 IronPort-SDR: 1+jpWQkcqoZkTTcxMrVytEuFdlDteWcFKJGAXztXvlGGHbgvnJ8LaZ+rjKa+96IOJsMPGrP8vp K3gl469s1uBg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:00 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Johannes Thumshirn Subject: [PATCH v8 01/41] block: add bio_add_zone_append_page Date: Fri, 2 Oct 2020 03:36:08 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Johannes Thumshirn Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which is intended to be used by file systems that directly add pages to a bio instead of using bio_iov_iter_get_pages(). Signed-off-by: Johannes Thumshirn --- block/bio.c | 36 ++++++++++++++++++++++++++++++++++++ include/linux/bio.h | 2 ++ 2 files changed, 38 insertions(+) diff --git a/block/bio.c b/block/bio.c index e865ea55b9f9..e0d41ccc4e90 100644 --- a/block/bio.c +++ b/block/bio.c @@ -853,6 +853,42 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, } EXPORT_SYMBOL(bio_add_pc_page); +/** + * bio_add_zone_append_page - attempt to add page to zone-append bio + * @bio: destination bio + * @page: page to add + * @len: vec entry length + * @offset: vec entry offset + * + * Attempt to add a page to the bio_vec maplist of a bio that will be submitted + * for a zone-append request. This can fail for a number of reasons, such as the + * bio being full or the target block device is not a zoned block device or + * other limitations of the target block device. The target block device must + * allow bio's up to PAGE_SIZE, so it is always possible to add a single page + * to an empty bio. + */ +int bio_add_zone_append_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int offset) +{ + struct request_queue *q; + bool same_page = false; + + if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND)) + return 0; + + if (WARN_ON_ONCE(!bio->bi_disk)) + return 0; + + q = bio->bi_disk->queue; + + if (WARN_ON_ONCE(!blk_queue_is_zoned(q))) + return 0; + + return bio_add_hw_page(q, bio, page, len, offset, + queue_max_zone_append_sectors(q), &same_page); +} +EXPORT_SYMBOL_GPL(bio_add_zone_append_page); + /** * __bio_try_merge_page - try appending data to an existing bvec. * @bio: destination bio diff --git a/include/linux/bio.h b/include/linux/bio.h index c6d765382926..7ef300cb4e9a 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -442,6 +442,8 @@ void bio_chain(struct bio *, struct bio *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, unsigned int, unsigned int); +int bio_add_zone_append_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int offset); bool __bio_try_merge_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off, bool *same_page); void __bio_add_page(struct bio *bio, struct page *page, From patchwork Thu Oct 1 18:36:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812223 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 88C86139F for ; Thu, 1 Oct 2020 18:38:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62BCB208C7 for ; Thu, 1 Oct 2020 18:38:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="gcD1I6Jq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732755AbgJASiJ (ORCPT ); Thu, 1 Oct 2020 14:38:09 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729927AbgJASiD (ORCPT ); Thu, 1 Oct 2020 14:38:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577483; x=1633113483; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5ETJSKP96iJAWlgibSI+CXwv6npNrB2/5xTsR6UalTM=; b=gcD1I6Jq1lBr4NXqro6Qbc7RmE8Gr/W4SuCtSbo0hwoK8pyo9vh4ABqJ p7zXpSHhB8/WFQ827NMk2X8PByL5SYEfqbR7C2PK1gRmC03/RZtTezpjf y/iuc1yJ+RT9MHthjaB58QMHxf4CmQvYKY4lMAFFHyXT8Q9FewJ1vaLkA twVDo1Vb7wJ08wNywi6wz1ihHSAbgui9hfzYZr5BH8wMrHYcHxPfQ6SgN fOOa9Qfa+GOVGmJFXFoXu8gDFTvWA3FBXuL+rvF173VHdP8aDRx+j5Ex+ /eWpoXkqNAzY895+bVYe7RslEiE7RKqAka3HMs3WkE+5uuoxWrctRNvmC w==; IronPort-SDR: Ti7coKoozM74JVoq4/2+3dU7s9wzUJQ/9/T9tnFu2aJq+QxkJ0wt4wX5WyJBnePHs0A6hLjBV5 Ylbl1RNN8XcEhJjelzkD3iCZTd/9ZFI83+JxMA76WHWrSG5wrjDv5j50OkIMeuuIxyxVKvwx3D ke5pizpWAShr2qcKrFtI97mGcOMPx/mhlgYLcCZo249enIiNImSSSH6rlJs4OGoJSx/SsraySH WcvPuQfaMhovkmJHRo9TI4GwhSFbtBjJgr6lG0VL7MJE+N5ZoHgx9pbhwY/r2+mnIHtyY8FfGw Nsw= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036762" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:02 +0800 IronPort-SDR: r8O3PXw9zc4GhH46WBRGcHA8vd3I6KrXERnmOScctQgKBQSB/Qz/sWV7XxTG13Oq/D+qCiMsS7 BpwMjbP4atpQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:23:59 -0700 IronPort-SDR: zfnoi9Wi2Zuarstdwa+QDGGK6Z3dCgtYMAOtCtEQB8dMGBkSWfYKLZgZc6Pdu57aenIL4f4gek G59jHh21XYBA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:01 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Anand Jain , Johannes Thumshirn , Damien Le Moal Subject: [PATCH v8 02/41] btrfs: introduce ZONED feature flag Date: Fri, 2 Oct 2020 03:36:09 +0900 Message-Id: <778806c32892a82a27a9252f8d33003fa95a20ee.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch introduces the ZONED incompat flag. The flag indicates that the volume management will satisfy the constraints imposed by host-managed zoned block devices. Reviewed-by: Anand Jain Reviewed-by: Johannes Thumshirn Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 279d9262b676..828006020bbd 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -263,6 +263,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES); BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID); BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE); BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34); +BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED); static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(mixed_backref), @@ -278,6 +279,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(metadata_uuid), BTRFS_FEAT_ATTR_PTR(free_space_tree), BTRFS_FEAT_ATTR_PTR(raid1c34), + BTRFS_FEAT_ATTR_PTR(zoned), NULL }; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2c39d15a2beb..5df73001aad4 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -307,6 +307,7 @@ struct btrfs_ioctl_fs_info_args { #define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9) #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID (1ULL << 10) #define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11) +#define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12) struct btrfs_ioctl_feature_flags { __u64 compat_flags; From patchwork Thu Oct 1 18:36:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812201 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8749E1668 for ; Thu, 1 Oct 2020 18:38:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62D71207DE for ; Thu, 1 Oct 2020 18:38:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="M8Caa9YO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732832AbgJASiK (ORCPT ); Thu, 1 Oct 2020 14:38:10 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24684 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732679AbgJASiE (ORCPT ); Thu, 1 Oct 2020 14:38:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577484; x=1633113484; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6933gJYkUZb6rpYa++9VsaLU3hfttuFha33Dqtn/XD4=; b=M8Caa9YOCMa16th/Zm0mpqMhptNUTq8ffw9slHaUz0E7t1OCdfLuQbKl cXQ62uQAW2Fk2AWMS5HCTHZUNeT55kWFYQruMr10PGAoM58M85MpePn1q I4csp4b3/GwpnA+fgEDUK/HNwK7pPCmhpEHCXeC3uL86HeZ/L0ZKw946Y atxul2tLJKDdEkAzkyTXHcGVO9eWcywtl8xVSvpggknj6Y5XqOeyXQP+w da0TMVHwwNJgeKlRbWmP+pyqgH62D7Fd4to/aJGgjzrAkaOCsgyVQTc0L QkdY6Jh1LDFgoOHzUlyO+KWW2b8/w8leQvAl7Ofy7CrkcVn9u1aXqbhTL Q==; IronPort-SDR: ENpYM1rbtehHWErhDokXcmPMJYIiyfstqhsZhkiEtKphTfxHj3kbtsGVwcge4pNXhFTFhn0CHn lrAxBDkUMV7bRf/6JiZ7/ZShmSZGEnCcJY2cvA57H4MRl81SGQ40LDB2n5TLK1playoKEgxudP rUHpnAFts3J/HNxHT6aQ9ojj9wK2HUA3dU6iWNyPXro2veYuiBve4bcnLnKQUZjV1SlN8Z1C5h bfB4trKivBE7su1eT7jwNeDPS67OG5MG3vCOSwI+Zz01gXn0ZFDAkiH+UUjGj4EDpjcNoXpLny ujc= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036767" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:04 +0800 IronPort-SDR: JrS//CD8N5dwdEn/QktGT5jVO6lYiYLnRgaz5P0O3TV2j8w9FwFyijKcqQBe8/z7wD3Vag0fcX wMtxWSyKigmA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:00 -0700 IronPort-SDR: cUig9x8u+j0ynbVd+KT2oyjKcMda57pp80UtG5UCYEloiTwDOR/TGHHxHFg0+6kSrElDH4P63J h4VAzgd9od1A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:03 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik , Damien Le Moal Subject: [PATCH v8 03/41] btrfs: Get zone information of zoned block devices Date: Fri, 2 Oct 2020 03:36:10 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If a zoned block device is found, get its zone information (number of zones and zone size) using the new helper function btrfs_get_dev_zone_info(). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/Makefile | 1 + fs/btrfs/dev-replace.c | 5 ++ fs/btrfs/volumes.c | 18 ++++- fs/btrfs/volumes.h | 4 + fs/btrfs/zoned.c | 179 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 92 +++++++++++++++++++++ 6 files changed, 297 insertions(+), 2 deletions(-) create mode 100644 fs/btrfs/zoned.c create mode 100644 fs/btrfs/zoned.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index e738f6206ea5..0497fdc37f90 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o +btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \ tests/extent-buffer-tests.o tests/btrfs-tests.o \ diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 20ce1970015f..6f6d77224c2b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -21,6 +21,7 @@ #include "rcu-string.h" #include "dev-replace.h" #include "sysfs.h" +#include "zoned.h" /* * Device replace overview @@ -291,6 +292,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_info->fs_devices; + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error; + mutex_lock(&fs_info->fs_devices->device_list_mutex); list_add(&device->dev_list, &fs_info->fs_devices->devices); fs_info->fs_devices->num_devices++; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 58b9c419a2b6..b0a7a820222e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -31,6 +31,7 @@ #include "space-info.h" #include "block-group.h" #include "discard.h" +#include "zoned.h" const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { @@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device) rcu_string_free(device->name); extent_io_tree_release(&device->alloc_state); bio_put(device->flush_bio); + btrfs_destroy_dev_zone_info(device); kfree(device); } @@ -667,6 +669,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); device->mode = flags; + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret != 0) + goto error_free_page; + fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { @@ -2543,6 +2550,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path } rcu_assign_pointer(device->name, name); + device->fs_info = fs_info; + device->bdev = bdev; + + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error_free_device; + trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -2559,8 +2574,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path fs_info->sectorsize); device->disk_total_bytes = device->total_bytes; device->commit_total_bytes = device->total_bytes; - device->fs_info = fs_info; - device->bdev = bdev; set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); device->mode = FMODE_EXCL; @@ -2707,6 +2720,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path sb->s_flags |= SB_RDONLY; if (trans) btrfs_end_transaction(trans); + btrfs_destroy_dev_zone_info(device); error_free_device: btrfs_free_device(device); error: diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 48bdca01e237..4bbb15c4161f 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -51,6 +51,8 @@ struct btrfs_io_geometry { #define BTRFS_DEV_STATE_REPLACE_TGT (3) #define BTRFS_DEV_STATE_FLUSH_SENT (4) +struct btrfs_zoned_device_info; + struct btrfs_device { struct list_head dev_list; /* device_list_mutex */ struct list_head dev_alloc_list; /* chunk mutex */ @@ -64,6 +66,8 @@ struct btrfs_device { struct block_device *bdev; + struct btrfs_zoned_device_info *zone_info; + /* the mode sent to blkdev_get */ fmode_t mode; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c new file mode 100644 index 000000000000..0c908f0e9469 --- /dev/null +++ b/fs/btrfs/zoned.c @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#include +#include +#include "ctree.h" +#include "volumes.h" +#include "zoned.h" +#include "rcu-string.h" + +/* Maximum number of zones to report per blkdev_report_zones() call */ +#define BTRFS_REPORT_NR_ZONES 4096 + +static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, + void *data) +{ + struct blk_zone *zones = data; + + memcpy(&zones[idx], zone, sizeof(*zone)); + + return 0; +} + +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, + struct blk_zone *zones, unsigned int *nr_zones) +{ + int ret; + + if (!*nr_zones) + return 0; + + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones, + copy_zone_info_cb, zones); + if (ret < 0) { + btrfs_err_in_rcu(device->fs_info, + "get zone at %llu on %s failed %d", pos, + rcu_str_deref(device->name), ret); + return ret; + } + *nr_zones = ret; + if (!ret) + return -EIO; + + return 0; +} + +int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = NULL; + struct block_device *bdev = device->bdev; + sector_t nr_sectors = bdev->bd_part->nr_sects; + sector_t sector = 0; + struct blk_zone *zones = NULL; + unsigned int i, nreported = 0, nr_zones; + unsigned int zone_sectors; + int ret; + char devstr[sizeof(device->fs_info->sb->s_id) + + sizeof(" (device )") - 1]; + + if (!bdev_is_zoned(bdev)) + return 0; + + zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL); + if (!zone_info) + return -ENOMEM; + + zone_sectors = bdev_zone_sectors(bdev); + ASSERT(is_power_of_2(zone_sectors)); + zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; + zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); + if (!IS_ALIGNED(nr_sectors, zone_sectors)) + zone_info->nr_zones++; + + zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->seq_zones) { + ret = -ENOMEM; + goto out; + } + + zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->empty_zones) { + ret = -ENOMEM; + goto out; + } + + zones = kcalloc(BTRFS_REPORT_NR_ZONES, + sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) { + ret = -ENOMEM; + goto out; + } + + /* Get zones type */ + while (sector < nr_sectors) { + nr_zones = BTRFS_REPORT_NR_ZONES; + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones, + &nr_zones); + if (ret) + goto out; + + for (i = 0; i < nr_zones; i++) { + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ) + set_bit(nreported, zone_info->seq_zones); + if (zones[i].cond == BLK_ZONE_COND_EMPTY) + set_bit(nreported, zone_info->empty_zones); + nreported++; + } + sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len; + } + + if (nreported != zone_info->nr_zones) { + btrfs_err_in_rcu(device->fs_info, + "inconsistent number of zones on %s (%u / %u)", + rcu_str_deref(device->name), nreported, + zone_info->nr_zones); + ret = -EIO; + goto out; + } + + kfree(zones); + + device->zone_info = zone_info; + + devstr[0] = 0; + if (device->fs_info) + snprintf(devstr, sizeof(devstr), " (device %s)", + device->fs_info->sb->s_id); + + rcu_read_lock(); + pr_info( +"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors", + devstr, + bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware", + rcu_str_deref(device->name), zone_info->nr_zones, + zone_info->zone_size >> SECTOR_SHIFT); + rcu_read_unlock(); + + return 0; + +out: + kfree(zones); + bitmap_free(zone_info->empty_zones); + bitmap_free(zone_info->seq_zones); + kfree(zone_info); + + return ret; +} + +void btrfs_destroy_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return; + + bitmap_free(zone_info->seq_zones); + bitmap_free(zone_info->empty_zones); + kfree(zone_info); + device->zone_info = NULL; +} + +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + unsigned int nr_zones = 1; + int ret; + + ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones); + if (ret != 0 || !nr_zones) + return ret ? ret : -EIO; + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h new file mode 100644 index 000000000000..e4a08ae0a96b --- /dev/null +++ b/fs/btrfs/zoned.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#ifndef BTRFS_ZONED_H +#define BTRFS_ZONED_H + +struct btrfs_zoned_device_info { + /* + * Number of zones, zone size and types of zones if bdev is a + * zoned block device. + */ + u64 zone_size; + u8 zone_size_shift; + u32 nr_zones; + unsigned long *seq_zones; + unsigned long *empty_zones; +}; + +#ifdef CONFIG_BLK_DEV_ZONED +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone); +int btrfs_get_dev_zone_info(struct btrfs_device *device); +void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +#else /* CONFIG_BLK_DEV_ZONED */ +static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + return 0; +} +static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + return 0; +} +static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +#endif + +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return false; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->seq_zones); +} + +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return true; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->empty_zones); +} + +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device, + u64 pos, bool set) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + unsigned int zno; + + if (!zone_info) + return; + + zno = pos >> zone_info->zone_size_shift; + if (set) + set_bit(zno, zone_info->empty_zones); + else + clear_bit(zno, zone_info->empty_zones); +} + +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, true); +} + +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, false); +} + +#endif From patchwork Thu Oct 1 18:36:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812203 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B3269112E for ; Thu, 1 Oct 2020 18:38:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 80DAA208C7 for ; Thu, 1 Oct 2020 18:38:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="O+oQMatP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732792AbgJASiK (ORCPT ); Thu, 1 Oct 2020 14:38:10 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24694 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732702AbgJASiF (ORCPT ); Thu, 1 Oct 2020 14:38:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577486; x=1633113486; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wMuHH6RI/+F0H8vO8jzqBgtlIttUzV9YrJ752d/15AA=; b=O+oQMatPG2DVTxzRX3dcxMAY21Jxmf5AUf+Vry+xPrgJIXHWGQSwVtxC X9VPoLuAgTefUFMU+ogaZwfkr4BqmoA33Vm2g8/ZEHwxQd803lB2ZD4Wy X5VSREl7ND6hxOnVMttqjDTzY+7oEe7aBaxWpOL0CBgiAmequdKW4v0+Q VIRYOe5Ot4GlkgsIYMS8cOe1pYSC/9Z3gH6WXUet58P/qHbbBqN7uXecq 4DjcC6AViJjLYsS+1iAEniFUa7+yBF24tR2MLcDF7AgSmbG791JhnCf1B 4XNNOLSUeXI6PVerQzTv6D+o95T+uBJ+CqkjMOF7USkBXnZge7cWAMyfY g==; IronPort-SDR: lwPYyPTARD+geqpjrUIiNb8yzzHITng6L9/5eZfKx69iTmvQtsI0t1FvpIRDkXkrIaqHfP5ElI ENcG0CLglekJvfnPcyuQa2D0iTeLdndlQuKqB4dFPh+DyYLbAv5UMkHkQa7LcNtWwyzRyD8UZb U82UErAXX9gzzSb91iH5iAJD7aqgVjncgg7ifarY6hJ9rJgWP5BUc5VXSrQ1e9jyhYPKePJQ7f aNEe3ho/Sj4MLr+b8QNzh+cDmQnu3+XQFdfSIwavem+en8wLV8lhExPa2K6oH277dio9xdsErL jjE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036773" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:05 +0800 IronPort-SDR: PHbAU0W3CWfEqpbQLCFTcsOTG6deFVSSIlmNCyUlV1vWGKexTox31r4C0Zz+n2Xunp+xJYMJ3d HsIWemKTMndA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:02 -0700 IronPort-SDR: lhMMbaBgpAvAavCjR+c9IueLD07m+H0cy+S+KBXncglhz9Xz1wp/N2oMpVKhHWbeo68U47NxAT HP0/QUluePoA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:04 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik , Damien Le Moal Subject: [PATCH v8 04/41] btrfs: Check and enable ZONED mode Date: Fri, 2 Oct 2020 03:36:11 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit introduces the function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/dev-replace.c | 7 ++++ fs/btrfs/disk-io.c | 9 +++++ fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 5 +++ fs/btrfs/zoned.c | 78 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 26 ++++++++++++++ 7 files changed, 129 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index aac3d6f4e35b..1a51aeb15574 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -588,6 +588,9 @@ struct btrfs_fs_info { struct btrfs_root *free_space_root; struct btrfs_root *data_reloc_root; + /* Zone size when in ZONED mode */ + u64 zone_size; + /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 6f6d77224c2b..5e3554482af1 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, return PTR_ERR(bdev); } + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + btrfs_err(fs_info, + "zone type of target device mismatch with the filesystem!"); + ret = -EINVAL; + goto error; + } + sync_blockdev(bdev); list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 764001609a15..acf374b6e1ab 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -42,6 +42,7 @@ #include "block-group.h" #include "discard.h" #include "space-info.h" +#include "zoned.h" #define BTRFS_SUPER_FLAG_SUPP (BTRFS_HEADER_FLAG_WRITTEN |\ BTRFS_HEADER_FLAG_RELOC |\ @@ -3130,7 +3131,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device btrfs_free_extra_devids(fs_devices, 1); + ret = btrfs_check_zoned_mode(fs_info); + if (ret) { + btrfs_err(fs_info, "failed to init ZONED mode: %d", + ret); + goto fail_block_groups; + } + ret = btrfs_sysfs_add_fsid(fs_devices); + if (ret) { btrfs_err(fs_info, "failed to init sysfs fsid interface: %d", ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8840a4fa81eb..1b2399c9c94e 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -44,6 +44,7 @@ #include "backref.h" #include "space-info.h" #include "sysfs.h" +#include "zoned.h" #include "tests/btrfs-tests.h" #include "block-group.h" #include "discard.h" diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b0a7a820222e..dc21cb8cdea9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2517,6 +2517,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path if (IS_ERR(bdev)) return PTR_ERR(bdev); + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + ret = -EINVAL; + goto error; + } + if (fs_devices->seeding) { seeding_dev = 1; down_write(&sb->s_umount); diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 0c908f0e9469..7509888b457a 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -177,3 +177,81 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, return 0; } + +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) +{ + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 hmzoned_devices = 0; + u64 nr_devices = 0; + u64 zone_size = 0; + int incompat_zoned = btrfs_fs_incompat(fs_info, ZONED); + int ret = 0; + + /* Count zoned devices */ + list_for_each_entry(device, &fs_devices->devices, dev_list) { + enum blk_zoned_model model; + + if (!device->bdev) + continue; + + model = bdev_zoned_model(device->bdev); + if (model == BLK_ZONED_HM || + (model == BLK_ZONED_HA && incompat_zoned)) { + hmzoned_devices++; + if (!zone_size) { + zone_size = device->zone_info->zone_size; + } else if (device->zone_info->zone_size != zone_size) { + btrfs_err(fs_info, + "Zoned block devices must have equal zone sizes"); + ret = -EINVAL; + goto out; + } + } + nr_devices++; + } + + if (!hmzoned_devices && !incompat_zoned) + goto out; + + if (!hmzoned_devices && incompat_zoned) { + /* No zoned block device found on ZONED FS */ + btrfs_err(fs_info, + "ZONED enabled file system should have zoned devices"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices && !incompat_zoned) { + btrfs_err(fs_info, + "Enable ZONED mode to mount HMZONED device"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices != nr_devices) { + btrfs_err(fs_info, + "zoned devices cannot be mixed with regular devices"); + ret = -EINVAL; + goto out; + } + + /* + * stripe_size is always aligned to BTRFS_STRIPE_LEN in + * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size, + * check the alignment here. + */ + if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) { + btrfs_err(fs_info, + "zone size is not aligned to BTRFS_STRIPE_LEN"); + ret = -EINVAL; + goto out; + } + + fs_info->zone_size = zone_size; + + btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", + fs_info->zone_size); +out: + return ret; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index e4a08ae0a96b..4341630cb756 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -9,6 +9,8 @@ #ifndef BTRFS_ZONED_H #define BTRFS_ZONED_H +#include + struct btrfs_zoned_device_info { /* * Number of zones, zone size and types of zones if bdev is a @@ -26,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone); int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -37,6 +40,14 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) return 0; } static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + btrfs_err(fs_info, "Zoned block devices support is not enabled"); + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -89,4 +100,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, btrfs_dev_set_empty_zone_bit(device, pos, false); } +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, + struct block_device *bdev) +{ + u64 zone_size; + + if (btrfs_fs_incompat(fs_info, ZONED)) { + zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT; + /* Do not allow non-zoned device */ + return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size; + } + + /* Do not allow Host Manged zoned device */ + return bdev_zoned_model(bdev) != BLK_ZONED_HM; +} + #endif From patchwork Thu Oct 1 18:36:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812221 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 81BD4112E for ; Thu, 1 Oct 2020 18:38:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5D51F208C7 for ; Thu, 1 Oct 2020 18:38:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="KHwuxTiJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732965AbgJASiY (ORCPT ); Thu, 1 Oct 2020 14:38:24 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732681AbgJASiK (ORCPT ); Thu, 1 Oct 2020 14:38:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577490; x=1633113490; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=o/QXdtpTvZbJXk7O9rYpJGqOU3WyM2Yv+BRRT6Mwh9s=; b=KHwuxTiJP1lj3BwtCTPrZrQQzKB+Sk6JB0N3Ed+mj2rQ9NUsnmJsyxkJ 8ad7pp5cJh/T9VTjNDkvXGr0u3upSIW63Gl/UyTNhbNZdc1nzizjCDKqe ZKsqbUa85kuFnD4OAzLCWOrAfq9UEngvElXp1Rc472K1ZY6eiJ2vCNodR zTnyLLrIGi3Zs0PyPVS9EcC3HXHUq7tafPfjKwmDQk6J6/+m4ZlL6nUvW 1pPzL7cZiJRIswqBq35x0UGhURKkITppcVSsq1kFNO7+NVg/iesAk5RHu 8PirI4+wtnVO9Fl5C++8cUq6/9tG3d1QlXB+9ZtfdTZ6Ps4lIgCIA7ZPM w==; IronPort-SDR: 1lf0WrMj35tG7+yoAZ2ABNh9s+AC1ZpQIxb140RCRGIyc6H3imYYW+agETE9WScAtmuCXM1fJn mDmhKjoq/8pvGf/zKKgO1IczOfzC49vqaKLooDt7KxTQ83xKmGBY0NLHf8wOKB3JNx5yW2WhGX 1/S9wr2FEPvV5bRS5Qsmqc/qnPq5wGz8Kh5+zTKvUSPpt9Y3o1xODiEWaUKnZOA0Ak3qXA8C3o +SaVjcwQ0yHtnc7pkpfupqU55D81vyWZuTGWPOBlGVOEzmTtXZdDDAatHEGcBqmuBdlS9FynoE mtc= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036776" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:06 +0800 IronPort-SDR: hhuFp/KiYG7g2GXsxNVyNCzezntVtsi4CWebSP/PGrO3k7eMQ+9Y89xLKNC3a4cH2PMCN1crhR 762EFTa4ZTnQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:03 -0700 IronPort-SDR: ZkM+ZoC0XNPOnyAL70rXA+ye24OwYPhtB3tH5qshfmWKcVJrIHlUpw3/303sKShvH3Tc5YcjXC rrgyT+hGGBeA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:06 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 05/41] btrfs: introduce max_zone_append_size Date: Fri, 2 Oct 2020 03:36:12 +0900 Message-Id: <16ed33b15dfb6dd2268cb1fa95a9595cdc23982a.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zone append write command has a maximum IO size restriction it accepts. Introduce max_zone_append_size to zone_info and fs_into to track the value. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/zoned.c | 17 +++++++++++++++-- fs/btrfs/zoned.h | 1 + 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1a51aeb15574..e6f0fe1920e9 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -590,6 +590,8 @@ struct btrfs_fs_info { /* Zone size when in ZONED mode */ u64 zone_size; + /* max size to emit ZONE_APPEND write command */ + u64 max_zone_append_size; /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 7509888b457a..2e12fce81abf 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -53,6 +53,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) { struct btrfs_zoned_device_info *zone_info = NULL; struct block_device *bdev = device->bdev; + struct request_queue *q = bdev_get_queue(bdev); sector_t nr_sectors = bdev->bd_part->nr_sects; sector_t sector = 0; struct blk_zone *zones = NULL; @@ -73,6 +74,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) ASSERT(is_power_of_2(zone_sectors)); zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->max_zone_append_size = + (u64)queue_max_zone_append_sectors(q) << SECTOR_SHIFT; zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); if (!IS_ALIGNED(nr_sectors, zone_sectors)) zone_info->nr_zones++; @@ -185,6 +188,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) u64 hmzoned_devices = 0; u64 nr_devices = 0; u64 zone_size = 0; + u64 max_zone_append_size = 0; int incompat_zoned = btrfs_fs_incompat(fs_info, ZONED); int ret = 0; @@ -198,15 +202,23 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) model = bdev_zoned_model(device->bdev); if (model == BLK_ZONED_HM || (model == BLK_ZONED_HA && incompat_zoned)) { + struct btrfs_zoned_device_info *zone_info = + device->zone_info; + hmzoned_devices++; if (!zone_size) { - zone_size = device->zone_info->zone_size; - } else if (device->zone_info->zone_size != zone_size) { + zone_size = zone_info->zone_size; + } else if (zone_info->zone_size != zone_size) { btrfs_err(fs_info, "Zoned block devices must have equal zone sizes"); ret = -EINVAL; goto out; } + if (!max_zone_append_size || + (zone_info->max_zone_append_size && + zone_info->max_zone_append_size < max_zone_append_size)) + max_zone_append_size = + zone_info->max_zone_append_size; } nr_devices++; } @@ -249,6 +261,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) } fs_info->zone_size = zone_size; + fs_info->max_zone_append_size = max_zone_append_size; btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", fs_info->zone_size); diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 4341630cb756..f200b46a71fb 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -18,6 +18,7 @@ struct btrfs_zoned_device_info { */ u64 zone_size; u8 zone_size_shift; + u64 max_zone_append_size; u32 nr_zones; unsigned long *seq_zones; unsigned long *empty_zones; From patchwork Thu Oct 1 18:36:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812215 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B580112E for ; Thu, 1 Oct 2020 18:38:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4B3AF207DE for ; Thu, 1 Oct 2020 18:38:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="JMV7tSvd" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732952AbgJASiU (ORCPT ); Thu, 1 Oct 2020 14:38:20 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24684 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729990AbgJASiK (ORCPT ); Thu, 1 Oct 2020 14:38:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577490; x=1633113490; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nt5RvzrKncyCwRonsGkKjpSvnmNJRAz+NkVdMTQ33Do=; b=JMV7tSvdj40lCEbmZPE/i6fIaCl7NKWzSpGJA5Q+OR/hJ/0YBRJLJ3Vk sJCHaLGOZZExGH15VHxq4WbTWKfo64XDx1a5eSonYdqLZUKNnLPuW4Mzn 7zj0wbNBhxZ95wPnGOTujdGaOdyr7oeKUsWkmPmwrReMsbWxgYdKIkuF5 0XsPWx/arCGK5KXlJbUzUX9AO3C7zmLtba7X8nqLE7Fn6WvK6vSVdf2Xo kRKF9OHluXta1l9Ghl7xxdWACwMggUoKBcnu4MSN4dRzzyCv2VAtyjzxB Whw+mkJZcBDPBtz4MAWpBL3oid7KIQH+e8hC2UeUb45PbufU1aihIEV7A w==; IronPort-SDR: tn3FRS82hCci8o2FaL97h/JQV42b35yE7ZhoKIAkoWHcv5K2iH35OIf61e54zOspqRYOzkWRPy JWK2j8XjiQBp07XXqsybfPcES+1epEdPx1nz+O15af1Cbj+SeIqTOeUBJynCblvqb4nCSPlThy ryDnNN3hR53OPVoFZGhrhmS3eNa68kpU2uuj6cCLzB4uTNgrNuA1X7Xf4Uw707Ep5yHIPgbc+m ieEk4t060Wf857u2Uk7zsm1qhawJamUAK0uBuOi253UZO0wIbEPBYQ6dPoEnxyTe/+gsSBTrPB 8Cc= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036778" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:08 +0800 IronPort-SDR: bf794fw+aBZsDY88/2tPNY5ym5+OFj0HPGRo4DWuXzgcUfvhIdS9hKRB8YHyy2pSLEW+Rm69HK Uwn9EamdP2Tw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:04 -0700 IronPort-SDR: 7IRGwnfewwsAEDD05qfTyQF+VX+UONahpyegvFo6FZoP/NJYURunourDkaW86XybvSjMX2hc6Q YAuhS41A8eOQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:07 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 06/41] btrfs: disallow space_cache in ZONED mode Date: Fri, 2 Oct 2020 03:36:13 +0900 Message-Id: <74608b65bb5c80387169b21b7b4e7c58f06883d6.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduces any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on ZONED mode. But, since ZONED mode now always allocate extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. For the same reason, NODATACOW is also disabled. Also INODE_MAP_CACHE is also disabled to avoid preallocation in the INODE_MAP_CACHE inode. In summary, ZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID/Dup | Cannot handle two zone append writes to different | | | zones | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | | INODE_MAP_CACHE | Need pre-allocation. (and will be deprecated?) | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | Signed-off-by: Naohiro Aota --- fs/btrfs/super.c | 12 ++++++++++-- fs/btrfs/zoned.c | 18 ++++++++++++++++++ fs/btrfs/zoned.h | 5 +++++ 3 files changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 1b2399c9c94e..dfdd4f161d16 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -525,8 +525,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, cache_gen = btrfs_super_cache_generation(info->super_copy); if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE); - else if (cache_gen) - btrfs_set_opt(info->mount_opt, SPACE_CACHE); + else if (cache_gen) { + if (btrfs_fs_incompat(info, ZONED)) { + btrfs_info(info, + "clearring existing space cache in ZONED mode"); + btrfs_set_super_cache_generation(info->super_copy, 0); + } else + btrfs_set_opt(info->mount_opt, SPACE_CACHE); + } /* * Even the options are empty, we still need to do extra check @@ -985,6 +991,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = -EINVAL; } + if (!ret) + ret = btrfs_check_mountopts_zoned(info); if (!ret && btrfs_test_opt(info, SPACE_CACHE)) btrfs_info(info, "disk space caching is enabled"); if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE)) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 2e12fce81abf..1629e585ba8c 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -268,3 +268,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) out: return ret; } + +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) +{ + if (!btrfs_fs_incompat(info, ZONED)) + return 0; + + /* + * SPACE CACHE writing is not CoWed. Disable that to avoid write + * errors in sequential zones. + */ + if (btrfs_test_opt(info, SPACE_CACHE)) { + btrfs_err(info, + "space cache v1 not supportted in ZONED mode"); + return -EOPNOTSUPP; + } + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index f200b46a71fb..2e1983188e6f 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -30,6 +30,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -49,6 +50,10 @@ static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "Zoned block devices support is not enabled"); return -EOPNOTSUPP; } +static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812209 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CD97B139F for ; Thu, 1 Oct 2020 18:38:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A559C208C7 for ; Thu, 1 Oct 2020 18:38:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="lUge+ZRe" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732936AbgJASiR (ORCPT ); Thu, 1 Oct 2020 14:38:17 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24694 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732672AbgJASiK (ORCPT ); Thu, 1 Oct 2020 14:38:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577490; x=1633113490; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7ukzEdMdawXTYzv0Ef7UrRfJK4C09v/gF5FDZbrkpSw=; b=lUge+ZReYcnF3oVFCXgT2U/9rCFLDVK6/hvl9XIpY2DO2xGjMPDm546+ P/xUtOoqBBw8vwuBIBL0BNKa+4B30lZz9AyaufcRWWqsqdco4eV4pusLR o1lQa8RYRVs9HODqXsiODK0DnnxJSEa00Gja9dMwJFmnqK+B1vAgxbAih Dllg0eIsqkyg6R6snU1ClVczWBGUBeceMD8dP+FtgGaej5YFptpE1AWWo 3tWESQ9Kgp+cWG1hSZC1TD24yJKzto56HxsY0MbXbB3aSHMK77v4yuyzz kut+7xGTrykUp1Vs/ZD/Q9+BNekR7KOm7qNK25hx2mHbiB8PHQj7hphxS g==; IronPort-SDR: 3xHQHszBHuXdSm0exzpewBnBEcD4V0T+lIIkEjmRwQCPPIts2LXCgA7rcVo0zwmGDh6jIms/wO tL6h45jRmDLzmDaMXHFV1Gc4+e6rOTFtUkVhTB/Big/ARaZBb8VSPUyRhqrgQtl3Pm4PwOhGx5 T1TvmCq0BBC6RSQDXUOvq8+f2WF65KO+ci+b0h2FSZS6GxkSZ4Oe1XRaqYczoGFL4Ep/LKnRBH XzfwA+W1xvKqS87GD4D0YSjG34tkpmoPe12i8cs0MSoAgooRbeCnC8nZclp1ElT6eqyHcglp0d NWE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036780" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:09 +0800 IronPort-SDR: ovHq9ETv8+6Mkjcz//kuuJ52wefC40UngaY2RCOJIPsPmuDrBrqiWgHPr1Q0s/cVBypsCPlR+l VGE7dOMrXyNw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:06 -0700 IronPort-SDR: 0VVbbzqeEtjaQjOK6fkyuI01aw/m9gbfwizMk0AhzN4o7jAyDypomaNESn+Egnic14ZMfJ6ziq eoWce5LG1R2A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:08 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Johannes Thumshirn , Josef Bacik Subject: [PATCH v8 07/41] btrfs: disallow NODATACOW in ZONED mode Date: Fri, 2 Oct 2020 03:36:14 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Johannes Thumshirn Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/ioctl.c | 3 +++ fs/btrfs/zoned.c | 6 ++++++ 2 files changed, 9 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ab408a23ba32..5d592da4e2ff 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -91,6 +91,9 @@ struct btrfs_ioctl_send_args_32 { static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode, unsigned int flags) { + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), ZONED)) + flags &= ~FS_NOCOW_FL; + if (S_ISDIR(inode->i_mode)) return flags; else if (S_ISREG(inode->i_mode)) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 1629e585ba8c..6bce654bb0e8 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -284,5 +284,11 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_opt(info, NODATACOW)) { + btrfs_err(info, + "cannot enable nodatacow with ZONED mode"); + return -EOPNOTSUPP; + } + return 0; } From patchwork Thu Oct 1 18:36:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812193 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 00D06139F for ; Thu, 1 Oct 2020 18:38:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D57A5208C7 for ; Thu, 1 Oct 2020 18:38:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="rS1+fKrR" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732845AbgJASiL (ORCPT ); Thu, 1 Oct 2020 14:38:11 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732824AbgJASiL (ORCPT ); Thu, 1 Oct 2020 14:38:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577491; x=1633113491; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ActMtzcr2+7CfSK+3pP+twewHCLI8xF4oYxSkgupwuk=; b=rS1+fKrRFhUXZ5uOgPpX9eg8m69SompI7aVVkt+Uq7JvG22no7VQNwG1 hscuRt6li2v5tZTWbSGtfo7OTZeOQSFr7XcKC1jDBP5VBMdpg5N2VIMk4 2XDQQDkI9VZnHRz7OpLPBFgk2giC+qtktXfvqBjj/WBI3S7snSicIp0Fl jyid5ehJqzOnSS1U1swbw1RtlEKbd+eISRrrZa1cr8Me2szauUYfRSpLu HsaBQy4W8cNQeeiShfIPY3qAXJvluVwUrBpErcllXVUCya89EfLlBeDtu Y7uGU7NU2dosacjwwIa8+jgz1QLi1BG862S/yHC430JSt5XxGEDzUR+l9 Q==; IronPort-SDR: AM+vEakLdT4pLB6LCzitXCrLfZwAVtxwEWzvrO1qvlrnY23Ikf5/bMQAkRYaAUZdkahSGdlk3+ Chj2QMcRH//4DQ05401ercVgYFurVGkRixYjM02luZCo8j5zMUkhtUE6EWU0mpiyjpIlbCvGVK u4VQVSBwx+xaEZcfBjXqRuKiE98CLHaXlovaRrev6Oj3IPXrhuAMTfRQ+ww+g3PYCUC9ZZza/h 0ViE+Bp7SVg3BosFZuHquRULgV7ZhBZ2yxszUgGJbX85boOLFcDIWPsKNI64z9LHA3Vk4cyN9R ZfE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036783" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:10 +0800 IronPort-SDR: k7gW9FdvLEDLLJuXYcv+OWevNY5JqQCI/IPxxnQbu8BR+SPNVNkGWRgt+Rbc2JcsQz1dMDj96y M4oSPOjau+Cg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:07 -0700 IronPort-SDR: CF+VKRvEaJIZFwSeA1Ik/8Jn3Pl19e6O+TZl92PM5bewCS+jAL1gIwx2OF7V0ADDGCkKskLWki KV4Oav8rfILw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:09 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Johannes Thumshirn , Josef Bacik Subject: [PATCH v8 08/41] btrfs: disable fallocate in ZONED mode Date: Fri, 2 Oct 2020 03:36:15 +0900 Message-Id: <3743047aa305f5592b972013d60739b0c4ba77b6.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or so. Reviewed-by: Johannes Thumshirn Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/file.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 038e0afaf3d0..60a01e1347ba 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3341,6 +3341,10 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_end = round_up(offset + len, blocksize); cur_offset = alloc_start; + /* Do not allow fallocate in ZONED mode */ + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), ZONED)) + return -EOPNOTSUPP; + /* Make sure we aren't being give some crap mode */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) From patchwork Thu Oct 1 18:36:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812195 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 63895112E for ; Thu, 1 Oct 2020 18:38:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 404DF207DE for ; Thu, 1 Oct 2020 18:38:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="A4iJhkmZ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732884AbgJASiM (ORCPT ); Thu, 1 Oct 2020 14:38:12 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732864AbgJASiM (ORCPT ); Thu, 1 Oct 2020 14:38:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577492; x=1633113492; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=i44UAAWCU3pEu/WrC87PmjeM9oN6Y5QwdfGRxQZHtDI=; b=A4iJhkmZJrmAGBoTTCtmnMhLqHrUDbEe8nWdM2trmM3TyRD7wSk28i7Y oxV9C63oobPFkBQ81VaGa90CMVR1G94e3Ov1l4r4WkMYSAx9dghO0rI9e 3qhMjkO97kKlLDaS4nydWFZ7PfdlZp2cyL6P/xAB8TPmrjKjAXVIdzaTN SiSUPbMaMlvIdbFhisfcR19Woq6VGiYe1OmL6Z+N7XXcGN6R3B0U8oeHF FiaZ7VdUbVeb8iB4PE5LG3BTVj9imXQ/BABhippmRfVbEECWpLnoRTAv9 ajZ6Yvry3I7ZHrjcdX9tVYuhwmY/NFwRTuBImA+qyj00d2K+Ul/WNdPf6 w==; IronPort-SDR: R5ezBBCJTpMiY+W3UmR0l3zJojhnid3KphjZYyBYhFwm1nymBrIlkAQWr8exBywYYv1mtQaGFx U8zf2Cbmwl61NkbKPC4frz1ZyPVBXaYNDzesWTWTSbzMIjZhU5A5evfueqv64mwGAKLl0MJKqR BXTaZvRuLSgoOWpF2WBgKOinCmyCGu7lpFc6ubU85gApJzW3qMH9v18Mjrqiz7MB1dCM7PzFLh SdsrloINZwhWL69EsIJ2b7UuHmIYjbe3TWbtXhbSF8BcazPMGv12fWkZyhx3O+ICG1Xul7yt3Y OL0= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036786" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:12 +0800 IronPort-SDR: +h/EfI0uvCrsyGw4eprtZJcuL9j1mcnskAc0jFeQilccTF4t1L29AZpDhaUxzFvzUYRbAd/Lnj JBre4SbtEJCw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:08 -0700 IronPort-SDR: dUBoZnvKoO8335zDgTLNwXuS7I3LlmzwT+zPRA3oQ0Nv5jZ2wsg99UTkmb2eZz/Zo6gQtwCHwT +qRPuZNZ5zwA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:11 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik Subject: [PATCH v8 09/41] btrfs: disallow mixed-bg in ZONED mode Date: Fri, 2 Oct 2020 03:36:16 +0900 Message-Id: <4e567db24f9a949e104687a0c467c488409ba14e.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Placing both data and metadata in a block group is impossible in ZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do so, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. This commit check and disallow MIXED_BG with ZONED mode. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 6bce654bb0e8..8cd43d2d5611 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -260,6 +260,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) goto out; } + if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) { + btrfs_err(fs_info, + "ZONED mode is not allowed for mixed block groups"); + ret = -EINVAL; + goto out; + } + fs_info->zone_size = zone_size; fs_info->max_zone_append_size = max_zone_append_size; From patchwork Thu Oct 1 18:36:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812207 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 442AD112E for ; Thu, 1 Oct 2020 18:38:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 14FB2207DE for ; Thu, 1 Oct 2020 18:38:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="gMOl0crI" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732925AbgJASiQ (ORCPT ); Thu, 1 Oct 2020 14:38:16 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732894AbgJASiN (ORCPT ); Thu, 1 Oct 2020 14:38:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577493; x=1633113493; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qgEBWgJSMacwtA3SUdTXH5qAHlX9ODJTm0ZOwmpKHYs=; b=gMOl0crI3YZ6k+Q/Mq/L77AkPjATfsh6BsGK9Ln1i9NnrjBbfZ/JMC1W dcQKbx2X0KM/TqCMldGNry81COnNCdHEDw+dzAWlDMwc7zSm9duuO990g 7Hanmx+fXJ7rJyWLrTRxqOkdTykW9oPw7GneMwBDJIE6iqYMSJoma5PtI YP8L85Jf7O3e4Rjf6SAzm1ZPt/B2an9/lij05S0TgP9mT7T6aRWUlb7Wp jJDrdeGsRR4UbY7295um1jzFflLobVSlHzd3/L1VTNrgNT3O5Pe600ryT qUYNLydnPFOBUSY44ouQIx/zhIc0xBt1PACUNW7YPW00Wd6AzwOuEmK4D Q==; IronPort-SDR: JjPwzCMLaqOw5+PL3ciYoXa9fYZ+TB4H3hgv75QaO64UtD4L4DIUw/5IRkmMHj8IIj6QJDeqq8 dzv5Ct1nQ4IY4naww7pYHWajhbCT+EI7bsRRbfUO003oWmoM9Nis55iI4hmnclu51WAFE4Y3rJ gw0p5etog2QXDhzYJR9t0DjQmGPAbJlgGJoAK3agCd06ni0KClFUh0Mg8Gxscd49VhxCLIhNxj GKb48U1wsoYGRHyoxAQZ4R+yz9iTc4S4Ri2L7EX6bfNeAlXaMEfXs/8CxXLJbwrq2yJWdXeMDl sp0= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036790" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:13 +0800 IronPort-SDR: xAqLW8qa52557hw5RspC239GR6UspCPKK0jCRN+2VE3ltwhrt1o92O9tWgyHR2D16ef8D0K//s 5vPi98cpVWbg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:10 -0700 IronPort-SDR: xc4RgaAzRd6qlo6VsUYb0UGcTD/0npLCT+R6bBXNqkLyve2DOU++SD37nvkF/ylxoJfArG03EV oztc75rvhsiA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:12 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik Subject: [PATCH v8 10/41] btrfs: disallow inode_cache in ZONED mode Date: Fri, 2 Oct 2020 03:36:17 +0900 Message-Id: <4aad45e8c087490facbd24fc037b6ab374295cbe.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org inode_cache use pre-allocation to write its cache data. However, pre-allocation is completely disabled in ZONED mode. We can technically enable inode_cache in the same way as relocation. However, inode_cache is rarely used and the man page discourage using it. So, let's just disable it for now. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 8cd43d2d5611..e47698d313a5 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -297,5 +297,11 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) { + btrfs_err(info, + "cannot enable inode map caching with ZONED mode"); + return -EOPNOTSUPP; + } + return 0; } From patchwork Thu Oct 1 18:36:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812237 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8FB7D139F for ; Thu, 1 Oct 2020 18:38:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4EC0F20B1F for ; Thu, 1 Oct 2020 18:38:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="J1c8u4au" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733010AbgJASii (ORCPT ); Thu, 1 Oct 2020 14:38:38 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732864AbgJASiP (ORCPT ); Thu, 1 Oct 2020 14:38:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577495; x=1633113495; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pmClhOPtDrFea7ar7p3G7l33bFEnRhz6PBpq4eUFmdI=; b=J1c8u4auvGqZJObDHMH1tnr7UH/+wSDUxPwRumZpBZKIBRngX/J93ACh nFkUw8nOf9rtBhMOzNfIOEheS0JhoWq0y6dMIX3qdFkbmjvbQwJRvFr+U H0I9I1yqnqKm0O0tCPY9IM2ndgs2RPO1eklE27guaA/n6oxJUt/mJAKN7 J/CW9RywxwWKBIXQmV3LRn5MCPurH6uz79ehe6AjxQgs+cxJhoxCRTzwZ GSdHOStaCSdC4iH0CfAaemQwI/y3fR7dM5IN5oNABoTaWrXwvSU8TmWU2 C8dPNWAAY3+r9vFAhQa52fTVSo9YN28ki6FbqiMAafrrQstvtqRQt4wHT A==; IronPort-SDR: gCgoDBRk3eKAJMjREhT+GrCjntOkfccI0nLzcHz+SOSVcJU4gg9fIWs4Cwzuyf0GiGNQb2QPbh SElWUDBsg6OhtMtNCG7p9Q8HxX9TTcgEavgD+Ey0aAd0nfd6URhDlzYn6fNZLixcwKugKJv5F3 J+HJaJRpY+jMQreUgrUlmbL/8BSdQ1oIeneQn4h6VIpBTud0N6LUfbgsS4ckPKfyc5zOd7IYyy jPMDn6FlovncsLpLDUxRd5MiuEWBDCr7LCYSUfz3qMOON5DLwmS6iJf1tWsJ56e0N4pFe2r7q0 8mY= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036793" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:14 +0800 IronPort-SDR: HBd9FXbXBSVC5OEZYWShYev7O/Kgey/A+InlwfVZwbuv0nfjW5H7EHuGsCnfB5x+x+6mzIFee7 DtYmOywopNsA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:11 -0700 IronPort-SDR: 4fEC6BYik2znt+i27daZJzh1Dgvs6UhVzhbIvGU3yLOSyg0eFHcP/fvn4BOluL4H9/nxIdYjeJ 8YA9xlv4jTlw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:13 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 11/41] btrfs: implement log-structured superblock for ZONED mode Date: Fri, 2 Oct 2020 03:36:18 +0900 Message-Id: <6e76f3f26ab5ba9973b81d0db069dd0001b63b25.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second buffer. Then, when the both zones are filled up and before start writing to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when the both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 9 ++ fs/btrfs/disk-io.c | 41 +++++- fs/btrfs/scrub.c | 3 + fs/btrfs/volumes.c | 21 ++- fs/btrfs/zoned.c | 304 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 40 ++++++ 6 files changed, 406 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index c0f1d6818df7..0ce68aef2dd7 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1723,6 +1723,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, static int exclude_super_stripes(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; + bool zoned = btrfs_fs_incompat(fs_info, ZONED); u64 bytenr; u64 *logical; int stripe_len; @@ -1744,6 +1745,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache) if (ret) return ret; + /* shouldn't have super stripes in sequential zones */ + if (zoned && nr) { + btrfs_err(fs_info, + "Zoned btrfs's block group %llu should not have super blocks", + cache->start); + return -EUCLEAN; + } + while (nr--) { u64 len = min_t(u64, stripe_len, cache->start + cache->length - logical[nr]); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index acf374b6e1ab..61b50a7df27b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3421,10 +3421,17 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev, { struct btrfs_super_block *super; struct page *page; - u64 bytenr; + u64 bytenr, bytenr_orig; struct address_space *mapping = bdev->bd_inode->i_mapping; + int ret; + + bytenr_orig = btrfs_sb_offset(copy_num); + ret = btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr); + if (ret == -ENOENT) + return ERR_PTR(-EINVAL); + else if (ret) + return ERR_PTR(ret); - bytenr = btrfs_sb_offset(copy_num); if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode)) return ERR_PTR(-EINVAL); @@ -3438,7 +3445,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev, return ERR_PTR(-ENODATA); } - if (btrfs_super_bytenr(super) != bytenr) { + if (btrfs_super_bytenr(super) != bytenr_orig) { btrfs_release_disk_super(super); return ERR_PTR(-EINVAL); } @@ -3493,7 +3500,8 @@ static int write_dev_supers(struct btrfs_device *device, SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); int i; int errors = 0; - u64 bytenr; + int ret; + u64 bytenr, bytenr_orig; if (max_mirrors == 0) max_mirrors = BTRFS_SUPER_MIRROR_MAX; @@ -3505,12 +3513,21 @@ static int write_dev_supers(struct btrfs_device *device, struct bio *bio; struct btrfs_super_block *disk_super; - bytenr = btrfs_sb_offset(i); + bytenr_orig = btrfs_sb_offset(i); + ret = btrfs_sb_log_location(device, i, WRITE, &bytenr); + if (ret == -ENOENT) + continue; + else if (ret < 0) { + btrfs_err(device->fs_info, "couldn't get super block location for mirror %d", + i); + errors++; + continue; + } if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; - btrfs_set_super_bytenr(sb, bytenr); + btrfs_set_super_bytenr(sb, bytenr_orig); crypto_shash_digest(shash, (const char *)sb + BTRFS_CSUM_SIZE, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE, @@ -3555,6 +3572,7 @@ static int write_dev_supers(struct btrfs_device *device, bio->bi_opf |= REQ_FUA; btrfsic_submit_bio(bio); + btrfs_advance_sb_log(device, i); } return errors < i ? 0 : -1; } @@ -3571,6 +3589,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) int i; int errors = 0; bool primary_failed = false; + int ret; u64 bytenr; if (max_mirrors == 0) @@ -3579,7 +3598,15 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) for (i = 0; i < max_mirrors; i++) { struct page *page; - bytenr = btrfs_sb_offset(i); + ret = btrfs_sb_log_location(device, i, READ, &bytenr); + if (ret == -ENOENT) + break; + else if (ret < 0) { + errors++; + if (i == 0) + primary_failed = true; + continue; + } if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index cf63f1e27a27..aa1b36cf5c88 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -20,6 +20,7 @@ #include "rcu-string.h" #include "raid56.h" #include "block-group.h" +#include "zoned.h" /* * This is only the first step towards a full-features scrub. It reads all @@ -3704,6 +3705,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, if (bytenr + BTRFS_SUPER_INFO_SIZE > scrub_dev->commit_total_bytes) break; + if (!btrfs_check_super_location(scrub_dev, bytenr)) + continue; ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr, scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dc21cb8cdea9..9fd5a2b0a0a7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1281,7 +1281,8 @@ void btrfs_release_disk_super(struct btrfs_super_block *super) } static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev, - u64 bytenr) + u64 bytenr, + u64 bytenr_orig) { struct btrfs_super_block *disk_super; struct page *page; @@ -1312,7 +1313,7 @@ static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev /* align our pointer to the offset of the super block */ disk_super = p + offset_in_page(bytenr); - if (btrfs_super_bytenr(disk_super) != bytenr || + if (btrfs_super_bytenr(disk_super) != bytenr_orig || btrfs_super_magic(disk_super) != BTRFS_MAGIC) { btrfs_release_disk_super(p); return ERR_PTR(-EINVAL); @@ -1347,7 +1348,8 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, bool new_device_added = false; struct btrfs_device *device = NULL; struct block_device *bdev; - u64 bytenr; + u64 bytenr, bytenr_orig; + int ret; lockdep_assert_held(&uuid_mutex); @@ -1357,14 +1359,18 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, * So, we need to add a special mount option to scan for * later supers, using BTRFS_SUPER_MIRROR_MAX instead */ - bytenr = btrfs_sb_offset(0); flags |= FMODE_EXCL; bdev = blkdev_get_by_path(path, flags, holder); if (IS_ERR(bdev)) return ERR_CAST(bdev); - disk_super = btrfs_read_disk_super(bdev, bytenr); + bytenr_orig = btrfs_sb_offset(0); + ret = btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr); + if (ret) + return ERR_PTR(ret); + + disk_super = btrfs_read_disk_super(bdev, bytenr, bytenr_orig); if (IS_ERR(disk_super)) { device = ERR_CAST(disk_super); goto error_bdev_put; @@ -2028,6 +2034,11 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info, if (IS_ERR(disk_super)) continue; + if (bdev_is_zoned(bdev)) { + btrfs_reset_sb_log_zones(bdev, copy_num); + continue; + } + memset(&disk_super->magic, 0, sizeof(disk_super->magic)); page = virt_to_page(disk_super); diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index e47698d313a5..897ce30cf1a1 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -26,6 +26,27 @@ static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, return 0; } +static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zone, + u64 *wp_ret); + +static inline u32 sb_zone_number(u8 shift, int mirror) +{ + ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX); + + switch (mirror) { + case 0: + return 0; + case 1: + return 16; + case 2: + return min(btrfs_sb_offset(mirror) >> shift, 1024ULL); + default: + BUG(); + } + + return 0; +} + static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, struct blk_zone *zones, unsigned int *nr_zones) { @@ -126,6 +147,40 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) goto out; } + nr_zones = 2; + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + u32 sb_zone = sb_zone_number(zone_info->zone_size_shift, i); + u64 sb_wp; + + if (sb_zone + 1 >= zone_info->nr_zones) + continue; + + sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT); + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, + &zone_info->sb_zones[2 * i], + &nr_zones); + if (ret) + goto out; + if (nr_zones != 2) { + btrfs_err_in_rcu(device->fs_info, + "failed to read SB log zone info at device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EIO; + goto out; + } + + ret = sb_write_pointer(device->bdev, + &zone_info->sb_zones[2 * i], &sb_wp); + if (ret != -ENOENT && ret) { + btrfs_err_in_rcu(device->fs_info, + "SB log zone corrupted: device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EUCLEAN; + goto out; + } + } + + kfree(zones); device->zone_info = zone_info; @@ -305,3 +360,252 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return 0; } + +static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones, + u64 *wp_ret) +{ + bool empty[2]; + bool full[2]; + sector_t sector; + + ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL && + zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL); + + empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY; + empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY; + full[0] = zones[0].cond == BLK_ZONE_COND_FULL; + full[1] = zones[1].cond == BLK_ZONE_COND_FULL; + + /* + * Possible state of log buffer zones + * + * E I F + * E * x 0 + * I 0 x 0 + * F 1 1 C + * + * Row: zones[0] + * Col: zones[1] + * State: + * E: Empty, I: In-Use, F: Full + * Log position: + * *: Special case, no superblock is written + * 0: Use write pointer of zones[0] + * 1: Use write pointer of zones[1] + * C: Compare SBs from zones[0] and zones[1], use the newer one + * x: Invalid state + */ + + if (empty[0] && empty[1]) { + /* special case to distinguish no superblock to read */ + *wp_ret = zones[0].start << SECTOR_SHIFT; + return -ENOENT; + } else if (full[0] && full[1]) { + /* Compare two super blocks */ + struct address_space *mapping = bdev->bd_inode->i_mapping; + struct page *page[2]; + struct btrfs_super_block *super[2]; + int i; + + for (i = 0; i < 2; i++) { + u64 bytenr = ((zones[i].start + zones[i].len) << SECTOR_SHIFT) - + BTRFS_SUPER_INFO_SIZE; + + page[i] = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS); + if (IS_ERR(page[i])) { + if (i == 1) + btrfs_release_disk_super(super[0]); + return PTR_ERR(page[i]); + } + super[i] = page_address(page[i]); + } + + if (super[0]->generation > super[1]->generation) + sector = zones[1].start; + else + sector = zones[0].start; + + for (i = 0; i < 2; i++) + btrfs_release_disk_super(super[i]); + } else if (!full[0] && (empty[1] || full[1])) { + sector = zones[0].wp; + } else if (full[0]) { + sector = zones[1].wp; + } else { + return -EUCLEAN; + } + *wp_ret = sector << SECTOR_SHIFT; + return 0; +} + +static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, + int rw, u64 *bytenr_ret) +{ + u64 wp; + int ret; + + if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { + *bytenr_ret = zones[0].start << SECTOR_SHIFT; + return 0; + } + + ret = sb_write_pointer(bdev, zones, &wp); + if (ret != -ENOENT && ret < 0) + return ret; + + if (rw == WRITE) { + struct blk_zone *reset = NULL; + + if (wp == zones[0].start << SECTOR_SHIFT) + reset = &zones[0]; + else if (wp == zones[1].start << SECTOR_SHIFT) + reset = &zones[1]; + + if (reset) { + ASSERT(reset->cond == BLK_ZONE_COND_FULL); + + ret = blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET, + reset->start, reset->len, + GFP_NOFS); + if (ret) + return ret; + + reset->cond = BLK_ZONE_COND_EMPTY; + reset->wp = reset->start; + } + } else if (ret != -ENOENT) { + /* For READ, we want the precious one */ + if (wp == zones[0].start << SECTOR_SHIFT) + wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT; + wp -= BTRFS_SUPER_INFO_SIZE; + } + + *bytenr_ret = wp; + return 0; + +} + +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret) +{ + struct blk_zone zones[2]; + unsigned int zone_sectors; + u32 sb_zone; + int ret; + u64 zone_size; + u8 zone_sectors_shift; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u32 nr_zones; + + if (!bdev_is_zoned(bdev)) { + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; + } + + ASSERT(rw == READ || rw == WRITE); + + zone_sectors = bdev_zone_sectors(bdev); + if (!is_power_of_2(zone_sectors)) + return -EINVAL; + zone_size = zone_sectors << SECTOR_SHIFT; + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift, 2, + copy_zone_info_cb, zones); + if (ret < 0) + return ret; + if (ret != 2) + return -EIO; + + return sb_log_location(bdev, zones, rw, bytenr_ret); +} + +int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, + u64 *bytenr_ret) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u32 zone_num; + + if (!zinfo) { + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; + } + + zone_num = sb_zone_number(zinfo->zone_size_shift, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return -ENOENT; + + return sb_log_location(device->bdev, &zinfo->sb_zones[2 * mirror], rw, + bytenr_ret); +} + +static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo, + int mirror) +{ + u32 zone_num; + + if (!zinfo) + return false; + + zone_num = sb_zone_number(zinfo->zone_size_shift, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return false; + + if (!test_bit(zone_num, zinfo->seq_zones)) + return false; + + return true; +} + +void btrfs_advance_sb_log(struct btrfs_device *device, int mirror) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + struct blk_zone *zone; + + if (!is_sb_log_zone(zinfo, mirror)) + return; + + zone = &zinfo->sb_zones[2 * mirror]; + if (zone->cond != BLK_ZONE_COND_FULL) { + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) + zone->cond = BLK_ZONE_COND_FULL; + return; + } + + zone++; + ASSERT(zone->cond != BLK_ZONE_COND_FULL); + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) + zone->cond = BLK_ZONE_COND_FULL; +} + +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) +{ + sector_t zone_sectors; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u8 zone_sectors_shift; + u32 sb_zone; + u32 nr_zones; + + zone_sectors = bdev_zone_sectors(bdev); + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + return blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET, + sb_zone << zone_sectors_shift, zone_sectors * 2, + GFP_NOFS); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 2e1983188e6f..60651040532a 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -10,6 +10,8 @@ #define BTRFS_ZONED_H #include +#include "volumes.h" +#include "disk-io.h" struct btrfs_zoned_device_info { /* @@ -22,6 +24,7 @@ struct btrfs_zoned_device_info { u32 nr_zones; unsigned long *seq_zones; unsigned long *empty_zones; + struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX]; }; #ifdef CONFIG_BLK_DEV_ZONED @@ -31,6 +34,12 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info); +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret); +int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, + u64 *bytenr_ret); +void btrfs_advance_sb_log(struct btrfs_device *device, int mirror); +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -54,6 +63,26 @@ static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) { return 0; } +static inline int btrfs_sb_log_location_bdev(struct block_device *bdev, + int mirror, int rw, + u64 *bytenr_ret) +{ + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; +} +static inline int btrfs_sb_log_location(struct btrfs_device *device, int mirror, + int rw, u64 *bytenr_ret) +{ + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; +} +static inline void btrfs_advance_sb_log(struct btrfs_device *device, + int mirror) { } +static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, + int mirror) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -121,4 +150,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, return bdev_zoned_model(bdev) != BLK_ZONED_HM; } +static inline bool btrfs_check_super_location(struct btrfs_device *device, + u64 pos) +{ + /* + * On a non-zoned device, any address is OK. On a zoned device, + * non-SEQUENTIAL WRITE REQUIRED zones are capable. + */ + return device->zone_info == NULL || + !btrfs_dev_is_sequential(device, pos); +} + #endif From patchwork Thu Oct 1 18:36:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812253 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1DFAD112E for ; Thu, 1 Oct 2020 18:38:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EBCC9207DE for ; Thu, 1 Oct 2020 18:38:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="IEMikTw5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732984AbgJASih (ORCPT ); Thu, 1 Oct 2020 14:38:37 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732912AbgJASiQ (ORCPT ); Thu, 1 Oct 2020 14:38:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577496; x=1633113496; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+RRWXz7tmlJ1tNqPrQJ4RdxdUt9Zlw9AI0j6pnlOGs0=; b=IEMikTw50nlrgb7qMGvNOAVXbLtsaGc3MGrP9sriSsdFOG/2ZImvCw7W eos8trBCzPrNcPmLAEcv4oF3ZuNbnMD5pnF/IbSrGpSZs66zlaca4kZG3 bl2hRGgi2zpeSFheH4ywCg46aIm8R8lt9drl/CufxEatVrchZF5bU0p7J kT34gKaAjdqMV47R653fAMtIG9uLg96iDIA944E79+Gntad9gh7NZY39D 6ZpHYESEjnf+OkIGrs7u76Q3ZmR2YW41KikUdCovrygxpoYtA+Qls7T2k Q3pMCAZMhZ7XZshHvMeBi5F2y6H1bp6C9uZv5G47aeSEXItt3JqHZgxIY g==; IronPort-SDR: mSK2kJkRfceXAqGaS9jQhvJe4I2aeF67K6QBXZhfFLXpMSqnrB0C8ye2/R20fvpskoHBUeoX3+ S+BalYFhXqbyJCGJkrvGT/xE2lWG7GCqEdBNKMckVzRiL0B3Wf1XpDYZ91tB9DblOESerL5zik AZEdsENifQkFZULVY4zNps/hAUCgRIHJ4lJUFCjtSepUh7W1TAD0ls+6UNWPoiM5/ecqHXYZHe w+wPg0OtO2ITDls6fDrTZvO8xdP0FOgVi5ZKqMYegLqsTFiogEGrdUY4aKTdq6v1/V9zkoGtGI +iM= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036795" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:15 +0800 IronPort-SDR: RbCtpu8P+sJsICofwq4WV5S9IfbTOMDHjjHT2YVOquvFd/m/bMfioX9AjyyAYzJ6D3jyH1KeEY Re20qt/CoOVw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:12 -0700 IronPort-SDR: xYMP02sZWFYd/YB/8EJ95uoXSUvfu43B7GMgvwMXduCbn/vkWi9cBF8ou8+yeqNNtS/SpPG8r/ m3BdV63GLpCQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:15 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 12/41] btrfs: implement zoned chunk allocator Date: Fri, 2 Oct 2020 03:36:19 +0900 Message-Id: <93fd46b0d81119f60ff3ad5fd1d7a0df73d8a16a.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit implements zoned chunk/dev_extent allocator. The zoned allocator align the device extents to zone boundaries so that a zone reset affects only the device extent and does not change the state of blocks in the neighbor device extents. Also, it checks that a region allocation is not over any locations of super block zones, and ensures the region is empty. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 133 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + fs/btrfs/zoned.c | 126 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 30 ++++++++++ 4 files changed, 290 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9fd5a2b0a0a7..3b6f07330553 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1415,6 +1415,14 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start, return false; } +static inline u64 dev_extent_search_start_zoned(struct btrfs_device *device, + u64 start) +{ + start = max_t(u64, start, + max_t(u64, device->zone_info->zone_size, SZ_1M)); + return btrfs_zone_align(device, start); +} + static u64 dev_extent_search_start(struct btrfs_device *device, u64 start) { switch (device->fs_devices->chunk_alloc_policy) { @@ -1425,11 +1433,57 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start) * make sure to start at an offset of at least 1MB. */ return max_t(u64, start, SZ_1M); + case BTRFS_CHUNK_ALLOC_ZONED: + return dev_extent_search_start_zoned(device, start); default: BUG(); } } +static bool dev_extent_hole_check_zoned(struct btrfs_device *device, + u64 *hole_start, u64 *hole_size, + u64 num_bytes) +{ + u64 zone_size = device->zone_info->zone_size; + u64 pos; + int ret; + int changed = 0; + + ASSERT(IS_ALIGNED(*hole_start, zone_size)); + + while (*hole_size > 0) { + pos = btrfs_find_allocatable_zones(device, *hole_start, + *hole_start + *hole_size, + num_bytes); + if (pos != *hole_start) { + *hole_size = *hole_start + *hole_size - pos; + *hole_start = pos; + changed = 1; + if (*hole_size < num_bytes) + break; + } + + ret = btrfs_ensure_empty_zones(device, pos, num_bytes); + + /* range is ensured to be empty */ + if (!ret) + return changed; + + /* given hole range was invalid (outside of device) */ + if (ret == -ERANGE) { + *hole_start += *hole_size; + *hole_size = 0; + return 1; + } + + *hole_start += zone_size; + *hole_size -= zone_size; + changed = 1; + } + + return changed; +} + /** * dev_extent_hole_check - check if specified hole is suitable for allocation * @device: the device which we have the hole @@ -1462,6 +1516,10 @@ static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start, case BTRFS_CHUNK_ALLOC_REGULAR: /* No extra check */ break; + case BTRFS_CHUNK_ALLOC_ZONED: + changed |= dev_extent_hole_check_zoned(device, hole_start, + hole_size, num_bytes); + break; default: BUG(); } @@ -1516,6 +1574,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device, search_start = dev_extent_search_start(device, search_start); + WARN_ON(device->zone_info && + !IS_ALIGNED(num_bytes, device->zone_info->zone_size)); + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -4906,6 +4967,39 @@ static void init_alloc_chunk_ctl_policy_regular( ctl->dev_extent_min = BTRFS_STRIPE_LEN * ctl->dev_stripes; } +static void +init_alloc_chunk_ctl_policy_zoned(struct btrfs_fs_devices *fs_devices, + struct alloc_chunk_ctl *ctl) +{ + u64 zone_size = fs_devices->fs_info->zone_size; + u64 limit; + int min_num_stripes = ctl->devs_min * ctl->dev_stripes; + int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies; + u64 min_chunk_size = min_data_stripes * zone_size; + u64 type = ctl->type; + + ctl->max_stripe_size = zone_size; + if (type & BTRFS_BLOCK_GROUP_DATA) { + ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE, + zone_size); + } else if (type & BTRFS_BLOCK_GROUP_METADATA) { + ctl->max_chunk_size = ctl->max_stripe_size; + } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) { + ctl->max_chunk_size = 2 * ctl->max_stripe_size; + ctl->devs_max = min_t(int, ctl->devs_max, + BTRFS_MAX_DEVS_SYS_CHUNK); + } else { + BUG(); + } + + /* We don't want a chunk larger than 10% of writable space */ + limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1), + zone_size), + min_chunk_size); + ctl->max_chunk_size = min(limit, ctl->max_chunk_size); + ctl->dev_extent_min = zone_size * ctl->dev_stripes; +} + static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices, struct alloc_chunk_ctl *ctl) { @@ -4926,6 +5020,9 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices, case BTRFS_CHUNK_ALLOC_REGULAR: init_alloc_chunk_ctl_policy_regular(fs_devices, ctl); break; + case BTRFS_CHUNK_ALLOC_ZONED: + init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl); + break; default: BUG(); } @@ -5052,6 +5149,40 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl, return 0; } +static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl, + struct btrfs_device_info *devices_info) +{ + u64 zone_size = devices_info[0].dev->zone_info->zone_size; + int data_stripes; /* number of stripes that count for + block group size */ + + /* + * It should hold because: + * dev_extent_min == dev_extent_want == zone_size * dev_stripes + */ + ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min); + + ctl->stripe_size = zone_size; + ctl->num_stripes = ctl->ndevs * ctl->dev_stripes; + data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies; + + /* + * stripe_size is fixed in ZONED. Reduce ndevs instead. + */ + if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) { + ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies, + ctl->stripe_size) + ctl->nparity, + ctl->dev_stripes); + ctl->num_stripes = ctl->ndevs * ctl->dev_stripes; + data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies; + ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size); + } + + ctl->chunk_size = ctl->stripe_size * data_stripes; + + return 0; +} + static int decide_stripe_size(struct btrfs_fs_devices *fs_devices, struct alloc_chunk_ctl *ctl, struct btrfs_device_info *devices_info) @@ -5079,6 +5210,8 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices, switch (fs_devices->chunk_alloc_policy) { case BTRFS_CHUNK_ALLOC_REGULAR: return decide_stripe_size_regular(ctl, devices_info); + case BTRFS_CHUNK_ALLOC_ZONED: + return decide_stripe_size_zoned(ctl, devices_info); default: BUG(); } diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 4bbb15c4161f..c01dd5e40ec8 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -213,6 +213,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used); enum btrfs_chunk_allocation_policy { BTRFS_CHUNK_ALLOC_REGULAR, + BTRFS_CHUNK_ALLOC_ZONED, }; struct btrfs_fs_devices { diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 897ce30cf1a1..b7cf837293e3 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -6,12 +6,14 @@ * Damien Le Moal */ +#include #include #include #include "ctree.h" #include "volumes.h" #include "zoned.h" #include "rcu-string.h" +#include "disk-io.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -324,6 +326,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) fs_info->zone_size = zone_size; fs_info->max_zone_append_size = max_zone_append_size; + fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED; btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", fs_info->zone_size); @@ -609,3 +612,126 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) sb_zone << zone_sectors_shift, zone_sectors * 2, GFP_NOFS); } + +/* + * btrfs_check_allocatable_zones - find allocatable zones within give region + * @device: the device to allocate a region + * @hole_start: the position of the hole to allocate the region + * @num_bytes: the size of wanted region + * @hole_size: the size of hole + * + * Allocatable region should not contain any superblock locations. + */ +u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, + u64 hole_end, u64 num_bytes) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u8 shift = zinfo->zone_size_shift; + u64 nzones = num_bytes >> shift; + u64 pos = hole_start; + u64 begin, end; + u64 sb_pos; + bool have_sb; + int i; + + ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size)); + ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size)); + + while (pos < hole_end) { + begin = pos >> shift; + end = begin + nzones; + + if (end > zinfo->nr_zones) + return hole_end; + + /* check if zones in the region are all empty */ + if (btrfs_dev_is_sequential(device, pos) && + find_next_zero_bit(zinfo->empty_zones, end, begin) != end) { + pos += zinfo->zone_size; + continue; + } + + have_sb = false; + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + sb_pos = sb_zone_number(zinfo->zone_size, i); + if (!(end < sb_pos || sb_pos + 1 < begin)) { + have_sb = true; + pos = (sb_pos + 2) << shift; + break; + } + } + if (!have_sb) + break; + } + + return pos; +} + +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes) +{ + int ret; + + *bytes = 0; + ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_RESET, + physical >> SECTOR_SHIFT, length >> SECTOR_SHIFT, + GFP_NOFS); + if (ret) + return ret; + + *bytes = length; + while (length) { + btrfs_dev_set_zone_empty(device, physical); + physical += device->zone_info->zone_size; + length -= device->zone_info->zone_size; + } + + return 0; +} + +int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u8 shift = zinfo->zone_size_shift; + unsigned long begin = start >> shift; + unsigned long end = (start + size) >> shift; + u64 pos; + int ret; + + ASSERT(IS_ALIGNED(start, zinfo->zone_size)); + ASSERT(IS_ALIGNED(size, zinfo->zone_size)); + + if (end > zinfo->nr_zones) + return -ERANGE; + + /* all the zones are conventional */ + if (find_next_bit(zinfo->seq_zones, begin, end) == end) + return 0; + + /* all the zones are sequential and empty */ + if (find_next_zero_bit(zinfo->seq_zones, begin, end) == end && + find_next_zero_bit(zinfo->empty_zones, begin, end) == end) + return 0; + + for (pos = start; pos < start + size; pos += zinfo->zone_size) { + u64 reset_bytes; + + if (!btrfs_dev_is_sequential(device, pos) || + btrfs_dev_is_empty_zone(device, pos)) + continue; + + /* free regions should be empty */ + btrfs_warn_in_rcu( + device->fs_info, + "resetting device %s zone %llu for allocation", + rcu_str_deref(device->name), pos >> shift); + WARN_ON_ONCE(1); + + ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size, + &reset_bytes); + if (ret) + return ret; + } + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 60651040532a..02baed605752 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -40,6 +40,11 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, u64 *bytenr_ret); void btrfs_advance_sb_log(struct btrfs_device *device, int mirror); int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); +u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, + u64 hole_end, u64 num_bytes); +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes); +int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -83,6 +88,23 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, { return 0; } +static inline u64 btrfs_find_allocatable_zones(struct btrfs_device *device, + u64 hole_start, u64 hole_end, + u64 num_bytes) +{ + return hole_start; +} +static inline int btrfs_reset_device_zone(struct btrfs_device *device, + u64 physical, u64 length, u64 *bytes) +{ + *bytes = 0; + return 0; +} +static inline int btrfs_ensure_empty_zones(struct btrfs_device *device, + u64 start, u64 size) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -161,4 +183,12 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device, !btrfs_dev_is_sequential(device, pos); } +static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) +{ + if (!device->zone_info) + return pos; + + return ALIGN(pos, device->zone_info->zone_size); +} + #endif From patchwork Thu Oct 1 18:36:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812249 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4D0AE112E for ; Thu, 1 Oct 2020 18:38:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 25A88208C7 for ; Thu, 1 Oct 2020 18:38:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="OZuMxQCa" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732998AbgJASii (ORCPT ); Thu, 1 Oct 2020 14:38:38 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732931AbgJASiR (ORCPT ); Thu, 1 Oct 2020 14:38:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577497; x=1633113497; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pTkyVP7Td2jDa9Gu75FvxlDevtrMsYCgV6qHyrQB9qk=; b=OZuMxQCaj1H4l87cibGCUj32GCkZgsXlshT43c0Qb/FRrU6xB8NYIP38 ptl/qH6mCWsgOIjUTAP3THyT9sZkzYAivgoXmkhQ6eIk8Z4dXU8HCwtin Pac7W7zUk5M/4JuqYP1eD75Alamn8Zj/uF7D2TP/siFy3zsLFs+hiq33E CVh6PkQFKQHzfPHcuix/3g5bN0t9YdWGbv5WRe+beqDNiETTb37wGhRws J9gIvFJm3nZe0WSizNwgcZC1vf3bG2oXFhmlsaY1d+Y4LNI46RmXgZszj u3vnNhYdnK8A5047OEmPIwwF64/V/EaWUCIBsyZnRL1iCp9JLZJej13tw A==; IronPort-SDR: omaPjMOYPJeOUz1oKOTIh1I6fiOMRWulGXmbO4pjs/JnoPZWpcsXAgwA4ojaaFoiV6U27xa+1A pbbqQFcn98T0w3ApBMtTIOD+h4GXVp+hbECsQspmn0cCRyY85sa95aCVi8qO2h+6K6ffmu3IeA 0OlmpTnsmf8xO4UQucrUS5On2bMwgCMlfvaPFYdGDIBAyy0Phn5muvkDxZXnuzaR36Sx3Sd3wW PluNQBXP6KGpB3a7RBqCepXJ3LJCZvWld/fjIoGL1rWSiosNmPwJlz9F1JZ0S0rgWTF/bpFud3 By8= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036798" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:17 +0800 IronPort-SDR: 5SwzOdgMmjguY66vk5NW5cj+eRqSh0Kc+tIfFabJSNEu32AK1lmRDncoHo50aY2L6qX+syzFoU GIuCBF6/9xRA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:13 -0700 IronPort-SDR: O7kSsBPHNqhzF7Iy9fhEfJWxQWu3OMLqL1Mnt+j+vG/nF5lNx832fhZEOIeD3pvGqd7bqHRBvF xGn7iEv8MkFQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:16 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 13/41] btrfs: verify device extent is aligned to zone Date: Fri, 2 Oct 2020 03:36:20 +0900 Message-Id: <7dd11a4a91107e78eab80932ff5ac3c89288a44e.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch add a verification in verify_one_dev_extent() to check if the device extent is aligned to zone boundary. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 3b6f07330553..c22ea7f0551f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7779,6 +7779,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info, ret = -EUCLEAN; goto out; } + + if (dev->zone_info) { + u64 zone_size = dev->zone_info->zone_size; + + if (!IS_ALIGNED(physical_offset, zone_size) || + !IS_ALIGNED(physical_len, zone_size)) { + btrfs_err(fs_info, +"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone", + devid, physical_offset, physical_len); + ret = -EUCLEAN; + goto out; + } + } + out: free_extent_map(em); return ret; From patchwork Thu Oct 1 18:36:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812227 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF95F139F for ; Thu, 1 Oct 2020 18:38:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C8798207DE for ; Thu, 1 Oct 2020 18:38:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="T4vEYcbR" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733026AbgJASij (ORCPT ); Thu, 1 Oct 2020 14:38:39 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732681AbgJASii (ORCPT ); Thu, 1 Oct 2020 14:38:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577518; x=1633113518; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZEehZj6Tcg90RXqyTM+IuyHkAyU2szt4bIq7ngLhT/U=; b=T4vEYcbR1HN3jkhcTQm1I5sfmZzCTxgkEYj+ImpzJsPGG9nRFm4PLLUB XdP51+EK4QMdg2HguWkZzmkEjcsRhWNGZazbbAlnQMnVX8pwlWtAfS0EA lVuhoWr4EV+SVCqNSpq0muSWECIXwMi/JbY+6t/U55KZBDjtXgWEjh8iU WhnQ8quj022aziDKZo6uKmFxEtZn7T9T0dC4+avzvUtgr7/V0Ng1YhZLJ qOhCD6MfZi3x8aU/oO8/BhslcIi0hSDLhcgjP2OIaliS4z98JemvY9Tib DjMYML4mVjCaIOzlUJN/atrTe/18BV3+jvK0Rm3Y/G21vnRCv6abC2KrS g==; IronPort-SDR: 4A2mRftsIDf1NusW5g5yaQLLepmixysk7LSKX0Gvr7fJpLW1ykta9QWlj2GE4HtsCRvN0e3mZG putbWZDQz3IassEi+SP/wOQqdkXkILote/iNrM5mzefvLQGCug5TZ8aczlWKFnMMD3+dgIxQLD D64ZHiwvYX9VYRL6tFy5UXxLJGF9NffyPVvLt1BKnXhWkpbqKqdWQ2nwi0ccKWANbbwTp98FZo 2iPLvl2IwWQYBaEjwEJRDccqm5SSs4qckfOSDM0uCQKjnP0BqMGlan/MlKM+/Zx077DDweHjiG ueA= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036799" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:18 +0800 IronPort-SDR: ObzevoTtJwvkXEPiVvoDkZAr0Vs+hrhoemQ1ndqISDcz8vdDLh6lJrOsssUzMpSru4HeSbabaT EuZ2kjMAvYtA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:14 -0700 IronPort-SDR: R70Bz3FVvPBpbw7cqzFWvYWJZNKXPjuSDedroM8wTWvlw0q9Xubmhztc99Y5/FaAUDRp6pE5CZ V9hYC54C7uIg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:17 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 14/41] btrfs: load zone's alloction offset Date: Fri, 2 Oct 2020 03:36:21 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zoned btrfs must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. This commit adds "alloc_offset" to track the logical address. This logical address is populated in btrfs_load_block-group_zone_info() from write pointers of corresponding zones. For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-SINGLE profiles for now. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 15 ++++ fs/btrfs/block-group.h | 6 ++ fs/btrfs/zoned.c | 153 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 6 ++ 4 files changed, 180 insertions(+) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 0ce68aef2dd7..6de3d95ab9d9 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -15,6 +15,7 @@ #include "delalloc-space.h" #include "discard.h" #include "raid56.h" +#include "zoned.h" /* * Return target flags in extended format or 0 if restripe for this chunk_type @@ -1935,6 +1936,13 @@ static int read_one_block_group(struct btrfs_fs_info *info, goto error; } + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_err(info, "failed to load zone info of bg %llu", + cache->start); + goto error; + } + /* * We need to exclude the super stripes now so that the space info has * super bytes accounted for, otherwise we'll think we have more space @@ -2161,6 +2169,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, cache->last_byte_to_unpin = (u64)-1; cache->cached = BTRFS_CACHE_FINISHED; cache->needs_free_space = 1; + + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_put_block_group(cache); + return ret; + } + ret = exclude_super_stripes(cache); if (ret) { /* We may have excluded something, so call this just in case */ diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index adfd7583a17b..14e3043c9ce7 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -183,6 +183,12 @@ struct btrfs_block_group { /* Record locked full stripes for RAID5/6 block group */ struct btrfs_full_stripe_locks_tree full_stripe_locks_root; + + /* + * Allocation offset for the block group to implement sequential + * allocation. This is used only with ZONED mode enabled. + */ + u64 alloc_offset; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index b7cf837293e3..33853f4d5a8b 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -9,14 +9,20 @@ #include #include #include +#include #include "ctree.h" #include "volumes.h" #include "zoned.h" #include "rcu-string.h" #include "disk-io.h" +#include "block-group.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 +/* Invalid allocation pointer value for missing devices */ +#define WP_MISSING_DEV ((u64)-1) +/* Pseudo write pointer value for conventional zone */ +#define WP_CONVENTIONAL ((u64)-2) static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data) @@ -735,3 +741,150 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) return 0; } + +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map_tree *em_tree = &fs_info->mapping_tree; + struct extent_map *em; + struct map_lookup *map; + struct btrfs_device *device; + u64 logical = cache->start; + u64 length = cache->length; + u64 physical = 0; + int ret; + int i; + unsigned int nofs_flag; + u64 *alloc_offsets = NULL; + u32 num_sequential = 0, num_conventional = 0; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + /* Sanity check */ + if (!IS_ALIGNED(length, fs_info->zone_size)) { + btrfs_err(fs_info, "unaligned block group at %llu + %llu", + logical, length); + return -EIO; + } + + /* Get the chunk mapping */ + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, logical, length); + read_unlock(&em_tree->lock); + + if (!em) + return -EINVAL; + + map = em->map_lookup; + + /* + * Get the zone type: if the group is mapped to a non-sequential zone, + * there is no need for the allocation offset (fit allocation is OK). + */ + alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), + GFP_NOFS); + if (!alloc_offsets) { + free_extent_map(em); + return -ENOMEM; + } + + for (i = 0; i < map->num_stripes; i++) { + bool is_sequential; + struct blk_zone zone; + + device = map->stripes[i].dev; + physical = map->stripes[i].physical; + + if (device->bdev == NULL) { + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } + + is_sequential = btrfs_dev_is_sequential(device, physical); + if (is_sequential) + num_sequential++; + else + num_conventional++; + + if (!is_sequential) { + alloc_offsets[i] = WP_CONVENTIONAL; + continue; + } + + /* + * This zone will be used for allocation, so mark this + * zone non-empty. + */ + btrfs_dev_clear_zone_empty(device, physical); + + /* + * The group is mapped to a sequential zone. Get the zone write + * pointer to determine the allocation offset within the zone. + */ + WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size)); + nofs_flag = memalloc_nofs_save(); + ret = btrfs_get_dev_zone(device, physical, &zone); + memalloc_nofs_restore(nofs_flag); + if (ret == -EIO || ret == -EOPNOTSUPP) { + ret = 0; + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } else if (ret) { + goto out; + } + + switch (zone.cond) { + case BLK_ZONE_COND_OFFLINE: + case BLK_ZONE_COND_READONLY: + btrfs_err(fs_info, "Offline/readonly zone %llu", + physical >> device->zone_info->zone_size_shift); + alloc_offsets[i] = WP_MISSING_DEV; + break; + case BLK_ZONE_COND_EMPTY: + alloc_offsets[i] = 0; + break; + case BLK_ZONE_COND_FULL: + alloc_offsets[i] = fs_info->zone_size; + break; + default: + /* Partially used zone */ + alloc_offsets[i] = + ((zone.wp - zone.start) << SECTOR_SHIFT); + break; + } + } + + if (num_conventional > 0) { + /* + * Since conventional zones does not have write pointer, we + * cannot determine alloc_offset from the pointer + */ + ret = -EINVAL; + goto out; + } + + switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { + case 0: /* single */ + cache->alloc_offset = alloc_offsets[0]; + break; + case BTRFS_BLOCK_GROUP_DUP: + case BTRFS_BLOCK_GROUP_RAID1: + case BTRFS_BLOCK_GROUP_RAID0: + case BTRFS_BLOCK_GROUP_RAID10: + case BTRFS_BLOCK_GROUP_RAID5: + case BTRFS_BLOCK_GROUP_RAID6: + /* non-SINGLE profiles are not supported yet */ + default: + btrfs_err(fs_info, "Unsupported profile on ZONED %s", + btrfs_bg_type_to_raid_name(map->type)); + ret = -EINVAL; + goto out; + } + +out: + kfree(alloc_offsets); + free_extent_map(em); + + return ret; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 02baed605752..3e05a526f0fa 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -45,6 +45,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -105,6 +106,11 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device, { return 0; } +static inline int btrfs_load_block_group_zone_info( + struct btrfs_block_group *cache) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812243 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E8B26174A for ; Thu, 1 Oct 2020 18:38:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CB120207DE for ; Thu, 1 Oct 2020 18:38:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Y2JJn5Iq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733051AbgJASiq (ORCPT ); Thu, 1 Oct 2020 14:38:46 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730079AbgJASii (ORCPT ); Thu, 1 Oct 2020 14:38:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577518; x=1633113518; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=70huaWDX98UzfF79GMzQzDz1K+t8l0FjZgA50bJ/GYI=; b=Y2JJn5IqS0HRQsjJNgXhiqk6uHWE13q8v3wnjC314lQ1jBSlnEkk28Rq 0cLjmhIrcEKU+PAwJOkQTc99c7Vs45/i22oKloFR0OEOHbzUsb8/gMVdT aIsZZ3hPTmNpzRtL2HA4puDYuhoPmJfYO5kEXuNixSSP7wR6gs0r7xIIF Ad/3HXzaPb9SPY2AnVCxBkPKo0akK8n2mSDn1k8cquyodHKT30Annr7MR VpbYMUpIURFdtGiMQjbeSpGeTKrryO0/FqsqHiCnR+2pNmI4U2mwX1n+q UCK3hVfgJM0KA8kc+InP6Hze6zkacGcRNJTNWhyg34FzV73Y+D1Sjk0R0 w==; IronPort-SDR: heP25J9XF5+q2jcgRCK/2tYfYJVVGVeBmqrFsi6M0mOU3NYwiWbQMNsUj1yb0VCAcPk/wpJZ9S 9nG6nVsiA+492i2gMWr6ItP9C+va10njWUoe+WjTPiZLvyoxmvtbCKGMaBjDacmLYGPAZLRQRk xEyuqvNxgsJ1GgexLzcjSwC0hN0CYW34MzCne4c09kxRpPuoHjuZ6afkQSOWSvRdHpi+vEa+oB qzneLb/WXD/Mzyvahgw/j7HlOj5+HKlBE3eIfdiV9wcOfL5PskcsKb1yDXzRV0MorVivwAVQYw DaE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036800" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:19 +0800 IronPort-SDR: tHsrkk+tiFRskOdHUu7eHQqbu6Hkmwp07Gy4tpycOqRttBkvbEINaV5UmayP8Ob1W9u46WStUS ZDpwdq8xFT6A== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:16 -0700 IronPort-SDR: XX39Q6I4OcTGB6pLfxNwA032JYVjh3rW4EjlS2yXoTxRbYqJuQbnfx5mOFYccWOy3/dHY3hPN1 6CtPvu0t58PA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:18 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 15/41] btrfs: emulate write pointer for conventional zones Date: Fri, 2 Oct 2020 03:36:22 +0900 Message-Id: <62d6c1774cf7ecffeacd66caec23d96dd4fdd70a.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Conventional zones do not have a write pointer. So, we cannot use it to determine the allocation offset if a block group contains a conventional zone. Instead, we can consider the end of the last allocated extent int the block group as an allocation offset. Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 119 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 113 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 33853f4d5a8b..3f65da0c4942 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -742,6 +742,104 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) return 0; } +static int emulate_write_pointer(struct btrfs_block_group *cache, + u64 *offset_ret) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_key search_key; + struct btrfs_key found_key; + int slot; + int ret; + u64 length; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + search_key.objectid = cache->start + cache->length; + search_key.type = 0; + search_key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); + if (ret < 0) + goto out; + ASSERT(ret != 0); + slot = path->slots[0]; + leaf = path->nodes[0]; + ASSERT(slot != 0); + slot--; + btrfs_item_key_to_cpu(leaf, &found_key, slot); + + if (found_key.objectid < cache->start) { + *offset_ret = 0; + } else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { + struct btrfs_key extent_item_key; + + if (found_key.objectid != cache->start) { + ret = -EUCLEAN; + goto out; + } + + length = 0; + + /* metadata may have METADATA_ITEM_KEY */ + if (slot == 0) { + btrfs_set_path_blocking(path); + ret = btrfs_prev_leaf(root, path); + if (ret < 0) + goto out; + if (ret == 0) { + slot = btrfs_header_nritems(leaf) - 1; + btrfs_item_key_to_cpu(leaf, &extent_item_key, + slot); + } + } else { + btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1); + ret = 0; + } + + if (ret == 0 && + extent_item_key.objectid == cache->start) { + if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY) + length = fs_info->nodesize; + else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY) + length = extent_item_key.offset; + else { + ret = -EUCLEAN; + goto out; + } + } + + *offset_ret = length; + } else if (found_key.type == BTRFS_EXTENT_ITEM_KEY || + found_key.type == BTRFS_METADATA_ITEM_KEY) { + + if (found_key.type == BTRFS_EXTENT_ITEM_KEY) + length = found_key.offset; + else + length = fs_info->nodesize; + + if (!(found_key.objectid >= cache->start && + found_key.objectid + length <= + cache->start + cache->length)) { + ret = -EUCLEAN; + goto out; + } + *offset_ret = found_key.objectid + length - cache->start; + } else { + ret = -EUCLEAN; + goto out; + } + ret = 0; + +out: + btrfs_free_path(path); + return ret; +} + int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; @@ -756,6 +854,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) int i; unsigned int nofs_flag; u64 *alloc_offsets = NULL; + u64 emulated_offset = 0; u32 num_sequential = 0, num_conventional = 0; if (!btrfs_fs_incompat(fs_info, ZONED)) @@ -856,12 +955,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) } if (num_conventional > 0) { - /* - * Since conventional zones does not have write pointer, we - * cannot determine alloc_offset from the pointer - */ - ret = -EINVAL; - goto out; + ret = emulate_write_pointer(cache, &emulated_offset); + if (ret || map->num_stripes == num_conventional) { + if (!ret) + cache->alloc_offset = emulated_offset; + goto out; + } } switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { @@ -883,6 +982,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) } out: + /* an extent is allocated after the write pointer */ + if (num_conventional && emulated_offset > cache->alloc_offset) { + btrfs_err(fs_info, + "got wrong write pointer in BG %llu: %llu > %llu", + logical, emulated_offset, cache->alloc_offset); + ret = -EIO; + } + kfree(alloc_offsets); free_extent_map(em); From patchwork Thu Oct 1 18:36:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812231 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8D6541668 for ; Thu, 1 Oct 2020 18:38:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4CE88207DE for ; Thu, 1 Oct 2020 18:38:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="bjg31Ggi" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733042AbgJASij (ORCPT ); Thu, 1 Oct 2020 14:38:39 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732643AbgJASij (ORCPT ); Thu, 1 Oct 2020 14:38:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577518; x=1633113518; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Jr2Oj7xfeM2ehlrNgynuTkCcETlJVvIbv5RDfGLlmSQ=; b=bjg31GgiK9CTJS7pO5UIn+KpIapYR6O+u5WY/hCdFBh2+NwCC4wB8Smr OiZWckayDHgGAr6zayvwFSbW32YC1PYhaUZLwp5BVRPz3+2Bncqa7Q0gb mR/4Pz7SnRc/GBfEYKr5E9/p+vR07fxutl1KQF+JRbfKPmZ0Vzh6ushpc dCSExe3zQJtgY4LEF50lVhKGV+hR2JC4h+BMYGGs491sV6t6AqOuDEdp9 iJylpEmzANihxZKtSAVQmgAtmMnqVKJ1WhSNj0BAHbSHgcUlUrCWa5BrF 0/ROS16TNIQrGOkq+XtxNBmm17kqBNYt1rCfDYKqKParmhpKrGrB2lU0q Q==; IronPort-SDR: Nv9rBvCByaqDZRA6BDPVmlHpeDkAErIgCQEwbnRQ7CszOnDjGEmKc3dKxlbreWysNnGuC/zS+5 gnq2Rbhy8DvJ1fB7OLWPOOqpVTRLddaXWNK0hQApLLdUV+EMvpQf9Ltef5prmApRdzYV6cS5qr T1govyj2l9I6AsZt0cczGC+jJMih1KlLstybuZ0brXc1A0nYpc9VS6KyfeWfTGbOooDB1G3hCf 5Hy9EIVGMTH+X+SKk2IAKXdXlgUY3HmxXZ2f515x6+GP/m+lrmSAThscfsrtn7wRKW775FR1rh qYo= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036803" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:20 +0800 IronPort-SDR: qn7278Kf1RgpdNtdEQ7Z10VAl3zkmqLcI06i5HxtzhzQsAk6kvKzKpXq0QTT7NTVWm7NzlRHl9 R9qHR9jYaoUw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:17 -0700 IronPort-SDR: qp3BV39J9Hwlraz8nMlWtUKnfv9ZW/ovlUrVeCuUKt/VZQI1rXcpuDXksDrAtybvDdbjUcjZuu R3mVGAgBk4VA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:20 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 16/41] btrfs: track unusable bytes for zones Date: Fri, 2 Oct 2020 03:36:23 +0900 Message-Id: <97c566b91d4cd1542a5c87a2a15137d325dedb7a.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In zoned btrfs, once written, then freed region is never usable until resetting underlying zones. We need to distinguish such unusable space from usable free space. So, this commit introduces "zone_unusable" to block group, and "bytes_zone_unusable" to space_info to track the unusable space. Pinned bytes are always reclaimed to the unsable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behave as the same as btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the allocation offset. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 19 +++++++++----- fs/btrfs/block-group.h | 1 + fs/btrfs/extent-tree.c | 15 ++++++++--- fs/btrfs/free-space-cache.c | 52 +++++++++++++++++++++++++++++++++++++ fs/btrfs/free-space-cache.h | 4 +++ fs/btrfs/space-info.c | 13 ++++++---- fs/btrfs/space-info.h | 4 ++- fs/btrfs/sysfs.c | 2 ++ fs/btrfs/zoned.c | 22 ++++++++++++++++ fs/btrfs/zoned.h | 2 ++ 10 files changed, 118 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 6de3d95ab9d9..e68e477d9160 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1080,12 +1080,15 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, WARN_ON(block_group->space_info->total_bytes < block_group->length); WARN_ON(block_group->space_info->bytes_readonly - < block_group->length); + < block_group->length - block_group->zone_unusable); + WARN_ON(block_group->space_info->bytes_zone_unusable + < block_group->zone_unusable); WARN_ON(block_group->space_info->disk_total < block_group->length * factor); } block_group->space_info->total_bytes -= block_group->length; - block_group->space_info->bytes_readonly -= block_group->length; + block_group->space_info->bytes_readonly -= + (block_group->length - block_group->zone_unusable); block_group->space_info->disk_total -= block_group->length * factor; spin_unlock(&block_group->space_info->lock); @@ -1229,7 +1232,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force) } num_bytes = cache->length - cache->reserved - cache->pinned - - cache->bytes_super - cache->used; + cache->bytes_super - cache->zone_unusable - cache->used; /* * Data never overcommits, even in mixed mode, so do just the straight @@ -1973,6 +1976,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, btrfs_free_excluded_extents(cache); } + btrfs_calc_zone_unusable(cache); + ret = btrfs_add_block_group_cache(info, cache); if (ret) { btrfs_remove_free_space_cache(cache); @@ -1980,7 +1985,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, } trace_btrfs_add_block_group(info, cache, 0); btrfs_update_space_info(info, cache->flags, cache->length, - cache->used, cache->bytes_super, &space_info); + cache->used, cache->bytes_super, + cache->zone_unusable, &space_info); cache->space_info = space_info; @@ -2217,7 +2223,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, */ trace_btrfs_add_block_group(fs_info, cache, 1); btrfs_update_space_info(fs_info, cache->flags, size, bytes_used, - cache->bytes_super, &cache->space_info); + cache->bytes_super, 0, &cache->space_info); btrfs_update_global_block_rsv(fs_info); link_block_group(cache); @@ -2325,7 +2331,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache) spin_lock(&cache->lock); if (!--cache->ro) { num_bytes = cache->length - cache->reserved - - cache->pinned - cache->bytes_super - cache->used; + cache->pinned - cache->bytes_super - + cache->zone_unusable - cache->used; sinfo->bytes_readonly -= num_bytes; list_del_init(&cache->ro_list); } diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 14e3043c9ce7..5be47f4bfea7 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -189,6 +189,7 @@ struct btrfs_block_group { * allocation. This is used only with ZONED mode enabled. */ u64 alloc_offset; + u64 zone_unusable; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3b21fee13e77..051e61f16cbe 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -34,6 +34,7 @@ #include "block-group.h" #include "discard.h" #include "rcu-string.h" +#include "zoned.h" #undef SCRAMBLE_DELAYED_REFS @@ -2807,9 +2808,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, cache = btrfs_lookup_block_group(fs_info, start); BUG_ON(!cache); /* Logic error */ - cluster = fetch_cluster_info(fs_info, - cache->space_info, - &empty_cluster); + if (!btrfs_fs_incompat(fs_info, ZONED)) + cluster = fetch_cluster_info(fs_info, + cache->space_info, + &empty_cluster); + empty_cluster <<= 1; } @@ -2846,7 +2849,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, space_info->max_extent_size = 0; percpu_counter_add_batch(&space_info->total_bytes_pinned, -len, BTRFS_TOTAL_BYTES_PINNED_BATCH); - if (cache->ro) { + if (btrfs_fs_incompat(fs_info, ZONED)) { + /* need reset before reusing in zoned Block Group */ + space_info->bytes_zone_unusable += len; + readonly = true; + } else if (cache->ro) { space_info->bytes_readonly += len; readonly = true; } diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index af0013d3df63..65dd1538692a 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2467,6 +2467,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, int ret = 0; u64 filter_bytes = bytes; + ASSERT(!btrfs_fs_incompat(fs_info, ZONED)); + info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS); if (!info) return -ENOMEM; @@ -2524,11 +2526,44 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, return ret; } +int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group, + u64 bytenr, u64 size, bool used) +{ + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 offset = bytenr - block_group->start; + u64 to_free, to_unusable; + + spin_lock(&ctl->tree_lock); + if (!used) + to_free = size; + else if (offset >= block_group->alloc_offset) + to_free = size; + else if (offset + size <= block_group->alloc_offset) + to_free = 0; + else + to_free = offset + size - block_group->alloc_offset; + to_unusable = size - to_free; + + ctl->free_space += to_free; + block_group->zone_unusable += to_unusable; + spin_unlock(&ctl->tree_lock); + if (!used) { + spin_lock(&block_group->lock); + block_group->alloc_offset -= size; + spin_unlock(&block_group->lock); + } + return 0; +} + int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size) { enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + true); + if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC)) trim_state = BTRFS_TRIM_STATE_TRIMMED; @@ -2537,6 +2572,16 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group, bytenr, size, trim_state); } +int btrfs_add_free_space_unused(struct btrfs_block_group *block_group, + u64 bytenr, u64 size) +{ + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + false); + + return btrfs_add_free_space(block_group, bytenr, size); +} + /* * This is a subtle distinction because when adding free space back in general, * we want it to be added as untrimmed for async. But in the case where we add @@ -2547,6 +2592,10 @@ int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group, { enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + true); + if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) || btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC)) trim_state = BTRFS_TRIM_STATE_TRIMMED; @@ -2564,6 +2613,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group, int ret; bool re_search = false; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return 0; + spin_lock(&ctl->tree_lock); again: diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index e3d5e0ad8f8e..7081216257a8 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -114,8 +114,12 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, struct btrfs_free_space_ctl *ctl, u64 bytenr, u64 size, enum btrfs_trim_state trim_state); +int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group, + u64 bytenr, u64 size, bool used); int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size); +int btrfs_add_free_space_unused(struct btrfs_block_group *block_group, + u64 bytenr, u64 size); int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group, u64 bytenr, u64 size); int btrfs_remove_free_space(struct btrfs_block_group *block_group, diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 64099565ab8f..bbbf3c1412a4 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -163,6 +163,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info, ASSERT(s_info); return s_info->bytes_used + s_info->bytes_reserved + s_info->bytes_pinned + s_info->bytes_readonly + + s_info->bytes_zone_unusable + (may_use_included ? s_info->bytes_may_use : 0); } @@ -257,7 +258,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info) void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info) { struct btrfs_space_info *found; @@ -273,6 +274,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, found->bytes_used += bytes_used; found->disk_used += bytes_used * factor; found->bytes_readonly += bytes_readonly; + found->bytes_zone_unusable += bytes_zone_unusable; if (total_bytes > 0) found->full = 0; btrfs_try_granting_tickets(info, found); @@ -422,10 +424,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info, info->total_bytes - btrfs_space_info_used(info, true), info->full ? "" : "not "); btrfs_info(fs_info, - "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu", + "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu", info->total_bytes, info->bytes_used, info->bytes_pinned, info->bytes_reserved, info->bytes_may_use, - info->bytes_readonly); + info->bytes_readonly, info->bytes_zone_unusable); DUMP_BLOCK_RSV(fs_info, global_block_rsv); DUMP_BLOCK_RSV(fs_info, trans_block_rsv); @@ -454,9 +456,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info, list_for_each_entry(cache, &info->block_groups[index], list) { spin_lock(&cache->lock); btrfs_info(fs_info, - "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s", + "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s", cache->start, cache->length, cache->used, cache->pinned, - cache->reserved, cache->ro ? "[readonly]" : ""); + cache->reserved, cache->zone_unusable, + cache->ro ? "[readonly]" : ""); spin_unlock(&cache->lock); btrfs_dump_free_space(cache, bytes); } diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h index 5646393b928c..ee003ffba956 100644 --- a/fs/btrfs/space-info.h +++ b/fs/btrfs/space-info.h @@ -17,6 +17,8 @@ struct btrfs_space_info { u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ u64 bytes_readonly; /* total bytes that are read only */ + u64 bytes_zone_unusable; /* total bytes that are unusable until + resetting the device zone */ u64 max_extent_size; /* This will hold the maximum extent size of the space info if we had an ENOSPC in the @@ -119,7 +121,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned"); int btrfs_init_space_info(struct btrfs_fs_info *fs_info); void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info); struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info, u64 flags); diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 828006020bbd..ea679803da9b 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -635,6 +635,7 @@ SPACE_INFO_ATTR(bytes_pinned); SPACE_INFO_ATTR(bytes_reserved); SPACE_INFO_ATTR(bytes_may_use); SPACE_INFO_ATTR(bytes_readonly); +SPACE_INFO_ATTR(bytes_zone_unusable); SPACE_INFO_ATTR(disk_used); SPACE_INFO_ATTR(disk_total); BTRFS_ATTR(space_info, total_bytes_pinned, @@ -648,6 +649,7 @@ static struct attribute *space_info_attrs[] = { BTRFS_ATTR_PTR(space_info, bytes_reserved), BTRFS_ATTR_PTR(space_info, bytes_may_use), BTRFS_ATTR_PTR(space_info, bytes_readonly), + BTRFS_ATTR_PTR(space_info, bytes_zone_unusable), BTRFS_ATTR_PTR(space_info, disk_used), BTRFS_ATTR_PTR(space_info, disk_total), BTRFS_ATTR_PTR(space_info, total_bytes_pinned), diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 3f65da0c4942..266daf7d60b7 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -995,3 +995,25 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) return ret; } + +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) +{ + u64 unusable, free; + + if (!btrfs_fs_incompat(cache->fs_info, ZONED)) + return; + + WARN_ON(cache->bytes_super != 0); + unusable = cache->alloc_offset - cache->used; + free = cache->length - cache->alloc_offset; + /* we only need ->free_space in ALLOC_SEQ BGs */ + cache->last_byte_to_unpin = (u64)-1; + cache->cached = BTRFS_CACHE_FINISHED; + cache->free_space_ctl->free_space = free; + cache->zone_unusable = unusable; + /* + * Should not have any excluded extents. Just + * in case, though. + */ + btrfs_free_excluded_extents(cache); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 3e05a526f0fa..ab048176a397 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -46,6 +46,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -111,6 +112,7 @@ static inline int btrfs_load_block_group_zone_info( { return 0; } +static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812239 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 050431668 for ; Thu, 1 Oct 2020 18:38:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D449B20B1F for ; Thu, 1 Oct 2020 18:38:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="nUpNdl8U" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733047AbgJASip (ORCPT ); Thu, 1 Oct 2020 14:38:45 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733006AbgJASii (ORCPT ); Thu, 1 Oct 2020 14:38:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577519; x=1633113519; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=cNblnJKvEcyik+Noh5LEork4Tx0gPmOKp5pglcVxp74=; b=nUpNdl8UkKl26rD5E3SnxAdn58yUvunFRBm50NrGsSAbQsKuRJRQ6U17 G0S2/41SmWpajliDABhPfL9XcwO1hXQeRruOULT+VWhVPBjGeisxUMvgI IBnUjBSmVBI1roMAQNIYgrH/KE/M4+uSHSwySqK+iB5oa3iL86e057PHY lbPWZWGhy4zNirYn0zSQ3772cfcU47I/GrxNNk2eBjfBQhk4ML714Y9xM sIEbXnQTlaHIpFTUMUDWvyrw0BJeDJF1l4xsqu7S5ODiKsEiNe/kONhiB kifvoVWB5Izks1G9K31vnUnLyhhNGWS4BmL3ctD1SQhOveYmhI1ZzdCYQ g==; IronPort-SDR: ioOsWv/jNcNZ4mcVP7WFoLD93a4nx2H8KXlyveI/1Pak74To2yDyM+ViaG8F37jHqir8gh1uaJ QIxzsq6Tknz3L9HF5wy18+oeAPCRNwPRcWQgJsdHXywaVaHu0yW1UfIytWQnFLxcNOhyGxXn9B 4sURnyXHzLzNPA4JANWWroqkCViQgEYirneO3XiS7lm2H8n7fYn0gcY8XV+DNzeHXqEzQc70Yn cc0LLHJADKh9QuKTwMe3tTDlghXeG+yNVld4bS7jWCGPiTj17LLyymC/qdfTrBVnxN57lQmgFu 1vw= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036804" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:22 +0800 IronPort-SDR: IC19x5HXaswEZ8tSmcgoQTZZRme6HEqKVjDOFqYQV4eaX0+/MoMeOAJHnMyEYcHI28lkX1H0N6 YyhlzswpudAQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:18 -0700 IronPort-SDR: QDeEemlrT54oZ6cmR5Zrc6XCKHYOLKd4Mx9DwqkQtyg+Te0SwNj377z6vhf753OYl6t8zoYPFd FRPW/XgLeuPA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:21 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 17/41] btrfs: do sequential extent allocation in ZONED mode Date: Fri, 2 Oct 2020 03:36:24 +0900 Message-Id: <0fd14437a7a5e7d979611542475f5d294953565d.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit implement sequential extent allocator for ZONED mode. The allocator just need to check if there is enough space in the block group. Since the allocator never manage bitmap or cluster. This commit also add ASSERTs to the corresponding functions. Actually, with zone append writing, it is unnecessary to track the allocation offset. It only needs to check space availability. But, by tracking the offset and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 4 ++ fs/btrfs/extent-tree.c | 82 ++++++++++++++++++++++++++++++++++--- fs/btrfs/free-space-cache.c | 6 +++ 3 files changed, 86 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index e68e477d9160..f07c03445390 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -683,6 +683,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only struct btrfs_caching_control *caching_ctl; int ret = 0; + /* Allocator for ZONED btrfs do not use the cache at all */ + if (btrfs_fs_incompat(fs_info, ZONED)) + return 0; + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); if (!caching_ctl) return -ENOMEM; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 051e61f16cbe..2be93d0f5978 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3562,6 +3562,7 @@ btrfs_release_block_group(struct btrfs_block_group *cache, enum btrfs_extent_allocation_policy { BTRFS_EXTENT_ALLOC_CLUSTERED, + BTRFS_EXTENT_ALLOC_ZONED, }; /* @@ -3814,6 +3815,55 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group, return find_free_extent_unclustered(block_group, ffe_ctl); } +/* + * Simple allocator for sequential only block group. It only allows + * sequential allocation. No need to play with trees. This function + * also reserve the bytes as in btrfs_add_reserved_bytes. + */ +static int do_allocation_zoned(struct btrfs_block_group *block_group, + struct find_free_extent_ctl *ffe_ctl, + struct btrfs_block_group **bg_ret) +{ + struct btrfs_space_info *space_info = block_group->space_info; + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 start = block_group->start; + u64 num_bytes = ffe_ctl->num_bytes; + u64 avail; + int ret = 0; + + ASSERT(btrfs_fs_incompat(block_group->fs_info, ZONED)); + + spin_lock(&space_info->lock); + spin_lock(&block_group->lock); + + if (block_group->ro) { + ret = 1; + goto out; + } + + avail = block_group->length - block_group->alloc_offset; + if (avail < num_bytes) { + ffe_ctl->max_extent_size = avail; + ret = 1; + goto out; + } + + ffe_ctl->found_offset = start + block_group->alloc_offset; + block_group->alloc_offset += num_bytes; + spin_lock(&ctl->tree_lock); + ctl->free_space -= num_bytes; + spin_unlock(&ctl->tree_lock); + + ASSERT(IS_ALIGNED(ffe_ctl->found_offset, + block_group->fs_info->stripesize)); + ffe_ctl->search_start = ffe_ctl->found_offset; + +out: + spin_unlock(&block_group->lock); + spin_unlock(&space_info->lock); + return ret; +} + static int do_allocation(struct btrfs_block_group *block_group, struct find_free_extent_ctl *ffe_ctl, struct btrfs_block_group **bg_ret) @@ -3821,6 +3871,8 @@ static int do_allocation(struct btrfs_block_group *block_group, switch (ffe_ctl->policy) { case BTRFS_EXTENT_ALLOC_CLUSTERED: return do_allocation_clustered(block_group, ffe_ctl, bg_ret); + case BTRFS_EXTENT_ALLOC_ZONED: + return do_allocation_zoned(block_group, ffe_ctl, bg_ret); default: BUG(); } @@ -3835,6 +3887,9 @@ static void release_block_group(struct btrfs_block_group *block_group, ffe_ctl->retry_clustered = false; ffe_ctl->retry_unclustered = false; break; + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + break; default: BUG(); } @@ -3863,6 +3918,9 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl, case BTRFS_EXTENT_ALLOC_CLUSTERED: found_extent_clustered(ffe_ctl, ins); break; + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + break; default: BUG(); } @@ -3878,6 +3936,9 @@ static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl) */ ffe_ctl->loop = LOOP_NO_EMPTY_SIZE; return 0; + case BTRFS_EXTENT_ALLOC_ZONED: + /* give up here */ + return -ENOSPC; default: BUG(); } @@ -4046,6 +4107,9 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info, case BTRFS_EXTENT_ALLOC_CLUSTERED: return prepare_allocation_clustered(fs_info, ffe_ctl, space_info, ins); + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + return 0; default: BUG(); } @@ -4109,6 +4173,9 @@ static noinline int find_free_extent(struct btrfs_root *root, ffe_ctl.last_ptr = NULL; ffe_ctl.use_cluster = true; + if (btrfs_fs_incompat(fs_info, ZONED)) + ffe_ctl.policy = BTRFS_EXTENT_ALLOC_ZONED; + ins->type = BTRFS_EXTENT_ITEM_KEY; ins->objectid = 0; ins->offset = 0; @@ -4251,20 +4318,23 @@ static noinline int find_free_extent(struct btrfs_root *root, /* move on to the next group */ if (ffe_ctl.search_start + num_bytes > block_group->start + block_group->length) { - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - num_bytes); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + num_bytes); goto loop; } if (ffe_ctl.found_offset < ffe_ctl.search_start) - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - ffe_ctl.search_start - ffe_ctl.found_offset); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + ffe_ctl.search_start - ffe_ctl.found_offset); ret = btrfs_add_reserved_bytes(block_group, ram_bytes, num_bytes, delalloc); if (ret == -EAGAIN) { - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - num_bytes); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + num_bytes); goto loop; } btrfs_inc_block_group_reservations(block_group); diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 65dd1538692a..d6bc1f97cc78 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2903,6 +2903,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group, u64 align_gap_len = 0; enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + spin_lock(&ctl->tree_lock); entry = find_free_space(ctl, &offset, &bytes_search, block_group->full_stripe_len, max_extent_size); @@ -3034,6 +3036,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group, struct rb_node *node; u64 ret = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + spin_lock(&cluster->lock); if (bytes > cluster->max_size) goto out; @@ -3810,6 +3814,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group, int ret; u64 rem = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + *trimmed = 0; spin_lock(&block_group->lock); From patchwork Thu Oct 1 18:36:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812311 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5057D139F for ; Thu, 1 Oct 2020 18:39:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2DC97208C7 for ; Thu, 1 Oct 2020 18:39:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="jvtk/ToH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730065AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733019AbgJASij (ORCPT ); Thu, 1 Oct 2020 14:38:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577519; x=1633113519; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=e3ZRvgpBpiKkHapBfHOfsgCp4wI4McFaHSm5NK2vFyc=; b=jvtk/ToHzWQaZfTXXtDTRTUgnNGjkY7dEGFTsbUHE4xGP1OwNKLkFtDd EJmAsE7UOsf8tAfcZa4amJImdA703bZjLC8emhhLRZtZvmZJdwYbSUQ6O TmOpwIAg9sZktIrz2NzcFQAPr05axlnX8yPK1wvMchnSNXJ0B5ck2PxEk p6sLebQopAe4t/6hzm9TWB4LEgO51gzwWpPOgKyATo+yoaYpnfM6SyxGu EFu4Zm4KbyuM9YSsD/nn2ShjNhJuPU92apkKpV7n/afFvz8b8tQRRx19X nPnWXmpnD9TrMBFj+vsyL2kbFVN5YsGrTumPRfohyl9qmU+68S8KC1+XY g==; IronPort-SDR: UgzBMboBuYcFYZr9C+IYOtrucNQ3uOGojp17XqiCewazgJKbd9GmwjI0g4/PIFQXg/KxiItZLt VdxI3fi689s1bdw0PcXj30qkKW/OxJgFq+KH3Nvs2S3swO/vnvqavVek1uhDyFqqeL2v7LEqHA 3rtmQTpn02BYT/0LAE8uu31+H8jdUxyxGHVyO7lN76NLqB3kZAAzxQuFyOD08Di2vatndhlDJV +WeW9sXc1HyV9jJj3563lfm/VqO08Lgtje6FLtUHPLSbMHVSnSarFvg3fAEE9P2/cxKI4BO/wE t/Y= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036806" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:23 +0800 IronPort-SDR: EnoYaXcVrxGdNJ7miS/RvSZo81lFp15Q55NUlMrh4/FyKmng9l2IyOuvFinj5CVkGN6cHaTAGH brGyalSolv6A== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:19 -0700 IronPort-SDR: ZFXTKx4UVT9rRXEAwRn86DnwwvDOkFzabmhtYlDiTifzaXXX5cMsJlS1oiaT7ZgjLWVuuZjIJX WplZLfWSOVtQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:22 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 18/41] btrfs: reset zones of unused block groups Date: Fri, 2 Oct 2020 03:36:25 +0900 Message-Id: <118ef50059dbbef6ea9febb30949716a26da6ef3.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For an ZONED volume, a block group maps to a zone of the device. For deleted unused block groups, the zone of the block group can be reset to rewind the zone write pointer at the start of the zone. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 8 ++++++-- fs/btrfs/extent-tree.c | 17 ++++++++++++----- fs/btrfs/zoned.h | 16 ++++++++++++++++ 3 files changed, 34 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index f07c03445390..2241d04ad4aa 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1468,8 +1468,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC)) goto flip_async; - /* DISCARD can flip during remount */ - trimming = btrfs_test_opt(fs_info, DISCARD_SYNC); + /* + * DISCARD can flip during remount. In ZONED mode, we need + * to reset sequential required zones. + */ + trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) || + btrfs_fs_incompat(fs_info, ZONED); /* Implicit trim during transaction commit. */ if (trimming) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2be93d0f5978..dbf178fee7c7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1331,6 +1331,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, stripe = bbio->stripes; for (i = 0; i < bbio->num_stripes; i++, stripe++) { + struct btrfs_device *dev = stripe->dev; + u64 physical = stripe->physical; + u64 length = stripe->length; u64 bytes; struct request_queue *req_q; @@ -1338,14 +1341,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, ASSERT(btrfs_test_opt(fs_info, DEGRADED)); continue; } + req_q = bdev_get_queue(stripe->dev->bdev); - if (!blk_queue_discard(req_q)) + /* zone reset in ZONED mode */ + if (btrfs_can_zone_reset(dev, physical, length)) + ret = btrfs_reset_device_zone(dev, physical, + length, &bytes); + else if (blk_queue_discard(req_q)) + ret = btrfs_issue_discard(dev->bdev, physical, + length, &bytes); + else continue; - ret = btrfs_issue_discard(stripe->dev->bdev, - stripe->physical, - stripe->length, - &bytes); if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index ab048176a397..e388189b28f0 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -199,4 +199,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) return ALIGN(pos, device->zone_info->zone_size); } +static inline bool btrfs_can_zone_reset(struct btrfs_device *device, + u64 physical, u64 length) +{ + u64 zone_size; + + if (!btrfs_dev_is_sequential(device, physical)) + return false; + + zone_size = device->zone_info->zone_size; + if (!IS_ALIGNED(physical, zone_size) || + !IS_ALIGNED(length, zone_size)) + return false; + + return true; +} + #endif From patchwork Thu Oct 1 18:36:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812315 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C6761112E for ; Thu, 1 Oct 2020 18:39:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9DB9A208C7 for ; Thu, 1 Oct 2020 18:39:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="JCfcoAWt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733081AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733030AbgJASik (ORCPT ); Thu, 1 Oct 2020 14:38:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577519; x=1633113519; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=uizKWFRSDOYPjgfFw88rW+cljtTsLWamHVwuFZBtHBA=; b=JCfcoAWtWFQYJms19C/8aGuXnrmRJ9TvIeWW5QW7ugtvACzDeRAC4oeo eovbH0Y6oHx7RGOWd+w8QcdOvjOAeM/cOy+3X6MVyxdGlClRo2lyyrp8a sj7c1sJq+2mTrU4butGDI2UVlmHsoZupZPoXxjAhEL8Xootq3W2qjtu/h oUjnDSHsLjCnF6vn9ddG1corUn32gsgUScED4/Kd0g8tvW0os5NlJkeWF PRGuubOjP559FkcavoDmqGl2wYm0uonrvfDZ6LcZQ/lUmRt2ch1ZhIgCw H2F/RlYGqcAq/YOgkdf5TdVp81EflHtTBkAJe2f/DTh+l3uQTuQpdqMPs w==; IronPort-SDR: ViTsX5nmWNdJYsYRnKtvPPgWrv0wDhRTw+FaghjT9j4L/Mwf1+p76KRLywpi3WRoFEL9gntfC6 OWxFrzRABm+FPcirv3nuwRjMwNMO3RDctkjaNTKzc9VZA8TmWRiPF5UvUHgweF+Vl0ycODO+ut 2YjB0pHNKw9Et6alDCv2qG/TrFbFpeon2SO5CsJVpi5Dg5cIpuOHKUPA4gj33Q8rIO5EFe/TyM Zh/ZfXLSz7WSN33YxsiOelIxTCi46S+I62asTdcPy8ibpw0yvWMnBEY1sO8/pfY7okO0AMS0hS aTA= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036808" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:24 +0800 IronPort-SDR: f72EYRq2gadNDZkKzbLy21cAFrSrreM4dW2yEUSGveHanIRuxtfhUwSbPhDsJbcn/x2ko3MOSi gm5EoHMwFwpQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:20 -0700 IronPort-SDR: rH5PeDQzLOrVBWNtFV5gFn0VqRmqPo1DUrnV5SntOE/8uPgKj6ieeHJypW5fRHuqqiFgJObtmB 4dN7yV0qgUUQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:23 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 19/41] btrfs: redirty released extent buffers in ZONED mode Date: Fri, 2 Oct 2020 03:36:26 +0900 Message-Id: <7c0ff07e2a2c5c98824348f7afe8f944637b4f90.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Tree manipulating operations like merging nodes often release once-allocated tree nodes. Btrfs cleans such nodes so that pages in the node are not uselessly written out. On ZONED volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. This patch introduces a list of clean and unwritten extent buffers that have been released in a transaction. Btrfs redirty the buffer so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. btrfsck. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 8 ++++++++ fs/btrfs/extent-tree.c | 12 +++++++++++- fs/btrfs/extent_io.c | 4 ++++ fs/btrfs/extent_io.h | 2 ++ fs/btrfs/transaction.c | 10 ++++++++++ fs/btrfs/transaction.h | 3 +++ fs/btrfs/tree-log.c | 6 ++++++ fs/btrfs/zoned.c | 37 +++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 6 ++++++ 9 files changed, 87 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 61b50a7df27b..c872f051b0a5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -462,6 +462,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) return 0; found_start = btrfs_header_bytenr(eb); + + if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) { + WARN_ON(found_start != 0); + return 0; + } + /* * Please do not consolidate these warnings into a single if. * It is useful to know what went wrong. @@ -4614,6 +4620,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, EXTENT_DIRTY); btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents); + btrfs_free_redirty_list(cur_trans); + cur_trans->state =TRANS_STATE_COMPLETED; wake_up(&cur_trans->commit_wait); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dbf178fee7c7..c0e4a577c61c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3421,8 +3421,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { ret = check_ref_cleanup(trans, buf->start); - if (!ret) + if (!ret) { + btrfs_redirty_list_add(trans->transaction, buf); goto out; + } } pin = 0; @@ -3434,6 +3436,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, goto out; } + if (btrfs_fs_incompat(fs_info, ZONED)) { + btrfs_redirty_list_add(trans->transaction, buf); + pin_down_extent(trans, cache, buf->start, buf->len, 1); + btrfs_put_block_group(cache); + goto out; + } + WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)); btrfs_add_free_space(cache, buf->start, buf->len); @@ -4767,6 +4776,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, __btrfs_tree_lock(buf, nest); btrfs_clean_tree_block(buf); clear_bit(EXTENT_BUFFER_STALE, &buf->bflags); + clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags); btrfs_set_lock_blocking_write(buf); set_extent_buffer_uptodate(buf); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 60f5f68d892d..e91c504fe973 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -24,6 +24,7 @@ #include "rcu-string.h" #include "backref.h" #include "disk-io.h" +#include "zoned.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -4959,6 +4960,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list, &fs_info->allocated_ebs); + INIT_LIST_HEAD(&eb->release_list); spin_lock_init(&eb->refs_lock); atomic_set(&eb->refs, 1); @@ -5744,6 +5746,8 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv, char *src = (char *)srcv; unsigned long i = start >> PAGE_SHIFT; + WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)); + if (check_eb_range(eb, start, len)) return; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index f39d02e7f7ef..5f2ccfd0205e 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -30,6 +30,7 @@ enum { EXTENT_BUFFER_IN_TREE, /* write IO error */ EXTENT_BUFFER_WRITE_ERR, + EXTENT_BUFFER_NO_CHECK, }; /* these are flags for __process_pages_contig */ @@ -107,6 +108,7 @@ struct extent_buffer { */ wait_queue_head_t read_lock_wq; struct page *pages[INLINE_EXTENT_BUFFER_PAGES]; + struct list_head release_list; #ifdef CONFIG_BTRFS_DEBUG int spinning_writers; atomic_t spinning_readers; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 84b506de2c91..fb02668026c9 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -22,6 +22,7 @@ #include "qgroup.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" #define BTRFS_ROOT_TRANS_TAG 0 @@ -336,6 +337,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info, spin_lock_init(&cur_trans->dirty_bgs_lock); INIT_LIST_HEAD(&cur_trans->deleted_bgs); spin_lock_init(&cur_trans->dropped_roots_lock); + INIT_LIST_HEAD(&cur_trans->releasing_ebs); + spin_lock_init(&cur_trans->releasing_ebs_lock); list_add_tail(&cur_trans->list, &fs_info->trans_list); extent_io_tree_init(fs_info, &cur_trans->dirty_pages, IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode); @@ -2346,6 +2349,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) goto scrub_continue; } + /* + * At this point, we should have written the all tree blocks + * allocated in this transaction. So it's now safe to free the + * redirtyied extent buffers. + */ + btrfs_free_redirty_list(cur_trans); + ret = write_all_supers(fs_info, 0); /* * the super is written, we can safely allow the tree-loggers diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 858d9153a1cd..380e0aaa15b3 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -92,6 +92,9 @@ struct btrfs_transaction { */ atomic_t pending_ordered; wait_queue_head_t pending_wait; + + spinlock_t releasing_ebs_lock; + struct list_head releasing_ebs; }; #define __TRANS_FREEZABLE (1U << 0) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 56cbc1706b6f..5f585cf57383 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -20,6 +20,7 @@ #include "inode-map.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" /* magic values for the inode_only field in btrfs_log_inode: * @@ -2742,6 +2743,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans, free_extent_buffer(next); return ret; } + btrfs_redirty_list_add( + trans->transaction, next); } else { if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags)) clear_extent_buffer_dirty(next); @@ -3277,6 +3280,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans, clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); extent_io_tree_release(&log->log_csum_range); + + if (trans && log->node) + btrfs_redirty_list_add(trans->transaction, log->node); btrfs_put_root(log); } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 266daf7d60b7..88b45af60e4f 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -16,6 +16,7 @@ #include "rcu-string.h" #include "disk-io.h" #include "block-group.h" +#include "transaction.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -1017,3 +1018,39 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) */ btrfs_free_excluded_extents(cache); } + +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) +{ + struct btrfs_fs_info *fs_info = eb->fs_info; + + if (!btrfs_fs_incompat(fs_info, ZONED) || + btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) || + !list_empty(&eb->release_list)) + return; + + set_extent_buffer_dirty(eb); + set_extent_bits_nowait(&trans->dirty_pages, eb->start, + eb->start + eb->len - 1, EXTENT_DIRTY); + memzero_extent_buffer(eb, 0, eb->len); + set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags); + + spin_lock(&trans->releasing_ebs_lock); + list_add_tail(&eb->release_list, &trans->releasing_ebs); + spin_unlock(&trans->releasing_ebs_lock); + atomic_inc(&eb->refs); +} + +void btrfs_free_redirty_list(struct btrfs_transaction *trans) +{ + spin_lock(&trans->releasing_ebs_lock); + while (!list_empty(&trans->releasing_ebs)) { + struct extent_buffer *eb; + + eb = list_first_entry(&trans->releasing_ebs, + struct extent_buffer, release_list); + list_del_init(&eb->release_list); + free_extent_buffer(eb); + } + spin_unlock(&trans->releasing_ebs_lock); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index e388189b28f0..32446135e882 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -47,6 +47,9 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb); +void btrfs_free_redirty_list(struct btrfs_transaction *trans); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -113,6 +116,9 @@ static inline int btrfs_load_block_group_zone_info( return 0; } static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } +static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) { } +static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812331 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 47C7A139F for ; Thu, 1 Oct 2020 18:39:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2C0AF208C7 for ; Thu, 1 Oct 2020 18:39:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="AegbhGo1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733067AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733038AbgJASik (ORCPT ); Thu, 1 Oct 2020 14:38:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577520; x=1633113520; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=75fAmRUO5nDwJooOw48asWhVJ81jKELWO2GhE82J9JA=; b=AegbhGo1wardOJUaZgdHecU6zYnz0UiN5Hyt5rI+uDCvsEpB/ouep3gj svJk4PbsMcBSQ05MFYbCACNFWk03xX1+zNBI60UGH/5QQZXFOQUjDPkkC itHBZ9f3i1iMgCMFZqrc7QY+b0l302ZKO9O1WNaz3JQIXy4nNZsShsfU2 vqWogBR9q09RMJ2agmPY0VYrNyZmgowjjwOLN6Izg5qtzbrgXQ9rFkAn5 RZhEwv2CZwNgAIawg3Qus62N7sEPlGByL+q9sAaa95TGOUdmNRwMzcodd nneLoKtwJeMHDsMQxzURs+IrjbRZ7Tw+lKemcueBiHsVGOBohyDeEH8kC A==; IronPort-SDR: /sxDnl23UZDdJUyNU7EwUJUDZiSwluFWZTOEE2BdUeqtAtlNCNWVQ7N3XD+moYyQEFwlC/96pe U3Q37Vv7JYSGbbtERqAyBrmzOZAgynOAV/IGrDIqQe4c+8tt53zJoTdVNoxXMpU8fQTV74vpQN 9JV6sfeVcWSaEHTUD5njYuvDoWhHvpKNq9er5BhTXj9jBQeY5Gnj0syR0vMVZ1ycB1pWaUd0wX jzPWxPJK6Wq9jY0U/2XOAbVXoQgrmOJ3SXa/oN09+YAWvgfy5Gx6zaktNiis1CWOBR2MUM/uoJ 0KE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036810" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:25 +0800 IronPort-SDR: xPW/v05qe/jKWyKkQDt2LhoUXK7dtagmZeNr21AUQrJCVwb+OaAl7xT12E2/tab8qj3ORhkJ1Q b6tn8pwcU23w== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:22 -0700 IronPort-SDR: DjflxTwUD3s9OdhVRIcgebWMmVY0MHbzF9Iz3SJ6Lb+61ZK68J0Fo4PQ0MWj4rda0ojA00FzA7 f1qTu4vdoDIw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:24 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 20/41] btrfs: extract page adding function Date: Fri, 2 Oct 2020 03:36:27 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit extract page adding to bio part from submit_extent_page(). The page is added only when bio_flags are the same, contiguous and the added page fits in the same stripe as pages in the bio. Condition checkings are reordered to allow early return to avoid possibly heavy btrfs_bio_fits_in_stripe() calling. Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++------------ 1 file changed, 40 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e91c504fe973..17285048fb5a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3012,6 +3012,43 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size) return bio; } +/** + * btrfs_bio_add_page - attempt to add a page to bio + * @bio: destination bio + * @page: page to add to the bio + * @logical: offset of the new bio or to check whether we are adding + * a contiguous page to the previous one + * @pg_offset: starting offset in the page + * @size: portion of page that we want to write + * @prev_bio_flags: flags of previous bio to see if we can merge the current one + * @bio_flags: flags of the current bio to see if we can merge them + * + * Attempt to add a page to bio considering stripe alignment etc. Return + * true if successfully page added. Otherwise, return false. + */ +bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical, + unsigned int size, unsigned int pg_offset, + unsigned long prev_bio_flags, unsigned long bio_flags) +{ + sector_t sector = logical >> SECTOR_SHIFT; + bool contig; + + if (prev_bio_flags != bio_flags) + return false; + + if (prev_bio_flags & EXTENT_BIO_COMPRESSED) + contig = bio->bi_iter.bi_sector == sector; + else + contig = bio_end_sector(bio) == sector; + if (!contig) + return false; + + if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags)) + return false; + + return bio_add_page(bio, page, size, pg_offset) == size; +} + /* * @opf: bio REQ_OP_* and REQ_* flags as one value * @wbc: optional writeback control for io accounting @@ -3040,27 +3077,15 @@ static int submit_extent_page(unsigned int opf, int ret = 0; struct bio *bio; size_t page_size = min_t(size_t, size, PAGE_SIZE); - sector_t sector = offset >> 9; struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree; ASSERT(bio_ret); if (*bio_ret) { - bool contig; - bool can_merge = true; - bio = *bio_ret; - if (prev_bio_flags & EXTENT_BIO_COMPRESSED) - contig = bio->bi_iter.bi_sector == sector; - else - contig = bio_end_sector(bio) == sector; - - if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags)) - can_merge = false; - - if (prev_bio_flags != bio_flags || !contig || !can_merge || - force_bio_submit || - bio_add_page(bio, page, page_size, pg_offset) < page_size) { + if (force_bio_submit || + !btrfs_bio_add_page(bio, page, offset, page_size, pg_offset, + prev_bio_flags, bio_flags)) { ret = submit_one_bio(bio, mirror_num, prev_bio_flags); if (ret < 0) { *bio_ret = NULL; From patchwork Thu Oct 1 18:36:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812255 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ADBE2112E for ; Thu, 1 Oct 2020 18:39:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8943C208C7 for ; Thu, 1 Oct 2020 18:39:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="quBXaI4d" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733074AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730079AbgJASis (ORCPT ); Thu, 1 Oct 2020 14:38:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577529; x=1633113529; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7uXZKRaKpQXl1PgUFHO2cQUccxPb0XkkDSSLIJVs/vg=; b=quBXaI4dlxB3cL5UgT1wt/QMD6Ewf2wkE/uylHdRnQLMhHZBA3qiBXVK Nd+U8zduLPoAUHmyAA5tg46wvig4DXJxYQy2Ixe2xhJvIZdHqcWlUiRF2 OcBdNbN7+qpXek1yp1AgpI/PBKJOCqRHcurMnsRbvxqR/rOFAQ7h+Xd1x Ns+yXkrEK2Q0HCpvoUCMqT1tXvN/rQmxINRIzXKY3vpuUSMu4865mg/8G sLJ6mtA5ceqPIxkEi/OtNQoY0Mtg0k/5rjOzZrmY5idenJMU8cncZaKQk +pE5ihLMUqMqhjOO7nP2IEA31j5U9xFD4EAanFD9uHsFK4Ki2t+FncCQV A==; IronPort-SDR: heYcixwjIXWJfEdC9Dd1sfA6ZtXL07YrDh2tpO0xXOkkMfzjdEXZNrw8TfPXs56vgIUCmWVe6q 4/DjM8IK2y2gQev34+Pya8Sy+RACMwVq2wBCb+k2rnf2NtVizCzRqXTIm39sZ0WUO2sk/As80k JsNgNh11Bfed8EXKjG9Zt8av1vVn6798EJrL2iYlu/Stn59JWQjDftTMVTRmKdVZmWlEafXFGz wEyt+9awmKM167UgbybYwYBnpAt2yJtUGwgsmemmXZ2tQaNgtillUHZsSrB9ve0qSHTth1fbCw rGg= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036812" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:26 +0800 IronPort-SDR: 1JgJnueFmHdn5UJ8oAmwVm2N1DAmzSK0oAngoAPmuu8ms/Tm+9nObQvnWpqi0cJLufVh083mqu +6Zh6Int0O3w== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:23 -0700 IronPort-SDR: +eoH4BA8kPoFE77+VleXHjwWGRjpzzwzAgkIkO50K2YtbQleC0y5kI1qG4uq1kCpMPy+gD2hNj USZg1iRth9bQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:25 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs Date: Fri, 2 Oct 2020 03:36:28 +0900 Message-Id: <6ed1ae6271e25066ff179c94e638958e1cdd23f5.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zoned device has its own hardware restrictions e.g. max_zone_append_size when using REQ_OP_ZONE_APPEND. To follow the restrictions, use bio_add_zone_append_page() instead of bio_add_page(). We need target device to use bio_add_zone_append_page(), so this commit reads the chunk information to memoize the target device to btrfs_io_bio(bio)->device. Currently, zoned btrfs only supports SINGLE profile. In the feature, btrfs_io_bio can hold extent_map and check the restrictions for all the devices the bio will be mapped. Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 17285048fb5a..5ee94a2ffa22 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3032,6 +3032,7 @@ bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical, { sector_t sector = logical >> SECTOR_SHIFT; bool contig; + int ret; if (prev_bio_flags != bio_flags) return false; @@ -3046,7 +3047,19 @@ bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical, if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags)) return false; - return bio_add_page(bio, page, size, pg_offset) == size; + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + struct bio orig_bio; + + memset(&orig_bio, 0, sizeof(orig_bio)); + bio_copy_dev(&orig_bio, bio); + bio_set_dev(bio, btrfs_io_bio(bio)->device->bdev); + ret = bio_add_zone_append_page(bio, page, size, pg_offset); + bio_copy_dev(bio, &orig_bio); + } else { + ret = bio_add_page(bio, page, size, pg_offset); + } + + return ret == size; } /* @@ -3077,7 +3090,9 @@ static int submit_extent_page(unsigned int opf, int ret = 0; struct bio *bio; size_t page_size = min_t(size_t, size, PAGE_SIZE); - struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree; + struct btrfs_inode *inode = BTRFS_I(page->mapping->host); + struct extent_io_tree *tree = &inode->io_tree; + struct btrfs_fs_info *fs_info = inode->root->fs_info; ASSERT(bio_ret); @@ -3108,11 +3123,27 @@ static int submit_extent_page(unsigned int opf, if (wbc) { struct block_device *bdev; - bdev = BTRFS_I(page->mapping->host)->root->fs_info->fs_devices->latest_bdev; + bdev = fs_info->fs_devices->latest_bdev; bio_set_dev(bio, bdev); wbc_init_bio(wbc, bio); wbc_account_cgroup_owner(wbc, page, page_size); } + if (btrfs_fs_incompat(fs_info, ZONED) && + bio_op(bio) == REQ_OP_ZONE_APPEND) { + struct extent_map *em; + struct map_lookup *map; + + em = btrfs_get_chunk_map(fs_info, offset, page_size); + if (IS_ERR(em)) + return PTR_ERR(em); + + map = em->map_lookup; + /* only support SINGLE profile for now */ + ASSERT(map->num_stripes == 1); + btrfs_io_bio(bio)->device = map->stripes[0].dev; + + free_extent_map(em); + } *bio_ret = bio; From patchwork Thu Oct 1 18:36:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812329 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3DA9B112E for ; Thu, 1 Oct 2020 18:39:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2314A208C7 for ; Thu, 1 Oct 2020 18:39:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="G1+7h1U5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733184AbgJASjf (ORCPT ); Thu, 1 Oct 2020 14:39:35 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730073AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577541; x=1633113541; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/dS61ZHd3vP9uYD0p4gcGPRjl47fX+Beo6W1L/zt6wM=; b=G1+7h1U5paAqYDSHBUI1n+Yp/klAStViBttOZMo83RwIhW/jYWcp1uxR yq96L/aa8sebwWEim5nNgIi20tDr6glAMtB4wIDesJYkn6TpaRD058OGj VBscy8+fQpaEzUZAZwUbeGHhgD0aSPG8z11XQOj2Nxh4F9/IsKDI39iC5 q1cT8WSKsOlgWuNf69dCB5ThXOlSLTqOexukQZnZQ0kqPsSrsA5ocRQMl rSDRBzN2KPJwmo+L0849gWtsvPvmrRi6F96/+8IdNgv1Qyak5KQ0C6yC5 +D/K5FBosJboLaVI0SDbCfyWP7j7udUE5loNcIKfKrJSxz7bxuSRYhKgr Q==; IronPort-SDR: LEcm32Px8HFUI1Jxjig/RkQ3fSqH4BGOAaTrifwlFtjfVokskAuMs4yUBn84IGJKQWquyP3NmU FSnY1ZSGflA0tkQR+J//CONVcNhUayotesL+Ug4IOrJJ22BWsFTvVpeimMLB9+TJuTuJR+aU8c uFO7UJZmFhXmfaRdmWRyAP3ESCN5vdAyrlwE8XpWHu8scd6o79QpztiVVQg/X1D00ftyZkZh8V /s3MywjIAqY4mFXxbLmC/gFEhvUpKcO9GtSihR7kxe52DX58eyOClrfdm9XEBhWVsElLONtdb2 rYM= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036813" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:27 +0800 IronPort-SDR: xiS/rxJy5sWThY7wMU8PCTvX+rdM4V0S7UJOCyx8T7eKI/P8HsBkHEnZaidY/xI+3t6KcuOVSz 8DJ/IY0By/tA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:24 -0700 IronPort-SDR: iFL/yOvYoy7BUSIjoVqLHnOVori+z8LF7b87sA8ugcHfpHOtDIHgchWRbiArK+QO/leCz9BJJ8 LfaWkLSBUrxw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:27 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing Date: Fri, 2 Oct 2020 03:36:29 +0900 Message-Id: <27a2688a096eda2be0fa491a3cd15837912f4dd1.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org ZONED btrfs uses REQ_OP_ZONE_APPEND for a bio going to actual devices. Let btrfs_end_bio() and btrfs_op, who faces the bios, aware of it. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 3 ++- fs/btrfs/volumes.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c22ea7f0551f..44ef7b2fb46c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6447,7 +6447,8 @@ static void btrfs_end_bio(struct bio *bio) struct btrfs_device *dev = btrfs_io_bio(bio)->device; ASSERT(dev->bdev); - if (bio_op(bio) == REQ_OP_WRITE) + if (bio_op(bio) == REQ_OP_WRITE || + bio_op(bio) == REQ_OP_ZONE_APPEND) btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); else if (!(bio->bi_opf & REQ_RAHEAD)) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index c01dd5e40ec8..f8fc3debd5e0 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -410,6 +410,7 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio) case REQ_OP_DISCARD: return BTRFS_MAP_DISCARD; case REQ_OP_WRITE: + case REQ_OP_ZONE_APPEND: return BTRFS_MAP_WRITE; default: WARN_ON_ONCE(1); From patchwork Thu Oct 1 18:36:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812257 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 90076139F for ; Thu, 1 Oct 2020 18:39:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 72B8C208C7 for ; Thu, 1 Oct 2020 18:39:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="PXmA4SoP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733098AbgJASjD (ORCPT ); Thu, 1 Oct 2020 14:39:03 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732431AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577541; x=1633113541; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8/xQVD50/d8enj2n99W6cqIQvZ/32vR2MauPROXbtf0=; b=PXmA4SoP+jqynUCiQsJkkni0cfBC2aaREGNsa1qWYjBlIU5xHLps8Esc uyF8JWyCSW+8X+8R7MVPOQGpORc4LScDSWE7jLTnfmthW7xgsjNqdGbBJ azuL/4trRc9KWcIRBZQA8go4wsmxN8VG558RKQ8EbgLwVE5jXH6mKqiWu vntHoJl7vhmscJ0r1iZK1PfQmMSgCXtW9+mpOtQSQ8NuwGRpQnEAN5ksI GoazID8otYAhG0LQu0ba5jNgbVYFQA2gTCjZ9MoF3gx9nTJVjyl2SAgup So8vGzmHz9nPNt4IoCvzQs1cArqZSiOJfNYaedw9nakU8AFZpk/HNUAKV w==; IronPort-SDR: PlzSZ4+CL92Yeu4bjZN7rI2kbJoKUIzqVBI+0JGNtExwCuhXKLD7BFJqGg0wIYahuE8IZ1qYAB GwNLdH0QCSB190SvC9B+8is1H7ng+ZXr142Q5jYXscxPI+Xe+CdsO6XCC5XE2mriub8iUv+WHt WKUT+OVzhIQUDwPQF1du+8dShxaer9SC1KhMcwoGgBsCWJdWsDht/tyGzp5D9z0YFtuQeT1bdl q9mZA41aAX2y/A8u+jcRyyhF5NQN3VwhWHIFYbEjxju2JxwPtuMYxqZY3ZchpGTXB7hkUzJASY YQ0= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036816" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:28 +0800 IronPort-SDR: gS+wUv/+2L0lcO+vfZKVGIXQoM6K+QqbTdTFh/+ykPbisqZK2gDPCf0W6jnSkSlJhosJtIzBdb NWVhD55XRhrg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:25 -0700 IronPort-SDR: c4lfygpGO0mILy/fNtyFqlQWnbzT1LG+0+kvmhlLcLWmmEGuEB89gZUzKcaH0glTztr6rBuHND MVQBa1fg+eHA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:28 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 23/41] btrfs: split ordered extent when bio is sent Date: Fri, 2 Oct 2020 03:36:30 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The device decides the written location of ZONE_APPEND write IO. Thus, we cannot ensure two bios to be written continuously on the device. So, we need to follow "one bio == one ordered extent" rule to ensure the region of ordered extent maps to contiguous region on the disk. This commit implements splitting of an ordered extent and an extent map on submitting bio to follow the rule. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 87 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/ordered-data.c | 73 ++++++++++++++++++++++++++++++++++ fs/btrfs/ordered-data.h | 2 + 3 files changed, 162 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 36efed0a24de..4bc975eafbd8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2158,6 +2158,87 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio, return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0); } +int extract_ordered_extent(struct inode *inode, struct bio *bio) +{ + struct btrfs_ordered_extent *ordered; + struct extent_map *em = NULL, *em_new = NULL; + struct page *page = bio_first_bvec_all(bio)->bv_page; + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT; + u64 len = bio->bi_iter.bi_size; + u64 end = start + len; + u64 ordered_end; + u64 pre, post; + int ret = 0; + + ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), + page_offset(page)); + if (WARN_ON_ONCE(!ordered)) + return -EIO; + + /* no need to split */ + if (ordered->disk_num_bytes == len) + goto out; + + /* cannot split once end_bio'd ordered extent */ + if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) { + ret = -EINVAL; + goto out; + } + + /* we cannot split compressed ordered extent */ + if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) { + ret = -EINVAL; + goto out; + } + + /* cannot split waietd ordered extent */ + if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) { + ret = -EINVAL; + goto out; + } + + ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; + /* bio must be in one ordered extent */ + if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) { + ret = -EINVAL; + goto out; + } + + /* checksum list should be empty */ + if (WARN_ON_ONCE(!list_empty(&ordered->list))) { + ret = -EINVAL; + goto out; + } + + pre = start - ordered->disk_bytenr; + post = ordered_end - end; + + btrfs_split_ordered_extent(ordered, pre, post); + + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, ordered->file_offset, len); + if (!em) { + read_unlock(&em_tree->lock); + ret = -EIO; + goto out; + } + read_unlock(&em_tree->lock); + + ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)); + em_new = create_io_em(BTRFS_I(inode), em->start + pre, len, + em->start + pre, em->block_start + pre, len, + len, len, BTRFS_COMPRESS_NONE, + BTRFS_ORDERED_REGULAR); + free_extent_map(em_new); + +out: + free_extent_map(em); + btrfs_put_ordered_extent(ordered); + + return ret; +} + /* * extent_io.c submission hook. This does the right thing for csum calculation * on write, or reading the csums from the tree before a read. @@ -2192,6 +2273,12 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio, if (btrfs_is_free_space_inode(BTRFS_I(inode))) metadata = BTRFS_WQ_ENDIO_FREE_SPACE; + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + ret = extract_ordered_extent(inode, bio); + if (ret) + goto out; + } + if (bio_op(bio) != REQ_OP_WRITE) { ret = btrfs_bio_wq_end_io(fs_info, bio, metadata); if (ret) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 87bac9ecdf4c..7cd22cb07f26 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -943,6 +943,79 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, } } +static void clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, + u64 len) +{ + struct inode *inode = ordered->inode; + u64 file_offset = ordered->file_offset + pos; + u64 disk_bytenr = ordered->disk_bytenr + pos; + u64 num_bytes = len; + u64 disk_num_bytes = len; + int type; + unsigned long flags_masked = + ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT); + int compress_type = ordered->compress_type; + unsigned long weight; + + weight = hweight_long(flags_masked); + WARN_ON_ONCE(weight > 1); + if (!weight) + type = 0; + else + type = __ffs(flags_masked); + + ASSERT(!test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)); + if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) { + WARN_ON_ONCE(1); + btrfs_add_ordered_extent_compress(BTRFS_I(inode), file_offset, + disk_bytenr, num_bytes, + disk_num_bytes, type, + compress_type); + } else { + btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, + disk_bytenr, num_bytes, disk_num_bytes, + type); + } +} + +void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, + u64 post) +{ + struct inode *inode = ordered->inode; + struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree; + struct rb_node *node; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + spin_lock_irq(&tree->lock); + /* remove from tree once */ + node = &ordered->rb_node; + rb_erase(node, &tree->tree); + RB_CLEAR_NODE(node); + if (tree->last == node) + tree->last = NULL; + + ordered->file_offset += pre; + ordered->disk_bytenr += pre; + ordered->num_bytes -= (pre + post); + ordered->disk_num_bytes -= (pre + post); + ordered->bytes_left -= (pre + post); + + /* re-insert the node */ + node = tree_insert(&tree->tree, ordered->file_offset, + &ordered->rb_node); + if (node) + btrfs_panic(fs_info, -EEXIST, + "inconsistency in ordered tree at offset %llu", + ordered->file_offset); + + spin_unlock_irq(&tree->lock); + + if (pre) + clone_ordered_extent(ordered, 0, pre); + if (post) + clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); +} + int __init ordered_data_init(void) { btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent", diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index c3a2325e64a4..e346b03bd66a 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -193,6 +193,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr, void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, struct extent_state **cached_state); +void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, + u64 post); int __init ordered_data_init(void); void __cold ordered_data_exit(void); From patchwork Thu Oct 1 18:36:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812291 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 30B1C112E for ; Thu, 1 Oct 2020 18:39:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 12FC7207DE for ; Thu, 1 Oct 2020 18:39:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="oX0heP7v" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733094AbgJASjD (ORCPT ); Thu, 1 Oct 2020 14:39:03 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733057AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577541; x=1633113541; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yI/KJxjAm8LeiXiMXlzP2TfXUDSKNUEh2w5iz0S3Gls=; b=oX0heP7v2+PReOi3p2psxg4u+2pCLIvpzwJnJOyjcb8H65HtHgciRqV+ wZj/64UOU7BEi2kNBV+C9LpukGWvprCstl5OW0Vx+8StZ46O3rhFqcG+d aIj4uXa/qbk0uHZJNEIQowGv0W4WrAfkSf+A+2g5ngovWkUb5O1+mPVcH 2gPUrT/9IzZBKGWObvqLigrBDoYi6Wf0iOBqnRitAg7R9q3MlUNdpb4Xq h/7grm3sSqaV6ggo+ed1omoAl/rCQHxjMp742K7KzoWvS4iBUlYcKQMXt uSpxSIFveK7HLdbvgJf4S4FD/lmkn60y1NtcNJoNBK2MUVXAvSS8K6S/9 w==; IronPort-SDR: ArGNP25jA6J2ZFr5Z070WqpHjIqbp9MAJaFtfyPUuOd9EDX8TjolPXZOEW6G4Iotq6nd+Dptfc E4zLP4IX2HSkyp5f+Wh+FLXgnkGvUgv/R4/SLgrxz0JeVSRbV24CwT3gHo/8MllYp4/LoAj0g5 dMmyq6RQbVy0YdeRi63c8sQw3Tl7SZVtnmGnoA4JGM2BJoAt9kFCtWBE2gaYwXvpuUURVQxSu8 FfdmXD3ISt9hE0fMWBwW7swGmslyfgX7KIGRn/yYMydafVt4akaC88rEHLaJFsz4XOsztipWTe llU= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036817" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:30 +0800 IronPort-SDR: 88v3Oipava5ZdtcbH6Ob05O2Ge+Y79PP3huOW1s5bocWCY3riH8LxVRoNdd+2ORLPs3LSb5z60 69/41yEj/RUg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:26 -0700 IronPort-SDR: Yi+gMHXjJNq0ZQjOo14n2BGU19IXkQpa2Jsw1ED9OESpitmu0V14bHW9QvfKzlilyAef3shTPy ZBuZJxeVU5aQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:29 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 24/41] btrfs: extend btrfs_rmap_block for specifying a device Date: Fri, 2 Oct 2020 03:36:31 +0900 Message-Id: <8bc03463cdb421fa62230aa4d8de5f93b5d502e2.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org btrfs_rmap_block currently reverse-map the physical address on all devices to logical addresses. This commit extends the function to match to a specified device. You can still query all devices by specifying NULL as a device. This commit also exporet the function for later use. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 23 ++++++++++++++++++----- fs/btrfs/block-group.h | 3 +++ 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 2241d04ad4aa..d39fa80d3d90 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1655,9 +1655,9 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) * Used primarily to exclude those portions of a block group that contain super * block copies. */ -EXPORT_FOR_TESTS -int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, - u64 physical, u64 **logical, int *naddrs, int *stripe_len) +int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + struct block_device *bdev, u64 physical, u64 **logical, + int *naddrs, int *stripe_len) { struct extent_map *em; struct map_lookup *map; @@ -1675,6 +1675,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, map = em->map_lookup; data_stripe_length = em->orig_block_len; io_stripe_size = map->stripe_len; + chunk_start = em->start; /* For RAID5/6 adjust to a full IO stripe length */ if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) @@ -1689,14 +1690,18 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, for (i = 0; i < map->num_stripes; i++) { bool already_inserted = false; u64 stripe_nr; + u64 offset; int j; if (!in_range(physical, map->stripes[i].physical, data_stripe_length)) continue; + if (bdev && map->stripes[i].dev->bdev != bdev) + continue; + stripe_nr = physical - map->stripes[i].physical; - stripe_nr = div64_u64(stripe_nr, map->stripe_len); + stripe_nr = div64_u64_rem(stripe_nr, map->stripe_len, &offset); if (map->type & BTRFS_BLOCK_GROUP_RAID10) { stripe_nr = stripe_nr * map->num_stripes + i; @@ -1710,7 +1715,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, * instead of map->stripe_len */ - bytenr = chunk_start + stripe_nr * io_stripe_size; + bytenr = chunk_start + stripe_nr * io_stripe_size + offset; /* Ensure we don't add duplicate addresses */ for (j = 0; j < nr; j++) { @@ -1732,6 +1737,14 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, return ret; } +EXPORT_FOR_TESTS +int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + u64 physical, u64 **logical, int *naddrs, int *stripe_len) +{ + return __btrfs_rmap_block(fs_info, chunk_start, NULL, physical, logical, + naddrs, stripe_len); +} + static int exclude_super_stripes(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 5be47f4bfea7..401e9bcefaec 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -275,6 +275,9 @@ void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type); u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); int btrfs_free_block_groups(struct btrfs_fs_info *info); +int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + struct block_device *bdev, u64 physical, u64 **logical, + int *naddrs, int *stripe_len); static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info) { From patchwork Thu Oct 1 18:36:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812333 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D0F52139F for ; Thu, 1 Oct 2020 18:39:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B0C3C207DE for ; Thu, 1 Oct 2020 18:39:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="QHk+3SM3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733189AbgJASjh (ORCPT ); Thu, 1 Oct 2020 14:39:37 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732380AbgJASjB (ORCPT ); Thu, 1 Oct 2020 14:39:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577541; x=1633113541; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TqYRZkjI+eHOJGvn2Y+h5fswysSdJ99elTXem5Z7Fr4=; b=QHk+3SM3PoUmiszRXSuMc2tHs0/z66IOBJHzQ8GtGxVsLz1ch+Zqc/nB xwJc762aea+TlJWbztRPNeGgnyG1FPbMOL1kXJHsHUEIuveHCoxfhIpB5 /luuXiWhmcyJU1BQdZVb+8BnSV0pA2+yWgS84exQ6MLjdvJ8pGrSAlKn4 bbJQriBJliAY0bjWmnmex+d4+gO8/GkVM/U+HXrj/G8Zvuxn900G/oWBM PwZJrcNLQER++35qN+c57/Q3q+7O9mOgpQ3aXjpdMcKTyIV7ky+JemUrw SotAs5vNNB6DLdrHMz9tmmPCqHK8+gia5OFD2c69XavATOhex08yteynl Q==; IronPort-SDR: 6pxq+0Z+Klu518EDboleO9catZThsp5vTQ3KXU58si867ZdNc+ZJe3X6I9j4mus/5r/lb2M+9i y+uDkT44DbkTDwxru/j9dxZsYpjyQX9oEuZWv/9UW4FtULqYnIU3zrOSRyI3stUVNDl3K9GDOO DZLKkoW1RTepZMR33yilt8lA27FQ9uZH5Cj35RSdQof+P2qx3X8k+FSmb1m8KvVZXJsKihpCG3 +MdMw3QxLXs/oS20PlHNQnQJZMMv3u/RBFBkJ9t4xYel3K1egeqvYTNb+4bh6KsS5q95ljscBw vqs= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036818" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:31 +0800 IronPort-SDR: v/XmTWKam1q7rJQWGA4IM3DOdM4lMinxeiOczV5KtqXNe4YFZGpHFFCjz4GNPWq6SsSFBbV4Nh uVToN7KLYzeg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:27 -0700 IronPort-SDR: vwCR39jI9LF2+/E8z7NMEyGOCcPB37n9eaMYhFGz87akTpMfsQGManYSJpMea3BA98xKIdhO/7 9JGZFPRd3cBg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:30 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs Date: Fri, 2 Oct 2020 03:36:32 +0900 Message-Id: <1727a2fbaa17db7c5d3447d2f547b98cb5f9bf32.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit enables zone append writing for zoned btrfs. Three parts are necessary to enable it. First, it modifies bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, it records returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage(). Finally, it rewrites logical addresses of the extent mapping and checksum data according to the physical address (using __btrfs_rmap_block). If the returned address match to the originaly allocated address, we can skip the rewriting process. [Johannes] fixed bvec handling Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 12 +++++++- fs/btrfs/inode.c | 6 +++- fs/btrfs/ordered-data.c | 3 ++ fs/btrfs/ordered-data.h | 4 +++ fs/btrfs/volumes.c | 9 ++++++ fs/btrfs/zoned.c | 68 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 9 ++++++ 7 files changed, 109 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 5ee94a2ffa22..bbcdc8dfbd45 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2743,6 +2743,7 @@ static void end_bio_extent_writepage(struct bio *bio) u64 start; u64 end; struct bvec_iter_all iter_all; + bool first_bvec = true; ASSERT(!bio_flagged(bio, BIO_CLONED)); bio_for_each_segment_all(bvec, bio, iter_all) { @@ -2769,6 +2770,11 @@ static void end_bio_extent_writepage(struct bio *bio) start = page_offset(page); end = start + bvec->bv_offset + bvec->bv_len - 1; + if (first_bvec) { + btrfs_record_physical_zoned(inode, start, bio); + first_bvec = false; + } + end_extent_writepage(page, error, start, end); end_page_writeback(page); } @@ -3531,6 +3537,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode, size_t blocksize; int ret = 0; int nr = 0; + int opf = REQ_OP_WRITE; const unsigned int write_flags = wbc_to_write_flags(wbc); bool compressed; @@ -3543,6 +3550,9 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode, return 1; } + if (btrfs_fs_incompat(inode->root->fs_info, ZONED)) + opf = REQ_OP_ZONE_APPEND; + /* * we don't want to touch the inode after unlocking the page, * so we update the mapping writeback index now @@ -3603,7 +3613,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode, page->index, cur, end); } - ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, + ret = submit_extent_page(opf | write_flags, wbc, page, offset, iosize, pg_offset, &epd->bio, end_bio_extent_writepage, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4bc975eafbd8..b5fdc93b319f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -51,6 +51,7 @@ #include "delalloc-space.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" struct btrfs_iget_args { u64 ino; @@ -2279,7 +2280,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio, goto out; } - if (bio_op(bio) != REQ_OP_WRITE) { + if (btrfs_op(bio) != BTRFS_MAP_WRITE) { ret = btrfs_bio_wq_end_io(fs_info, bio, metadata); if (ret) goto out; @@ -2674,6 +2675,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool clear_reserved_extent = true; unsigned int clear_bits; + if (ordered_extent->disk) + btrfs_rewrite_logical_zoned(ordered_extent); + start = ordered_extent->file_offset; end = start + ordered_extent->num_bytes - 1; diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 7cd22cb07f26..b32cdbfd408d 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -199,6 +199,9 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset entry->compress_type = compress_type; entry->truncated_len = (u64)-1; entry->qgroup_rsv = ret; + entry->physical = (u64)-1; + entry->disk = NULL; + entry->partno = (u8)-1; if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, &entry->flags); diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index e346b03bd66a..1d5a8dc00373 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -127,6 +127,10 @@ struct btrfs_ordered_extent { struct completion completion; struct btrfs_work flush_work; struct list_head work_list; + + u64 physical; + struct gendisk *disk; + u8 partno; }; /* diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 44ef7b2fb46c..924ba96dc8fa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6500,6 +6500,15 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio, btrfs_io_bio(bio)->device = dev; bio->bi_end_io = btrfs_end_bio; bio->bi_iter.bi_sector = physical >> 9; + /* + * For zone append writing, bi_sector must point the beginning of the + * zone + */ + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + u64 zone_start = round_down(physical, fs_info->zone_size); + + bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT; + } btrfs_debug_in_rcu(fs_info, "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector, diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 88b45af60e4f..9e1056e2c2c8 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1054,3 +1054,71 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans) } spin_unlock(&trans->releasing_ebs_lock); } + +void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, + struct bio *bio) +{ + struct btrfs_ordered_extent *ordered; + u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT; + + if (bio_op(bio) != REQ_OP_ZONE_APPEND) + return; + + ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset); + if (WARN_ON(!ordered)) + return; + + ordered->physical = physical; + ordered->disk = bio->bi_disk; + ordered->partno = bio->bi_partno; + + btrfs_put_ordered_extent(ordered); +} + +void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) +{ + struct extent_map_tree *em_tree; + struct extent_map *em; + struct inode *inode = ordered->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_ordered_sum *sum; + struct block_device *bdev; + u64 orig_logical = ordered->disk_bytenr; + u64 *logical = NULL; + int nr, stripe_len; + + bdev = bdget_disk(ordered->disk, ordered->partno); + if (WARN_ON(!bdev)) + return; + + if (WARN_ON(__btrfs_rmap_block(fs_info, orig_logical, bdev, + ordered->physical, &logical, &nr, + &stripe_len))) + goto out; + + WARN_ON(nr != 1); + + if (orig_logical == *logical) + goto out; + + ordered->disk_bytenr = *logical; + + em_tree = &BTRFS_I(inode)->extent_tree; + write_lock(&em_tree->lock); + em = search_extent_mapping(em_tree, ordered->file_offset, + ordered->num_bytes); + em->block_start = *logical; + free_extent_map(em); + write_unlock(&em_tree->lock); + + list_for_each_entry(sum, &ordered->list, list) { + if (*logical < orig_logical) + sum->bytenr -= orig_logical - *logical; + else + sum->bytenr += *logical - orig_logical; + } + +out: + kfree(logical); + bdput(bdev); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 32446135e882..f6263a893a07 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -50,6 +50,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); +void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, + struct bio *bio); +void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -119,6 +122,12 @@ static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb) { } static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } +static inline void btrfs_record_physical_zoned(struct inode *inode, + u64 file_offset, struct bio *bio) +{ +} +static inline void +btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812323 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 19063112E for ; Thu, 1 Oct 2020 18:39:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EC27D208C7 for ; Thu, 1 Oct 2020 18:39:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="N4yTsJZc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733171AbgJASj2 (ORCPT ); Thu, 1 Oct 2020 14:39:28 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733038AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577542; x=1633113542; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=SUpU5ae8xi0dcVpozR41vTzLShl0EEP65BEoTYKUtaI=; b=N4yTsJZcTLxJZZimQeFrVgB+o4Mbyt4sl3rVA9SF/ofMnmJJSVvMP6bT 8kIN0ONfmNGPl/MwT40cjgF7XuWxGzyWutaAaZOPHdj4SNOGopFxdkRVr f1BEb/ZJk0NMvSsJ2Mqu7ti3TZu0MSKBxPX5H2gfg1a6aUOxY6DFSnCAL 0z6o4rB0t/88dk+zU+nf71OTsGMzJoPUgE/Vp4EvUWyZfcln7uxeuf3rl JaA1Cv8gjk5XoiqtgB10qCXrss1amklEYBwhjseGIxw9x+Xu18YLN5ceW CC9gTlmxh3BGw1ISEoUZHU0cmaF6Xw/ns07R11k9ABZNETaSx+Xfn2aIm w==; IronPort-SDR: H/dA3BJXYjIE+4if2wcGA1j9xREa0ICsDJMwSP76zO2gzFPSwPz4X33hrY12zw3tc/PNVl+76L 7TZjvBVyoo8YvupqlvNpWXbYyjsU2igsFFHPOEUEgqRzo0EBEUmvXWD7IULXuGlPhT1Gy2M0Nw 2u17257irUkQYGg0Bd9T+rWHE5Z5aK1a7z7Ba7ZPFynOZx2mL3YjeEJXKuHGMJ642ieNMxeonY yfnadRypD6XVCNC/FPpx+08L5sbvWUAcNwGU5P1Uw4MdsMG+RIPM3LpBbOtenN+fSEvp/VQ+SF 734= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036819" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:32 +0800 IronPort-SDR: TvYmX/LlNkHrRWeSfCz23MbQPdzlRM+2C/sil269fHOtDep5YMHPDwtdn43RVtq97HHpryF5HO kY9TiNp3IMMA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:28 -0700 IronPort-SDR: cwrxMZXafPIYIbL8x+MToL9/9O4Geb+o3b4eHB7ggPbx1goK6a+UeCYC9w5IdVXCaY77j9KfPc N7+ITzmnfa1w== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:31 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 26/41] btrfs: enable zone append writing for direct IO Date: Fri, 2 Oct 2020 03:36:33 +0900 Message-Id: <6f42df516fd9a7a8fc5dddbe73a048749e12ebea.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit enables zone append writing as same as in buffered write. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b5fdc93b319f..37d85c062f3a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7422,6 +7422,11 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, current->journal_info == BTRFS_DIO_SYNC_STUB); current->journal_info = NULL; + if (write && fs_info->max_zone_append_size) { + length = min_t(u64, length, fs_info->max_zone_append_size); + len = length; + } + if (!write) len = min_t(u64, len, fs_info->sectorsize); @@ -7777,6 +7782,8 @@ static void btrfs_end_dio_bio(struct bio *bio) if (err) dip->dio_bio->bi_status = err; + btrfs_record_physical_zoned(dip->inode, dip->logical_offset, bio); + bio_put(bio); btrfs_dio_private_put(dip); } @@ -7786,7 +7793,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_dio_private *dip = bio->bi_private; - bool write = bio_op(bio) == REQ_OP_WRITE; + bool write = bio_op(bio) == REQ_OP_WRITE || + bio_op(bio) == REQ_OP_ZONE_APPEND; blk_status_t ret; /* Check btrfs_submit_bio_hook() for rules about async submit. */ @@ -7931,6 +7939,12 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, bio->bi_end_io = btrfs_end_dio_bio; btrfs_io_bio(bio)->logical = file_offset; + if (write && btrfs_fs_incompat(fs_info, ZONED) && + fs_info->max_zone_append_size) { + bio->bi_opf &= ~REQ_OP_MASK; + bio->bi_opf |= REQ_OP_ZONE_APPEND; + } + ASSERT(submit_len >= clone_len); submit_len -= clone_len; From patchwork Thu Oct 1 18:36:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812319 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B7BAD112E for ; Thu, 1 Oct 2020 18:39:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 96E37207DE for ; Thu, 1 Oct 2020 18:39:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="a0P/R2rD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733177AbgJASj3 (ORCPT ); Thu, 1 Oct 2020 14:39:29 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733072AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577542; x=1633113542; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=smnfPZ2PHSaLwMiHjwkNcdokXv96GnZf7zgeO0D351k=; b=a0P/R2rDPlAwFOjLkpC8TlCdY5R4XPElCHZSnUzS+YHyJ7FWxJAN3BYv OdjG1Wd+Kt+7s620AOfSi4s9w9mD1pii50BtGO/JjA/ONCOHGhs5ykULg A/XabMPT4Z6AdArlLxlsf3VOUALncpbw0r2xBGsolBAmvuWnGEeB7lL6h AvDiXnF7GmIeaQ4B9YiKduRIXp+IOTKngD75lt2PBM9GebZuWmanYDo42 h7FwIyDoLr4I+lQh0xPJ3aQfOOin7UBVWKQ9GTdhUeNg1F902ryhf2Ha/ 6/6XHek4BIV56Ufx/voU43q9ll09xS5x7aew5/vybaVn97gFBpG/aKAR7 Q==; IronPort-SDR: BZKqcrLb/7Eu0KdHfq3djlX7gaj0/HYEQSzF7iAVWMzuM0eIInLhrrtOp9ElCpoKpK7iJrHRXK XXu+j5dLKmvpqGi8NalEgiMg9js47GhCPSExHv2kGS6FGQbUJEHIt/mSZ034nNFCMWrVwYDQT2 pg7/oSxQJD1wuN+Jq2WwdDKlDRe/jkgrMUGFSauNVTSXbympaPOkMX5hjMqeMB+JQlALrm1wsW 1B14oN8UflS+8cK7E+0ZZrtD9CXFLiEFdG4H5S3nVadx/tRqumBHqI9J864VmXRzrh9rwfLUON RL0= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036821" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:33 +0800 IronPort-SDR: yCMFQnJLSlp3zLn4VbYWTmL+9lC5b2B/ZJiwkyvAikGFx6iPbRe5sErKabgWp+9HdKL7AOO91f yGcLelSjFezQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:30 -0700 IronPort-SDR: SSMVERy9fPXzbSNZXrF/YmA0F3j+hHugr7nc8MZbOeoJzd1TACMeDxo2sym+WsZPJm9vl8E1F/ C+l3RsjiizOQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:32 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 27/41] btrfs: introduce dedicated data write path for ZONED mode Date: Fri, 2 Oct 2020 03:36:34 +0900 Message-Id: <2a95e45089e9f9b1425e8fbe0acac966ee93d07b.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If more than one IO is issued for one file extent, these IO can be written in a separate region on a device. Since we cannot map one file extent to such a separate area, we need to follow the "one IO == one ordered extent" rule. Normal (buffered, uncompressed, not pre-allocated) write path (= cow_file_range()) sometime does not follow the rule. It can write a part of an ordered extent when specified a region to write e.g., called from fdatasync(). This commit introduces a dedicated (uncompressed buffered) data write path for ZONED mode. This write path CoW the region and write the region at once. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 37d85c062f3a..46eba6c7792b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1350,6 +1350,29 @@ static int cow_file_range_async(struct btrfs_inode *inode, return 0; } +static noinline int run_delalloc_zoned(struct btrfs_inode *inode, + struct page *locked_page, u64 start, + u64 end, int *page_started, + unsigned long *nr_written) +{ + int ret; + + ret = cow_file_range(inode, locked_page, start, end, + page_started, nr_written, 0); + if (ret) + return ret; + + if (*page_started) + return 0; + + __set_page_dirty_nobuffers(locked_page); + account_page_redirty(locked_page); + extent_write_locked_range(&inode->vfs_inode, start, end, WB_SYNC_ALL); + *page_started = 1; + + return 0; +} + static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes) { @@ -1820,17 +1843,24 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page { int ret; int force_cow = need_force_cow(inode, start, end); + int do_compress = inode_can_compress(inode) && + inode_need_compress(inode, start, end); + bool zoned = btrfs_fs_incompat(inode->root->fs_info, ZONED); if (inode->flags & BTRFS_INODE_NODATACOW && !force_cow) { + ASSERT(!zoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); } else if (inode->flags & BTRFS_INODE_PREALLOC && !force_cow) { + ASSERT(!zoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_can_compress(inode) || - !inode_need_compress(inode, start, end)) { + } else if (!do_compress && !zoned) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); + } else if (!do_compress && zoned) { + ret = run_delalloc_zoned(inode, locked_page, start, end, + page_started, nr_written); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags); ret = cow_file_range_async(inode, wbc, locked_page, start, end, From patchwork Thu Oct 1 18:36:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812307 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7EFC1139F for ; Thu, 1 Oct 2020 18:39:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5790320796 for ; Thu, 1 Oct 2020 18:39:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="quvwArtx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733164AbgJASjY (ORCPT ); Thu, 1 Oct 2020 14:39:24 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733076AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577542; x=1633113542; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ugdU8mzAQKkaZkmLNMfFgDBsu6jBgINON9Q0E2NdmGw=; b=quvwArtxBtk8YM9bvTg+SyvJIWh/FAZsKOdqxzv61WuzIg2H141q3In3 jCEPQCvweGMTVrxe+XNPge2+8vl8rLGB4slMD8xv/B2MK0L6LaeNmObdX 740cTn57m3g//ycgbeDCWUInqgLFf3xK5eZsh5mSRGKandb9g+e3UEQxh h+OtNL/1KC4517jnJ0xr/ZfbbEtiidGzKUvg+8m8mLcrH/peOikITN9tq 2Z2dWdmj6zYF4x7OyBDgYZgWrmbVfEyeJE6ejYLsQnys8qdXhY6f192Ju AjiJNp/N0vJS7iz+mI7ErqZF6r64/4LCEle9ZsLsRsOen+Q9in1RNZx+3 g==; IronPort-SDR: PQdfiJMWaa7q663J+l+0M3O77e//nKwym/AuSOVeduVciSzjnKKXLF8ggsSAGoGSXGd/bM/1eN OBsNzQDZJvpoIcSQTsmep+1Y+wTo6fjz/aU68M/xD5fJI0WvD/0QgygWlduC/QrQq2y9++w+Yk 2wswdmSWHqcqwVG0ZLzP9qOPQp+Sufgvxm8HUv4PwOTbiviIOVqyS66pk3L34xbGEHUIdQrSQs O5sFE1epWH8FCa1xskAF6X3+JyOshntZr/KTAgjF0pyfBqjimz/paJgC5FSoU40uV9N3AvKAJF abY= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036822" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:34 +0800 IronPort-SDR: ooAqbl7AIrENy7b+SaP4r4fhlKdEIiqMvG1cleg9lGd3L0ZMy3xRqEqQFwr2fS0EXVeYI05aTb uMeAJOU+Pi/w== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:31 -0700 IronPort-SDR: TMXv94CbeNa/FnHihQNlVK5h7WUjJt94bHFmjbBUUP1eCmTDwvS5DnsyDyf+II6hccn7eVOe1+ ZEWn+iXIeTEQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:33 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 28/41] btrfs: serialize meta IOs on ZONED mode Date: Fri, 2 Oct 2020 03:36:35 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We cannot use zone append writing for metadata, because the B-tree nodes have references to each other using the logical address. Without knowing the address in advance, we cannot construct the tree in the first place. Thus, we need to serialize write IOs for metadata. We cannot add mutex around allocation and submit because metadata blocks are allocated in an earlier stage to build up B-trees. Thus, this commit add zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this commit add per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can be caused when writing back blocks in a not finished transaction. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 1 + fs/btrfs/extent_io.c | 27 ++++++++++++++++++++++- fs/btrfs/zoned.c | 50 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 31 ++++++++++++++++++++++++++ 6 files changed, 111 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 401e9bcefaec..b2a8a3beceac 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -190,6 +190,7 @@ struct btrfs_block_group { */ u64 alloc_offset; u64 zone_unusable; + u64 meta_write_pointer; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e6f0fe1920e9..d021bc4a92cd 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -953,6 +953,8 @@ struct btrfs_fs_info { /* Type of exclusive operation running */ unsigned long exclusive_operation; + struct mutex zoned_meta_io_lock; + #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; struct rb_root block_tree; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c872f051b0a5..87c978fecaa2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2652,6 +2652,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info) mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); + mutex_init(&fs_info->zoned_meta_io_lock); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bbcdc8dfbd45..ed6a9fce016d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -25,6 +25,7 @@ #include "backref.h" #include "disk-io.h" #include "zoned.h" +#include "block-group.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -4001,6 +4002,7 @@ int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc) { struct extent_buffer *eb, *prev_eb = NULL; + struct btrfs_block_group *cache = NULL; struct extent_page_data epd = { .bio = NULL, .extent_locked = 0, @@ -4035,6 +4037,7 @@ int btree_write_cache_pages(struct address_space *mapping, tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; + btrfs_zoned_meta_io_lock(fs_info); retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -4077,12 +4080,30 @@ int btree_write_cache_pages(struct address_space *mapping, if (!ret) continue; + if (!btrfs_check_meta_write_pointer(fs_info, eb, + &cache)) { + /* + * If for_sync, this hole will be filled with + * trasnsaction commit. + */ + if (wbc->sync_mode == WB_SYNC_ALL && + !wbc->for_sync) + ret = -EAGAIN; + else + ret = 0; + done = 1; + free_extent_buffer(eb); + break; + } + prev_eb = eb; ret = lock_extent_buffer_for_io(eb, &epd); if (!ret) { + btrfs_revert_meta_write_pointer(cache, eb); free_extent_buffer(eb); continue; } else if (ret < 0) { + btrfs_revert_meta_write_pointer(cache, eb); done = 1; free_extent_buffer(eb); break; @@ -4115,10 +4136,12 @@ int btree_write_cache_pages(struct address_space *mapping, index = 0; goto retry; } + if (cache) + btrfs_put_block_group(cache); ASSERT(ret <= 0); if (ret < 0) { end_write_bio(&epd, ret); - return ret; + goto out; } /* * If something went wrong, don't allow any metadata write bio to be @@ -4153,6 +4176,8 @@ int btree_write_cache_pages(struct address_space *mapping, ret = -EROFS; end_write_bio(&epd, ret); } +out: + btrfs_zoned_meta_io_unlock(fs_info); return ret; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 9e1056e2c2c8..57bd6dbd8f45 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -991,6 +991,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) ret = -EIO; } + if (!ret) + cache->meta_write_pointer = cache->alloc_offset + cache->start; + kfree(alloc_offsets); free_extent_map(em); @@ -1122,3 +1125,50 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) kfree(logical); bdput(bdev); } + +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + struct btrfs_block_group *cache; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return true; + + cache = *cache_ret; + + if (cache && (eb->start < cache->start || + cache->start + cache->length <= eb->start)) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + } + + if (!cache) + cache = btrfs_lookup_block_group(fs_info, eb->start); + + if (cache) { + *cache_ret = cache; + + if (cache->meta_write_pointer != eb->start) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + return false; + } + + cache->meta_write_pointer = eb->start + eb->len; + } + + return true; +} + +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb) +{ + if (!btrfs_fs_incompat(eb->fs_info, ZONED) || !cache) + return; + + ASSERT(cache->meta_write_pointer == eb->start + eb->len); + cache->meta_write_pointer = eb->start; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index f6263a893a07..fc8012ebcc36 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -53,6 +53,11 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans); void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, struct bio *bio); void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered); +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret); +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -128,6 +133,18 @@ static inline void btrfs_record_physical_zoned(struct inode *inode, } static inline void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { } +static inline bool +btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + return true; +} +static inline void +btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb) +{ +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -230,4 +247,18 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device, return true; } +static inline void btrfs_zoned_meta_io_lock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return; + mutex_lock(&fs_info->zoned_meta_io_lock); +} + +static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return; + mutex_unlock(&fs_info->zoned_meta_io_lock); +} + #endif From patchwork Thu Oct 1 18:36:36 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812259 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C67CB1668 for ; Thu, 1 Oct 2020 18:39:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A93D32158C for ; Thu, 1 Oct 2020 18:39:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="F6xAoSK0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732431AbgJASjE (ORCPT ); Thu, 1 Oct 2020 14:39:04 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730079AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577542; x=1633113542; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lF+1gtKcY/8Ddsnnjdki26oFBUL/Pm8XsVg1+Bp/B6Y=; b=F6xAoSK0iyvfHssbUUZ8ipBO3KOBW+zrZhLRLhyzzfusThC4AymRhgwH I6xBfcvp4+Q1weF0gdolTs5g9Y54Hzi/dR0kS5bq3NV72KMq30EUCf+BB TSnQh8b4phGEMX87SWOpelFd4Hxicx1Cc6diiqcnJZ6AXzvxsLkwScn84 nfw+GBaPy5yNKaXX291T0Lcykd/flhuHON92pyTGz3nDuRZDF5YAguY5+ g4wb2rqKoeGTQe/QfrDsPNUR740XghsA8xu/VIJ0CSIfPBal2YW0k1tKK I8qh69TWjmLCb2OTKeOyLdN4sHq/duS68MW3LfDf0w/MYTGPdhykBNxm1 Q==; IronPort-SDR: EOuIBfA0BYAlrNvQljdPI4hZ241vzjcN/wo9bYWIFRwDisrQJbwK6ss0NduttR7QLdT/D5vYNC wUUOZCx4q9ZCIZMdWhxhEh8jLP7ZUJK35djgRt5tx38hR/JM4ANhg8Uetlmi+1uEGgqShPyeg2 gXo4tb+trhKOo+xoVl7rO34+9kCEc928VeK3j7UfMEwIKj954K5SRuSV2JT3laYG/j/F0CFBcA asmcF0cWqhJzBfPlRRfshv/zb/uVqBqFh+eTXDwEPh2silsskqjY4YSzn5PJeg4y3Jj6gik2Y8 iFQ= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036823" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:35 +0800 IronPort-SDR: E6drWRHrV1IYdeo02n66apWQInNXe25pED2ktIT8/MOn8LSF4UgJDFuzWGtqtdzGTFVZmH9N32 xy4dGbCQiu5g== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:32 -0700 IronPort-SDR: K9tu9LXXogV7XQSVEG43BQKOPg49OjpZLovX8ruEDFYHwyVvJkSA983NtJFrP5bPMxMXbUbdOd 0TxQ19t3X0aA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:35 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik Subject: [PATCH v8 29/41] btrfs: wait existing extents before truncating Date: Fri, 2 Oct 2020 03:36:36 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 46eba6c7792b..40704b61f582 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4953,6 +4953,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_drew_write_unlock(&root->snapshot_lock); btrfs_end_transaction(trans); } else { + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + if (btrfs_fs_incompat(fs_info, ZONED)) { + ret = btrfs_wait_ordered_range( + inode, + ALIGN(newsize, fs_info->sectorsize), + (u64)-1); + if (ret) + return ret; + } /* * We're truncating a file that used to have good data down to From patchwork Thu Oct 1 18:36:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812285 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 01567139F for ; Thu, 1 Oct 2020 18:39:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D63E4208C7 for ; Thu, 1 Oct 2020 18:39:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="pIeJ/7x2" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733113AbgJASjI (ORCPT ); Thu, 1 Oct 2020 14:39:08 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733085AbgJASjD (ORCPT ); Thu, 1 Oct 2020 14:39:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577543; x=1633113543; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NV5uXusiJdcbhwTsDSi3QoePq27Ek8RGGCfo9PXzbOs=; b=pIeJ/7x2BlS8H1OI0QARZfUuy241k1QNpU5n629PvtL2YkQBYmjDVTlI W4+FR+RGw2FzxxdznT3Ry6b6bnVbRFEoUJ6wxQ5gC3a5437CEMd9TVQYR I7C0sBrV2skr3/UI5NDgck2Xvx4QioU/BmgMkuG20ZhnxC+42/ZeTPYQm z1MZYvpBgcxmCEtznomUaTmicimGj7LRO/ZkyBn8aaZdx+zwq70pMN0aL uOwa3mK1P9FmNcSNqclCRX3LKyhwKvcX9C7pUAvtjlNhn1OFvwdgQTBnJ V6yd+j9Hz6mo6WFvXzLsKSvQ9pjcQeZpBIr3QXSsSYHv7eOFMZ48C6DBR g==; IronPort-SDR: Lfq83t8VEJHgC3xGrGSjHcLJK+f9X1i4X8UVoPLQmJZmZpPj4aybdM7m5ucYzZGPmdNbB7tY1P 9kSh6jYLnIcg8fsMGJ9Ao6AEaHaGorc6zjm3qwpCZjIsQC2oG2KqXDweQyNje6BoBN+iSf9kx7 tIvbK6dSOhu1uzTr6uoUQlmuOl8/FPkYmCQPZFNQWCwypxQS70F5hQ+pEyJaxmA8NxS+JreZQh DdxK+pFj8f7B2//LY/DX8lvF4wSjP6tKe9caovDXuEvDBpfjRi9xQ8mCI5ik4IHnkCG4P3JpLP BmY= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036826" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:37 +0800 IronPort-SDR: CXMtjfo3MG8sABznyQFz6O1BNVhMHI3oUnC7Fa4LOdDwiAdLwbeisV/WMkkivst2uYrGkCh1Iq QAlsotS17/cQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:33 -0700 IronPort-SDR: pDVH33KBuAZwEOR7oDJsbeyuV1LTz7tSX65ZnF5hmbdh0QqBngZS+nTkMn9YI8VqeLOeomFZUA M8qb+3mQSiJQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:36 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 30/41] btrfs: avoid async metadata checksum on ZONED mode Date: Fri, 2 Oct 2020 03:36:37 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata write IOs. Even with these serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write BIO ordering, we can disable async metadata checksum on ZONED. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting performance. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 87c978fecaa2..5ce5b18f9dc4 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -813,6 +813,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio, static int check_async_write(struct btrfs_fs_info *fs_info, struct btrfs_inode *bi) { + if (btrfs_fs_incompat(fs_info, ZONED)) + return 0; if (atomic_read(&bi->sync_writers)) return 0; if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags)) From patchwork Thu Oct 1 18:36:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812309 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6290B112E for ; Thu, 1 Oct 2020 18:39:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3DA7420B1F for ; Thu, 1 Oct 2020 18:39:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Fs2mGwVS" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733159AbgJASjX (ORCPT ); Thu, 1 Oct 2020 14:39:23 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733086AbgJASjC (ORCPT ); Thu, 1 Oct 2020 14:39:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577543; x=1633113543; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=g83IDaetFeG2qxHEJsxkSMJbqm1nC0dkIReub0/vubg=; b=Fs2mGwVS0FzrXhpX/BejfYxnnthSTjAZkT6Q94UAAJo+HwBDlPuVxS15 chxQ/1NT0k15wUVi7R3gIsZvCc8NgfuGzm3BxN2BCXisZ7mW9vRgQtIVG AvZVNKK+QToQ6BArWgLNbiRzh9xFIPWrzac1gRBDyKnXlDt6XtgTZZi1u RXiBCuJi07p+4nnQpW0Rh/CkOvtbc5BkwGbrU85Vh1OLJOhpaRhHZBzsj 26KZaUMcSkkqIKQ8RRn4ezLixRaJXfCE1d/uo1yllk5sPsYkjRh3HWeRv qQNXy0JsBtGsiBCGc5XqwJ+Nr6y1eB4+I+t9FuXk4Djwl05cpMMzrRl8X g==; IronPort-SDR: fOUVVYRH/5C4mQ91W2t4p9mgStFe+UUwNfROQMtdrWT2Y2ndz0fM/cMMWkxuv6KYa1CiJxtTEC FRBE2UjRGkIF15DQ7VzVvGamQrDh7cfE8mLwjYXLPsFPjCDAZBltrqg40s4NGW7sn+oKngmNwy R9ITdGzbR6YfoAXxDzVYbNdk3/EMSkWXP7p6TM6Ir3oajYqTTfjq3K1jHuJSM+4XTDXUtgwyr/ FiM/brPc40ijcerU2/CjN5qPrvWzZxTDWljqib1KHB+9pGOUAdAlK6MQfcvD1nNRs3SBLaQ8NT AfA= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036827" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:38 +0800 IronPort-SDR: sTFXU9TuFoBFs5zylhquPJU+7Z/HJRnGdS4POUHjtfZ6RmZt74KF0zRHGDclxNu4gB1nubkv44 t1MA0XGnZpqg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:34 -0700 IronPort-SDR: +RKVdMvDuJougfTXYcbz6Q8IodeFuIxy65WQijLy+k0YBsTbZRaxhysdJp9GUqCnbMX4viXJrl IuVHReHSWt4w== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:37 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 31/41] btrfs: mark block groups to copy for device-replace Date: Fri, 2 Oct 2020 03:36:38 +0900 Message-Id: <83652d36a020f8c11e601d969cc8940a829020e9.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 1/4 patch to support device-replace in ZONED mode. We have two types of I/Os during the device-replace process. One is an I/O to "copy" (by the scrub functions) all the device extents on the source device to the destination device. The other one is an I/O to "clone" (by handle_ops_on_dev_replace()) new incoming write I/Os from users to the source device into the target device. Cloning incoming I/Os can break the sequential write rule in the target device. When writing is mapped in the middle of a block group, the I/O is directed in the middle of a target device zone, which breaks the sequential write rule. However, the cloning function cannot be merely disabled since incoming I/Os targeting already copied device extents must be cloned so that the I/O is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether bio is going to not yet copied region. Since we have a time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have a newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block groups as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/dev-replace.c | 175 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/dev-replace.h | 3 + fs/btrfs/scrub.c | 17 ++++ 4 files changed, 196 insertions(+) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index b2a8a3beceac..e91123495d68 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -95,6 +95,7 @@ struct btrfs_block_group { unsigned int iref:1; unsigned int has_caching_ctl:1; unsigned int removed:1; + unsigned int to_copy:1; int disk_cache_state; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 5e3554482af1..e86aff38aea4 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -22,6 +22,7 @@ #include "dev-replace.h" #include "sysfs.h" #include "zoned.h" +#include "block-group.h" /* * Device replace overview @@ -437,6 +438,176 @@ static char* btrfs_dev_name(struct btrfs_device *device) return rcu_str_deref(device->name); } +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, + struct btrfs_device *src_dev) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_root *root = fs_info->dev_root; + struct btrfs_dev_extent *dev_extent = NULL; + struct btrfs_block_group *cache; + struct extent_buffer *l; + struct btrfs_trans_handle *trans; + int slot; + int ret = 0; + u64 chunk_offset, length; + + /* Do not use "to_copy" on non-ZONED for now */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + mutex_lock(&fs_info->chunk_mutex); + + /* ensulre we don't have pending new block group */ + while (fs_info->running_transaction && + !list_empty(&fs_info->running_transaction->dev_update_list)) { + mutex_unlock(&fs_info->chunk_mutex); + trans = btrfs_attach_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret == -ENOENT) + continue; + else + goto out; + } + + ret = btrfs_commit_transaction(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret) + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + path->reada = READA_FORWARD; + path->search_commit_root = 1; + path->skip_locking = 1; + + key.objectid = src_dev->devid; + key.offset = 0ull; + key.type = BTRFS_DEV_EXTENT_KEY; + + while (1) { + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + break; + if (ret > 0) { + if (path->slots[0] >= + btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + } else { + ret = 0; + } + } + + l = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(l, &found_key, slot); + + if (found_key.objectid != src_dev->devid) + break; + + if (found_key.type != BTRFS_DEV_EXTENT_KEY) + break; + + if (found_key.offset < key.offset) + break; + + dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent); + length = btrfs_dev_extent_length(l, dev_extent); + + chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + if (!cache) + goto skip; + + spin_lock(&cache->lock); + cache->to_copy = 1; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + +skip: + key.offset = found_key.offset + length; + btrfs_release_path(path); + } + + btrfs_free_path(path); +out: + mutex_unlock(&fs_info->chunk_mutex); + + return ret; +} + +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map *em; + struct map_lookup *map; + u64 chunk_offset = cache->start; + int num_extents, cur_extent; + int i; + + /* Do not use "to_copy" on non-ZONED for now */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return true; + + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + return true; + } + spin_unlock(&cache->lock); + + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); + BUG_ON(IS_ERR(em)); + map = em->map_lookup; + + num_extents = cur_extent = 0; + for (i = 0; i < map->num_stripes; i++) { + /* we have more device extent to copy */ + if (srcdev != map->stripes[i].dev) + continue; + + num_extents++; + if (physical == map->stripes[i].physical) + cur_extent = i; + } + + free_extent_map(em); + + if (num_extents > 1 && cur_extent < num_extents - 1) { + /* + * Has more stripes on this device. Keep this BG + * readonly until we finish all the stripes. + */ + return false; + } + + /* last stripe on this device */ + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + + return true; +} + static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -478,6 +649,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) return ret; + ret = mark_block_group_to_copy(fs_info, src_device); + if (ret) + return ret; + down_write(&dev_replace->rwsem); switch (dev_replace->replace_state) { case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index 60b70dacc299..3911049a5f23 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info); void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info); int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info); int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace); +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical); #endif diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index aa1b36cf5c88..d0d7db3c8b0b 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3500,6 +3500,17 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + + if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) { + spin_lock(&cache->lock); + if (!cache->to_copy) { + spin_unlock(&cache->lock); + ro_set = 0; + goto done; + } + spin_unlock(&cache->lock); + } + /* * Make sure that while we are scrubbing the corresponding block * group doesn't get its logical address and its device extents @@ -3631,6 +3642,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + if (sctx->is_dev_replace && + !btrfs_finish_block_group_to_copy(dev_replace->srcdev, + cache, found_key.offset)) + ro_set = 0; + +done: down_write(&dev_replace->rwsem); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; From patchwork Thu Oct 1 18:36:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812283 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CF676139F for ; Thu, 1 Oct 2020 18:39:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B4F24208C7 for ; Thu, 1 Oct 2020 18:39:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="LbqkXs8w" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733122AbgJASjJ (ORCPT ); Thu, 1 Oct 2020 14:39:09 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733088AbgJASjD (ORCPT ); Thu, 1 Oct 2020 14:39:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577543; x=1633113543; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nL9Fc9WGfI7JmFd2yFwmGohr+9ZRNKakMbaYwBZP3M4=; b=LbqkXs8wlgQ5R75ZweaFwVeeFq2TPDD7eAUHDIjXOggr0HZZkOyqMnkI HF4HEf6OaeLZ9TMyUl4l2dxy94EJ30t2HSumpx2lXab+EKNSafg7hXMkc yE+vxzm6HlizPT36q8nu3z6KhbURlQ5bj3RyCkXogklLtxwzgugDozuzn nahoccZUvBuxHfqKHf1Bu18WnDj/gwZXmzTXDyxnWGFI9A1os0RRWrtZ9 zMURWo+ar5r3Oa3F34HWDC8zBW926TuuRBc4FiLYUyLpAceOXoeyqS2Wi ao0LSKZNehHM6IwFq0l+GZzEGpbaAh5UJtMqYXzHTKdlwqDb0ddQTvxZp g==; IronPort-SDR: 88X4xiX2dejCLJE1xxBfN/5T9V9hclIjVCr73Fbt1YQhiFKRS7vpgDbEXXVB0ns+FcD9+guX+P yC6UQE8NLYwgC3uaDR3gHNGj2rL+mDo1eoFSMJkrRFz+d8k+XUAilxqCTnWbjZZU4lihIZUpyj DJe1hE2D58scCC/syzLd7om25UI1j6e246VFicMOe+IhIGUADJZD84uYCZo/BieujInNIyM7Jg wc+w56H5HnFydcZ0+XE19nzsVWLtoNoQ1Ef8yjsQ6Te8fh2cYIxFCBIsOovbG3Ocx/Lv82DjQz 4HA= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036828" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:39 +0800 IronPort-SDR: BAEdnzlrn5XZeoQJVRIvC+TpWOCGHY7Mcv2Mk7FFdz9JEG/50qlcqwQPQYfebQDcLBxYfZAB9N bBgHccY/QUsA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:36 -0700 IronPort-SDR: MG6yF5uUWI8d6ZiNhRjQuWFu9MpbhKpCsBngh3Z42sCJ9zZ+MJ9Wn5PLBG/khPQDdU5pk8DTkz CMT9Mga2RE/A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:38 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 32/41] btrfs: implement cloning for ZONED device-replace Date: Fri, 2 Oct 2020 03:36:39 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 2/4 patch to implement device-replace for ZONED mode. On zoned mode, a block group must be either copied (from the source device to the destination device) or cloned (to the both device). This commit implements the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Signed-off-by: Naohiro Aota --- fs/btrfs/extent-tree.c | 20 ++++++++++++++++++-- fs/btrfs/volumes.c | 33 +++++++++++++++++++++++++++++++-- fs/btrfs/zoned.c | 11 +++++++++++ 3 files changed, 60 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c0e4a577c61c..f44faaf7aca2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -35,6 +35,7 @@ #include "discard.h" #include "rcu-string.h" #include "zoned.h" +#include "dev-replace.h" #undef SCRAMBLE_DELAYED_REFS @@ -1336,6 +1337,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 length = stripe->length; u64 bytes; struct request_queue *req_q; + struct btrfs_dev_replace *dev_replace = + &fs_info->dev_replace; if (!stripe->dev->bdev) { ASSERT(btrfs_test_opt(fs_info, DEGRADED)); @@ -1344,15 +1347,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, req_q = bdev_get_queue(stripe->dev->bdev); /* zone reset in ZONED mode */ - if (btrfs_can_zone_reset(dev, physical, length)) + if (btrfs_can_zone_reset(dev, physical, length)) { ret = btrfs_reset_device_zone(dev, physical, length, &bytes); - else if (blk_queue_discard(req_q)) + if (ret) + goto next; + if (!btrfs_dev_replace_is_ongoing( + dev_replace) || + dev != dev_replace->srcdev) + goto next; + + discarded_bytes += bytes; + /* send to replace target as well */ + ret = btrfs_reset_device_zone( + dev_replace->tgtdev, + physical, length, &bytes); + } else if (blk_queue_discard(req_q)) ret = btrfs_issue_discard(dev->bdev, physical, length, &bytes); else continue; +next: if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 924ba96dc8fa..af2ed4d3389f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5965,9 +5965,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info, return ret; } +static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + bool ret; + + /* non-ZONED mode does not use "to_copy" flag */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return false; + + cache = btrfs_lookup_block_group(fs_info, logical); + + spin_lock(&cache->lock); + ret = cache->to_copy; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + return ret; +} + static void handle_ops_on_dev_replace(enum btrfs_map_op op, struct btrfs_bio **bbio_ret, struct btrfs_dev_replace *dev_replace, + u64 logical, int *num_stripes_ret, int *max_errors_ret) { struct btrfs_bio *bbio = *bbio_ret; @@ -5980,6 +6000,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, if (op == BTRFS_MAP_WRITE) { int index_where_to_add; + /* + * a block group which have "to_copy" set will + * eventually copied by dev-replace process. We can + * avoid cloning IO here. + */ + if (is_block_group_to_copy(dev_replace->srcdev->fs_info, + logical)) + return; + /* * duplicate the write operations while the dev replace * procedure is running. Since the copying of the old disk to @@ -6375,8 +6404,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) { - handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes, - &max_errors); + handle_ops_on_dev_replace(op, &bbio, dev_replace, logical, + &num_stripes, &max_errors); } *bbio_ret = bbio; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 57bd6dbd8f45..f10cc5f49962 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -17,6 +17,7 @@ #include "disk-io.h" #include "block-group.h" #include "transaction.h" +#include "dev-replace.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -892,6 +893,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) for (i = 0; i < map->num_stripes; i++) { bool is_sequential; struct blk_zone zone; + struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + int dev_replace_is_ongoing = 0; device = map->stripes[i].dev; physical = map->stripes[i].physical; @@ -918,6 +921,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) */ btrfs_dev_clear_zone_empty(device, physical); + down_read(&dev_replace->rwsem); + dev_replace_is_ongoing = + btrfs_dev_replace_is_ongoing(dev_replace); + if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) + btrfs_dev_clear_zone_empty(dev_replace->tgtdev, + physical); + up_read(&dev_replace->rwsem); + /* * The group is mapped to a sequential zone. Get the zone write * pointer to determine the allocation offset within the zone. From patchwork Thu Oct 1 18:36:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812273 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C200A112E for ; Thu, 1 Oct 2020 18:39:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A14D2207DE for ; Thu, 1 Oct 2020 18:39:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="hy4yR1hf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733131AbgJASjK (ORCPT ); Thu, 1 Oct 2020 14:39:10 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730144AbgJASjH (ORCPT ); Thu, 1 Oct 2020 14:39:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577547; x=1633113547; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=p8jEUQ2/9iKSCb1ckWRNyZtlugA4tMYdiRJMHkJM4s4=; b=hy4yR1hfgqTI1mDF7tdqcI+GWhusn3aKeqPM68L1dzGLZM1v2cHga1Th WlQQIOOQbENydJ2aOw+Bz0+52AoCCBtiL+fb3m8+x1qB0KnmBI1kOjpeW U8XSzW/fgXlTzUUbk6X4N4FbdHAcDnAZy0CD2oyUxpmmFtrXbAHDaP4wN hlPfsZbd5teSspk4Dmp9fZh8EM9JcKHwznKxC5PEwChcHDmn6r7qtHRT7 JCytPskKt4tJwm1fOoks6FA0mMeD1yPSmzV15phx3q6+9EalYjQd0SdBr T0UR/pDJLEgy9gY7wDsE3ODv7MyDL3Cvc37eTFEr/Q8aeOXBIvUc+Lx8j Q==; IronPort-SDR: +YF7ccX/E7TpqIf8dP1V82jn94FE8mJEMrs9FeYHMpkCw52LT4a2fDWt8/LY9B2pznNoLeg5AJ HXBA7tOWDnx4VTElcsKzxsnzg4cDTv3qmNbq8H5AMJcylcCjiMVLuCA1zo4Yfbo0Y3bBsyIQ34 kiVjiYifz9Kpfo/UJrupvjkTgYnxhYBdRtc6lWw9Ge8Yzs5s5AzOSTMsnjviRvdpO8WEwW85c5 x67C9PpMz5ImFghQgi3xuuZhYK+MFq5N2EY/rhgXoETJAQ1e+4PATdD2rE+t6s0+GXH5Qi9B4T 7kE= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036830" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:40 +0800 IronPort-SDR: aevLmKCb5uJmgmgTlLNs+Bc64Zp9zGLZXx38Psm+8+R2l18j8A2tGdrb2SMSpmmU0a0Xcfb/MO F+ePXlQa/8Cg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:37 -0700 IronPort-SDR: LlQNYbBv38E3kp2Aj1zc0eOORQIJcgsJw51ganqHkueGkLnZLNjCk/rZ3oamp9cUWhy/s0mbpV 6dKXZfrZrkYA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:39 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 33/41] btrfs: implement copying for ZONED device-replace Date: Fri, 2 Oct 2020 03:36:40 +0900 Message-Id: <7cbde84e957f5ac0b58c7f95403269a35f95373b.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 3/4 patch to implement device-replace on ZONED mode. This commit implement copying. So, it track the write pointer during device replace process. Device-replace's copying is smart to copy only used extents on source device, we have to fill the gap to honor the sequential write rule in the target device. Device-replace process in ZONED mode must copy or clone all the extents in the source device exactly once. So, we need to use to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). Signed-off-by: Naohiro Aota --- fs/btrfs/scrub.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.c | 12 +++++++ fs/btrfs/zoned.h | 7 ++++ 3 files changed, 105 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index d0d7db3c8b0b..65e460670160 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -169,6 +169,7 @@ struct scrub_ctx { int pages_per_rd_bio; int is_dev_replace; + u64 write_pointer; struct scrub_bio *wr_curr_bio; struct mutex wr_lock; @@ -1623,6 +1624,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock, return scrub_add_page_to_wr_bio(sblock->sctx, spage); } +static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical) +{ + int ret = 0; + u64 length; + + if (!btrfs_fs_incompat(sctx->fs_info, ZONED)) + return 0; + + if (sctx->write_pointer < physical) { + length = physical - sctx->write_pointer; + + ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev, + sctx->write_pointer, length); + if (!ret) + sctx->write_pointer = physical; + } + return ret; +} + static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { @@ -1645,6 +1665,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, if (sbio->page_count == 0) { struct bio *bio; + ret = fill_writer_pointer_gap(sctx, + spage->physical_for_dev_replace); + if (ret) { + mutex_unlock(&sctx->wr_lock); + return ret; + } + sbio->physical = spage->physical_for_dev_replace; sbio->logical = spage->logical; sbio->dev = sctx->wr_tgtdev; @@ -1706,6 +1733,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx) * doubled the write performance on spinning disks when measured * with Linux 3.5 */ btrfsic_submit_bio(sbio->bio); + + if (btrfs_fs_incompat(sctx->fs_info, ZONED)) + sctx->write_pointer = sbio->physical + + sbio->page_count * PAGE_SIZE; } static void scrub_wr_bio_end_io(struct bio *bio) @@ -2973,6 +3004,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } +static void sync_replace_for_zoned(struct scrub_ctx *sctx) +{ + if (!btrfs_fs_incompat(sctx->fs_info, ZONED)) + return; + + sctx->flush_all_writes = true; + scrub_submit(sctx); + mutex_lock(&sctx->wr_lock); + scrub_wr_submit(sctx); + mutex_unlock(&sctx->wr_lock); + + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3105,6 +3151,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, */ blk_start_plug(&plug); + if (sctx->is_dev_replace && + btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { + mutex_lock(&sctx->wr_lock); + sctx->write_pointer = physical; + mutex_unlock(&sctx->wr_lock); + sctx->flush_all_writes = true; + } + /* * now find all extents for each stripe and scrub them */ @@ -3292,6 +3346,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, if (ret) goto out; + if (sctx->is_dev_replace) + sync_replace_for_zoned(sctx); + if (extent_logical + extent_len < key.objectid + bytes) { if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { @@ -3414,6 +3471,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, return ret; } +static int finish_extent_writes_for_zoned(struct btrfs_root *root, + struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_trans_handle *trans; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + btrfs_wait_block_group_reservations(cache); + btrfs_wait_nocow_writers(cache); + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + return btrfs_commit_transaction(trans); +} + static noinline_for_stack int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_device *scrub_dev, u64 start, u64 end) @@ -3569,6 +3645,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, * group is not RO. */ ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace); + if (!ret && sctx->is_dev_replace) { + ret = finish_extent_writes_for_zoned(root, cache); + if (ret) { + btrfs_dec_block_group_ro(cache); + scrub_pause_off(fs_info); + btrfs_put_block_group(cache); + break; + } + } + if (ret == 0) { ro_set = 1; } else if (ret == -ENOSPC && !sctx->is_dev_replace) { diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index f10cc5f49962..7ff2a590c93f 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1183,3 +1183,15 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, ASSERT(cache->meta_write_pointer == eb->start + eb->len); cache->meta_write_pointer = eb->start; } + +int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length) +{ + if (!btrfs_dev_is_sequential(device, physical)) + return -EOPNOTSUPP; + + return blkdev_issue_zeroout(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS, 0); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index fc8012ebcc36..d5b2d31e6c91 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -58,6 +58,8 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, struct btrfs_block_group **cache_ret); void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); +int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -145,6 +147,11 @@ btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb) { } +static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device, + u64 physical, u64 length) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812275 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ED32517C5 for ; Thu, 1 Oct 2020 18:39:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CF555208C7 for ; Thu, 1 Oct 2020 18:39:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="d8GfwRmB" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733135AbgJASjL (ORCPT ); Thu, 1 Oct 2020 14:39:11 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732535AbgJASjH (ORCPT ); Thu, 1 Oct 2020 14:39:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577547; x=1633113547; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TfRgsWxDDWA55YSwMrVFURcqO616WdoBQprPhxawU5s=; b=d8GfwRmBO+hNZpAmlzA5BgjvHIk3tKrs+BCZFSO/JWveBhaxR1aASd3s zErs5CibOgRtGWH+ww2cZfMXH0Bw5ae6ImnoLX0Ve+D/mjDRjBkBBQ1rI zU5jgzTF12vc5PvoeqG5RPMR4/Yrf5DxxgQlidz/wW2aBDuSt9LHjYTjn sGFg0OGFCGmIDcySmQF3J3C+qEFPcYLgDIqgF0ome3KVfDzQrPumAhNLk HUZIExYI39jtKPd3LK3uSm9X0D2zWqkIz7kRfhEgVdKwkByz6BEJTF7aY NfCp2bZBODtYHWlZhFAzUlXh49QzwezjTV2bM/VNZxCnp2hnit3oa2HW/ Q==; IronPort-SDR: JsCMTlwjObn5LKApu8qaQrClx2xZIW1f12qYwRotKPY8uxdxTC8bP0N5KE9qqBTG5kq+/zv0jN Sqdj9cX5PSbWCLG3JHjVHTvsq1ovJJh1fYRJHfy0yz+aaYUM7yRffYuFyI7X/fAGz28WUFXxTz QH5NVsqvCQEJeGfC5Ju8YWWnuG1Mqa6Oono3xgEqLcJCDpZPLy3bQeSzeGVl07NlkDmmpKXs/G bkxJoiOUHbhtvq9wJ8bAvUQrAIJpm1jqKeJs99sfsz/fOiA8w3yDZuLLIMoZBeTNyZMbfqWvEP Udc= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036831" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:41 +0800 IronPort-SDR: TAP9jWvUMetq6rPDlrqSWDDwurEHaImCLjET8G1pxZ5t81sCNBYM2tazuerBaFZYNRUmLdJeU2 8d9meT+Jzhhg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:38 -0700 IronPort-SDR: ejNEcENoDjKmASXuYdPBOAorsijezNjMWcCy63eq4QjPx8Ya8RLulaWgx73aT/0duYeVRZvXVW 9rhvZE0XMd/A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:41 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 34/41] btrfs: support dev-replace in ZONED mode Date: Fri, 2 Oct 2020 03:36:41 +0900 Message-Id: X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 4/4 patch to implement device-replace on ZONED mode. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. This patch synchronize the write pointers by writing zeros to the destination device. Signed-off-by: Naohiro Aota --- fs/btrfs/scrub.c | 36 +++++++++++++++++++++++++ fs/btrfs/zoned.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 8 ++++++ 3 files changed, 113 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 65e460670160..2e607caa5ab9 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3019,6 +3019,31 @@ static void sync_replace_for_zoned(struct scrub_ctx *sctx) atomic_read(&sctx->bios_in_flight) == 0); } +static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical, + u64 physical, u64 physical_end) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + int ret = 0; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); + + mutex_lock(&sctx->wr_lock); + if (sctx->write_pointer < physical_end) { + ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical, + physical, + sctx->write_pointer); + if (ret) + btrfs_err(fs_info, "failed to recover write pointer"); + } + mutex_unlock(&sctx->wr_lock); + btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical); + + return ret; +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3416,6 +3441,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, blk_finish_plug(&plug); btrfs_free_path(path); btrfs_free_path(ppath); + + if (sctx->is_dev_replace && ret >= 0) { + int ret2; + + ret2 = sync_write_pointer_for_zoned(sctx, base + offset, + map->stripes[num].physical, + physical_end); + if (ret2) + ret = ret2; + } + return ret < 0 ? ret : 0; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 7ff2a590c93f..f28a70eaa20a 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -18,6 +18,7 @@ #include "block-group.h" #include "transaction.h" #include "dev-replace.h" +#include "space-info.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -1195,3 +1196,71 @@ int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, length >> SECTOR_SHIFT, GFP_NOFS, 0); } + +static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical, + struct blk_zone *zone) +{ + struct btrfs_bio *bbio = NULL; + u64 mapped_length = PAGE_SIZE; + unsigned int nofs_flag; + int nmirrors; + int i, ret; + + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, + &mapped_length, &bbio); + if (ret || !bbio || mapped_length < PAGE_SIZE) { + btrfs_put_bbio(bbio); + return -EIO; + } + + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) + return -EINVAL; + + nofs_flag = memalloc_nofs_save(); + nmirrors = (int)bbio->num_stripes; + for (i = 0; i < nmirrors; i++) { + u64 physical = bbio->stripes[i].physical; + struct btrfs_device *dev = bbio->stripes[i].dev; + + /* missing device */ + if (!dev->bdev) + continue; + + ret = btrfs_get_dev_zone(dev, physical, zone); + /* failing device */ + if (ret == -EIO || ret == -EOPNOTSUPP) + continue; + break; + } + memalloc_nofs_restore(nofs_flag); + + return ret; +} + +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos) +{ + struct btrfs_fs_info *fs_info = tgt_dev->fs_info; + struct blk_zone zone; + u64 length; + u64 wp; + int ret; + + if (!btrfs_dev_is_sequential(tgt_dev, physical_pos)) + return 0; + + ret = read_zone_info(fs_info, logical, &zone); + if (ret) + return ret; + + wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT); + + if (physical_pos == wp) + return 0; + + if (physical_pos > wp) + return -EUCLEAN; + + length = wp - physical_pos; + return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index d5b2d31e6c91..d857538660f1 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -60,6 +60,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length); +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -152,6 +154,12 @@ static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device, { return -EOPNOTSUPP; } +static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, + u64 logical, u64 physical_start, + u64 physical_pos) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Thu Oct 1 18:36:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812281 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0A3CD112E for ; Thu, 1 Oct 2020 18:39:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E322F20796 for ; Thu, 1 Oct 2020 18:39:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="mAf08R9T" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733127AbgJASjJ (ORCPT ); Thu, 1 Oct 2020 14:39:09 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733103AbgJASjH (ORCPT ); Thu, 1 Oct 2020 14:39:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577547; x=1633113547; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=q5i/I3gLwRLnWg5cl41eN7YTBx4uHlSfrYgPFVxy6iM=; b=mAf08R9TslT/9dkPgUbesT1T3v6CSs8E5x2gNXA5xuOkmHhY2FdezbQZ 4wsxwlTsPmMPoNfq8h6s1LM48kSPCVsrR2zw5N9G1xdbFUICb+HtRia1N XapOo0HTWEFxETneNXQExbj+hiu2wbchoXH2ZJBoSuJNBRc/cZL5X9rdh jMKOnBWmEAhIMOa8UQrUES80ouJhSivm0t0JlwmFr9xNmszNmDGAzpTgl CmR0xJOEGvjJuGQ0YFPbfdIr4zZibXTa7zLNTwNuwwmllAKq8x6ZeiVV0 WafPaXf2XAbbTmu89tNMAjracZA3JXh+Y2pvseqd8w/s4veKFG7XEtSBf g==; IronPort-SDR: ehF3AQeTioDudjVLILMlU8Z/oJCPcf6oEU6p6FPFWCsv/Gt+wOh7RHaJH7RH4GO4K2cvMsSvMe aC61N8n+st8gSRvRaby8o/rVm71MNpIvgxHJCnE5KiIHyQJky145UV6fG71lMVVQ1mFWs9AkT+ Ouon0qSzIR+k7VlqsYnybm2k+TLhsrZmltpsyz8D1fqLx+LW14aYFGRwSY68vQKSpkfr8OD9of CD8USv5hYlca4PY/POetwAZFIv9JBRF0j9907W1UYzAw8ObHyd1c350OFNBLq0lZk6xVucboNY 5/M= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036833" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:42 +0800 IronPort-SDR: zEI/sW8fGcyZfUs4Cg42wmxrhwrRfPdG7GSIwRpfgGvjW1tVbtd//F984ja0oOLa5FaxU1EfPG 9ydqnGUyhgDg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:39 -0700 IronPort-SDR: mLJyzyrngVD96Zl4kdjiOE/N54Xjm4SOKuLetpDndMUnDP/FBZLiLJ31iBoQJjHuaLZVwyMSUF 1gmw0CGV93sw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:42 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 35/41] btrfs: enable relocation in ZONED mode Date: Fri, 2 Oct 2020 03:36:42 +0900 Message-Id: <0b38315d936c12a25e56118c429514541585aa23.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To serialize allocation and submit_bio, we introduced mutex around them. As a result, preallocation must be completely disabled to avoid a deadlock. Since current relocation process relies on preallocation to move file data extents, it must be handled in another way. In ZONED mode, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing relocation process. run_delalloc_zoned() will handle all the allocation and submit IOs to the underlying layers. Signed-off-by: Naohiro Aota --- fs/btrfs/relocation.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 3602806d71bd..3fa60065b483 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2603,6 +2603,32 @@ static noinline_for_stack int prealloc_file_extent_cluster( if (ret) return ret; + /* + * In ZONED mode, we cannot preallocate the file region. Instead, we + * dirty and fiemap_write the region. + */ + + if (btrfs_fs_incompat(inode->root->fs_info, ZONED)) { + struct btrfs_root *root = inode->root; + struct btrfs_trans_handle *trans; + + end = cluster->end - offset + 1; + trans = btrfs_start_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + inode->vfs_inode.i_ctime = current_time(&inode->vfs_inode); + i_size_write(&inode->vfs_inode, end); + ret = btrfs_update_inode(trans, root, &inode->vfs_inode); + if (ret) { + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + + return btrfs_end_transaction(trans); + } + inode_lock(&inode->vfs_inode); for (nr = 0; nr < cluster->nr; nr++) { start = cluster->boundary[nr] - offset; @@ -2799,6 +2825,8 @@ static int relocate_file_extent_cluster(struct inode *inode, } } WARN_ON(nr != cluster->nr); + if (btrfs_fs_incompat(fs_info, ZONED) && !ret) + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); out: kfree(ra); return ret; @@ -3434,8 +3462,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_inode_item *item; struct extent_buffer *leaf; + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; int ret; + if (btrfs_fs_incompat(trans->fs_info, ZONED)) + flags &= ~BTRFS_INODE_PREALLOC; + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -3450,8 +3482,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, btrfs_set_inode_generation(leaf, item, 1); btrfs_set_inode_size(leaf, item, 0); btrfs_set_inode_mode(leaf, item, S_IFREG | 0600); - btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS | - BTRFS_INODE_PREALLOC); + btrfs_set_inode_flags(leaf, item, flags); btrfs_mark_buffer_dirty(leaf); out: btrfs_free_path(path); From patchwork Thu Oct 1 18:36:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812295 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C3AA6112E for ; Thu, 1 Oct 2020 18:39:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A1837208C7 for ; Thu, 1 Oct 2020 18:39:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="cT+KNSt6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733150AbgJASjT (ORCPT ); Thu, 1 Oct 2020 14:39:19 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730079AbgJASjH (ORCPT ); Thu, 1 Oct 2020 14:39:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577547; x=1633113547; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=shVFtZnTpmLC/FMiePjkuEnS2T84Svac1gaMW/ga3nw=; b=cT+KNSt6nETolEv3i9rNfwMZShwekdjBaQB2upJhVwv0IKCDRXvegVYT 6WuXA49Pdj0tP0Zn9KITN6Zzx9mcooACYXZC5FnjgF+FgKLtK0ODCEsWu T5mC45dwfyF8f9n1uVrKWru7L5tQ+uAH7duSZbV9SfPoBgXazs7kYK9hJ G/IG9IkF2/++Z9LI4vyNh8wA26JYvZF45JtV2CR7LgWEkpnr5j3Gtlqx+ k/EtmCnjPlxWhSXgd92W9v0JaVOf0rSS+UqjtIyiigWvtyPe+psKabmt4 JmoUVzxNdLLq+C49b9JIkLop5nM/R1QEHAp8u2pgDMcpmLB5s5wZ7IHBc A==; IronPort-SDR: 4bXM91bz94JBkS50mUue6hEEVC24+F+IenwchuSuRTQLZwVC8ayyRJrFA/Rp7UE5tz64Q7MMmq IpeJwvnwDXo6CrzBBsEZN/Na3fFevarfWVOinUclijCCXfODmoUAmQKsCbVvVr/6nwircRWIAg q7QB5jGuhV5eOmImn/WPx+ytBPUOCo4rqjks4qQkEoiDitODrEGcvfcvmAShuWSIY2ikPu/s1E C++0Z7zrxdyLRrA6t5F9ZhqRW1RW71S164gjE6mzCag+KdDAoCm+OKDVict0G+1LMOClZrSXB7 tC4= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036841" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:44 +0800 IronPort-SDR: eJT85jVFDAhWTO4lwNmpnaSMtrdCiz/fj5LoeYxewIm3EfL3OxtD6pad3RrDTtJ13WNLib4e6N VRT96mKcfwxg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:40 -0700 IronPort-SDR: Acg0lP1uR9Pn9ccqUWB593gGmUa0AnugzrsbUUef/vdJOD0ZwUad7Lr/c8EWBY2IDEyv1/2pmW m2ZsT0Ic11AA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:43 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik Subject: [PATCH v8 36/41] btrfs: relocate block group to repair IO failure in ZONED Date: Fri, 2 Oct 2020 03:36:43 +0900 Message-Id: <2e1473edbe7144719eaa72444fa24cc3ac5074d7.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When btrfs find a checksum error and if the file system has a mirror of the damaged data, btrfs read the correct data from the mirror and write the data to damaged blocks. This repairing, however, is against the sequential write required rule. We can consider three methods to repair an IO failure in ZONED mode: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group, so it is straightforward to implement. It relocates all the mirrored device extents, so it is, potentially, a more costly operation than method (1) or (2). But it relocates only using extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/extent_io.c | 3 ++ fs/btrfs/scrub.c | 3 ++ fs/btrfs/volumes.c | 71 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 5 files changed, 79 insertions(+) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index e91123495d68..50e5ddb0a19b 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -96,6 +96,7 @@ struct btrfs_block_group { unsigned int has_caching_ctl:1; unsigned int removed:1; unsigned int to_copy:1; + unsigned int relocating_repair:1; int disk_cache_state; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ed6a9fce016d..b93c67e8ba1d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2268,6 +2268,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, ASSERT(!(fs_info->sb->s_flags & SB_RDONLY)); BUG_ON(!mirror_num); + if (btrfs_fs_incompat(fs_info, ZONED)) + return btrfs_repair_one_zone(fs_info, logical); + bio = btrfs_io_bio_alloc(1); bio->bi_iter.bi_size = 0; map_length = length; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 2e607caa5ab9..4c247a1618d0 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + if (btrfs_fs_incompat(fs_info, ZONED) && !sctx->is_dev_replace) + return btrfs_repair_one_zone(fs_info, logical); + /* * We must use GFP_NOFS because the scrub task might be waiting for a * worker task executing this function and in turn a transaction commit diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index af2ed4d3389f..33380f20a206 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7975,3 +7975,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr) spin_unlock(&fs_info->swapfile_pins_lock); return node != NULL; } + +static int relocating_repair_kthread(void *data) +{ + struct btrfs_block_group *cache = (struct btrfs_block_group *) data; + struct btrfs_fs_info *fs_info = cache->fs_info; + u64 target; + int ret = 0; + + target = cache->start; + btrfs_put_block_group(cache); + + if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) { + btrfs_info(fs_info, + "skip relocating block group %llu to repair: EBUSY", + target); + return -EBUSY; + } + + mutex_lock(&fs_info->delete_unused_bgs_mutex); + + /* ensure Block Group still exists */ + cache = btrfs_lookup_block_group(fs_info, target); + if (!cache) + goto out; + + if (!cache->relocating_repair) + goto out; + + ret = btrfs_may_alloc_data_chunk(fs_info, target); + if (ret < 0) + goto out; + + btrfs_info(fs_info, "relocating block group %llu to repair IO failure", + target); + ret = btrfs_relocate_chunk(fs_info, target); + +out: + if (cache) + btrfs_put_block_group(cache); + mutex_unlock(&fs_info->delete_unused_bgs_mutex); + btrfs_exclop_finish(fs_info); + + return ret; +} + +int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + + /* do not attempt to repair in degraded state */ + if (btrfs_test_opt(fs_info, DEGRADED)) + return 0; + + cache = btrfs_lookup_block_group(fs_info, logical); + if (!cache) + return 0; + + spin_lock(&cache->lock); + if (cache->relocating_repair) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + return 0; + } + cache->relocating_repair = 1; + spin_unlock(&cache->lock); + + kthread_run(relocating_repair_kthread, cache, + "btrfs-relocating-repair"); + + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index f8fc3debd5e0..7f1a38c820e3 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -584,5 +584,6 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info, int btrfs_bg_type_to_factor(u64 flags); const char *btrfs_bg_type_to_raid_name(u64 flags); int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info); +int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical); #endif From patchwork Thu Oct 1 18:36:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812293 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E4258112E for ; Thu, 1 Oct 2020 18:39:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C40F620B1F for ; Thu, 1 Oct 2020 18:39:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="jMOAjghy" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733146AbgJASjS (ORCPT ); Thu, 1 Oct 2020 14:39:18 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733106AbgJASjI (ORCPT ); Thu, 1 Oct 2020 14:39:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577548; x=1633113548; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AoO3xZRWnSdnCylQAmdUKrRADOJvnmHa8U78NBbPy3Q=; b=jMOAjghyf8VQhX5rv9+l2GiOsRKB55pCcJ0iycOYRUkr/pqPdYG6STpK nJ0fkavvAWA3kfrheHtCErBwM5BtOs0TQXrTR2ug0uE4Bd0626xHlrdSk lC65A2pr26hsT75/3m9GEFjHKTa2ecOfJOVmXKYd5tZefWY+yM4EozH0G NL/VPml+TYu1WNE2ioW/MESo0ODicXyPgbPdLjpAni3mn3StnxuENyDi+ EtNgojOV905XfjztcyPk873cGDk/j2cStbeuDxYxWpMrO+FqkfKggHlIU IvO8hO4/8Uih0bNLpnx2WOlkIcPlplIQHzB5jVHYWwwtElfbNLrM/dbXO A==; IronPort-SDR: r2fhwohEKGv1uQzQpDQBjUBijgYOcgWvUqRAMZaGVsN0hWx4BdiTjSdSLMEgl8/dAq67dv2on6 0b1dCxdi4IQdseToKYsNHfiSKJhyni77Zhz0fh5pbO1HpoFAr7CTrdpo/K3aVL4W/C21ICVsap +x4rtSIDjU72ZPR0jjBDHsB1KiTf42af9o99ualk4JbiWS3QfS7WV0cZSzVwTGPAvORMk/7rcN FfWcWbjRDOyMZpy1j1mUSw+VwKpV47PO34H4U/1L28efm+FJ+u2BDhqWfZ052VLkdKXxRu2tWN itg= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036844" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:45 +0800 IronPort-SDR: /KN4EDoHedYFZfsRoZl0A2N4LC45fWN0Xl9gs4KkMFUZq7t/ShfOoSgW/5q4sFUN563muRDqLe NmYkDvNG09ZQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:41 -0700 IronPort-SDR: W9G3l2HZ23hI46uJSE9pqyCqNqiFbhvQOlPd+QyyYuRJc9dWaOw+OcjWYUqRLsRmL7uFyHSHI0 3NFOPLZK6EBg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:44 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 37/41] btrfs: split alloc_log_tree() Date: Fri, 2 Oct 2020 03:36:44 +0900 Message-Id: <82d118bccd4a795dc9c64a2fe74032d3ae43dba6.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is a preparation for the next patch. This commit split alloc_log_tree() to allocating tree structure part (remains in alloc_log_tree()) and allocating tree node part (moved in btrfs_alloc_log_tree_node()). The latter part is also exported to be used in the next patch. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 27 ++++++++++++++++++++++++--- fs/btrfs/disk-io.h | 2 ++ 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5ce5b18f9dc4..02b1f9b20bed 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1211,7 +1211,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *root; - struct extent_buffer *leaf; root = btrfs_alloc_root(fs_info, BTRFS_TREE_LOG_OBJECTID, GFP_NOFS); if (!root) @@ -1221,6 +1220,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, root->root_key.type = BTRFS_ROOT_ITEM_KEY; root->root_key.offset = BTRFS_TREE_LOG_OBJECTID; + return root; +} + +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root) +{ + struct extent_buffer *leaf; + /* * DON'T set SHAREABLE bit for log trees. * @@ -1235,24 +1242,31 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, NULL, 0, 0, 0, BTRFS_NESTING_NORMAL); if (IS_ERR(leaf)) { btrfs_put_root(root); - return ERR_CAST(leaf); + return PTR_ERR(leaf); } root->node = leaf; btrfs_mark_buffer_dirty(root->node); btrfs_tree_unlock(root->node); - return root; + + return 0; } int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; @@ -1264,11 +1278,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_root *log_root; struct btrfs_inode_item *inode_item; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } + log_root->last_trans = trans->transid; log_root->root_key.offset = root->root_key.objectid; diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index fee69ced58b4..b82ae3711c42 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -115,6 +115,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, extent_submit_bio_start_t *submit_bio_start); blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio, int mirror_num); +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, From patchwork Thu Oct 1 18:36:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812339 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1D73B139F for ; Thu, 1 Oct 2020 18:39:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E8858207DE for ; Thu, 1 Oct 2020 18:39:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="N2Ra50fl" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733114AbgJASjz (ORCPT ); Thu, 1 Oct 2020 14:39:55 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24728 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733108AbgJASjJ (ORCPT ); Thu, 1 Oct 2020 14:39:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577548; x=1633113548; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=78tY7kw55+iG+sd683/R9VBKR5DRB7JfI7n0ke6usVo=; b=N2Ra50flcaTU5nHm9NOL07DjGbxu8me47JvswOHFQWpGm32y+Rv3TJbb yGXiWy9E3apA6yhU9y9EvjIINsZyhfzjPAj2Du+uyFAFRwChP+ZzkUFB9 p5S7iQprw9UsFV+diAr4WivDKP6R379ANWgv0yGWU8LJVN3J8RC13lRrk /vdw92i7vElcvTzRQAeKDikhTq69ilnzFJuIdVuMe80j8zH8NZ+tGURki /caFjZq72AErR3FtPKfh5eGgxSyQj+BHTe79qKykJTcSaQZm4ttRxKgk+ RBiOeb7MbcA2st2AiAuiE3anHdvQzDTp/W/a7A83sPGdzVWI6/43fT1Es Q==; IronPort-SDR: 0NABLiUZiYHvzpxebPlz5qi6ivUcfffGfCJbC4744mhI7rSMMOvYYv8z2VdkCCRTkHKSTvpbqL Wq4qKawU+frgtZ9l1rf5TnhJMvrvpCE31N1QEEinG8KCIP/Sz2ZXcZLd5KocAAAQOrJAC8Q6eu zT65TV3NJmh1NHuF3pfM3HyWT4oR76ddPU+s/iR0nhza0v2fntgLRM2XqFar2MqLNhFRGoIwzf hlU2BCyG9BCpohuMud0AxzGO6I+iYyNRQRAVRK45QIn6eQExbZpZOzZpnxTIwQbZw3i5yPnWAS tU0= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036847" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:46 +0800 IronPort-SDR: rVGkRFHhsgPMNBCAbLX7zflAW7R+wAruiDr7f3OlzOice8N5Gahz06Q0nGpY0bQJIFTzZaEwKR 2LXWynZvT3nQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:43 -0700 IronPort-SDR: unafe/Ju30hVclqLalD9HwpuHffd0H/9/JXOmj0yppkNycTBj1F1ZkMEOHe6fPdGkt52UkapWE eKBf3T/1y44A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:45 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group Date: Fri, 2 Oct 2020 03:36:45 +0900 Message-Id: <17f8b62a6fc896598378ecf88bdab5f6b3d3b9cc.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 1/3 patch to enable tree log on ZONED mode. The tree-log feature does not work on ZONED mode as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks, and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which is different timing from a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that ZONED mode must avoid. We can introduce a dedicated block group for tree-log blocks so that tree-log blocks and other metadata blocks can be separated write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and btrfs assign "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 7 +++++ fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 65 ++++++++++++++++++++++++++++++++++++++---- 3 files changed, 69 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index d39fa80d3d90..64253d4a7bfc 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -939,6 +939,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, btrfs_return_cluster_to_free_space(block_group, cluster); spin_unlock(&cluster->refill_lock); + if (btrfs_fs_incompat(fs_info, ZONED)) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg == block_group->start) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d021bc4a92cd..81e2f5b78917 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -954,6 +954,8 @@ struct btrfs_fs_info { unsigned long exclusive_operation; struct mutex zoned_meta_io_lock; + spinlock_t treelog_bg_lock; + u64 treelog_bg; #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f44faaf7aca2..c4d382e5b45f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3656,6 +3656,9 @@ struct find_free_extent_ctl { /* Allocation policy */ enum btrfs_extent_allocation_policy policy; + + /* Allocation is called for tree-log */ + bool for_treelog; }; @@ -3856,23 +3859,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, struct find_free_extent_ctl *ffe_ctl, struct btrfs_block_group **bg_ret) { + struct btrfs_fs_info *fs_info = block_group->fs_info; struct btrfs_space_info *space_info = block_group->space_info; struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; u64 start = block_group->start; u64 num_bytes = ffe_ctl->num_bytes; u64 avail; + u64 bytenr = block_group->start; + u64 log_bytenr; int ret = 0; + bool skip; ASSERT(btrfs_fs_incompat(block_group->fs_info, ZONED)); + /* + * Do not allow non-tree-log blocks in the dedicated tree-log block + * group, and vice versa. + */ + spin_lock(&fs_info->treelog_bg_lock); + log_bytenr = fs_info->treelog_bg; + skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) || + (!ffe_ctl->for_treelog && bytenr == log_bytenr)); + spin_unlock(&fs_info->treelog_bg_lock); + if (skip) + return 1; + spin_lock(&space_info->lock); spin_lock(&block_group->lock); + spin_lock(&fs_info->treelog_bg_lock); + + ASSERT(!ffe_ctl->for_treelog || + block_group->start == fs_info->treelog_bg || + fs_info->treelog_bg == 0); if (block_group->ro) { ret = 1; goto out; } + /* + * Do not allow currently using block group to be tree-log dedicated + * block group. + */ + if (ffe_ctl->for_treelog && !fs_info->treelog_bg && + (block_group->used || block_group->reserved)) { + ret = 1; + goto out; + } + avail = block_group->length - block_group->alloc_offset; if (avail < num_bytes) { ffe_ctl->max_extent_size = avail; @@ -3880,6 +3914,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, goto out; } + if (ffe_ctl->for_treelog && !fs_info->treelog_bg) + fs_info->treelog_bg = block_group->start; + ffe_ctl->found_offset = start + block_group->alloc_offset; block_group->alloc_offset += num_bytes; spin_lock(&ctl->tree_lock); @@ -3887,10 +3924,13 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, spin_unlock(&ctl->tree_lock); ASSERT(IS_ALIGNED(ffe_ctl->found_offset, - block_group->fs_info->stripesize)); + fs_info->stripesize)); ffe_ctl->search_start = ffe_ctl->found_offset; out: + if (ret && ffe_ctl->for_treelog) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); spin_unlock(&block_group->lock); spin_unlock(&space_info->lock); return ret; @@ -4140,7 +4180,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info, return prepare_allocation_clustered(fs_info, ffe_ctl, space_info, ins); case BTRFS_EXTENT_ALLOC_ZONED: - /* nothing to do */ + if (ffe_ctl->for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg) + ffe_ctl->hint_byte = fs_info->treelog_bg; + spin_unlock(&fs_info->treelog_bg_lock); + } return 0; default: BUG(); @@ -4184,6 +4229,7 @@ static noinline int find_free_extent(struct btrfs_root *root, struct find_free_extent_ctl ffe_ctl = {0}; struct btrfs_space_info *space_info; bool full_search = false; + bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID; WARN_ON(num_bytes < fs_info->sectorsize); @@ -4197,6 +4243,7 @@ static noinline int find_free_extent(struct btrfs_root *root, ffe_ctl.orig_have_caching_bg = false; ffe_ctl.found_offset = 0; ffe_ctl.hint_byte = hint_byte_orig; + ffe_ctl.for_treelog = for_treelog; ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED; /* For clustered allocation */ @@ -4271,8 +4318,15 @@ static noinline int find_free_extent(struct btrfs_root *root, struct btrfs_block_group *bg_ret; /* If the block group is read-only, we can skip it entirely. */ - if (unlikely(block_group->ro)) + if (unlikely(block_group->ro)) { + if (btrfs_fs_incompat(fs_info, ZONED) && for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (block_group->start == fs_info->treelog_bg) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } continue; + } btrfs_grab_block_group(block_group, delalloc); ffe_ctl.search_start = block_group->start; @@ -4460,6 +4514,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, bool final_tried = num_bytes == min_alloc_size; u64 flags; int ret; + bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID; flags = get_alloc_profile_by_root(root, is_data); again: @@ -4483,8 +4538,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, sinfo = btrfs_find_space_info(fs_info, flags); btrfs_err(fs_info, - "allocation failed flags %llu, wanted %llu", - flags, num_bytes); + "allocation failed flags %llu, wanted %llu treelog %d", + flags, num_bytes, for_treelog); if (sinfo) btrfs_dump_space_info(fs_info, sinfo, num_bytes, 1); From patchwork Thu Oct 1 18:36:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812289 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B17331668 for ; Thu, 1 Oct 2020 18:39:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 922CD20796 for ; Thu, 1 Oct 2020 18:39:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="SQtRYAAl" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733143AbgJASjQ (ORCPT ); Thu, 1 Oct 2020 14:39:16 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24779 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733109AbgJASjI (ORCPT ); Thu, 1 Oct 2020 14:39:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577548; x=1633113548; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=n831lhAdxLYEN4MUkMzrkg2IF6/VSOBYsrPc+3Ofl2w=; b=SQtRYAAlLTdRyYKKLAdAz9a6bL40SaKYkzoMx0ZzysV+eisY1/9asBR/ MNMp0yPWdV4aaTAYoAb6FzsHfd+bdTFqnowNKuaV7VC4E2NS5+Tjxja3B q9FeZbRutdcSsxb8odoQxxl/ZrlGnwDPS0x4VtlmFyDWnz6sh0dtqLCmE yT25Z4/j79RFtbG7Oct5CeU3kaRYMsQpf6MrkepVeXMUiezw0jC9rGyC7 THD7/5U6L+2H9MN2u07pcAaiDwUPDF6ryP30soEfS1zbZWPe7lRtiR2DC nunDqEp4tHZWH4Idw7pzculkY+ogLiK7L8bgshJH3Xkp6o8hHjPslD4GR A==; IronPort-SDR: jRC5orpsQIkc5RJvdN7gNXfFtk99RIkm/ZCyek9PBSeNykQB0+jt44Uy1Wg8Aeq4jaMWynxgDP k6rotnOal9yVdsrfNynTq6rqacfjErfULoTROPgq5POBfD2UWNz12uNjpRbAlaHgsDJBF0gMj4 +jpilcn3z8pDRNdgH7veEY/CoZdfMpkhTfDYZqhCK23mkfmKzYkSh1LUpDrnv/85b7evQg0f0J 5iyFlyxF+YyZM/M1/R6l/DTF2fMWNdNSEEodEUafYzrxP95qYOwFQ+IVTnMi2JX0zV9hPkTafB icI= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036849" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:47 +0800 IronPort-SDR: biFGsTBWqGD1HYdcpfYnZjBQqHbWu7WllNzgtlEj4F0uSJKxV4jh6Gk3KZRgM4ET75DqeJNgmN rLtHsop2lE2g== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:44 -0700 IronPort-SDR: UB/1dVSEOc4P1WzFi9fMDpKB0fmiG7rCUPeP1Fo0vO6uPxFroJ1DYaGBKPoPuGK6DM7VgVcFAT Wfayqlk9tskg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:46 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 39/41] btrfs: serialize log transaction on ZONED mode Date: Fri, 2 Oct 2020 03:36:46 +0900 Message-Id: <2d84e29552ae6771c987992ec59bd11f3178532f.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 2/3 patch to enable tree-log on ZONED mode. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same mixed allocation problem. This patch serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Signed-off-by: Naohiro Aota --- fs/btrfs/tree-log.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 5f585cf57383..42175b8d1bee 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -106,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, u64 dirid, int del_all); +static void wait_log_commit(struct btrfs_root *root, int transid); /* * tree logging is a special write ahead log used to make sure that @@ -140,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; + bool zoned = btrfs_fs_incompat(fs_info, ZONED); int ret = 0; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + if (btrfs_need_log_full_commit(trans)) { ret = -EAGAIN; goto out; } + if (zoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } + if (!root->log_start_pid) { clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); root->log_start_pid = current->pid; @@ -158,8 +168,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans, } } else { mutex_lock(&fs_info->tree_log_mutex); - if (!fs_info->log_root_tree) + if (zoned && fs_info->log_root_tree) { + ret = -EAGAIN; + mutex_unlock(&fs_info->tree_log_mutex); + goto out; + } else if (!fs_info->log_root_tree) { ret = btrfs_init_log_root_tree(trans, fs_info); + } mutex_unlock(&fs_info->tree_log_mutex); if (ret) goto out; @@ -193,14 +208,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans, */ static int join_running_log_trans(struct btrfs_root *root) { + bool zoned = btrfs_fs_incompat(root->fs_info, ZONED); int ret = -ENOENT; if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state)) return ret; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + ret = 0; + if (zoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } atomic_inc(&root->log_writers); } mutex_unlock(&root->log_mutex); From patchwork Thu Oct 1 18:36:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812343 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D097B112E for ; Thu, 1 Oct 2020 18:39:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B151021481 for ; Thu, 1 Oct 2020 18:39:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="JtmizhqL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733202AbgJASjz (ORCPT ); Thu, 1 Oct 2020 14:39:55 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24722 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733116AbgJASjJ (ORCPT ); Thu, 1 Oct 2020 14:39:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577549; x=1633113549; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OmAi2+NSsYKSgRbWdebZeHR8g9vYhF9/qIqK9TbTi2U=; b=JtmizhqLGtq7uThUo27tKFVUimZt9Lcm6MMLHewWvD+KcefxlsBK3UfG +7e3v8nJpgltSkXDLiaQC0cdTG1runN1kiM2I5UU2uHHO165dBXWrB3A7 aUf4qmYWsMvfTBfRW3Yhz7SjL9exqSZQ6oIHcSisHYmQcil7JfWhNG+X0 FbtwBpWSQQW5ekmZqOm8vEylVobA0nCCbAKFE6BatAg1RQ8uzdHLKFbRQ rc75diE5ssYutYtw6tiWKPyWWO5ZFi+uP18ulFUWoeWQLgWaFk+WcCzQL BMFtSWwEokftjeaiWmmLLsbqEH62viqn9ZjE8YtUzZfo705cQW2S4cRO3 g==; IronPort-SDR: zL9L6vN5LfZpQCgVwxb2W7efF2u2TexPl/36f3pJBaLNTetLn6xPKcJXRwl5XDBJtJKhBOPkL1 mKNF6RyTjFtfFnaZWc7DzQ6/clkdNdWw4VqL8SOYvCFEKnWNfdeHYa51DCZUs0DlYeWs4AKC5P 9UQLUewN6EJt/9++ry5XlFfJek/lWP5C4rTFDHCXYP6ufXKq4EfWm4i4vGb1Q85q7sjmYvB3M8 e6cCzcyv4iLzLFTZA7vifjQSVcLk5W3u9dDH0ARFW3k6BImHhvBwnoxfPi3IaxdYqgGEKh3/7H ZMs= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036853" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:48 +0800 IronPort-SDR: ayaJdEs4eVd5VyhXX0FD+hVTmeGOL9vZbfniUBy25ui4m5FO3ZN+BePbXr78uG4ifum+6NPtV/ HfWt+qAOm8BA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:45 -0700 IronPort-SDR: 8ptyzsGHOHhy1CtrmWpO/XObMg1y6Q6WyYQcobQfmlYsJDPnrJFJzcXlH/Wg7RXj41W6HhHJzD OvYExObw8mQg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:48 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v8 40/41] btrfs: reorder log node allocation Date: Fri, 2 Oct 2020 03:36:47 +0900 Message-Id: <2999cb6cad58c822b1fa9d642605134d6a65f318.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 3/3 patch to enable tree-log on ZONED mode. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. This patch reorders the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 6 ------ fs/btrfs/tree-log.c | 19 +++++++++++++------ 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 02b1f9b20bed..0c041ad096ac 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1257,16 +1257,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; - int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); - ret = btrfs_alloc_log_tree_node(trans, log_root); - if (ret) { - kfree(log_root); - return ret; - } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 42175b8d1bee..7a4bfedb2929 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3143,6 +3143,11 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]); root_log_ctx.log_transid = log_root_tree->log_transid; + mutex_lock(&fs_info->tree_log_mutex); + if (!log_root_tree->node) + btrfs_alloc_log_tree_node(trans, log_root_tree); + mutex_unlock(&fs_info->tree_log_mutex); + /* * Now we are safe to update the log_root_tree because we're under the * log_mutex, and we're a current writer so we're holding the commit @@ -3292,12 +3297,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans, .process_func = process_one_buffer }; - ret = walk_log_tree(trans, log, &wc); - if (ret) { - if (trans) - btrfs_abort_transaction(trans, ret); - else - btrfs_handle_fs_error(log->fs_info, ret, NULL); + if (log->node) { + ret = walk_log_tree(trans, log, &wc); + if (ret) { + if (trans) + btrfs_abort_transaction(trans, ret); + else + btrfs_handle_fs_error(log->fs_info, ret, NULL); + } } clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, From patchwork Thu Oct 1 18:36:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11812349 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C395C112E for ; Thu, 1 Oct 2020 18:40:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A141C207DE for ; Thu, 1 Oct 2020 18:40:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="c4QvaUsU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732704AbgJASjy (ORCPT ); Thu, 1 Oct 2020 14:39:54 -0400 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:24680 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733114AbgJASjI (ORCPT ); Thu, 1 Oct 2020 14:39:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1601577549; x=1633113549; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=SoPash0411BS/9S75lSWLF1bHnsRbsdKp/SZjqYeY6U=; b=c4QvaUsU+9/FinE3zFDKoZ8kQD5fFnkUZLZAD79BHDiozsWMinx/y4VJ 4SEjbi6aDAALzNLcxvL2ZtQtqhaIk3+OEJybOvu6Eo2AYFPoYpGA4R5aU gmH4gkS2fJCKf7b1A9Kum/eKi8YIzofWe1FzkyFU0svREGj5Pmoo4Oz3O hQm6BJtWwmKJJxpW166CytkW29767dLD238W+oP4zBnv6M9aBzFcI6UYy fv+3hLZdfnhxbW8M3fvS8wwGu/2JGzARei6eRCxlrgpfKLmAZV6ZN4IeD NJR/8x0NGi8ybkgjeApUCWoZEBOepiQ+ASbk6pyJbGAYs16uwPWaHFBBL A==; IronPort-SDR: rQA+1/6Tz7J81Nx0wA2RgDSkSMwgBC506ZgTxDqIrcg5XA9sFZRJa+qsQvl3qDDnThZ7oyu368 8q32CNuKIeA8WvvCXE+cJ5wq9B+dm4dMTD3b9Yfxrpxkd3YcP88Cqn2tgnwY1YYUFlfGTpBRG7 +tvPt6PMREVXJC0Yumgb2dWMCDFtLbszTw79EWBwAYed3uNbq/aByie8nZsF2eZynxqEWRQ/XL gv+Rt9ePB6yfgPfaSDrWrE1X6TWaPWmqGcdncbIYwbm/S2FGX+JxJSJs9RPB9B8Z9No/Wh+Hlb LWw= X-IronPort-AV: E=Sophos;i="5.77,324,1596470400"; d="scan'208";a="150036858" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 02 Oct 2020 02:38:50 +0800 IronPort-SDR: BdXoxdq2g+hEWA6BCe31O4Hy+wlT0FdORPqSnGpsu45xtx8rBKT8QEwXee8tIiIvQhC6YXlkdg 3yBZW/yCiTng== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Oct 2020 11:24:46 -0700 IronPort-SDR: wzs2VezWDo6hfkixw3vKpBVpSmh4GOBvwXgUzuw1GSSIyDLDFui4ZhF/Obya6ZPJtyOcbTkto8 J0YgJxj1p/dw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 01 Oct 2020 11:38:49 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Naohiro Aota , Josef Bacik Subject: [PATCH v8 41/41] btrfs: enable to mount ZONED incompat flag Date: Fri, 2 Oct 2020 03:36:48 +0900 Message-Id: <95a1e2f5ca2cc551b946b91b13804ec1bd093675.1601574234.git.naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This final patch adds the ZONED incompat flag to BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file system. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 81e2f5b78917..f5b78ae3baff 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -302,7 +302,8 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \ BTRFS_FEATURE_INCOMPAT_METADATA_UUID | \ - BTRFS_FEATURE_INCOMPAT_RAID1C34) + BTRFS_FEATURE_INCOMPAT_RAID1C34 | \ + BTRFS_FEATURE_INCOMPAT_ZONED) #define BTRFS_FEATURE_INCOMPAT_SAFE_SET \ (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)